games growing up was definitely Minecraft. To this day, I still remember meeting up with a couple of friends after school and figuring out what new, odd red-stone contraption we would build next. That’s why, when Oasis, an automatically generated open AI world model, was released in October 2024, I was flabbergasted! Building reactive world models seemed finally in reach using current technologies, and soon enough, we might have fully AI-generated environments.
World models[3], introduced back in 2018 by David HA et al, are machine learning models capable of both simulating and interacting with a fully virtual environment. Their main limitation has been computational inefficiency, which made real-time interaction with the model a significant challenge.
In this blog post, we will introduce the first open-source Minecraft world model developed by Microsoft, Mineworld[1], which is capable of fast real-time interactions and high controllability, while using fewer resources compared to its closed-source counterpart, Oasis [2]. Their contribution lies in three main points:
- Mineworld: A real-time, interactive world model with high controllability , and it’s open source.
- A parallel decoding algorithm that speeds up the generation process, increasing the number of frames generated per second.
- A novel evaluation metric designed to measure a world model’s controllability.
Paper link: https://arxiv.org/abs/2504.08388
Code: https://github.com/microsoft/mineworld
Released: 11th of April 2025
Mineworld, Simplified
To accurately explain Mineworld and its approach, we will divide this section into three subsections:
- Problem Formulation: where we define the problem and establish some ground rules for both training and inference
- Model Architecture: An overview of the models used for generating tokens and output images.
- Parallel Decoding: A look into how the authors tripled the number of frames generated per second using a novel diagonal decoding algorithm [8].
Problem Formulation
There are two types of input to the world model: video game footage and player actions taken during gameplay. Each of these requires a different type of tokenization to be correctly utilized.
Given a clip of Minecraft video 𝑥, containing 𝑛 states/frames, image tokenization can be formulated as follows:
$$x=(x_{1},…,x_{n})$$
$$t= (t_{1},…,t_{c},t_{c+1},…,t_{2c},t_{2c+1},…,t_{N})$$
Each frame 𝑥(i) contains c patches, and each patch can be represented by a token t(j). This means that a single frame 𝑥(i) can be further described as the set of quantized tokens {t(1),t(2),…,t(c)}, where each t(j) ∈ t is a distinct patch, capturing its own set of pixels.
Since every frame contains c tokens, the total amount of tokens over one video clip is N =n.c.
In addition to tokenizing video input, player actions must also be tokenized. These tokens need to capture variations such as changes in camera perspective, keyboard input, and mouse movements. This is achieved using 11 distinct tokens that represent the full range of input features:
- 7 tokens for seven exclusive action groups. Related actions are grouped into the same class (grouping of actions is represented in Table 1).
- 2 tokens to encode camera angles following [5]
- 2 tokens capturing the beginning and end of the action sequence:
and .
Thus, a flat sequence capturing all game states and actions can be represented as follows:
$$t= (t_{i*c+1},…,t_{(i+1)*c},[aBOS],t_{1}^{a_{i}},…,t_{9}^{a_{i}},[aEOS])$$
We begin with a list of quantized IDs for each patch, starting from t(1) to t(N) (as shown in the previous equation), followed by a beginning-of-sequence token
Model Architecture
Two main models were used in this work: a Vector Quantized Variational Autoencoder (VQ-VAE)[6] and a Transformer decoder based on the LLaMA architecture[7].
Although traditional Variational Autoencoders (VAEs) were once the go-to architecture for image generation (especially before the wide adoption of diffusion models), they had some limitations. VAEs struggled in cases with data that was more discrete in nature ( such as words or tokens) or required high realism and certainty. VQ-VAEs, on the other hand, address these shortcomings by moving from a continuous latent space to a discrete one, making them more structured and improving the model’s suitability for downstream tasks.
In this paper, VQ-VAE was used as the visual tokenizer, converting each image frame 𝑥 into its quantized ID representation t. Images of size 224×384 were used as input, with each image divided further into 16 different patches of size 14×24. This results in a sequence of 336 discrete tokens representing the visual information in a single frame.
On the other hand, a LLaMA transformer decoder was employed to predict each token conditioned on all previous tokens.
$$f_{\theta}(t)=\prod_{i=1}^{N} p\left( t_{i}|t_{\lt i} \right) $$
The Transformer function processes not only visual-based tokens but also action tokens. This enables modeling of the relationship between the two modalities, allowing it to be used as both a world model (as intended in the paper) and as a policy model capable of predicting actions based on preceding tokens.
Parallel Decoding
The authors had a clear requirement to consider a game “playable” under normal settings: it must generate enough frames per second for the player to comfortably perform an average amount of actions per minute (APM). Based on their analysis, an average player performs 150 APM. To accommodate such needs, the environment would need to run at least 2~3 frames per second.
To meet this requirement, the authors had to move away from typical raster scan generation (generating from left to right, top to bottom, each token separately) and instead utilize combined diagonal decoding.
Diagonal decoding works by executing several image patches in parallel during a single run. For example, if patch x(i,j) was processed on step t, both patches x(i+1,j) and x(i,j+1) are processed on step t+1. This method leverages the spatial and temporal connections between consecutive frames, enabling faster generation. This effect could also be seen in more detail in Figure 2.
However, switching from sequential to parallel generation introduces some performance degradation. This is due to a mismatch between the training and inference processes (as parallel generation is necessary during inference) and to the sequential nature of LLaMA’s causal attention mask. The authors mitigate this issue by fine-tuning using a modified attention mask that is more suitable for their parallel decoding strategy.
Key Findings & Analysis
For evaluation, Mineworld utilized the VPT dataset [5], which consists of recorded gaming clips paired with their corresponding actions. VPT consists of 10M video clips, each comprising 16 frames. As previously mentioned, each frame( 224×384 pixels) is split into 336 patches, each patch represented by a separate token t(i). Alongside the 11 action tokens, this results in a total of up to 347 tokens per frame, summing up to 55B tokens for the entire dataset.
Quantitative results
Mineworld primarily compared its results to Oasis using two categories of metrics: visual quality and controllability.
To accurately measure controllability, the authors introduced a novel approach by training an Inverse Dynamics Model (IDM) [5], tasked with predicting the action occurring between two consecutive frames. In addition to reaching 90.6% accuracy, the model was further tested by supplying 20 game clips with IDM’s predicted actions to 5 experienced players. After scoring each action from 1 to 5 and calculating the Pearson correlation coefficient, they obtained a p-value of 0.56, which indicates a significant positive correlation.
With the Inverse Dynamics Model providing reliable results, it can be used to calculate metrics such as accuracy, F1 score, or L1 loss by treating the input action as the ground truth and the IDM’s predicted action as the action produced by the world model. Due to variations in the types of actions taken, this evaluation can be further divided into two categories:
- Discrete Action Classification: Precision, Recall, and F1 scores for the 7 action classes described in Figure 1.
- Camera Movement: By dividing rotation around the X and Y axes into 11 discrete bins, an L1 score can be calculated using the IDM predictions.
Examining the results in Table 2, we observe that Mineworld, despite having only 300M parameters, outperforms Oasis on all given metrics, whether related to controllability or visual quality. The most interesting metric is frames per second, where Mineworld delivers more than twice as many frames, enabling a smoother interactive experience that can handle 354 APM, far exceeding the 150 APM hard limit.
While scaling Mineworld to 700M or 1.2B parameters improves image quality, it unfortunately comes at the cost of a slowdown, with the FPS dropping to 3.01. This reduction in speed can negatively impact user experience, though it still supports a playable 180 APM.
Qualitative Results
Further qualitative analysis was conducted to evaluate Mineworld’s capability of generating fine details, following action instructions, and understanding/re-generating contextual information. The initial game state was provided, along with a predefined list of actions for the model to execute.
Looking at Figure 3, we can draw three conclusions:
- Top Panel: Given an image of a player in the house and instructions to move towards the door and open it, the model successfully generated the desired sequence of actions.
- Middle Panel: In a wood-chopping scenario, the model demonstrated the ability to generate fine-grained visual details, correctly rendering the wood destruction animation.
- Bottom Panel: A case of high fidelity and context awareness. On moving the camera left and right, we notice the house being out of sight, then back again fully with the same details.
These three cases show the power of Mineworld not only in generating high-quality gameplay content but in following the desired actions and re-generating contextual information consistently, a feature that Oasis struggles with.
In a second set of results, the authors focused on evaluating the controllability of the model by providing the exact same input scene alongside three different sets of actions. The model successfully generated three distinct output sequences, each one leading to a completely different final state.
Conclusion
In this blog post, we explored MineWorld, the first open-source world model for Minecraft. We have discussed their approach to tokenizing each frame/state into several tokens and combining them with 11 additional tokens representing both discrete actions and camera movement. We have also highlighted their innovative use of an Inverse Dynamics Model to compute controllability metrics, alongside their novel parallel decoding algorithm that triples inference speed, reaching an average of 3 frames per second.
In the future, it could be valuable to extend the testing running time beyond a 16-frame window. Such a long time can accurately test Mineworld’s ability to regenerate specific objects, a challenge that, in my opinion, will remain a major obstacle to adapting such models widely.
Thanks for reading!
Interested in trying a Minecraft world model in your browser? Try Oasis[2] here.
References
[1] J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce and J. Bian, MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft (2025), arXiv preprint arXiv:2504.08388v1
[2] R. Wachen and D. Leitersdorf, Oasis (2024), https://oasis-ai.org/
[3] D. Ha and J. Schmidhuber, World Models (2018), arXiv preprint arXiv:1803.10122
[4] J. Guo, Y. Ye, T. He, H. Wu, Y. Jiang, T. Pearce and J. Bian, MineWorld (2025), GitHub repository: https://github.com/microsoft/mineworld
[5] B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro and J. Clune, Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos (2022), arXiv preprint arXiv:2206.11795
[6] A. van den Oord, O. Vinyals and K. Kavukcuoglu, Neural Discrete Representation Learning (2017), arXiv preprint arXiv:1711.00937
[7] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Joulin, E. Grave and G. Lample, LLaMA: Open and Efficient Foundation Language Models (2023), arXiv preprint arXiv:2302.13971
[8] Y. Ye, J. Guo, H. Wu, T. He, T. Pearce, T. Rashid, K. Hofmann and J. Bian, Fast Autoregressive Video Generation with Diagonal Decoding (2025), arXiv preprint arXiv:2503.14070
Source link
#Dreaming #BlocksMineWorld #Minecraft #WorldModel