Only Text of MineDojo Wiki Dataset

• minecraft-utility-agent

MineDojo has compiled an incredibly large dataset of Minecraft wiki pages. The team behind the dataset put a lot of work into recording the full state of each wikipedia page by including the images and full screenshots of the page. However for LLM fine-tuning purposes, including all the images results in a very large dataset, with most of the space being taken up by files which won't be used.

minedojo-dataset-original-size.png

I wrote a script which recursively deleted every single image in the wiki folders, leaving only the JSON files with the text components inside.

minedojo-textonly-dataset-size.png

I've uploaded this only-text dataset to my GitHub in case it's of use to anyone else. I'm planning on using it to generate Minecraft question-answer pair dataset for fine-tuning.

Link: https://github.com/Nicolas-Gatien/minedojo-text-only