Only Text of MineDojo Wiki Dataset

MineDojo has compiled an incredibly large dataset of Minecraft wiki pages. The team behind the dataset put a lot of work into recording the full state of each wikipedia page by including the images and full screenshots of the page. However for LLM fine-tuning purposes, including all the images results in a very large dataset, with most of the space being taken up by files which won't be used.

I wrote a script which recursively deleted every single image in the wiki folders, leaving only the JSON files with the text components inside.

I've uploaded this only-text dataset to my GitHub in case it's of use to anyone else. I'm planning on using it to generate Minecraft question-answer pair dataset for fine-tuning.

Link: https://github.com/Nicolas-Gatien/minedojo-text-only