Wikipedia opens structured access to its data to train AI models

Wikipedia dataset scraping AI
A Wikipedia dataset for AI training is now available on Kaggle.

In response to intensive scraping, Wikimedia is putting an optimized Wikipedia dataset online on Kaggle, intended for artificial intelligence researchers and developers.

Wikimedia Enterprise recently published a large, structured Wikipedia dataset on Kaggle, the Google-owned data science platform. The goal is to provide researchers, developers, and AI professionals with clean, up-to-date, and easily exploitable access to encyclopedic content. This initiative also addresses the growing pressure from intensive scraping of “free encyclopedia” content.

Wikimedia Enterprise aims to make Wikipedia data easier for AI

Kaggle is a well-known online platform for data scientists, offering machine learning competitions, open datasets, and a collaborative environment for developing AI models. By making a large dataset available, Wikimedia Enterprise aims to encourage responsible and accessible use of Wikipedia content and reduce the significant burden on its own infrastructure.

This bundle simplifies access to clean, pre-analyzed article data that can be immediately used for modeling, benchmarking, fine-tuning, and exploratory analysis, Wikimedia Enterprise explains.

This announcement comes in a context where the massive use of Wikipedia by scraping robots generates considerable, sometimes problematic, traffic. And this data collection is often carried out by actors whose goal is to train language models on a large scale, without necessarily respecting good technical or ethical practices. “We discovered that at least 65% of this resource-intensive traffic on our site comes from bots,” explained Wikimedia in early April 2025, which has also noted, since January 2024, a 50% increase in the bandwidth used for downloading content from its servers.

A dataset designed for training and analyzing AI models

The dataset provided by Wikimedia contains a compressed and structured version of Wikipedia content, updated monthly. It focuses on the English and French versions of the encyclopedia, with enriched metadata (page IDs, version timestamps, section structures, internal links, etc.), in JSON format optimized for automated analysis.

Instead of extracting or parsing the raw text of articles, Kaggle users can work directly with well-structured JSON representations of Wikipedia content, which is ideal for training models, Wikimedia Enterprise said.

The dataset also contains “summaries, descriptions, infobox data, image links, and clearly segmented article sections,” excluding non-text elements. Furthermore, the content is licensed under a free license (Creative Commons and GFDL). Finally, this project is not limited to simple dissemination: it is accompanied by detailed documentation, an associated GitHub repository, and a community forum on Kaggle to discuss possible uses.

Share this article
0
Share
Shareable URL
Prev Post

Gemini Live: Live video is available for free on Android

Next Post

Google Won’t Remove Third-Party Cookies on Chrome: What You Need to Know

Leave a Reply

Your email address will not be published. Required fields are marked *

Read next