Overview
- The Wikimedia Foundation has partnered with Kaggle to release a beta dataset of structured Wikipedia content in English and French, designed for AI workflows.
- The dataset includes abstracts, short descriptions, infobox data, image links, and segmented article sections, but excludes references and multimedia files.
- Freely licensed under Creative Commons and similar licenses, the dataset aims to support smaller companies and independent researchers by providing accessible and lawful data.
- This initiative seeks to deter unauthorized scraping by AI bots, which have significantly increased Wikimedia's server costs and disrupted human user access.
- Wikimedia will monitor the dataset's impact on reducing server strain while continuing to explore broader solutions to protect its open knowledge mission.