Wikipedia has announced a partnership with Kaggle, a Google-owned data science community platform, to create a machine-readable dataset of its content specifically designed for training artificial intelligence models. This initiative comes in response to a significant increase in non-human traffic due to bots scraping the site for AI training, with bandwidth consumption rising by 50% since January 2024. The new dataset will initially focus on English and French, providing stripped-down versions of Wikipedia articles that exclude references and markdown code. As the Wikimedia Foundation seeks to manage costs associated with this surge in traffic, it emphasizes the importance of protecting contributors’ rights by adhering to Creative Commons licensing terms. The dataset is expected to enhance accessibility for AI developers while addressing the ongoing challenges of content scraping from the platform.
A concerning new trend has emerged where users are employing OpenAI’s latest models, o3 and o4-mini, to conduct reverse location searches from photographs. These AI models have advanced image-analyzing capabilities, allowing them to identify cities, landmarks, and even specific venues based on visual clues. This trend has gained traction on social media platforms, with users sharing examples of the models successfully identifying locations from various types of images. For instance, one user demonstrated how the model accurately identified a location from a seemingly random photo taken in a library. However, experts warn that this capability poses significant privacy risks, as malicious actors could misuse this technology to uncover personal information. OpenAI has yet to address these potential dangers in its safety reports for the new models.
Why do we care?
Wikipedia’s move is a calculated pivot: if AI models are going to ingest your data anyway, better to shape how it happens. It tackles two core issues — runaway scraping costs and contributor rights under Creative Commons licensing. Expect more structured open data offerings, and be aware this is an offering you too can include. In the webinar I hosted today, Srinivas Krishnaswamy offered ready-to-use schema templates for your website. And those helping clients with AI integration will need to track these new official pipelines — they’ll often be more cost-effective and compliant than unstructured scraping.
I included the OpenAI insights to provide perspective on unanticipated safety risks.

