AI‑Culture‑Commons

AI‑Culture‑Commons curates multilingual cultural corpora for language‑model research.

We are a non-profit digital humanities project, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the cultural evolution of artificial intelligence by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.

Our repositories provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more culturally aware and philosophically grounded AI models.

Our Datasets

Dataset Size Languages Formats License Citation & Research
Multilingual Culture Corpus 16M words 12 ALIGNED languages HTML · CSV · DOLMA JSONL CC‑BY‑4.0 DOI
Project Websites Raw 160MB 12 ALIGNED languages ZIP (HTML + images + CSS) CC‑BY‑4.0 DOI

Key Features

Languages

English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew

Source Websites & Licensing

Our corpora are carefully extracted from our websites:


As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment