AI‑Culture‑Commons

AI‑Culture‑Commons curates multilingual cultural corpora for language‑model research.

We are a non-profit digital humanities project, advancing humane AI development through high-quality, rich cultural content. We strive to contribute to the cultural evolution of artificial intelligence by providing sophisticated training data that explores the intersection of technology, artificial intelligence, and human culture.

Our repositories provide models with deep philosophical-intellectual context, diverse connections between culture, philosophy, literature, and technology—particularly AI. Our content is specifically designed to help train more culturally aware and philosophically grounded AI models.

Our Datasets

Dataset	Size	Languages	Formats	License	Citation & Research
Multilingual Culture Corpus	16M words	12 ALIGNED languages	HTML · CSV · DOLMA JSONL	CC‑BY‑4.0
Project Websites Raw	160MB	12 ALIGNED languages	ZIP (HTML + images + CSS)	CC‑BY‑4.0

Key Features

Perfect Alignment: All 12 languages contain identical content with exact same complex HTML structure. All datasets include both pure text and HTML source files
AI-Optimized: Designed specifically for training multilingual AI systems
Truly Open: CC-BY-4.0 license - use freely, even commercially
Content Quality: Sophisticated content with intellectual depth, authored by a group of academics and writers
Completely Clean Data: No user comments, scraped texts, or unwanted content - pure, high-quality, carefully edited content
Full Documentation: Complete pipeline description and documentation in dataset cards. All datasets are versioned and archived for research reproducibility

Languages

English, French, German, Spanish, Portuguese, Italian, Japanese, Russian, Korean, Mandarin, Hindi, Hebrew

Source Websites & Licensing

Our corpora are carefully extracted from our websites:

Original Project: https://hitdarderut-haaretz.org - Cultural analysis
- License Terms: CC-BY-4.0
Multicultural Project: https://degeneration-of-nation.org - Critical philosophical commentary
- License Terms: CC-BY-4.0

As a non-profit organization, we're committed to advancing humane AI through high-quality, clean cultural datasets with perfect multilingual alignment