Community AI, built open
An end-to-end platform for ethically sourcing conversational data from under-represented languages and fine-tuning multilingual models — shipped as an open-source Digital Public Good anyone can deploy anywhere.
Where we are
The data collection platform is live and actively receiving contributions. Here's what's running today versus what's ahead.
Data collection platform
- ✓Submission form live — currently open for Javanese and Sundanese content
- ✓Automated processing pipeline operational (PII removal, deduplication, quality scoring)
- ✓Public stats dashboard tracking submissions and word counts in real time
- ✓Community contributor recruitment and onboarding underway
What's next
- ●Expand collection to additional Indonesian languages
- ●Open-weight multilingual model fine-tuning — beginning Q2 2026
- ●Community chat interface for local-language AI access
- ●Open-source release of the full pipeline toolkit
How the platform works
From community data collection to model serving — every stage is modular, designed for scale and replication on any infrastructure.
Community data collection
Contributors submit conversations, stories, and text in 700+ Indonesian languages via web app, WhatsApp, and in-person community hubs. Each submission is tagged by language, dialect, tone, and topic.
Data processing pipeline
PII Removal
Names, phone numbers, and identifiers are automatically detected and scrubbed before data reaches storage.
Deduplication
Fuzzy and exact matching across submissions to ensure dataset quality and prevent redundancy.
Quality Scoring
Each submission is scored on length, coherence, cultural relevance, and topical diversity.
Model training & fine-tuning
Dataset Packaging
Data mixture recipes balance across languages, topics, and quality tiers for optimal training outcomes.
Multilingual Fine-Tuning
Fine-tuning recipes for open-weight multilingual base models — producing both cloud-scale and small on-device variants.
Community chat interface
AI that speaks local languages, deployed via low-bandwidth endpoints accessible on basic smartphones and 2G connections.
Open-source toolkit
The entire stack is open-sourced for any organization to fork, deploy, and serve new language communities worldwide.
What we're building
Four core components — each open-sourced per our agreement with UNICEF Ventures — that together form a replicable Community AI toolkit.
Data Ingestion Pipeline
Multi-channel collection from web submissions, WhatsApp messages, and bulk uploads from field community hubs. Automated PII detection and removal ensures personal information never reaches storage.
Training Infrastructure
Data mixture recipes for balancing across languages and topics. Fine-tuning pipelines for open-weight multilingual models — targeting both cloud-serving and small on-device variants for low-connectivity rural deployment.
Inference & Chat
Serving endpoints and a community-facing chat interface in local languages. Designed for low-bandwidth access on basic smartphones — works on 2G networks with minimal data consumption.
Open-Source Toolkit
The complete pipeline — collection, processing, training notebooks, data recipes, and inference config — released as open-source. Any organization can fork and deploy for a new language group or region.
A Digital Public Good
As part of our UNICEF Ventures investment, the entire software stack is open-sourced as a Digital Public Good — designed so any organization can replicate the Community AI model for new languages, regions, and infrastructure.
Data Submission Pipeline
Collection endpoints, validation logic, PII detection and scrubbing, deduplication, and quality scoring — from ingestion to cleaned dataset.
Training Notebooks & Recipes
Jupyter notebooks for data mixture configuration, fine-tuning scripts, and hyperparameter documentation for reproducing model training runs.
Inference Configuration
Model serving setup, quantization configs for on-device deployment, and API endpoint specifications for both cloud and edge inference.
Web Application
The full contributor-facing web app — submission forms, dashboards, community leaderboards, and admin tools — ready to deploy on any container platform.
Funded by UNICEF Ventures
Bahasa Ibu is a portfolio investment of the UNICEF Venture Fund, which invests in early-stage, open-source technology solutions that benefit children and communities worldwide. All platform deliverables are open-sourced as a Digital Public Good.
UNICEF Venture Fund Portfolio Company
Selected from hundreds of applicants for building open-source AI infrastructure that serves under-represented communities. Investment is scoped specifically to the Bahasa Ibu initiative and its Community AI platform, with all pipeline deliverables open-sourced as a Digital Public Good.
View UNICEF PortfolioLet's build Community AI together.
Whether you're a contributor, a partner, or a funder — there's a place for you in this project.