Community AI, Built Open
An end-to-end platform for ethically sourcing conversational data from under-represented languages and fine-tuning multilingual models. Runs natively on Google Cloud — and ships as an open-source Digital Public Good anyone can deploy anywhere.
How the Platform Works
From community data collection to model serving — every stage runs on Google Cloud infrastructure, designed for scale and replication.
Community Data Collection
Contributors submit conversations, stories, and text in 700+ Indonesian languages via web app, WhatsApp, and in-person community hubs. Each submission is tagged by language, dialect, tone, and topic.
Cloud RunAutomated Data Processing Pipeline
Cloud Workflows + FunctionsPII Removal
Names, phone numbers, and identifiers are automatically detected and scrubbed before data reaches storage.
Deduplication
Fuzzy and exact matching across submissions to ensure dataset quality and prevent redundancy.
Quality Scoring
Each submission is scored on length, coherence, cultural relevance, and topical diversity.
Model Training & Fine-Tuning
Vertex AIDataset Packaging
Data mixture recipes balance across languages, topics, and quality tiers for optimal training outcomes.
Fine-Tuning on Gemma 3
Leveraging Gemma's multilingual tokenizer to produce cloud-scale and small on-device model variants.
Community Chat Interface
AI that speaks local languages, deployed via low-bandwidth endpoints accessible on basic smartphones and 2G connections.
Cloud RunOpen-Source Toolkit
The entire stack is open-sourced for any organization to fork, deploy, and serve new language communities worldwide.
GitHubWhat We're Building
Four core components — each open-sourced per our agreement with UNICEF Ventures — that together form a replicable Community AI toolkit.
Data Ingestion Pipeline
Multi-channel collection from web submissions, WhatsApp messages, and bulk uploads from field community hubs. Automated PII detection and removal ensures personal information never reaches storage.
Training Infrastructure
Data mixture recipes for balancing across languages and topics. Fine-tuning pipelines on Gemma 3 via Vertex AI, targeting both cloud-serving and small on-device models for low-connectivity rural deployment.
Inference & Chat
Serving endpoints and a community-facing chat interface in local languages. Designed for low-bandwidth access on basic smartphones — works on 2G networks with minimal data consumption.
Open-Source Toolkit
The complete pipeline — collection, processing, training notebooks, data recipes, and inference config — released as open-source. Any organization can fork and deploy for a new language group or region.
Two Ways to Deploy
We run on Google Cloud. But the entire stack is open-sourced as a Digital Public Good — deploy it on any infrastructure that fits your context.
Google Cloud
The Baibu production platform runs natively on GCP — optimised for Indonesia's infrastructure and Google's AI tooling.
Deploy Anywhere
Fork the open-source toolkit and run it on any cloud or on-premise infrastructure. The platform is designed to be cloud-agnostic — swap in your own storage, compute, and model serving layer.
A Digital Public Good
As part of our UNICEF Ventures investment, the entire software stack is open-sourced as a Digital Public Good — designed so any organization can replicate the Community AI model for new languages, regions, and infrastructure.
Data Submission Pipeline
Collection endpoints, validation logic, PII detection and scrubbing, deduplication, and quality scoring — from ingestion to cleaned dataset.
Training Notebooks & Recipes
Jupyter notebooks for data mixture configuration, fine-tuning scripts, and hyperparameter documentation for reproducing model training runs.
Inference Configuration
Model serving setup, quantization configs for on-device deployment, and API endpoint specifications for both cloud and edge inference.
Web Application
The full contributor-facing web app — submission forms, dashboards, community leaderboards, and admin tools — ready to deploy on Cloud Run.
Funded by UNICEF Ventures
Bahasa Ibu is a portfolio investment of the UNICEF Venture Fund, which invests in early-stage, open-source technology solutions that benefit children and communities worldwide. All platform deliverables are open-sourced as a Digital Public Good.
UNICEF Venture Fund Portfolio Company
Selected from hundreds of applicants for building open-source AI infrastructure that serves under-represented communities. Investment is scoped specifically to the Bahasa Ibu initiative and its Community AI platform, with all pipeline deliverables open-sourced as a Digital Public Good.
VIEW UNICEF PORTFOLIO ↗Let's Build Community AI Together
Whether you're a contributor, a partner, or a funder — there's a place for you in this project.