OPEN-SOURCE AI INFRASTRUCTURE

Community AI, built open

An end-to-end platform for ethically sourcing conversational data from under-represented languages and fine-tuning multilingual models — shipped as an open-source Digital Public Good anyone can deploy anywhere.

LIVE NOW

Where we are

The data collection platform is live and actively receiving contributions. Here's what's running today versus what's ahead.

Live

Data collection platform

  • Submission form live — currently open for Javanese and Sundanese content
  • Automated processing pipeline operational (PII removal, deduplication, quality scoring)
  • Public stats dashboard tracking submissions and word counts in real time
  • Community contributor recruitment and onboarding underway
View live stats
Roadmap
🗺️

What's next

  • Expand collection to additional Indonesian languages
  • Open-weight multilingual model fine-tuning — beginning Q2 2026
  • Community chat interface for local-language AI access
  • Open-source release of the full pipeline toolkit
ARCHITECTURE

How the platform works

From community data collection to model serving — every stage is modular, designed for scale and replication on any infrastructure.

1 Layer 1
Web · WhatsApp · Hubs

Community data collection

Contributors submit conversations, stories, and text in 700+ Indonesian languages via web app, WhatsApp, and in-person community hubs. Each submission is tagged by language, dialect, tone, and topic.

2 Layer 2
Automated workflows

Data processing pipeline

PII Removal

Names, phone numbers, and identifiers are automatically detected and scrubbed before data reaches storage.

Deduplication

Fuzzy and exact matching across submissions to ensure dataset quality and prevent redundancy.

Quality Scoring

Each submission is scored on length, coherence, cultural relevance, and topical diversity.

3 Layer 3
Open-weight models

Model training & fine-tuning

Dataset Packaging

Data mixture recipes balance across languages, topics, and quality tiers for optimal training outcomes.

Multilingual Fine-Tuning

Fine-tuning recipes for open-weight multilingual base models — producing both cloud-scale and small on-device variants.

Output

Community chat interface

AI that speaks local languages, deployed via low-bandwidth endpoints accessible on basic smartphones and 2G connections.

Output

Open-source toolkit

The entire stack is open-sourced for any organization to fork, deploy, and serve new language communities worldwide.

THE STACK

What we're building

Four core components — each open-sourced per our agreement with UNICEF Ventures — that together form a replicable Community AI toolkit.

1

Data Ingestion Pipeline

Multi-channel collection from web submissions, WhatsApp messages, and bulk uploads from field community hubs. Automated PII detection and removal ensures personal information never reaches storage.

Web Forms WhatsApp Bulk Import
2

Training Infrastructure

Data mixture recipes for balancing across languages and topics. Fine-tuning pipelines for open-weight multilingual models — targeting both cloud-serving and small on-device variants for low-connectivity rural deployment.

Open Weights LoRA / PEFT Quantization
3

Inference & Chat

Serving endpoints and a community-facing chat interface in local languages. Designed for low-bandwidth access on basic smartphones — works on 2G networks with minimal data consumption.

Low Bandwidth On-Device 2G Friendly
4

Open-Source Toolkit

The complete pipeline — collection, processing, training notebooks, data recipes, and inference config — released as open-source. Any organization can fork and deploy for a new language group or region.

Apache 2.0 GitHub Cloud-Agnostic
OPEN SOURCE

A Digital Public Good

As part of our UNICEF Ventures investment, the entire software stack is open-sourced as a Digital Public Good — designed so any organization can replicate the Community AI model for new languages, regions, and infrastructure.

Data Submission Pipeline

Collection endpoints, validation logic, PII detection and scrubbing, deduplication, and quality scoring — from ingestion to cleaned dataset.

Training Notebooks & Recipes

Jupyter notebooks for data mixture configuration, fine-tuning scripts, and hyperparameter documentation for reproducing model training runs.

Inference Configuration

Model serving setup, quantization configs for on-device deployment, and API endpoint specifications for both cloud and edge inference.

Web Application

The full contributor-facing web app — submission forms, dashboards, community leaderboards, and admin tools — ready to deploy on any container platform.

BACKED BY

Funded by UNICEF Ventures

Bahasa Ibu is a portfolio investment of the UNICEF Venture Fund, which invests in early-stage, open-source technology solutions that benefit children and communities worldwide. All platform deliverables are open-sourced as a Digital Public Good.

UNICEF Innovation Fund

UNICEF Venture Fund Portfolio Company

Selected from hundreds of applicants for building open-source AI infrastructure that serves under-represented communities. Investment is scoped specifically to the Bahasa Ibu initiative and its Community AI platform, with all pipeline deliverables open-sourced as a Digital Public Good.

View UNICEF Portfolio

Let's build Community AI together.

Whether you're a contributor, a partner, or a funder — there's a place for you in this project.