Open-Source AI Infrastructure

Community AI, Built Open

An end-to-end platform for ethically sourcing conversational data from under-represented languages and fine-tuning multilingual models. Runs natively on Google Cloud — and ships as an open-source Digital Public Good anyone can deploy anywhere.

Architecture

How the Platform Works

From community data collection to model serving — every stage runs on Google Cloud infrastructure, designed for scale and replication.

Layer 1

Community Data Collection

Contributors submit conversations, stories, and text in 700+ Indonesian languages via web app, WhatsApp, and in-person community hubs. Each submission is tagged by language, dialect, tone, and topic.

Cloud Run

↓

Layer 2

Automated Data Processing Pipeline

Cloud Workflows + Functions

PII Removal

Names, phone numbers, and identifiers are automatically detected and scrubbed before data reaches storage.

Deduplication

Fuzzy and exact matching across submissions to ensure dataset quality and prevent redundancy.

Quality Scoring

Each submission is scored on length, coherence, cultural relevance, and topical diversity.

↓

Layer 3

Model Training & Fine-Tuning

Vertex AI

Dataset Packaging

Data mixture recipes balance across languages, topics, and quality tiers for optimal training outcomes.

Fine-Tuning on Gemma 3

Leveraging Gemma's multilingual tokenizer to produce cloud-scale and small on-device model variants.

↓

Output

Community Chat Interface

AI that speaks local languages, deployed via low-bandwidth endpoints accessible on basic smartphones and 2G connections.

Cloud Run

Output

Open-Source Toolkit

The entire stack is open-sourced for any organization to fork, deploy, and serve new language communities worldwide.

GitHub

The Stack

What We're Building

Four core components — each open-sourced per our agreement with UNICEF Ventures — that together form a replicable Community AI toolkit.

Data Ingestion Pipeline

Multi-channel collection from web submissions, WhatsApp messages, and bulk uploads from field community hubs. Automated PII detection and removal ensures personal information never reaches storage.

Cloud Run Cloud Workflows GCS

Training Infrastructure

Data mixture recipes for balancing across languages and topics. Fine-tuning pipelines on Gemma 3 via Vertex AI, targeting both cloud-serving and small on-device models for low-connectivity rural deployment.

Vertex AI Gemma 3

Inference & Chat

Serving endpoints and a community-facing chat interface in local languages. Designed for low-bandwidth access on basic smartphones — works on 2G networks with minimal data consumption.

Cloud Run Cloud Functions

Open-Source Toolkit

The complete pipeline — collection, processing, training notebooks, data recipes, and inference config — released as open-source. Any organization can fork and deploy for a new language group or region.

Apache 2.0 GitHub

Two Ways to Deploy

We run on Google Cloud. But the entire stack is open-sourced as a Digital Public Good — deploy it on any infrastructure that fits your context.

Our Deployment

Google Cloud

The Baibu production platform runs natively on GCP — optimised for Indonesia's infrastructure and Google's AI tooling.

Cloud Run Cloud Storage Vertex AI Cloud Workflows Cloud Functions Gemma 3

Your Deployment

Deploy Anywhere

Fork the open-source toolkit and run it on any cloud or on-premise infrastructure. The platform is designed to be cloud-agnostic — swap in your own storage, compute, and model serving layer.

AWS Azure On-Premise Docker Kubernetes Any LLM

Open Source

A Digital Public Good

As part of our UNICEF Ventures investment, the entire software stack is open-sourced as a Digital Public Good — designed so any organization can replicate the Community AI model for new languages, regions, and infrastructure.

Data Submission Pipeline

Collection endpoints, validation logic, PII detection and scrubbing, deduplication, and quality scoring — from ingestion to cleaned dataset.

Training Notebooks & Recipes

Jupyter notebooks for data mixture configuration, fine-tuning scripts, and hyperparameter documentation for reproducing model training runs.

Inference Configuration

Model serving setup, quantization configs for on-device deployment, and API endpoint specifications for both cloud and edge inference.

Web Application

The full contributor-facing web app — submission forms, dashboards, community leaderboards, and admin tools — ready to deploy on Cloud Run.

Backed By

Funded by UNICEF Ventures

Bahasa Ibu is a portfolio investment of the UNICEF Venture Fund, which invests in early-stage, open-source technology solutions that benefit children and communities worldwide. All platform deliverables are open-sourced as a Digital Public Good.

UNICEF Venture Fund Portfolio Company

Selected from hundreds of applicants for building open-source AI infrastructure that serves under-represented communities. Investment is scoped specifically to the Bahasa Ibu initiative and its Community AI platform, with all pipeline deliverables open-sourced as a Digital Public Good.

VIEW UNICEF PORTFOLIO ↗

Let's Build Community AI Together

Whether you're a contributor, a partner, or a funder — there's a place for you in this project.

IKUTAN YUK