DATA COLLECTION APPROACH

Chat as a data flywheel

Forms collect strings. Conversations collect language. We replace rigid data-entry workflows with guided multi-turn chats that capture the full texture of how a community actually speaks — topics, registers, dialects, and code-switching included.

THE PROBLEM

Why forms fail low-resource languages

Most language datasets are built from translation pairs and scraped text. For Indonesia's 700+ mother tongues — many primarily spoken, not written — that approach hits a wall fast.

🥱

Contributors quit forms

Static prompts feel like homework. Drop-off is steep — especially for older speakers and rural communities where literacy registers differ from the prompt language.

📋

Single-turn data is thin

Translation pairs miss how speakers actually use a language — pragmatics, turn-taking, repair, honorifics, the moment someone switches dialect mid-sentence.

🎭

Registers go missing

Formal-to-friend, elder-to-child, market haggling, prayer, gossip — different registers carry different vocabulary. Forms collect one register, usually the wrong one.

THE FLYWHEEL

How a chat collects what a form can't

Each conversation is steered by a topic prompt and adapts in real time. The contributor talks; the platform listens, branches, and asks follow-ups designed to surface the linguistic features researchers actually need.

Topic: Market memories · Javanese (ngoko)
Coba ceritakan, pasar apa yang paling sering kamu kunjungi waktu kecil?
Pasar Beringharjo, bareng mbah putri. Mesti tuku jajan pasar.
Jajan pasar apa sing paling mbok elingi? Sopo sing biasane nawar regane?
Klepon karo cenil. Mbah putri sing nawar — galak banget, hahaha.
Kepriye carane mbah nawar? Coba tulis persis tembung sing dianggo.
Turns
12
Register
Ngoko
Code-switch
1

Topic & register selected

The contributor picks a topic (market, family, work, ritual, news) and a register (formal, conversational, child-directed). The platform tracks both as structured metadata on every turn.

2

Conversation branches on signal

Follow-up prompts are adaptive — pushing for specifics, eliciting direct quotes, asking the contributor to switch register or dialect mid-thread. Every branch is logged for researcher review.

3

Output feeds back into prompts

As models improve from collected data, the next generation of prompts gets sharper — better at eliciting rare dialects, harder gaps in coverage, and underrepresented speaker demographics. The flywheel compounds.

STRUCTURED SIGNAL

What every conversation gives us

A single contribution is not a sentence — it's a multi-dimensional record. Each turn is tagged across language, register, topic, and pragmatic context, then PII-scrubbed before it reaches storage.

🌐

Multilingual context

Code-switches between Indonesian, regional language, and dialects within a single thread — labeled per turn.

🗣️

Linguistic registers

Ngoko, krama, halus, gaul, child-directed, elder-formal — captured natively rather than reverse-engineered.

💬

Multi-turn pragmatics

How speakers repair, hedge, joke, defer, and disagree — context that single-sentence corpora never see.

🎯

Topic diversity

Coverage tracked across daily life, work, ritual, civic, and news — so model training isn't biased toward whichever topics happened to scrape well.

THE COMPOUND EFFECT

Why this compounds

Better data trains better models. Better models run better chats. Better chats collect rarer data. Every iteration tightens coverage gaps and surfaces under-represented voices.

Stage 1

Guided chat

Contributors converse on chosen topics in their language.

Stage 2

Tagged corpus

Turns enriched with register, topic, dialect, pragmatics.

Stage 3

Model training

Open-weight multilingual fine-tunes built on richer signal.

Stage 4

Smarter prompts

Better models = sharper follow-ups = harder gaps surfaced.

Each loop tightens dialect, topic, and register coverage.

Partner with us on the flywheel.

Funders, researchers, and community partners — talk to us about co-designing topic packs, target dialects, or shared deployments.