Chat as a data flywheel
Forms collect strings. Conversations collect language. We replace rigid data-entry workflows with guided multi-turn chats that capture the full texture of how a community actually speaks — topics, registers, dialects, and code-switching included.
Why forms fail low-resource languages
Most language datasets are built from translation pairs and scraped text. For Indonesia's 700+ mother tongues — many primarily spoken, not written — that approach hits a wall fast.
Contributors quit forms
Static prompts feel like homework. Drop-off is steep — especially for older speakers and rural communities where literacy registers differ from the prompt language.
Single-turn data is thin
Translation pairs miss how speakers actually use a language — pragmatics, turn-taking, repair, honorifics, the moment someone switches dialect mid-sentence.
Registers go missing
Formal-to-friend, elder-to-child, market haggling, prayer, gossip — different registers carry different vocabulary. Forms collect one register, usually the wrong one.
How a chat collects what a form can't
Each conversation is steered by a topic prompt and adapts in real time. The contributor talks; the platform listens, branches, and asks follow-ups designed to surface the linguistic features researchers actually need.
Topic & register selected
The contributor picks a topic (market, family, work, ritual, news) and a register (formal, conversational, child-directed). The platform tracks both as structured metadata on every turn.
Conversation branches on signal
Follow-up prompts are adaptive — pushing for specifics, eliciting direct quotes, asking the contributor to switch register or dialect mid-thread. Every branch is logged for researcher review.
Output feeds back into prompts
As models improve from collected data, the next generation of prompts gets sharper — better at eliciting rare dialects, harder gaps in coverage, and underrepresented speaker demographics. The flywheel compounds.
What every conversation gives us
A single contribution is not a sentence — it's a multi-dimensional record. Each turn is tagged across language, register, topic, and pragmatic context, then PII-scrubbed before it reaches storage.
Multilingual context
Code-switches between Indonesian, regional language, and dialects within a single thread — labeled per turn.
Linguistic registers
Ngoko, krama, halus, gaul, child-directed, elder-formal — captured natively rather than reverse-engineered.
Multi-turn pragmatics
How speakers repair, hedge, joke, defer, and disagree — context that single-sentence corpora never see.
Topic diversity
Coverage tracked across daily life, work, ritual, civic, and news — so model training isn't biased toward whichever topics happened to scrape well.
Why this compounds
Better data trains better models. Better models run better chats. Better chats collect rarer data. Every iteration tightens coverage gaps and surfaces under-represented voices.
Guided chat
Contributors converse on chosen topics in their language.
Tagged corpus
Turns enriched with register, topic, dialect, pragmatics.
Model training
Open-weight multilingual fine-tunes built on richer signal.
Smarter prompts
Better models = sharper follow-ups = harder gaps surfaced.
Partner with us on the flywheel.
Funders, researchers, and community partners — talk to us about co-designing topic packs, target dialects, or shared deployments.