2026 · Multilingual language models
Sweta-Hi and Sweta-Kn
Hindi and Kannada language models trained from scratch — a technical tribute to the languages closest to my life.
Why I built this
Hindi is my mother tongue. I moved to Bengaluru for work and the city and its people welcomed me warmly. Training a Kannada language model felt like a small technical contribution to the community that gave me opportunity. Both languages are underrepresented in open-source LLM research relative to their speaker populations.
Architecture
Multilingual Corpus
Sangraha: Hindi verified (34.5 GB) and Kannada verified (14 GB). Samanantar: 49.6M English↔Indic sentence pairs. IndicCorp v2 for Indian-context English. Aya dataset (8K Hindi + Kannada instruction pairs). KCC agriculture Q&A for domain coverage.
Custom 64K BPE Tokenizer
Trained jointly on English, Hindi, and Kannada corpora. 64K vocab (vs 32K for Phoenix) to cover both Indic scripts without excessive token fragmentation. Language-balanced training corpus for the tokenizer itself.
Model
Same LLaMA-style decoder architecture as Phoenix 125M: RoPE, SwiGLU, RMSNorm, FlashAttention, bf16. 134M parameters. The only architectural difference is the larger 64K embedding table.
Per-Language Evaluation
Separate perplexity evaluation for each language using held-out sets from each corpus. Tracks EN, HI, and KN PPL independently to detect language forgetting during training.
Tech stack
Technologies used
core
data
tools
Key highlights
Proof points
- 01
Custom 64K BPE tokenizer trained jointly on English, Hindi, and Kannada — covers both Indic scripts without excessive fragmentation.
- 02
Hindi perplexity of 14.5 — strong signal the model has absorbed Hindi language structure at 134M parameters.
- 03
Kannada perplexity of 34.0 — meaningful open-source coverage for a language with limited LLM representation.
- 04
Reuses the full Phoenix 125M pipeline, demonstrating the architecture generalises cleanly to multilingual training.
- 05
Released on HuggingFace as a contribution to Indian language NLP.
Benchmark results
Hindi PPL
134M params, step 2250
Kannada PPL
134M params, step 2250
English PPL
secondary language
Focus areas
Explore the work