2026 · Multilingual language models

Sweta-Hi and Sweta-Kn

Hindi and Kannada language models trained from scratch, a technical tribute to the languages closest to my life.

Released 2026

Why I built this

Hindi is my mother tongue. I moved to Bengaluru for work and the city and its people welcomed me warmly. Training a Kannada language model felt like a small technical contribution to the community that gave me opportunity. Both languages are underrepresented in open-source LLM research relative to their speaker populations.

134M Parameters

64K Vocab size

3 Languages (EN + HI + KN)

Architecture

Multilingual Corpus

Sangraha Hindi · 34.5 GB verified Hindi text
Sangraha Kannada · 14 GB verified Kannada text
Samanantar · 49.6M English-Indic sentence pairs
IndicCorp v2 · Indian-context English
Aya Dataset · 8K Hindi and Kannada instruction pairs
KCC Agriculture Q&A · domain coverage

Custom 64K BPE Tokenizer

Joint training · English, Hindi, and Kannada trained together on a balanced corpus
64K vocab · double Phoenix's 32K to cover both Indic scripts without fragmentation
Language-balanced · tokenizer training corpus balanced across all three languages

Model

RoPE · rotary positional encoding
SwiGLU · gated activation, no ReLU
RMSNorm · pre-norm, no bias
FlashAttention · PyTorch 2.x kernel
134M params · 64K embedding table
bf16 · same architecture as Phoenix

Per-Language Evaluation

Separate perplexity evaluation for each language using held-out sets from each corpus. Tracks EN, HI, and KN PPL independently to detect language forgetting during training.

Tech stack

Technologies used

core

PyTorch 2.xHuggingFace TransformersBPE Tokenizer (64K)FlashAttentionRoPESwiGLU

data

Sangraha (Hindi 34.5 GB, Kannada 14 GB)Samanantar (49.6M pairs)IndicCorp v2Aya Dataset

tools

XLM-RoBERTa (lang detection)MinHash LSH dedupRay (distributed preprocessing)

Key highlights

Proof points

01
Custom 64K BPE tokenizer trained jointly on English, Hindi, and Kannada, covering both Indic scripts without excessive fragmentation.
02
Hindi perplexity of 14.5: strong signal the model has absorbed Hindi language structure at 134M parameters.
03
Kannada perplexity of 34.0: meaningful open-source coverage for a language with limited LLM representation.
04
Reuses the full Phoenix 125M pipeline, demonstrating the architecture generalises cleanly to multilingual training.
05
Released on HuggingFace as a contribution to Indian language NLP.

Benchmark results

14.5

Hindi PPL

134M params, step 2250

34.0

Kannada PPL

134M params, step 2250

215.7

English PPL

secondary language

Focus areas

Multilingual NLPData engineeringCustom tokenizersModel evaluation

Explore the work

View model card ← All projects