← All projects

2026 · Multilingual language models

Sweta-Hi and Sweta-Kn

Hindi and Kannada language models trained from scratch — a technical tribute to the languages closest to my life.

Released 2026

Why I built this

Hindi is my mother tongue. I moved to Bengaluru for work and the city and its people welcomed me warmly. Training a Kannada language model felt like a small technical contribution to the community that gave me opportunity. Both languages are underrepresented in open-source LLM research relative to their speaker populations.

134M Parameters
64K Vocab size
3 Languages (EN + HI + KN)

Architecture

Multilingual Corpus

Sangraha: Hindi verified (34.5 GB) and Kannada verified (14 GB). Samanantar: 49.6M English↔Indic sentence pairs. IndicCorp v2 for Indian-context English. Aya dataset (8K Hindi + Kannada instruction pairs). KCC agriculture Q&A for domain coverage.

Custom 64K BPE Tokenizer

Trained jointly on English, Hindi, and Kannada corpora. 64K vocab (vs 32K for Phoenix) to cover both Indic scripts without excessive token fragmentation. Language-balanced training corpus for the tokenizer itself.

Model

Same LLaMA-style decoder architecture as Phoenix 125M: RoPE, SwiGLU, RMSNorm, FlashAttention, bf16. 134M parameters. The only architectural difference is the larger 64K embedding table.

Per-Language Evaluation

Separate perplexity evaluation for each language using held-out sets from each corpus. Tracks EN, HI, and KN PPL independently to detect language forgetting during training.

Tech stack

Technologies used

core

PyTorch 2.xHuggingFace TransformersBPE Tokenizer (64K)FlashAttentionRoPESwiGLU

data

Sangraha (Hindi 34.5 GB, Kannada 14 GB)Samanantar (49.6M pairs)IndicCorp v2Aya Dataset

tools

XLM-RoBERTa (lang detection)MinHash LSH dedupRay (distributed preprocessing)

Key highlights

Proof points

  1. 01

    Custom 64K BPE tokenizer trained jointly on English, Hindi, and Kannada — covers both Indic scripts without excessive fragmentation.

  2. 02

    Hindi perplexity of 14.5 — strong signal the model has absorbed Hindi language structure at 134M parameters.

  3. 03

    Kannada perplexity of 34.0 — meaningful open-source coverage for a language with limited LLM representation.

  4. 04

    Reuses the full Phoenix 125M pipeline, demonstrating the architecture generalises cleanly to multilingual training.

  5. 05

    Released on HuggingFace as a contribution to Indian language NLP.

Benchmark results

14.5

Hindi PPL

134M params, step 2250

34.0

Kannada PPL

134M params, step 2250

215.7

English PPL

secondary language

Focus areas

Multilingual NLPData engineeringCustom tokenizersModel evaluation