🧠
Aakash Meghwar
Computational Linguist ·
NLP Engineer ·
Low-Resource Language Researcher
About
Location
🇷🇺 Nizhny Novgorod, Russia ←→ Pakistan 🇵🇰
Education
MSc @ Higher School of Economics (HSE) — Nizhny Novgorod (2024–)
Mission
Bridge linguistics and AI for low-resource languages — starting with Sindhi, spoken by 80M people with zero NLP tooling
Research Areas
Languages
Sindhi — Native 🌙
Urdu — Native ✨
English — C2 🔥
Russian — A2 📖
Fun fact
Building NLP tools for 80 million Sindhi speakers — from scratch 🌙
Flagship Projects
SindhiLM
Sindhi Language Model — fine-tuning Qwen2.5-0.5B on 505M token Sindhi corpus. First transformer dedicated to Sindhi.
In Training
SindhiLM-Tokenizer
Morpheme-aware BPE tokenizer merged into Qwen2.5. v1 live with 7,978 Sindhi tokens. v2 with SindhiNLTK pre-seg coming.
v1 Live
sindhinltk
First open-source Python NLP library for Sindhi. Zero deps. Tokenizer · Normalizer · Stemmer · Stopwords · Sentiment.
pip install sindhinltk
Sindhi Corpus 505M
Largest open-source deduplicated Sindhi pretraining corpus. 742K docs · ~505M tokens · 11 source datasets.
Live on HuggingFace
Aurat March Sentiment
MiniLM fine-tuned for feminist discourse sentiment. English · Urdu · Sindhi. Linked to SSRN publication 2025.
Live
Urdu Poetry Transformer
Compact transformer for classical Urdu poetry generation and stylistics. Related to Springer 2026 publication.
Live
Publications & Research
Springer · Language Resources and Evaluation · 2026
Compact Transformer Models for Classical Urdu Poetry: A Computational Stylistics Approach
SSRN eJournal · 2025
The Comparative Analysis of How the Aurat March Movement is Represented in British and Pakistani Pro-Government Press
78th Herzen Readings · St. Petersburg, Russia · April 2025
Sentiment Analysis of International Students' Online Reviews: A Case Study of Experiences and Adaptation in Russian Universities
Tech Stack
🐍 Python
🔥 PyTorch
🤗 Transformers
⛓️ LangChain
📊 scikit-learn
🧮 TensorFlow
📓 Jupyter
☁️ Azure
🤖 IBM Watson
📈 Pandas
🔢 NumPy
🔤 spaCy
📦 PyPI
🐙 Git
📝 LaTeX
📊 R
Open To