Natural Language Processing for Pashto: Challenges, Methods, and Opportunities in the Context of Low-Resource Languages

Haqmal L. U. R., GÜNAY M.

5th International Conference on Informatics and Software Engineering, IISEC 2026, Ankara, Türkiye, 5 - 06 Şubat 2026, ss.52-57, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/iisec69317.2026.11418495
Basıldığı Şehir: Ankara
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.52-57
Anahtar Kelimeler: Data Scarcity, Large Language Models, Low-Resource Languages, Multilingual Models, Natural Language Processing, Pashto, Transfer Learning, Transformers
Akdeniz Üniversitesi Adresli: Evet

Özet

While Natural Language Processing (NLP) has progressed a lot in recent years, Pashto language remains significantly underrepresented in both academic as well as industrial NLP research. This paper presents a critical survey focusing specifically on the current state of NLP for Pashto language. We have analyzed existing models, datasets, and methods related to tasks such as part-of-speech tagging, named entity recognition, optical character recognition, and text classification. Our review reveals the lack of standard datasets, limited publicly available tools, and underexplored dialectal diversity. We have also looked at some of the most recent techniques, such as multilingual transformer-based models and transfer learning. These methods have demonstrated a lot of potential to bridge the resource gap with high-resource languages. Some of the ethical concerns in low-resource settings such as dialect marginalization, privacy in data collecting, and fair language representation are also covered. For scholars and practitioners interested in developing Pashto NLP in an informed, culturally aware, and technically robust manner, this work attempts to serve as a fundamental reference.