A Novel Parts of Speech (POS) Tagset for morphological, syntactic and lexical annotations of Saraiki language

Muhammad Nabeel Asghar, Farrukh Javed Saleemi, Sajid Iqbal, Muhammad Umar Chaudhry, Muhammad Yasir, Sibghat Ullah Bazai, Muhammad Qasim Khan


One of the important resources required for various Natural Language Processing (NLP) applications like machine translation, information retrieval and text mining, is annotated text corpora. Text corpora annotation process requires parts of speech (POS) tags to mark different parts of text with grammatical annotations in order to identify linguistic properties of a word, sentence or discourse. The process of marking text items is based on two main features 1) grammatical category and 2) context of text (word, sentence or discourse) i.e. relationship with adjacent and related text.

Saraiki being one of oldest languages is still resource scarce language in recorded literature as well as in computational context. According to our study, at present, there is no tagset defined for Saraiki language. This work presents first hierarchical POS (MPOST) tag set for the Saraiki language which is designed to be used in morphological, syntactic and lexical annotations of Saraiki language corpora.


Corpora, Parts of Speech (POS); Saraiki; Tag set; Tagging

Full Text:



. Abbas, Qaiser. "Building a hierarchical annotated corpus of urdu: the URDU. KON-TB treebank." International Conference on Intelligent Text Processing and Computational Linguistics. Springer, Berlin, Heidelberg, 2012.

. Adamou, Evangelia. A corpus-driven approach to language contact: Endangered languages in a comparative perspective. Vol. 12. Walter de Gruyter GmbH & Co KG, 2016.

. Ahmed, T., et al. "The CLE urdu POS tagset." poster presentation in Language Resources and Evaluation Conference (LERC 14). 2014.

. Ahmad, A.M., Sulong, G., Rehman, A., Alkawaz,MH., Saba, T. (2014) Data Hiding Based on Improved Exploiting Modification Direction Method and Huffman Coding, Journal of Intelligent Systems, vol. 23 (4), pp. 451-459, doi. 10.1515/jisys-2014-0007.

. Aman ullah Kazim, “Jamy Saraiki Qwaed”, Usman Publications April, 2015, Lahore, Pakistan.

. Atwell, E. S., et al. "A comparative evaluation of modern English corpus grammatical annotation schemes." ICAME Journal: International Computer Archive of Modern and Medieval English Journal 24 (2000): 7-23.

. Baker, Paul, et al. "EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation." LREC. 2002.

. Bashir, Elena et. al. “ Grammar of Hindko, Panjabi, and Saraiki”, De Gruyter publishers 2018.

. Belaı̈d, Abdel, and Mohamed Imran Razzak. "Middle Eastern Character Recognition." Handbook of Document Image Processing and Recognition. Springer London, 2014. 427-457.

. Gill, Mandeep Singh, Gurpreet Singh Lehal, and Shiv Sharma Joshi. "Part of speech tagging for grammar checking of punjabi." The Linguistic Journal 4.1 (2009): 6-21.

. Hardie, Andrew. "Developing a tagset for automated part-of-speech tagging in Urdu." Corpus Linguistics 2003. 2003.

. Harouni, M., Rahim, M.S.M., Al-Rodhaan,M., Saba, T., Rehman, A., Al-Dhelaan, A. (2014) Online Persian/Arabic script classification without contextual information, The Imaging Science Journal, vol. 62(8), pp. 437-448, doi. 10.1179/1743131X14Y.0000000083.

. Hussain, Sarmad. "Resources for urdu language processing." Proceedings of the 6th workshop on Asian Language Resources. 2008.

. Hussain, Syed Safdar. "The Growth of Saraiki Language." Pakistan Journal of Social Sciences (PJSS) 36.1 (2016).

. Jan, M. T., and Y. Saleem. "Optical Character Recognition (OCR) System For Saraiki Language Using Neural Networks." University of Engineering and Technology Taxila. Technical Journal 21.3 (2016): 106.

. Kumar, Sunil (2018). Developing POS Tagset for Dogri. Language in India, 18(1).

. Lewinski, Nastassja A., Ivan Jimenez, and Bridget T. McInnes. "An annotated corpus with nanomedicine and pharmacokinetic parameters." International journal of nanomedicine 12 (2017): 7519.

. Lung, J.W.J., Salam, M.S.H, Rehman, A., Rahim, M.S.M., Saba, T. (2014) Fuzzy phoneme classification using multi-speaker vocal tract length normalization, IETE Technical Review, vol. 31 (2), pp. 128-136, doi. 10.1080/02564602.2014.892669.

. Mahar, Javed Ahmed, and Ghulam Qadir Memon. "Rule based part of speech tagging of sindhi language." Signal Acquisition and Processing, 2010. ICSAP'10. International Conference on. IEEE, 2010.

. Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. "Building a large annotated corpus of English: The Penn Treebank." Computational linguistics 19.2 (1993): 313-330.

. Mughal, Shaukat. "Saraiki qaidah (Saraiki primer)." Multan, Pakistan: Saraiki Isha’ati Idarah.

. Nodehi, A. Sulong, G. Al-Rodhaan, M. Al-Dhelaan, A., Rehman, A. Saba, T. (2014) Intelligent fuzzy approach for fast fractal image compression, EURASIP Journal on Advances in Signal Processing, doi. 10.1186/1687-6180-2014-112.

. Petrov, Slav, Dipanjan Das, and Ryan McDonald. "A universal part-of-speech tagset." arXiv preprint arXiv:1104.2086 (2011).

. Rahman, Tariq. "The Siraiki Movement in Pakistan." Language Problems and Language Planning 19.1 (1995): 1-25.

. Rehman, A. Kurniawan, F. Saba, T. (2011) An automatic approach for line detection and removal without smash-up characters, The Imaging Science Journal, vol. 59(3), pp. 177-182, doi. 10.1179/136821910X12863758415649.

. Raza, Ghulam. "Reduction of Compound Adpositions in Persian, Urdu and Saraiki." Presentation given at the Sixth International Contrastive Linguistics Conference (ICLC’06), Berlin. Vol. 253. 2010.

. Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. A large annotated corpus for learning natural language inference. In EMNLP, 2015.

. Skeppstedt, Maria, et al. "Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: an annotation and machine learning study." Journal of biomedical informatics 49 (2014): 148-158.

. Younus, Z.S. Mohamad, D. Saba, T. Alkawaz,M.H. Rehman, A. Al-Rodhaan,M. Al-Dhelaan, A. (2015) Content-based image retrieval using PSO and k-means clustering algorithm, Arabian Journal of Geosciences, vol. 8(8) , pp. 6211-6224, doi. 10.1007/s12517-014-1584-7.

. Baskaran, Sankaran, et al. "A common parts-of-speech tagset framework for indian languages." In Proc. of LREC 2008. 2008.

. Hardie, Andrew. The computational analysis of morphosyntactic categories in Urdu. Diss. Lancaster University, 2004.

. Horsmann, Tobias, and Torsten Zesch. "Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging." Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. 2016.

DOI: http://dx.doi.org/10.36785/jaes.111459

Creative Commons License
Journal of Applied and Emerging Sciences by BUITEMS is licensed under a Creative Commons Attribution 4.0 International License.
Based on a work at www.buitms.edu.pk.
Permissions beyond the scope of this license may be available at http://journal.buitms.edu.pk/j/index.php/bj

Contacts | Feedback
© 2002-2014 BUITEMS