Sarwar, Raheem ORCID: https://orcid.org/0000-0002-0640-807X and Hassan, Saeed-Ul (2022) UrduAI: writeprints for Urdu authorship identification. ACM Transactions on Asian and Low-Resource Language Information Processing, 21 (2). 34. ISSN 2375-4699
|
Accepted Version
Available under License In Copyright. Download (277kB) | Preview |
Abstract
The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.
Impact and Reach
Statistics
Additional statistics for this dataset are available via IRStats2.