UrduAI: writeprints for Urdu authorship identification

Sarwar, Raheem ORCID: https://orcid.org/0000-0002-0640-807X and Hassan, Saeed-Ul (2022) UrduAI: writeprints for Urdu authorship identification. ACM Transactions on Asian and Low-Resource Language Information Processing, 21 (2). 34. ISSN 2375-4699

Preview

Accepted Version
Available under License In Copyright.
Download (277kB) | Preview

Official URL: http://dx.doi.org/10.1145/3476467

Abstract

The authorship identification task aims at identifying the original author of an anonymous text sample from a set of candidate authors. It has several application domains such as digital text forensics and information retrieval. These application domains are not limited to a specific language. However, most of the authorship identification studies are focused on English and limited attention has been paid to Urdu. However, existing Urdu authorship identification solutions drop accuracy as the number of training samples per candidate author reduces and when the number of candidate authors increases. Consequently, these solutions are inapplicable to real-world cases. Moreover, due to the unavailability of reliable POS taggers or sentence segmenters, all existing authorship identification studies on Urdu text are limited to the word n-grams features only. To overcome these limitations, we formulate a stylometric feature space, which is not limited to the word n-grams feature only. Based on this feature space, we use an authorship identification solution that transforms each text sample into a point set, retrieves candidate text samples, and relies on the nearest neighbors classifier to predict the original author of the anonymous text sample. To evaluate our solution, we create a significantly larger corpus than existing studies and conduct several experimental studies that show that our solution can overcome the limitations of existing studies and report an accuracy level of 94.03%, which is higher than all previous authorship identification works.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	15 Dec 2023 12:50
Publisher:	Association for Computing Machinery (ACM)
Additional Information:	© ACM 2021. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Asian and Low-Resource Language Information Processing, http://dx.doi.org/10.1145/3476467.
Divisions:	Faculties > Business and Law
URI:	https://mmu-uat.leaf.cosector.com/id/eprint/633539
DOI:	https://doi.org/10.1145/3476467
ISSN	2375-4699
e-ISSN	2375-4702

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

165Downloads

6 month trend

31Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record