Crossing linguistic barriers: authorship attribution in Sinhala texts

Sarwar, Raheem ORCID: https://orcid.org/0000-0002-0640-807X, Perera, Maneesha ORCID: https://orcid.org/0009-0000-0684-726X, Teh, Pin Shen ORCID: https://orcid.org/0000-0002-0607-2617, Nawaz, Raheel ORCID: https://orcid.org/0000-0001-9588-0052 and Hassan, Muhammad Umair ORCID: https://orcid.org/0000-0001-7607-5154 (2024) Crossing linguistic barriers: authorship attribution in Sinhala texts. ACM Transactions on Asian and Low-Resource Language Information Processing, 23 (5). pp. 1-14. ISSN 2375-4699

Preview

Accepted Version
Available under License In Copyright.
Download (582kB) | Preview

Official URL: http://dx.doi.org/10.1145/3655620

Abstract

Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. The author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages such as Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of a small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.

Item Type:	Article (Article)
Peer-reviewed:	Yes
Date Deposited:	17 May 2024 11:27
Publisher:	Association for Computing Machinery (ACM)
Additional Information:	© Authors 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ACM Transactions on Asian and Low-Resource Language Information Processing, http://dx.doi.org/10.1145/3655620.
Divisions:	Faculties > Business and Law
URI:	https://mmu-uat.leaf.cosector.com/id/eprint/634675
DOI:	https://doi.org/10.1145/3655620
ISSN	2375-4699
e-ISSN	2375-4702

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

82Downloads

6 month trend

34Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Repository staff only

Edit record