Sarwar, Raheem ORCID: https://orcid.org/0000-0002-0640-807X, Perera, Maneesha ORCID: https://orcid.org/0009-0000-0684-726X, Teh, Pin Shen ORCID: https://orcid.org/0000-0002-0607-2617, Nawaz, Raheel ORCID: https://orcid.org/0000-0001-9588-0052 and Hassan, Muhammad Umair ORCID: https://orcid.org/0000-0001-7607-5154 (2024) Crossing linguistic barriers: authorship attribution in Sinhala texts. ACM Transactions on Asian and Low-Resource Language Information Processing, 23 (5). pp. 1-14. ISSN 2375-4699
|
Accepted Version
Available under License In Copyright. Download (582kB) | Preview |
Abstract
Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. The author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages such as Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of a small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.
Impact and Reach
Statistics
Additional statistics for this dataset are available via IRStats2.