A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs

Alnajran, Noufa, Crockett, Keeley ORCID: https://orcid.org/0000-0003-1941-6201, McLean, David and Latham, Annabel ORCID: https://orcid.org/0000-0002-8410-7950 (2018) A Heuristic Based Pre-processing Methodology for Short Text Similarity Measures in Microblogs. In: IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS),, 23 June 2018 - 30 June 2018, Exeter, UK.

Preview

Accepted Version
Download (320kB) | Preview

Official URL: https://ieeexplore.ieee.org/document/8623003

Abstract

Short text similarity measures have lots of applications in online social networks (OSN), as they are being integrated in machine learning algorithms. However, the data quality is a major challenge in most OSNs, particularly Twitter. The sparse, ambiguous, informal, and unstructured nature of the medium impose difficulties to capture the underlying semantics of the text. Therefore, text pre-processing is a crucial phase in similarity identification applications, such as clustering and classification. This is because selecting the appropriate data processing methods contributes to the increase in correlations of the similarity measure. This research proposes a novel heuristicdriven pre-processing methodology for enhancing the performance of similarity measures in the context of Twitter tweets. The components of the proposed pre-processing methodology are discussed and evaluated on an annotated dataset that was published as part of SemEval-2014 shared task. An experimental analysis was conducted using the cosine angle as a similarity measure to assess the effect of our method against a baseline (C-Method). Experimental results indicate that our approach outperforms the baseline in terms of correlations and error rates.

Item Type:	Conference or Workshop Item (Paper)
Peer-reviewed:	Yes
Date Deposited:	07 Nov 2018 12:57
Publisher:	IEEE
Divisions:	Faculties > Science and Engineering Research Centres > Centre for Advanced Computational Science
URI:	https://mmu-uat.leaf.cosector.com/id/eprint/621804
DOI:	https://doi.org/10.1109/HPCC/SmartCity/DSS.2018.00265

Impact and Reach

Statistics

DownloadsShow export options

Activity Overview

6 month trend

296Downloads

6 month trend

420Hits

Additional statistics for this dataset are available via IRStats2.

Altmetric

Actions (login required)

View Item