مرکز منطقه ای اطلاع رساني علوم و فناوري

DocumentCode :

2030479

Title :

Lexical normalisation of Twitter Data

Author :

Ahmed, Bilal

Author_Institution :

Dept. of Comput. & Inf. Syst., Univ. of Melbourne, Melbourne, VIC, Australia

fYear :

2015

fDate :

28-30 July 2015

Firstpage :

326

Lastpage :

328

Abstract :

Twitter with over 500 million users globally, generates over 100,000 tweets per minute¹. The 140 character limit per tweet has, perhaps unintentionally, encourages users to use shorthand notations and to strip spellings to their bare minimum “syllables” or elisions e.g. “srsly”. The analysis of Twitter messages which typically contain misspellings, elisions, and grammatical errors, poses a challenge to established Natural Language Processing (NLP) tools which are generally designed with the assumption that the data conforms to the basic grammatical structure commonly used in English language. In order to make sense of Twitter messages it is necessary to first transform them into a canonical form, consistent with the dictionary or grammar. This process, performed at the level of individual tokens (“words”), is called lexical normalisation. This paper investigates various techniques for lexical normalisation of Twitter data and presents the findings as the techniques are applied to process raw data from Twitter.

Keywords :

grammars; natural language processing; social networking (online); text analysis; English language; NLP; Twitter data; Twitter message analysis; grammatical structure; lexical normalisation; natural language processing; Approximation algorithms; Arrays; Context; Dictionaries; Pattern matching; Twitter; Vocabulary; Levenshtein distance; Lexical Normalisation; N-Gram; Peter Norvig´s Algorithm; Phonetic Matching; Refined Soundex; Twitter Data;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Science and Information Conference (SAI), 2015

Conference_Location :

London

Type :

conf

DOI :

10.1109/SAI.2015.7237164

Filename :

7237164

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2030479