N-grams Based Features for Indonesian Tweets Classification Problems

Publication Name : 2017 INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATICS (ICELTICS)

DOI :

Date : 2017


Twitter is one of popular microblogging services that allows users to write short messages up to 140 characters. Twitter active users in Indonesia have reached 29.4 million in 2017 and they have created an enormous number of tweets, a potential data source for supervised learning. In this work, six different set of n-grams words dictionaries were generated and they were used as references for creating numerical features of the tweets. We classified the tweets using k-Nearest Neighbors (k-NN) and Naive Bayes Classifier and compared the accuracy using F-measure. We also observed the classification times of each algorithm. The results show that k-NN algorithm performed better than Naive Bayes Classifier, i.e. 81.2% for F-measure using k=7. However, in terms of classification time, Naive Bayes Classifier is faster than k-NN for all k parameters.

Type
Book in series
ISSN
2155-6822
EISSN
Page
307 - 310