College Papers

Abstract—In olden days only MNC companies used to

Abstract—In olden days only MNC companies used to formulate the data and make use of it. But nowadays each and every individual is creating the bulk of data and using such a huge data. For example, we have numerous products available in one of the reputed websites viz. Amazon website in which most of the people are buying a vast amount of products and readily provide their esteemed reviews on that particular product. Data is generated as explained above. Google will produce more than 20PB of data, while Facebook generates more than 5PB of messages and so on. To analyze that huge amount of data is troublesome to humans. To solve this challenging task, sentiment analysis comes into consideration. In our analysis, we have developed an innovative way for finding sentiment analysis at document level by using SVM and HARN’s algorithm. It is proved to be one of the best ways to analyze customer’s opinion on a product at the document level.Keywords—SVM, TextBlob, NV-Dictionary IntroductionThe process of analyzing the reviews/comments computationally which are given by customers and providing sentiment or opinion or attitude of that text is called sentiment analysis. This is also known as opinion mining.We stepped into olden days to golden days, as we experience lots of technical advancements day-by-day. In olden days, humans acted as computers themselves. In those days each individual has to work hard in their respective field/stream of work because of poor technology. In modern days, humans are replaced with machines and made the work of human easier. Due to this explosive growth in technical advancements, people invented many possessions and bought new progression to the society.Due to technical advancements, lots of products were released in the market to make the life of the human feel congenial relating to different fields like IOT etc. In this ever-growing world, due to high automation systems, humans can control a number of appliances through the home. Due to this high technology, the user can study-online, buy and sell products, make voice calls and video calls with his/her friends or relatives, can take or conduct interviews, they can write or conduct exams online. To save time and energy, people are depending on the Internet and making their work easier.  To provide all facilities on the Internet lots of companies and organizations are creating their own websites and competing with one another. There are many websites like hacker rank, code-chef, tech-gig etc. for coding practices, Amazon, Flipkart, etc. for online-shopping and many websites providing video lectures and several gaming web-applications etc.Nowadays, marketing lot of products are completely done through online e-commerce sites. The customer can buy or sell any item through online these days due to rise in technology. The valuable suggestions of users are acquired through an online website in the form of reviews. If the total reviews are in small number then it is easy to read all those reviews and get a quick conclusion on the reviews. But due to high population and increment in websites day-by-day, getting a quick solution on reviews of their product in their respective website is becoming a difficult task as there were a large number of reviews provided. If the government want to conduct any survey relating to some work it will be a troublesome task as there will be a large number of reviews provided.To overcome such loopholes associated with large datasets, Sentiment analysis came into existence. The name, sentiment analysis itself says that it analyze the sentiment hidden in a piece of text and come to one opinion of text which may be positive or negative. Sentiment analysis helps a lot in finding the reviews which are positive, negative or neutral. Product developers easily identify their faults by using sentiment analysis. The evolution of sentiment analysis is shown in Fig.1.Sentiment Analysis    In our research, we observed that there are many limitations associated with sentiment analysis. To overcome one of the limitation we implemented the new method in Dictionary Based approach.    Undergoing several stages of work, we used SVM in machine-learning approach as machines can work faster when compared to the human. As per technology, now machine-learning techniques play a major role, especially in Sentiment analysis. SVM is one of the best-supervised classification approaches and it works more accurately as it is trained with large datasets. In SVM, there are two datasets required one is training dataset and the other is test datasets. These two datasets lead a major part of the classification. Training sets are used to train the systems in order to detect opinion in reviews accurately as a system don’t have any knowledge primarily to predict the text whether it is positive or negative. After well-equipped training with associated training datasets, the system acquires some knowledge which further helps to predict the future results. Test data sets are used to test the system. Based on the output of the testing we calculate the accuracy of our algorithm. There are so many approaches in machine-learning to classify the data such as Naive Bayes, Maximum Entropy (ME) and Support Vector Machines (SVM) which promoted satisfied result in the classification of text.In this context, we use machine-learning approach and Lexicon-based approach for the formation of the new algorithm which increases the recall value even though the training set is small. This algorithm also increases the accuracy to find the polarity at the document level.Related WorkThere will be a number of research works processing in the field of Sentiment analysis (also known as opinion mining). In our research, we encountered many methods to solve sentiment analysis, but we found some limitations in those methods. In this paper, we want to maximize the recall value even though the training set is small.    There are many algorithms in Machine-learning and Lexicon-based approach but we preferred SVM and HARN’s 2 algorithms. In 1 using SVM algorithm to get the polarity at aspect level. In 2 authors proposed a new algorithm called HARN’s algorithm to find the polarity at the sentence level. In paper 3 authors proposed machine learning approaches for sentiment classification and proved that these techniques can yield a good result when compared to other techniques.Wang et al 4 proposed supervised learning methods. They are very popular and proved to be effective in sentiment classification. It is difficult to work with supervised methods as they are expensive and time-consuming. Researchers are working on this area to find out better techniques.Opinion mining is to identify the polarity as positive or negative. Saleh, et al., 5 extracted this using Support Vector Machines (SVM) by using various datasets and weighting schemes.Challenges in Sentiment AnalysisSarcastic Sentences Detecting sarcasm from a piece of text and finding correct meaning of the text is a challenging task for each and every individual. Sentences which are involved in sarcasm could be difficult to recognize, where these sentences lead to wrong orientation and deceptive opinion mining overall. These sentences seem to be the appreciation for someone but in general, it represents as taunting or cutting.Poor SpellingsIn this context, it is very troublesome to finding the meaning of abbreviations, spelling, poor punctuation, grammar etc. The tool which is used for POS tagging will give the speech as Noun category if any meaningless words were encountered in the process of execution.Polarity for Message LanguageThe identification of parts of speech of the keywords of a particular sentence given in Message Language (also known as shortcut language), will be becoming a troublesome task. Without POS (Parts of speech) tagging it is impossible to continue for acquiring the desired result.Improving the precision of algorithmThe result is purely dependent on the accuracy of the algorithm. This stage of precision is mandatory to reduce the human effort which improves the accuracy of this algorithm.The Proposed MethodThere are only a few methods to find the sentiment of a sentence using Dictionaries. In our algorithm or method, we use Machine learning approach and Dictionary based approach in Lexicon-based approach to get the polarity at the document level.In Machine Learning approach we want more training set to get result accurately. If training set is too small then we don’t acquire results accurately. To overcome this problem, we used Lexicon based approach and SVM algorithm to maximize the RECALL for the small training set.We divide the data sets into two parts. One for training and another for testing. In our algorithm, the training set is too small for SVM model. After training, the SVM model we can predict the polarity. We can then apply HARN’s algorithm to the same dataset/ reviews. Then compare these acquired results from SVM and HARN’s algorithms. The acquired result concluded that there is an increase in the accuracy when compared to the independent usage of SVM and HARN’s algorithms even though the provided training data set is too small.DatasetsA Dataset is a collection of reviews/comments provided by the users. There are two types of data sets viz. training set and testing dataset. Training sets contain reviews which are collected from e-commerce, social websites, etc further used to train the algorithm. After training the algorithm, the algorithm will be able to predict the result in future. The testing set also contains reviews and is used to test the algorithm, how efficiently algorithm works. To perform this task we are using five domains in Amazon like Baby, Garden, Health, Music, Video as training set and some reviews whose polarities are known are used as testing set.Splitting of DatasetsProposed Algorithm1. Consider document as input.2. Split the document into sentences3. Apply SVM on that data set and store the acquired result.4. Perform POS tagging for each sentence in the documentation5. Trim the sentence by removing stopwords.6. Extraction of features like adverbs, verbs, adjectives7. Get polarities of each feature from NV-Dictionary and SentiWordNet.8. Calculate result and store it.9. Draw bar-graph to represent the result at the document level.Support Vector MachineA Support Vector Machine (SVM) can be defined as a knowledgeable classifier, which is defined by a separating hyperplane. Support vector machines fall under supervised learning models with affiliate learning algorithms which are further used to analyze data for classification and regression analysis. It can be used to avoid major difficulties associated with usage of linear functions in the high dimensional feature space and where optimization problem is transformed into dual convex quadratic programmers. It is used in the classification of domains (say k) where k>=2. Each review consists of a set of features where each review represents as positive or negative or neutral. SVM has a special feature that is, it works efficiently with bulk amount of data for text classification as it handles large features. SVM perform linear classification as well as non-linear classification efficiently, which is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.Vk (Fi, Oij, Pij, Nij)where Vk is the kth review, Fi is the ith feature in review k, Oij is the jth opinionated word for the ith feature, Pij is the Positive score for jth opinion word of ith feature, Nij is the Negative score for jth opinion word of the ith feature.HARN’s AlgorithmAccording to HARN’s algorithm, we give sentences as input to the algorithm and we are passing sentences to the tool called TextBlob which is used for Parts of Speech (POS) tagging for each and every word involved in respective sentences. After Parts of Speech tagging, we will remove the stop words like is, of, on, the etc. Later we extract the main Parts of Speech like verb, adverbs, and adjectives. We choose particularly those Parts of Speech because the actual meaning will be hidden in those speeches. We search those particular extracted words in NV-Dictionary, which is designed for our algorithm. The words which were found in NV-Dictionary will be directly thrown for calculating the polarity otherwise it considers the polarities of those particular words from SentiWordNet and further those polarities will be appended to the NV-Dictionary and then the polarity of sentences will be calculated.Extraction of polarity of particular productIn order to determine the overall opinion of the user on a particular product, initially, we consider all the available reviews. Let us assume that there are ‘z’ features of the particular review, the below (1) and (2) forms narrate about the particular review whereas (3) and (4) narrates Positive score and a Negative score of the particular product given by SVM model.The final Positive score for that particular review:                                      (1)The final Negative score for a particular review:                                     (2)The final Positive score for that particular product:                                      (3)The final Negative score for the particular product:                                      (4)TextBlobTextBlob is said to be an impressive python library for processing textual data. It provides a simple programming interface for modulating into common natural language processing (NLP). It facilitates the user to be benefited with inbuilt features making a programmer congenial. It provides some features like Noun phrase extraction, Parts-of-speech tagging, Sentiment analysis, Classification (Naive Bayes, Decision tree), Language translation and detection, Tokenization(splitting text into words and sentences), Parsing, n-grams, word inflection (pluralization and singularization), spelling correction etc. It starts gaining its resources to the user once he starts importing this library into his workspace.NV-DictionaryNoun-Verb (NV) Dictionary is said to be a sophisticated tool which provides the polarities of extracted features. In this process of extraction, the noun is assumed as dictionary name whereas, the rest of the parts of speech act as keywords and their respective resultant polarities act as values. We implemented this NV-Dictionary for acquiring the desired results when we encounter the same keyword with a different meaning in different sentences by having unique nouns in each of it.Extraction of the Whole PolarityTo find whole product polarity, first, we extract the polarity from SVM and compare with HARN’s algorithm. In HARN’s algorithm, we find the polarities of each and every feature in the review and find their polarity from NV-Dictionary and SentiWordNet. After extracting all the polarities at aspect level, then we find the product of polarities of each feature to get the polarity at the sentence level and compare it with TextBlob polarity. Then we conclude either the sentence is positive or negative or neutral. After finding the polarities of all reviews in the document then find the total number of positive reviews and the total number of negative reviews and the total number of neutral reviews and then predict the polarity at the document level.ResultsThere were numerous algorithms used for sentimental Analysis. In our prospective work, we initially used HARN’s algorithm and acquired 70-75% accuracy. Then we put a step forward and used SVM in machine learning. In SVM, if the recall value crosses the value of 90, the precision value will be acquired large which is considered as best-fit. If we take small datasets SVM will not work accurately. To overcome this problem, we combined HARN with SVM and got 80-85% accuracy even with the small datasets. Table. 2 demonstrate the performance of the individual algorithm.In this case, we increase the performance SVM and HARN’s algorithms by combining those two. This will also be automated HARN’s algorithm which is not automated previously. HARN’s algorithm is designed for sentence level polarity but here we implement this algorithm up to document level and the results are shown in Fig. 3.Polarity for each DomainEvaluationDuring the result calculation, we analyze the evaluation of the performance of our algorithm by considering the values of precision, recall and Accuracy using True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN) shown in Table.1.Contingency MatrixCorrect LabelsPositiveNegativePredicted LabelsPositiveTP (True Positive)FN (False Negative)NegativeFN (False Positive)TN (True Negative)                   (5)                     (6)          (7)    By using above (5), (6) and (7) formulas, we calculate the performance of these three algorithms which are shown in Table.2.Comparison of Three AlgorithmsAlgorithmsPrecisionRecallAccuracyHARN88.73%56.88%55.4%SVM50.00%33.33%4.33%SVM + HARN91.56%99.77%91.7%Conclusion and Future WorkActually, HARN’s algorithm in its own nature is not automated but we did HARN’s algorithm automated. As it is integrated with small amounts of training data to SVM, we resulted in less recall value, further resulting in less precision. Further continuing as precision value is less, the accuracy of the algorithm is also decreased. The independent use of HARN’s algorithm and SVM did not facilitate to promote the expected accuracy, so to overcome those loopholes associated with accuracy, we used HARN’s algorithm and SVM  and got higher precision value got increased and further reached to higher accuracy even in smaller data sets with fewer data.We only provide sentiment of the sentence, but we never provide the information to which domain it belongs to, thus it becomes a competitive task for achieving it. Another great task associated with the sentimental analysis is to correct the spelling of the misspelled words and provide polarity to those words.