Fundamental Tasks of NLP:
Sentiment Analysis:
Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment expressed in a piece of text. The goal of sentiment analysis is to automatically extract and quantify subjective information from text data such as opinions, attitudes, emotions and feelings.
Let's say we have a dataset containing customer reviews of a product, and we want to analyze the sentiment expressed in each review. The sentiment could be positive, negative or neutral.
For example, given the review "I absolutely love this product! It's amazing!", the sentiment analysis model might classify it as positive.
Similarly, for the review "This product is terrible. I would not recommend it to anyone.", the model might classify it as negative.
And for the review "The product arrived on time, but it was not what I expected.", the model might classify it as neutral.
Steps of Sentiment Analysis:
Here's how sentiment analysis typically works:
Text Input: The input to sentiment analysis is a piece of text which could be a sentence, paragraph, document or even a social media post.
Preprocessing: The text is preprocessed to remove any noise or irrelevant information, such as punctuation, special characters and stopwords (common words like "and", "the", "is" that do not carry much meaning).
Feature Extraction: Next, features are extracted from the preprocessed text. Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings (e.g., BERT, GPT).
Sentiment Classification: Once the features are extracted a machine learning model or a pre-trained deep learning model is used to classify the sentiment of the text into predefined categories, such as positive, negative or neutral. Some models may provide more granular sentiment analysis, such as sentiment scores ranging from strongly negative to strongly positive.
Evaluation and Optimization: The performance of the sentiment analysis model is evaluated using metrics such as accuracy, precision, recall, F1-score or Mean Squared Error (MSE), depending on the task. The model may be fine-tuned and optimized using techniques such as hyperparameter tuning or cross-validation to improve its performance.
Example:
Movie reviews help users decide whether a movie is worth watching or not. A summary of the reviews for a movie can help a user make quick decisions within a small period of time, rather than spending much more time reading multiple reviews for a movie. Sentiment analysis helps in rating how positive or negative a movie review is. Therefore, the process of understanding if a review is positive or negative can be automated as the machine learns different techniques from the domain of Natural Language Processing.
The dataset contains 10,000 movie reviews. The objective is to do Sentiment Analysis(positive/negative) for the movie reviews using different techniques like supervised and unsupervised learning methods and compare which gives the better and most accurate results.
- Supervised models - Some popular techniques used for encoding text:
**Bag of Words**
**TF-IDF** (**T**erm **F**requency - **I**nverse **D**ocument **F**requency)
- Unsupervised models - Some popular techniques used for unsupervised Sentiment Analysis:
**TextBlob**
**VADER Sentiment**
Data Dictionary:
- review: reviews of the movies.
- sentiment: indicates the sentiment of the review 0 or 1( 0 is for negative review and 1 for positive review)
Dataset source:
- IMDB Movie Ratings Sentiment Analysis: https://www.kaggle.com/datasets/yasserh/imdb-movie-ratings-sentiment-analysi
s
Sample reviews:
Here, a sentiment value of 0 is negative, and 1 represents a positive sentiment.
A sample wordcloud after segregating negative and positive sentiment:
The even, bad, never, little, least, maybe, instead, waste, terrible, still, boring were some of the important recurring words observed in the negative reviews.
Wordcloud for positive sentiment:
well, good, best, great, enjoy, interesting, wonderful, much, fun, beautiful, fun were some of the important words observed in the positive reviews.
After constructing a model based on above dataset using one of the Supervised Learning algorithm namely Bag of Words (BoW) or CountVectorizer, an accuracy score of 82% can be obtained.
A wordcloud based on top 40 features from the CountVectorizer model:
Based on this model, it has been observed that the movie received mixed response with probably an upperhand in negative sentiments.
As the best scoring model is Bag of Words (CountVectorizer), we have opted to build the model features based on this supervised algorithm, however some of the other models tested include TF-IDF and unsupervised learning techniques, namely TextBlob and Vader.