Skip to main content

Natural Language Processing - II - Sentiment Analysis

 Fundamental Tasks of NLP: 

Sentiment Analysis:

Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment expressed in a piece of text. The goal of sentiment analysis is to automatically extract and quantify subjective information from text data such as opinions, attitudes, emotions and feelings.

Let's say we have a dataset containing customer reviews of a product, and we want to analyze the sentiment expressed in each review. The sentiment could be positive, negative or neutral.

For example, given the review "I absolutely love this product! It's amazing!", the sentiment analysis model might classify it as positive.

Similarly, for the review "This product is terrible. I would not recommend it to anyone.", the model might classify it as negative.

And for the review "The product arrived on time, but it was not what I expected.", the model might classify it as neutral.

Steps of Sentiment Analysis:  

Here's how sentiment analysis typically works:

  1. Text Input: The input to sentiment analysis is a piece of text which could be a sentence, paragraph, document or even a social media post.

  2. Preprocessing: The text is preprocessed to remove any noise or irrelevant information, such as punctuation, special characters and stopwords (common words like "and", "the", "is" that do not carry much meaning).

  3. Feature Extraction: Next, features are extracted from the preprocessed text. Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings (e.g., BERT, GPT).

  4. Sentiment Classification: Once the features are extracted a machine learning model or a pre-trained deep learning model is used to classify the sentiment of the text into predefined categories, such as positive, negative or neutral. Some models may provide more granular sentiment analysis, such as sentiment scores ranging from strongly negative to strongly positive.

  5. Evaluation and Optimization: The performance of the sentiment analysis model is evaluated using metrics such as accuracy, precision, recall, F1-score or Mean Squared Error (MSE), depending on the task. The model may be fine-tuned and optimized using techniques such as hyperparameter tuning or cross-validation to improve its performance.

 

Example: 


 

Movie reviews help users decide whether a movie is worth watching or not. A summary of the reviews for a movie can help a user make quick decisions within a small period of time, rather than spending much more time reading multiple reviews for a movie. Sentiment analysis helps in rating how positive or negative a movie review is. Therefore, the process of understanding if a review is positive or negative can be automated as the machine learns different techniques from the domain of Natural Language Processing.

The dataset contains 10,000 movie reviews. The objective is to do Sentiment Analysis(positive/negative) for the movie reviews using different techniques like supervised and unsupervised learning methods and compare which gives the better and most accurate results.

  1. Supervised models - Some popular techniques used for encoding text:
    •       **Bag of Words**
      
    •       **TF-IDF** (**T**erm  **F**requency - **I**nverse **D**ocument **F**requency)
      
  2. Unsupervised models - Some popular techniques used for unsupervised Sentiment Analysis:
    •       **TextBlob**         
      
    •       **VADER Sentiment**
      Data Dictionary:  
    • review: reviews of the movies.
    • sentiment: indicates the sentiment of the review 0 or 1( 0 is for negative review and 1 for positive review)

    Dataset source:

  3. IMDB Movie Ratings Sentiment Analysis: https://www.kaggle.com/datasets/yasserh/imdb-movie-ratings-sentiment-analysis

 Sample reviews:


 
 

 Here, a sentiment value of 0 is negative, and 1 represents a positive sentiment.

A sample wordcloud after segregating negative and positive sentiment: 


 
The even, bad, never, little, least, maybe, instead, waste, terrible, still, boring were some of the important recurring words observed in the negative reviews.

Wordcloud for positive sentiment: 


 
well, good, best, great, enjoy, interesting, wonderful, much, fun, beautiful, fun were some of the important words observed in the positive reviews.

After constructing a model based on above dataset using one of the Supervised Learning algorithm namely Bag of Words (BoW) or CountVectorizer, an accuracy score of 82% can be obtained. 

A wordcloud based on top 40 features from the CountVectorizer model: 

Based on this model, it has been observed that the movie received mixed response with probably an upperhand in negative sentiments.

As the best scoring model is Bag of Words (CountVectorizer), we have opted to build the model features based on this supervised algorithm, however some of the other models tested include TF-IDF and unsupervised learning techniques, namely TextBlob and Vader.



 

 

Popular posts from this blog

Case Study: Reported Rape Cases Analysis

Case Study  : Rape Cases Analysis Country : India Samples used are the reports of rape cases from 2016 to 2021 in Indian states and Union Territories Abstract : Analyzing rape cases reported in India is crucial for understanding patterns, identifying systemic failures and driving policy reforms to ensure justice and safety. With high underreporting and societal stigma, data-driven insights can help reveal gaps in law enforcement, judicial processes and victim support systems. Examining factors such as regional trends, conviction rates and yearly variations aids in developing more effective legal frameworks and prevention strategies. Furthermore, such analysis raises awareness, encourages institutional accountability and empowers advocacy efforts aimed at addressing gender-based violence. A comprehensive approach to studying these cases is essential to creating a safer, legally sound and legitimate society. This study is being carried out with an objective to perform descriptive a...

Everything/Anything as a Service (XaaS)

  "Anything as a Service" or "Everything as a Service."     XaaS, or "Anything as a Service," represents the comprehensive and evolving suite of services and applications delivered to users via the internet. This paradigm encompasses a wide array of cloud-based solutions, transcending traditional boundaries to include software, infrastructure, platforms and more. There are numerous types of XaaS: Software as a service Platform as a service Infrastructure as a service Storage as a service Mobility as a service Database as a service Communications as a service Network as a service  .. and this list goes on by each passing day  Most familiar and known services in Cloud Computing : Software as a service ...

The light weight distro : Alpine

    Ever since its inception in DockerCon in 2017, this light weight Linux distro has been gaining some popularity.  With a light weight ISO image (9 Mb -> Alpine:latest) and the fastest boot time (12 sec), this Linux distribution is doing its own rounds. But why ? Well to begin with, one of its nearest neighbor ISOs weigh almost 77Mb (Ubuntu:latest), as anyone can see that's one huge difference.  Secure, lightweight, fastest boot time, perfect fit for container image s and even for running containers across multiple platforms due to its light weight.. but how does Alpine Linux achieves it all. Lets look into its architecture:  Core Utilities:  Musl libc: Alpine Linux uses musl libc instead of the more common GNU C Library (glibc). Musl is a lightweight, fast and simple implementation of the standard C library, a standards-compliant and optimized lib for static linking and minimal resource usage. Busybox:  BusyBox combines tiny versions of many comm...