Skip to main content

Natural Language Processing - II - Sentiment Analysis

 Fundamental Tasks of NLP: 

Sentiment Analysis:

Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment expressed in a piece of text. The goal of sentiment analysis is to automatically extract and quantify subjective information from text data such as opinions, attitudes, emotions and feelings.

Let's say we have a dataset containing customer reviews of a product, and we want to analyze the sentiment expressed in each review. The sentiment could be positive, negative or neutral.

For example, given the review "I absolutely love this product! It's amazing!", the sentiment analysis model might classify it as positive.

Similarly, for the review "This product is terrible. I would not recommend it to anyone.", the model might classify it as negative.

And for the review "The product arrived on time, but it was not what I expected.", the model might classify it as neutral.

Steps of Sentiment Analysis:  

Here's how sentiment analysis typically works:

  1. Text Input: The input to sentiment analysis is a piece of text which could be a sentence, paragraph, document or even a social media post.

  2. Preprocessing: The text is preprocessed to remove any noise or irrelevant information, such as punctuation, special characters and stopwords (common words like "and", "the", "is" that do not carry much meaning).

  3. Feature Extraction: Next, features are extracted from the preprocessed text. Common techniques include bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings (e.g., Word2Vec, GloVe) or contextual embeddings (e.g., BERT, GPT).

  4. Sentiment Classification: Once the features are extracted a machine learning model or a pre-trained deep learning model is used to classify the sentiment of the text into predefined categories, such as positive, negative or neutral. Some models may provide more granular sentiment analysis, such as sentiment scores ranging from strongly negative to strongly positive.

  5. Evaluation and Optimization: The performance of the sentiment analysis model is evaluated using metrics such as accuracy, precision, recall, F1-score or Mean Squared Error (MSE), depending on the task. The model may be fine-tuned and optimized using techniques such as hyperparameter tuning or cross-validation to improve its performance.

 

Example: 


 

Movie reviews help users decide whether a movie is worth watching or not. A summary of the reviews for a movie can help a user make quick decisions within a small period of time, rather than spending much more time reading multiple reviews for a movie. Sentiment analysis helps in rating how positive or negative a movie review is. Therefore, the process of understanding if a review is positive or negative can be automated as the machine learns different techniques from the domain of Natural Language Processing.

The dataset contains 10,000 movie reviews. The objective is to do Sentiment Analysis(positive/negative) for the movie reviews using different techniques like supervised and unsupervised learning methods and compare which gives the better and most accurate results.

  1. Supervised models - Some popular techniques used for encoding text:
    •       **Bag of Words**
      
    •       **TF-IDF** (**T**erm  **F**requency - **I**nverse **D**ocument **F**requency)
      
  2. Unsupervised models - Some popular techniques used for unsupervised Sentiment Analysis:
    •       **TextBlob**         
      
    •       **VADER Sentiment**
      Data Dictionary:  
    • review: reviews of the movies.
    • sentiment: indicates the sentiment of the review 0 or 1( 0 is for negative review and 1 for positive review)

    Dataset source:

  3. IMDB Movie Ratings Sentiment Analysis: https://www.kaggle.com/datasets/yasserh/imdb-movie-ratings-sentiment-analysis

 Sample reviews:


 
 

 Here, a sentiment value of 0 is negative, and 1 represents a positive sentiment.

A sample wordcloud after segregating negative and positive sentiment: 


 
The even, bad, never, little, least, maybe, instead, waste, terrible, still, boring were some of the important recurring words observed in the negative reviews.

Wordcloud for positive sentiment: 


 
well, good, best, great, enjoy, interesting, wonderful, much, fun, beautiful, fun were some of the important words observed in the positive reviews.

After constructing a model based on above dataset using one of the Supervised Learning algorithm namely Bag of Words (BoW) or CountVectorizer, an accuracy score of 82% can be obtained. 

A wordcloud based on top 40 features from the CountVectorizer model: 

Based on this model, it has been observed that the movie received mixed response with probably an upperhand in negative sentiments.

As the best scoring model is Bag of Words (CountVectorizer), we have opted to build the model features based on this supervised algorithm, however some of the other models tested include TF-IDF and unsupervised learning techniques, namely TextBlob and Vader.



 

 

Popular posts from this blog

Artificial intelligence on Cloud

  Cloud computing is a technology model that enables convenient, on-demand access to a shared pool of computing resources (such as servers, storage, networking, databases, applications, and services) over the internet. Instead of owning and maintaining physical hardware and infrastructure, users can access and use computing resources on a pay-as-you-go basis, similar to a utility service.  Cloud computing also has deployment models, indicating how cloud services are hosted and made available to users: Public Cloud: Services are provided over the public internet and are available to anyone who wants to use or purchase them. Examples include AWS, Azure, and Google Cloud. Private Cloud: Cloud resources are used exclusively by a single organization. Private clouds can be hosted on-premises or by a third-party provider. Hybrid Cloud: Combines elements of both public and private clouds. It allows data and applications to be shared between them, offering greater flexibility a...

Mathematics for Artificial Intelligence : Multivariate Analysis

 A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Multivariate Analysis (Important Pointers only)   Module VIII : Multivariate Analysis  Multivariate analysis is a branch of statistics that deals with the observation and analysis of more than one statistical outcome variable at a time. It is used to understand the relationships between multiple variables simultaneously and to model their interactions. I. Principal Component Analysis (PCA). Principal Component Analysis (PCA) is a statistical technique used to simplify a dataset by reducing its dimensions while retaining most of the variance in the data. Important Concepts: Dimensionality Reduction : PCA reduces the number of dimensions (features) in the dataset while preserving as much variability (information) as possible. Principal Components : These are new, uncorrelated variables formed from linear combinations of the original variables. The first prin...

Natural Language Processing - I

    Natural Language Processing is a subfield of AI that focuses on the interaction between computers and human languages. The primary goal of NLP is to enable machines to understand, interpret, and generate human language in a way that is both meaningful and valuable. NLP in AI involves the development of algorithms and models that allow computers to process and analyze natural language data. This includes tasks such as text parsing, sentiment analysis, language translation and speech recognition. NLP applications can be found in various domains, including virtual assistants, chatbots, language translation services and sentiment analysis tools.  Tasks of NLP :   Text Classification: Sentiment Analysis: Determining the sentiment expressed in a piece of text (positive, negative, neutral). Topic Classification: Categorizing a document or piece of text into predefined topics or categories. Named Entity Recognition (NER): Identifying and classifying entiti...