Skip to main content

Mathematics for Artificial Intelligence : Multivariate Analysis

 A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Multivariate Analysis (Important Pointers only)

 

Module VIII : Multivariate Analysis 

Multivariate analysis is a branch of statistics that deals with the observation and analysis of more than one statistical outcome variable at a time. It is used to understand the relationships between multiple variables simultaneously and to model their interactions.

I. Principal Component Analysis (PCA).

Principal Component Analysis (PCA) is a statistical technique used to simplify a dataset by reducing its dimensions while retaining most of the variance in the data.

Important Concepts:

  1. Dimensionality Reduction:

    PCA reduces the number of dimensions (features) in the dataset while preserving as much variability (information) as possible.
  2. Principal Components:

    These are new, uncorrelated variables formed from linear combinations of the original variables. The first principal component captures the maximum variance, the second captures the next highest variance, and so on.
  3. Eigenvalues and Eigenvectors:

    Eigenvectors determine the directions of the principal components, and eigenvalues determine their magnitude (variance explained).
  4. Variance Explained:

    Each principal component explains a certain percentage of the total variance in the data. The cumulative variance helps in deciding how many components to retain.

 Example: PCA with Iris Dataset

The Iris dataset contains 150 samples of iris flowers with four features: sepal length, sepal width, petal length, and petal width.

Sample Code :  (in python) 

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Apply PCA
pca = PCA(n_components=2)  # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)

# 4. Visualize the results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=100)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.colorbar()
plt.show()

 

Applications

  • Data Visualization:
    PCA helps in visualizing high-dimensional data in 2D or 3D.
  • Noise Reduction:
    By reducing dimensions, PCA can filter out noise and improve model performance.
  • Feature Extraction:
    PCA can create new features that are combinations of the original features, which might be more informative for machine learning models.

 

II. Factor Analysis.

 Factor Analysis (FA) is a statistical method used to identify underlying relationships between measured variables.

Important Concepts:

  1. Latent Variables:

    These are unobserved variables that are inferred from the observed data. They represent the underlying structure that influences the observed variables.
  2. Factors:

    Factors are the latent variables that explain the correlations among observed variables.
  3. Loadings:

    Factor loadings are the coefficients that indicate the relationship between observed variables and the latent factors.
  4. Communalities:

    The proportion of each variable's variance that can be explained by the factors.

 Example: Factor Analysis with Python using Iris dataset, although FA is more commonly used for psychometric data (e.g., survey responses).

 import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply Factor Analysis
fa = FactorAnalyzer(n_factors=2, rotation='varimax')
fa.fit(X_scaled)

# Get factor loadings
loadings = fa.loadings_
print("Factor Loadings:\n", loadings)

# Get communalities
communalities = fa.get_communalities()
print("\nCommunalities:\n", communalities)

# Explained variance
explained_variance = fa.get_factor_variance()
print("\nExplained Variance:\n", explained_variance)

# Plot the factor loadings
plt.figure(figsize=(10, 6))
plt.scatter(loadings[:, 0], loadings[:, 1])
for i, txt in enumerate(iris.feature_names):
    plt.annotate(txt, (loadings[i, 0], loadings[i, 1]))
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')
plt.title('Factor Loadings Plot')
plt.grid()
plt.show()


III. Multivariate Normal Distribution.

 The multivariate normal distribution is a generalization of the one-dimensional normal distribution to higher dimensions. It describes a set of variables that each have a normal distribution and where there may be correlations between the variables.

 Important Concepts

  1. Mean Vector (μ\mu):

    A vector representing the means of each variable.
  2. Covariance Matrix (Σ\Sigma):

    A matrix representing the variances and covariances between pairs of variables.
  3. Probability Density Function:

    The function that defines the likelihood of observing a particular set of values.

Properties

  1. Symmetry: The distribution is symmetric about the mean vector.

  2. Marginal Distributions: Any subset of the variables follows a multivariate normal distribution.

  3. Linear Combinations: Any linear combination of the variables is also normally distributed.

Probability Density Function

For a kk-dimensional random vector X\mathbf{X}, the probability density function is given by:

f(X)=1(2π)k/2Σ1/2exp(12(Xμ)TΣ1(Xμ))f(\mathbf{X}) = \frac{1}{(2\pi)^{k/2} |\Sigma|^{1/2}} \exp \left( -\frac{1}{2} (\mathbf{X} - \mu)^T \Sigma^{-1} (\mathbf{X} - \mu) \right)

Example: Multivariate Normal Distribution in Python using a sample from a bivariate normal distribution (2-dimensional case).

 import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal

# Mean vector
mu = np.array([0, 0])

# Covariance matrix
sigma = np.array([[1, 0.5], [0.5, 1]])

# Generate a sample of 1000 points
sample = np.random.multivariate_normal(mu, sigma, 1000)

# Plot the sample
plt.figure(figsize=(8, 6))
plt.scatter(sample[:, 0], sample[:, 1], c='blue', s=10, alpha=0.5)
plt.title('Bivariate Normal Distribution')
plt.xlabel('X1')
plt.ylabel('X2')
plt.grid(True)
plt.axis('equal')
plt.show()


 Applications

  1. Multivariate Data Analysis:Used to model and analyze data with multiple correlated variables.
  2. Financial Modeling:Used in portfolio theory to model returns of multiple assets.
  3. Machine Learning:Used in various algorithms, including Gaussian Mixture Models and Bayesian networks.
  4. Statistics:Foundation for various statistical techniques, including hypothesis testing and confidence intervals for multivariate data.

 

IV. Covariance and Correlation Matrices.

 Covariance and correlation matrices are fundamental tools in statistics for understanding the relationships between multiple variables.

1. Covariance Matrix

The covariance matrix provides a measure of the joint variability of multiple variables. Each element in the covariance matrix represents the covariance between a pair of variables.

For a dataset with nn variables, the covariance matrix Σ\Sigma is an n×nn \times n matrix where the element Σij is the covariance between the ii-th and jj-th variables.

Σij=Cov(Xi,Xj)=E[(Xiμi)(Xjμj)]\Sigma_{ij} = \text{Cov}(X_i, X_j) = \mathbb{E}[(X_i - \mu_i)(X_j - \mu_j)]

Where μi\mu_i and μj\mu_j are the means of variables XiX_i and XjX_j, respectively.

Properties

  • Symmetry: The covariance matrix is symmetric (Σij=Σji).
  • Diagonal Elements: The diagonal elements represent the variances of the variables (Σii=Var(Xi)\Sigma_{ii} = \text{Var}(X_i).

2. Correlation Matrix

The correlation matrix standardizes the covariances by the variances of the variables, providing a scale-free measure of the linear relationship between variables.

The correlation matrix PP is an n×n matrix where the element PijP_{ij} is the correlation between the ii-th and jj-th variables.

Pij=Corr(Xi,Xj)=Cov(Xi,Xj)Var(Xi)Var(Xj)=ΣijΣiiΣjjP_{ij} = \text{Corr}(X_i, X_j) = \frac{\text{Cov}(X_i, X_j)}{\sqrt{\text{Var}(X_i) \cdot \text{Var}(X_j)}} = \frac{\Sigma_{ij}}{\sqrt{\Sigma_{ii} \cdot \Sigma_{jj}}}

Example: Python code using Iris dataset.

 import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Covariance matrix
cov_matrix = np.cov(X_scaled, rowvar=False)

# Correlation matrix
corr_matrix = np.corrcoef(X_scaled, rowvar=False)

# Convert to DataFrame for better visualization
cov_df = pd.DataFrame(cov_matrix, index=iris.feature_names, columns=iris.feature_names)
corr_df = pd.DataFrame(corr_matrix, index=iris.feature_names, columns=iris.feature_names)

# Plot covariance matrix
plt.figure(figsize=(10, 8))
sns.heatmap(cov_df, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Covariance Matrix')
plt.show()

# Plot correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_df, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()


V.  Canonical Correlation Analysis (CCA).

 Canonical Correlation Analysis (CCA) is a statistical method used to understand the relationship between two sets of variables.

 Important Concepts

  1. Two Sets of Variables:

    Set 1 (X variables): X1,X2,...,XpX_1, X_2, ..., X_p
    Set 2 (Y variables): Y1,Y2,...,YqY_1, Y_2, ..., Y_q
  2. Canonical Variates:

    Linear combinations of the X variables: U=a1X1+a2X2+...+apXpU = a_1X_1 + a_2X_2 + ... + a_pX_p
    Linear combinations of the Y variables: V=b1Y1+b2Y2+...+bqYqV = b_1Y_1 + b_2Y_2 + ... + b_qY_q
  3. Objective:

    Find coefficients a1,a2,...,apa_1, a_2, ..., a_p and b1,b2,...,bqb_1, b_2, ..., b_q such that the correlation between UU and VV is maximized.

Steps in Canonical Correlation Analysis

  1. Standardize the Variables:

    Convert all variables to have a mean of 0 and a standard deviation of 1.
  2. Compute the Covariance Matrices:

    Compute the covariance matrix for the X variables: ΣXX\Sigma_{XX}
    Compute the covariance matrix for the Y variables: ΣYY\Sigma_{YY}
    Compute the cross-covariance matrix between X and Y variables: ΣXY\Sigma_{XY}
  3. Solve the Generalized Eigenvalue Problem:

    The canonical correlations are obtained by solving the generalized eigenvalue problem for the matrices ΣXX1ΣXYΣYY1ΣYX and ΣYY1ΣYXΣXX1ΣXY\Sigma_{YY}^{-1} \Sigma_{YX} \Sigma_{XX}^{-1} \Sigma_{XY}.
  4. Canonical Correlations:

    The eigenvalues from the above problem give the squared canonical correlations.
    The eigenvectors provide the coefficients for the linear combinations of the X and Y variables.
  5. Interpret the Results:

    The canonical correlations indicate the strength of the relationship between the canonical variates.
    The coefficients (eigenvectors) indicate how much each original variable contributes to the canonical variates.

 

Sample Code (in python) :

import numpy as np
import pandas as pd
from sklearn.cross_decomposition import CCA

# Example data
np.random.seed(0)
X = np.random.rand(100, 3)  # 100 samples, 3 X variables
Y = np.random.rand(100, 3)  # 100 samples, 3 Y variables

# Initialize CCA
n_components = 2
cca = CCA(n_components=n_components)

# Fit the model
cca.fit(X, Y)

# Transform the data
X_c, Y_c = cca.transform(X, Y)

# Canonical correlations
correlations = [np.corrcoef(X_c[:, i], Y_c[:, i])[0, 1] for i in range(n_components)]
print("Canonical Correlations:", correlations)

# Coefficients of the linear combinations
print("X coefficients:\n", cca.x_weights_)
print("Y coefficients:\n", cca.y_weights_)


VI. Multidimensional Scaling.

 Multidimensional Scaling (MDS) is a technique used to visualize the level of similarity or dissimilarity of individual cases in a dataset. It is used in exploratory data analysis to detect patterns in the presence of high-dimensional data.

Sample Code (in python):

 import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import MDS
from scipy.spatial.distance import pdist, squareform

# Example data
np.random.seed(0)
data = np.random.rand(10, 5)  # 10 samples, 5 features

# Compute the distance matrix
dist_matrix = squareform(pdist(data, 'euclidean'))

# Initialize MDS
mds = MDS(n_components=2, dissimilarity='precomputed', random_state=0)

# Fit the model
mds_transformed = mds.fit_transform(dist_matrix)

# Plot the results
plt.scatter(mds_transformed[:, 0], mds_transformed[:, 1], c='blue', marker='o')
for i in range(len(mds_transformed)):
    plt.text(mds_transformed[i, 0], mds_transformed[i, 1], str(i))
plt.title('MDS Plot')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()

 VII. Discriminant Analysis.

Discriminant Analysis is a statistical technique used to classify observations into predefined classes. It is particularly useful when you have categorical dependent variables and continuous independent variables. The two main types of discriminant analysis are:

  1. Linear Discriminant Analysis (LDA)
  2. Quadratic Discriminant Analysis (QDA)

Linear Discriminant Analysis (LDA)

LDA assumes that the data from each class is drawn from a Gaussian distribution with the same covariance matrix. It aims to find a linear combination of features that best separate two or more classes.

Quadratic Discriminant Analysis (QDA)

QDA is similar to LDA but assumes that each class has its own covariance matrix. This allows for more flexible decision boundaries but requires more parameters to be estimated.

 

 Sample Code using LDA (in python )

 import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Initialize LDA
lda = LinearDiscriminantAnalysis()

# Fit the model
lda.fit(X_train, y_train)

# Predict the classes for the test set
y_pred = lda.predict(X_test)

# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Accuracy Score
print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Classification Report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Visualizing the LDA
lda_transformed = lda.transform(X)
plt.scatter(lda_transformed[:, 0], lda_transformed[:, 1], c=y, cmap='viridis')
plt.xlabel('LD1')
plt.ylabel('LD2')
plt.title('LDA: Iris Data')
plt.show()

 VIII. Cluster Analysis.

Cluster analysis, also known as clustering, is a technique used to group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. There are several clustering algorithms, but some of the most commonly used include:

  1. K-means Clustering
  2. Hierarchical Clustering
  3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

K-means Clustering

K-means is one of the simplest and most popular clustering algorithms. It partitions the data into K clusters, where each data point belongs to the cluster with the nearest mean.

 Sample Code (in python):

 import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

# Generate sample data
np.random.seed(0)
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Initialize KMeans with the number of clusters
kmeans = KMeans(n_clusters=4)

# Fit the model
kmeans.fit(X)

# Predict the clusters for each data point
y_kmeans = kmeans.predict(X)

# Plot the clusters
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot the cluster centers
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X')
plt.title('K-means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

Hierarchical Clustering

Hierarchical clustering seeks to build a hierarchy of clusters. It can be either:

  1. Agglomerative: A bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
  2. Divisive: A top-down approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

DBSCAN is a clustering algorithm that groups together points that are close to each other based on a distance measurement and a minimum number of points. It is especially useful for identifying outliers or noise in data.





Popular posts from this blog

Case Study: Reported Rape Cases Analysis

Case Study  : Rape Cases Analysis Country : India Samples used are the reports of rape cases from 2016 to 2021 in Indian states and Union Territories Abstract : Analyzing rape cases reported in India is crucial for understanding patterns, identifying systemic failures and driving policy reforms to ensure justice and safety. With high underreporting and societal stigma, data-driven insights can help reveal gaps in law enforcement, judicial processes and victim support systems. Examining factors such as regional trends, conviction rates and yearly variations aids in developing more effective legal frameworks and prevention strategies. Furthermore, such analysis raises awareness, encourages institutional accountability and empowers advocacy efforts aimed at addressing gender-based violence. A comprehensive approach to studying these cases is essential to creating a safer, legally sound and legitimate society. This study is being carried out with an objective to perform descriptive a...

Everything/Anything as a Service (XaaS)

  "Anything as a Service" or "Everything as a Service."     XaaS, or "Anything as a Service," represents the comprehensive and evolving suite of services and applications delivered to users via the internet. This paradigm encompasses a wide array of cloud-based solutions, transcending traditional boundaries to include software, infrastructure, platforms and more. There are numerous types of XaaS: Software as a service Platform as a service Infrastructure as a service Storage as a service Mobility as a service Database as a service Communications as a service Network as a service  .. and this list goes on by each passing day  Most familiar and known services in Cloud Computing : Software as a service ...

The light weight distro : Alpine

    Ever since its inception in DockerCon in 2017, this light weight Linux distro has been gaining some popularity.  With a light weight ISO image (9 Mb -> Alpine:latest) and the fastest boot time (12 sec), this Linux distribution is doing its own rounds. But why ? Well to begin with, one of its nearest neighbor ISOs weigh almost 77Mb (Ubuntu:latest), as anyone can see that's one huge difference.  Secure, lightweight, fastest boot time, perfect fit for container image s and even for running containers across multiple platforms due to its light weight.. but how does Alpine Linux achieves it all. Lets look into its architecture:  Core Utilities:  Musl libc: Alpine Linux uses musl libc instead of the more common GNU C Library (glibc). Musl is a lightweight, fast and simple implementation of the standard C library, a standards-compliant and optimized lib for static linking and minimal resource usage. Busybox:  BusyBox combines tiny versions of many comm...