A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Multivariate Analysis (Important Pointers only)
Module VIII : Multivariate Analysis
Multivariate analysis is a branch of statistics that deals with the observation and analysis of more than one statistical outcome variable at a time. It is used to understand the relationships between multiple variables simultaneously and to model their interactions.
I. Principal Component Analysis (PCA).
Principal Component Analysis (PCA) is a statistical technique used to simplify a dataset by reducing its dimensions while retaining most of the variance in the data.
Important Concepts:
Dimensionality Reduction:
PCA reduces the number of dimensions (features) in the dataset while preserving as much variability (information) as possible.Principal Components:
These are new, uncorrelated variables formed from linear combinations of the original variables. The first principal component captures the maximum variance, the second captures the next highest variance, and so on.Eigenvalues and Eigenvectors:
Eigenvectors determine the directions of the principal components, and eigenvalues determine their magnitude (variance explained).Variance Explained:
Each principal component explains a certain percentage of the total variance in the data. The cumulative variance helps in deciding how many components to retain.
Example: PCA with Iris Dataset
The Iris dataset contains 150 samples of iris flowers with four features: sepal length, sepal width, petal length, and petal width.
Sample Code : (in python)
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# 2. Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# 3. Apply PCA
pca = PCA(n_components=2) # Reduce to 2 dimensions
X_pca = pca.fit_transform(X_scaled)
# 4. Visualize the results
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=100)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.colorbar()
plt.show()
Applications
- Data Visualization:
PCA helps in visualizing high-dimensional data in 2D or 3D. - Noise Reduction:
By reducing dimensions, PCA can filter out noise and improve model performance. - Feature Extraction:
PCA can create new features that are combinations of the original features, which might be more informative for machine learning models.
II. Factor Analysis.
Factor Analysis (FA) is a statistical method used to identify underlying relationships between measured variables.
Important Concepts:
Latent Variables:
These are unobserved variables that are inferred from the observed data. They represent the underlying structure that influences the observed variables.Factors:
Factors are the latent variables that explain the correlations among observed variables.Loadings:
Factor loadings are the coefficients that indicate the relationship between observed variables and the latent factors.Communalities:
The proportion of each variable's variance that can be explained by the factors.
Example: Factor Analysis with Python using Iris dataset, although FA is more commonly used for psychometric data (e.g., survey responses).
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer
# Load the Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply Factor Analysis
fa = FactorAnalyzer(n_factors=2, rotation='varimax')
fa.fit(X_scaled)
# Get factor loadings
loadings = fa.loadings_
print("Factor Loadings:\n", loadings)
# Get communalities
communalities = fa.get_communalities()
print("\nCommunalities:\n", communalities)
# Explained variance
explained_variance = fa.get_factor_variance()
print("\nExplained Variance:\n", explained_variance)
# Plot the factor loadings
plt.figure(figsize=(10, 6))
plt.scatter(loadings[:, 0], loadings[:, 1])
for i, txt in enumerate(iris.feature_names):
plt.annotate(txt, (loadings[i, 0], loadings[i, 1]))
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')
plt.title('Factor Loadings Plot')
plt.grid()
plt.show()
III. Multivariate Normal Distribution.
The multivariate normal distribution is a generalization of the one-dimensional normal distribution to higher dimensions. It describes a set of variables that each have a normal distribution and where there may be correlations between the variables.
Important Concepts
Mean Vector ():
A vector representing the means of each variable.Covariance Matrix ():
A matrix representing the variances and covariances between pairs of variables.Probability Density Function:
The function that defines the likelihood of observing a particular set of values.
Properties
Symmetry: The distribution is symmetric about the mean vector.
Marginal Distributions: Any subset of the variables follows a multivariate normal distribution.
Linear Combinations: Any linear combination of the variables is also normally distributed.
Probability Density Function
For a -dimensional random vector , the probability density function is given by:
Example: Multivariate Normal Distribution in Python using a sample from a bivariate normal distribution (2-dimensional case).
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
# Mean vector
mu = np.array([0, 0])
# Covariance matrix
sigma = np.array([[1, 0.5], [0.5, 1]])
# Generate a sample of 1000 points
sample = np.random.multivariate_normal(mu, sigma, 1000)
# Plot the sample
plt.figure(figsize=(8, 6))
plt.scatter(sample[:, 0], sample[:, 1], c='blue', s=10, alpha=0.5)
plt.title('Bivariate Normal Distribution')
plt.xlabel('X1')
plt.ylabel('X2')
plt.grid(True)
plt.axis('equal')
plt.show()
Applications
- Multivariate Data Analysis:Used to model and analyze data with multiple correlated variables.
- Financial Modeling:Used in portfolio theory to model returns of multiple assets.
- Machine Learning:Used in various algorithms, including Gaussian Mixture Models and Bayesian networks.
- Statistics:Foundation for various statistical techniques, including hypothesis testing and confidence intervals for multivariate data.
IV. Covariance and Correlation Matrices.
Covariance and correlation matrices are fundamental tools in statistics for understanding the relationships between multiple variables.
1. Covariance Matrix
The covariance matrix provides a measure of the joint variability of multiple variables. Each element in the covariance matrix represents the covariance between a pair of variables.
For a dataset with variables, the covariance matrix is an matrix where the element is the covariance between the -th and -th variables.
Where and are the means of variables and , respectively.
Properties
- Symmetry: The covariance matrix is symmetric ().
- Diagonal Elements: The diagonal elements represent the variances of the variables ().
2. Correlation Matrix
The correlation matrix standardizes the covariances by the variances of the variables, providing a scale-free measure of the linear relationship between variables.
The correlation matrix is an matrix where the element is the correlation between the -th and -th variables.
Canonical Correlation Analysis (CCA) is a statistical method used to understand the relationship between two sets of variables.
Concepts
Two Sets of Variables:
Set 1 (X variables):
Set 2 (Y variables):Canonical Variates:
Linear combinations of the X variables:
Linear combinations of the Y variables:Objective:
Find coefficients and such that the correlation between and is maximized.
Steps in Canonical Correlation Analysis
Standardize the Variables:
Convert all variables to have a mean of 0 and a standard deviation of 1.Compute the Covariance Matrices:
Compute the covariance matrix for the X variables:
Compute the covariance matrix for the Y variables:
Compute the cross-covariance matrix between X and Y variables:Solve the Generalized Eigenvalue Problem:
The canonical correlations are obtained by solving the generalized eigenvalue problem for the matrices and .Canonical Correlations:
The eigenvalues from the above problem give the squared canonical correlations.
The eigenvectors provide the coefficients for the linear combinations of the X and Y variables.Interpret the Results:
The canonical correlations indicate the strength of the relationship between the canonical variates.
The coefficients (eigenvectors) indicate how much each original variable contributes to the canonical variates.
Multidimensional Scaling (MDS) is a technique used to visualize the level of similarity or dissimilarity of individual cases in a dataset. It is used in exploratory data analysis to detect patterns in the presence of high-dimensional data.
Discriminant Analysis is a statistical technique used to classify observations into predefined classes. It is particularly useful when you have categorical dependent variables and continuous independent variables. The two main types of discriminant analysis are:
- Linear Discriminant Analysis (LDA)
- Quadratic Discriminant Analysis (QDA)
Linear Discriminant Analysis (LDA)
LDA assumes that the data from each class is drawn from a Gaussian distribution with the same covariance matrix. It aims to find a linear combination of features that best separate two or more classes.
Quadratic Discriminant Analysis (QDA)
QDA is similar to LDA but assumes that each class has its own covariance matrix. This allows for more flexible decision boundaries but requires more parameters to be estimated.
Cluster analysis, also known as clustering, is a technique used to group a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. There are several clustering algorithms, but some of the most commonly used include:
- K-means Clustering
- Hierarchical Clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
K-means Clustering
K-means is one of the simplest and most popular clustering algorithms. It partitions the data into K clusters, where each data point belongs to the cluster with the nearest mean.
Hierarchical Clustering
Hierarchical clustering seeks to build a hierarchy of clusters. It can be either:
- Agglomerative: A bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
- Divisive: A top-down approach where all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is a clustering algorithm that groups together points that are close to each other based on a distance measurement and a minimum number of points. It is especially useful for identifying outliers or noise in data.