Mathematics for Artificial Intelligence: Boolean Algebra

A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Boolean Algebra (Important Pointers only)

Module VI : Boolean Algebra

I. Information Theory.

Information theory, a mathematical framework developed by Claude Shannon in the mid-20th century, quantifies information and provides tools for analyzing communication systems and processes.

It is at the intersection of electronic engineering, mathematics, statistics, computer science, neurobiology, physics, and electrical engineering.

Information:

Information measures the reduction of uncertainty. When an event is uncertain, the receipt of a message about the event reduces this uncertainty.
Information is quantified in terms of bits (binary digits).

Abstractly, information can be thought of as the resolution of uncertainty. Shannon's main result, the noisy-channel coding theorem, showed that, in the limit of many channel uses, the rate of information that is asymptotically achievable is equal to the channel capacity, a quantity dependent merely on the statistics of the channel over which the messages are sent.

A third class of information theory codes are cryptographic algorithms (both codes and ciphers). Concepts, methods and results from coding theory and information theory are widely used in cryptography and cryptanalysis, such as the unit ban.

A key measure in information theory is entropy which is a measure of the uncertainty or unpredictability of a random variable.

Joint Entropy:

Joint entropy $H(X, Y)$ of two random variables $X$ and $Y$ is the entropy of their combined system.
It measures the total uncertainty of the pair $(X, Y)$ .

Conditional Entropy:

Conditional entropy $H(X|Y)$ is the amount of uncertainty remaining about $X$ given that $Y$ is known.
It is defined as: $H(X|Y) = H(X, Y) - H(Y)$

Mutual Information:

Mutual information $I(X;Y)$ quantifies the amount of information obtained about one random variable through another random variable.

Relative Entropy (Kullback-Leibler Divergence):

Relative entropy $D_{KL}(P || Q)$ measures the difference between two probability distributions $P$ and $Q$

Channel Capacity:

Channel capacity $C$ is the maximum rate at which information can be reliably transmitted over a communication channel.
It is determined by the channel's noise characteristics and is given by the Shannon-Hartley theorem for a continuous channel: $C = B \log_2 (1 + \frac{S}{N})$
Where $B$ is the bandwidth, $S$ is the signal power, and $N$ is the noise power.

Applications

Data Compression
Error Correction
Cryptography
Machine Learning and Statistics
Network Information Theory

II. Entropy and Properties.

Entropy is a central concept in information theory that quantifies the amount of uncertainty or unpredictability in a random variable.

For a discrete random variable $X$ with possible outcomes $\{x_1, x_2, \ldots, x_n\}$ and corresponding probabilities $\{p_1, p_2, \ldots, p_n\}$ , the entropy $H(X)$ is defined as:

H(X) = -\sum_{i=1}^n p_i \log_2 p_i

Entropy is measured in bits when the logarithm is base 2.

Properties of Entropy

Non-negativity:
Entropy is always non-negative: $H(X) \geq 0$
$H(X) = 0$ if and only if $X$ is a certain event (i.e., one outcome has probability 1 and all others have probability 0).
Symmetry:
Entropy is symmetric with respect to the probabilities of the outcomes. The order of outcomes does not affect the value of entropy.
Maximum Entropy:
For a random variable with $n$ possible outcomes, entropy is maximized when all outcomes are equally likely, i.e., $p_i = \frac{1}{n}$ for all $i$ .
In this case, the maximum entropy is: $H(X) = \log_2 n$
Additivity (for Independent Variables):
For two independent random variables $X$ and $Y$ , the joint entropy is the sum of the individual entropies: $H(X, Y) = H(X) + H(Y)$
Subadditivity (for Dependent Variables):
For two random variables $X$ and $Y$ the joint entropy is less than or equal to the sum of the individual entropies: $H(X, Y) \leq H(X) + H(Y)$
Conditional Entropy:
The conditional entropy $H(X|Y)$ is the entropy of $X$ given that $Y$ is known. It quantifies the remaining uncertainty about $X$ after knowing $Y$ : $H(X|Y) = H(X, Y) - H(Y)$
- Conditional entropy is always non-negative: $H(X|Y) \geq 0$ .
Chain Rule:
The entropy of a joint distribution can be decomposed using the chain rule: $H(X, Y) = H(X) + H(Y|X)$ $H(X, Y, Z) = H(X) + H(Y|X) + H(Z|X, Y)$
- This property helps in breaking down complex distributions into simpler parts.
Data Processing Inequality:
If $X \rightarrow Y \rightarrow Z$ forms a Markov chain, then the mutual information satisfies: $I(X; Z) \leq I(X; Y)$
- This implies that processing data cannot increase the amount of information.

Eg: Fair Die.

For a fair six-sided die, the entropy is: $H(X) = \log_2 6 \approx 2.585 \text{ bits}$

III. Information Gain.

Information gain is a metric used in information theory and machine learning to measure the reduction in entropy or uncertainty about a random variable $X$ given the knowledge of another variable $Y$ .

Information gain $IG$ is defined as the difference between the entropy of a variable before and after the observation of another variable. For a random variable $X$ and an attribute $A$ , the information gain is calculated as:

IG(X, A) = H(X) - H(X|A)

Where:

$H(X)$ is the entropy of $X$ .
$H(X|A)$ is the conditional entropy of $X$ given $A$ .

Steps to compute $I G$

Consider a dataset to predict a binary outcome $Y$ (e.g., whether a customer will buy a product: Yes/No) based on several attributes (e.g., Age, Income, etc.).

Calculate the entropy of the target variable $Y$ :

H(Y) = -\sum_{i=1}^2 p_i \log_2 p_i

For each attribute $A$ , calculate the conditional entropy $H(Y|A)$ :

H(Y|A) = \sum_{j=1}^m P(A = a_j) H(Y|A = a_j)

Compute the information gain for each attribute:

IG(Y, A) = H(Y) - H(Y|A)

Choose the attribute with the highest information gain to split the dataset.

Eg: Consider the dataset with the following distribution for the target variable $Y$ :

$P(Y = \text{Yes}) = 0.6$
$P(Y = \text{No}) = 0.4$

The entropy $H(Y)$ is:

H(Y) = - (0.6 \log_2 0.6 + 0.4 \log_2 0.4) \approx 0.970

If we consider an attribute $A$ with values $\{a_1, a_2\}$ and the conditional entropies are:

$H(Y|A = a_1) = 0.8$
$H (Y ∣ A = a_{2}) = 0.5$

And probabilities:

$P (A = a_{1}) = 0.5$
$P(A = a_2) = 0.5$

Then the conditional entropy $H(Y|A)$ is:

H(Y|A) = 0.5 \cdot 0.8 + 0.5 \cdot 0.5 = 0.65

The information gain $IG(Y, A)$ is:

IG(Y, A) = H(Y) - H(Y|A) = 0.970 - 0.65 = 0.320

Repeat this process for other attributes to find the one with the highest information gain for splitting the data.

Properties of Information Gain

Non-negativity: Information gain is always non-negative since knowing more information never increases uncertainty.

Maximum Information Gain: When an attribute perfectly predicts the target variable, the information gain is maximized, and conditional entropy is zero.

Bias Towards Attributes with More Values: Information gain tends to favor attributes with more distinct values, which may lead to overfitting.

Applications

Decision Trees: Information gain is used to select the best attribute to split the data at each node in the tree, such as in algorithms like ID3, C4.5, and CART.
Feature Selection: In machine learning, information gain helps in selecting the most informative features for building models.

IV. Mutual Information.

Mutual information is a measure from information theory that quantifies the amount of information obtained about one random variable through another random variable. It is a symmetric measure and provides insights into the dependency between two variables.

For two discrete random variables $X$ and $Y$ , the mutual information $I(X;Y)$ is defined as:

I(X;Y) = \sum_{x \in X} \sum_{y \in Y} P(x, y) \log_2 \left( \frac{P(x, y)}{P(x) P(y)} \right)

Where:

$P(x, y)$ is the joint probability distribution function of $X$ and $Y$ .
$P(x)$ and $P(y)$ are the marginal probability distribution functions of $X$ and $Y$ , respectively.

Mutual information measures the reduction in uncertainty of one variable due to the knowledge of another. If $X$ and $Y$ are independent, knowing $Y$ does not provide any information about $X$ , and vice versa, so their mutual information is zero. Conversely, if $X$ and $Y$ are strongly dependent, knowing $Y$ reduces the uncertainty about $X$ , resulting in higher mutual information.

Relationship with Entropy

Mutual information can also be expressed in terms of entropy:

I(X;Y) = H(X) + H(Y) - H(X, Y)

Where:

$H(X)$ is the entropy of $X$ .
$H(Y)$ is the entropy of $Y$ .
$H(X, Y)$ is the joint entropy of $X$ and $Y$ .

Another useful form is:

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)

Where:

$H(X|Y)$ is the conditional entropy of $X$ given $Y$ .
$H(Y|X)$ is the conditional entropy of $Y$ given $X$ .

Example : Consider two binary random variables $X$ and $Y$ with the following joint probability distribution:

$X$	$Y$	$P(X,Y)$
0	0	0.1
0	1	0.4
1	0	0.2
1	1	0.3

The marginal probabilities are:

P(X=0) = 0.5, \quad P(X=1) = 0.5

P(Y=0) = 0.3, \quad P(Y=1) = 0.7

The mutual information $I(X;Y)$ can be calculated as:

$I(X;Y) = \sum_{x} \sum_{y} P(x, y) \log_2 \left( \frac{P(x, y)}{P(x) P(y)} \right)$ $I(X;Y) = 0.1 \log_2 \left( \frac{0.1}{0.5 \cdot 0.3} \right) + 0.4 \log_2 \left( \frac{0.4}{0.5 \cdot 0.7} \right) + 0.2 \log_2 \left( \frac{0.2}{0.5 \cdot 0.3} \right) + 0.3 \log_2 \left( \frac{0.3}{0.5 \cdot 0.7} \right)$ $I(X;Y) = 0.1 \log_2 \left( \frac{0.1}{0.15} \right) + 0.4 \log_2 \left( \frac{0.4}{0.35} \right) + 0.2 \log_2 \left( \frac{0.2}{0.15} \right) + 0.3 \log_2 \left( \frac{0.3}{0.35} \right)$ $I(X;Y) = 0.1 \log_2 \left( \frac{2}{3} \right) + 0.4 \log_2 \left( \frac{8}{7} \right) + 0.2 \log_2 \left( \frac{4}{3} \right) + 0.3 \log_2 \left( \frac{6}{7} \right)$ $I(X;Y) \approx -0.1 \cdot 0.585 + 0.4 \cdot 0.151 + 0.2 \cdot 0.415 + 0.3 \cdot -0.222$ $I(X;Y) \approx -0.058 + 0.060 + 0.083 - 0.067$ $I(X;Y) \approx 0.018$

So, the mutual information $I(X;Y)$ is approximately 0.018 bits, indicating a small amount of dependency between $X$ and $Y$ .

Properties

Non-negativity: Mutual information is always non-negative: $I(X;Y) \geq 0$
Symmetry: $I(X;Y) = I(Y;X)$ .
Zero Mutual Information: If $X$ and $Y$ are independent, $I (X; Y) = 0$ .
Bounds: $I(X;Y) \leq \min(H(X), H(Y))$ .

Applications

Feature Selection: In machine learning, mutual information can be used to select features that have the most information about the target variable.
Clustering: Mutual information is used to measure the similarity between clusters.
Dependency Detection: It helps in detecting and quantifying dependencies between variables in statistical analysis.

V. Kullback-Leibler (KL) divergence.

The Kullback-Leibler (KL) divergence, also known as relative entropy, is a measure from information theory that quantifies how one probability distribution diverges from a second, reference probability distribution.

For two probability distributions $P$ and $Q$ defined on the same probability space, the KL divergence from $Q$ to $P$ is defined as:

D_{KL}(P \| Q) = \sum_{x \in \mathcal{X}} P(x) \log_2 \left( \frac{P(x)}{Q(x)} \right)

In the continuous case, the KL divergence is defined as:

D_{KL}(P \| Q) = \int_{-\infty}^{\infty} p(x) \log_2 \left( \frac{p(x)}{q(x)} \right) dx

Where:

$P$ and $Q$ (or $p(x)$ and $q(x)$ in the continuous case) are the probability distributions.
$\mathcal{X}$ is the set of possible outcomes.

KL divergence measures the expected number of extra bits required to code samples from $P$ using a code optimized for $Q$ rather than the true distribution $P$ . It can be interpreted as a measure of information loss when $Q$ is used to approximate $P$ .

Example : Consider two discrete probability distributions $P$ and $Q$ over a binary variable $X$ :

$P(X=0) = 0.8$ , $P(X=1) = 0.2$
$Q(X=0) = 0.5$ , $Q(X=1) = 0.5$

The KL divergence from $Q$ to $P$ is:

$D_{KL}(P \| Q) = 0.8 \log_2 \left( \frac{0.8}{0.5} \right) + 0.2 \log_2 \left( \frac{0.2}{0.5} \right)$

$D_{KL}(P \| Q) = 0.8 \log_2(1.6) + 0.2 \log_2(0.4)$ $D_{KL}(P \| Q) = 0.8 \cdot 0.678 + 0.2 \cdot (-1.322)$ $D_{KL}(P \| Q) = 0.542 - 0.264 = 0.278$

So, the KL divergence $D_{KL}(P \| Q)$ is approximately 0.278 bits, indicating that there is an information loss when using $Q$ to approximate $P$ .

Properties

Non-negativity: KL divergence is always non-negative, i.e., $D_{KL}(P \| Q) \geq 0$ , with equality if and only if $P = Q$ almost everywhere.
Asymmetry: $D_{KL}(P \| Q) \neq D_{KL}(Q \| P)$ . This means the divergence from $P$ to $Q$ is not the same as from $Q$ to $P$ .
Not a True Metric: Since KL divergence is asymmetric and does not satisfy the triangle inequality, it is not a true metric.

Applications

Machine Learning: KL divergence is used in various machine learning algorithms, including variational inference and training of generative models like variational autoencoders.
Information Retrieval: It is used to compare the similarity between different probability distributions, such as in document classification and clustering.
Signal Processing: KL divergence helps in measuring the difference between actual and predicted signal distributions.
Statistics: It is used to measure the goodness of fit of a model and to perform hypothesis testing.

VI. Applications of KL Divergence in Feature Selection.

KL divergence is a versatile tool in feature selection, particularly effective in text classification, image processing, and biomedical data analysis.

By quantifying the divergence between probability distributions of features across different classes, it identifies the most informative features. For instance, in text classification, KL divergence helps select terms that distinguish between document classes, while in image processing, it aids in identifying features that differentiate textures or objects.

Methods Using KL Divergence for Feature Selection

Univariate Feature Selection:
For each feature, compute the KL divergence between the distributions of the feature values in different classes.
Rank the features based on their KL divergence scores and select the top features with the highest scores.
Multivariate Feature Selection:
Compute the joint probability distributions of multiple features and use KL divergence to measure the divergence between these joint distributions across different classes.
Select combinations of features that collectively maximize the KL divergence, thus providing the most information about the class distinctions.
Wrapper Methods:
Integrate KL divergence into wrapper methods, where a predictive model (e.g., a classifier) is trained iteratively with different subsets of features.
Use KL divergence to evaluate and rank the importance of features based on their impact on the model's performance.

Workflow Sample

Data Preprocessing:
Normalize or standardize the features to ensure they are on a comparable scale.
Discretize continuous features if necessary to estimate probability distributions.
Probability Distribution Estimation:
Estimate the probability distributions of each feature for different classes. This can be done using histograms, kernel density estimation, or other methods.
KL Divergence Calculation:
For each feature, compute the KL divergence between the distributions of the feature values in different classes.
Feature Ranking and Selection:
Rank the features based on their KL divergence scores.
Select the top $k$ features with the highest KL divergence scores for use in the predictive model.

Sample Code : (using python)

import numpy as np
from sklearn.feature_selection import mutual_info_classif

# Sample data: 100 samples, 10 features
X = np.random.rand(100, 10)
y = np.random.randint(2, size=100)

# Calculate KL divergence (using mutual information as an approximation)
kl_scores = mutual_info_classif(X, y, discrete_features='auto')

# Select top features based on KL divergence scores
top_k = 5
top_features = np.argsort(kl_scores)[-top_k:]

print("Top features based on KL divergence:", top_features)

mutual_info_classif from scikit-learn is used to approximate KL divergence for feature selection.

VII. Shannon's Theorem.

Also known as the Shannon-Hartley theorem or Shannon's channel capacity theorem, is a fundamental principle in information theory. It provides a formula to determine the maximum rate at which information can be transmitted over a communication channel with a specified bandwidth in the presence of noise, without error.

The Shannon-Hartley theorem states that the channel capacity $C$ , in bits per second (bps), of a communication channel is given by:

C = B \log_2 \left( 1 + \frac{S}{N} \right)

Where:

$C$ is the channel capacity in bits per second.
$B$ is the bandwidth of the channel in hertz (Hz).
$S$ is the average signal power.
$N$ is the average noise power.
$\frac{S}{N}$ is the signal-to-noise ratio (SNR).

Keywords:

Channel Capacity (C): The maximum rate at which information can be reliably transmitted over the channel.

Bandwidth (B): The range of frequencies over which the channel can transmit signals.

Signal Power (S): The power of the transmitted signal.

Noise Power (N): The power of the noise affecting the transmission.

Signal-to-Noise Ratio (SNR): A measure of signal quality relative to the background noise.

Implications

Maximum Data Rate: Shannon's theorem defines the theoretical upper limit on the data rate that can be achieved over a noisy channel. No matter how advanced the encoding or modulation techniques are, the data rate cannot exceed this limit.

Error-Free Communication: The theorem assures that it is possible to transmit information over a noisy channel with an arbitrarily low error rate, as long as the transmission rate does not exceed the channel capacity.

Impact of Bandwidth and SNR: Increasing the bandwidth $B$ or improving the signal-to-noise ratio $\frac{S}{N}$ will increase the channel capacity. This highlights the importance of both factors in designing communication systems.

Example : Consider a communication channel with a bandwidth of 3 kHz (3000 Hz) and a signal-to-noise ratio of 30 dB.

First, convert the SNR from decibels to a linear scale:

\text{SNR (linear)} = 10^{\left(\frac{\text{SNR (dB)}}{10}\right)} = 10^{\left(\frac{30}{10}\right)} = 10^3 = 1000

Then, apply the Shannon-Hartley theorem:

C = 3000 \log_2 \left( 1 + 1000 \right) = 3000 \log_2 (1001)

Using the approximation $\log_2(1001) \approx 9.97$ :

C \approx 3000 \times 9.97 \approx 29910 \text{ bits per second (bps)}

So, the maximum achievable data rate for this channel is approximately 29.91 kbps.

Shannon's theorem guides the design and evaluation of communication systems, ensuring efficient and reliable data transmission even in the presence of noise.

VIII. Coding Theory Basics .

Coding theory is a branch of mathematics and computer science that deals with the design of error-correcting codes for reliable data transmission and storage. The primary goal of coding theory is to detect and correct errors introduced during the transmission or storage of data

Key Concepts

Codes:
A code is a set of strings (called codewords) over an alphabet. The alphabet is typically binary (0, 1), but it can be any set of symbols.
A code can be used to represent data in a way that allows for error detection and correction.
Encoding and Decoding:
Encoding: The process of converting data into a codeword using a specific algorithm.
Decoding: The process of interpreting a received codeword, possibly correcting any errors, to recover the original data.
Error Detection and Correction:
Error Detection: Identifying whether an error has occurred during data transmission or storage.
Error Correction: Identifying and correcting errors to retrieve the original data.

Types of Codes

Block Codes:
Definition: A block code encodes a fixed number of information bits into a fixed number of code bits.
Examples: Hamming codes, Reed-Solomon codes, BCH codes.
Parameters: A block code is characterized by (n, k), where $n$ is the length of the codeword, and $k$ is the number of information bits.
Convolutional Codes:
Definition: A convolutional code encodes data by applying a set of linear operations to a sequence of input bits, producing a sequence of output bits.
Usage: Commonly used in real-time communication systems.
Parameters: Defined by the code rate (k/n) and constraint length.
Linear Codes:
Definition: A linear code is a block code where any linear combination of codewords is also a codeword.
Examples: Hamming codes, Cyclic codes.
Properties: Easy to encode and decode using linear algebra techniques.

Key Metrics

Hamming Distance:
The number of positions at which two codewords differ.
The minimum Hamming distance of a code determines its error-detecting and error-correcting capabilities.
Code Rate:
The ratio of the number of information bits to the total number of bits in the codeword (k/n).
Higher code rates are more efficient but provide less error protection.
Error Detection and Correction Capability:
Error Detection: A code can detect up to $d-1$ errors if the minimum Hamming distance is $d$ .
Error Correction: A code can correct up to $\lfloor (d-1)/2 \rfloor$ errors.

Example: Hamming Code (7,4)

Consider a simple Hamming code with parameters (7, 4), which encodes 4 information bits into 7-bit codewords. The code can detect and correct single-bit errors.

Encoding: Given a 4-bit message, the encoder adds 3 parity bits to form a 7-bit codeword.

Decoding: The receiver checks the parity bits to detect and correct any single-bit errors in the received codeword.

Applications

Digital Communication: Ensuring reliable data transmission over noisy channels (e.g., satellite communication, mobile networks).
Data Storage: Protecting data on storage devices (e.g., hard drives, CDs, DVDs) from corruption.
Cryptography: Enhancing security by detecting and correcting errors in cryptographic algorithms.
Barcodes and QR Codes: Enabling error detection and correction in optical data storage and retrieval.

technotes.

Search This Blog