Mathematics for Artificial Intelligence : Probability and Statistics

A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Probability and Statistics (Important Pointers only)

Module - III : Probability and Statistics

I. Probability Axioms and Rules.

1. Probability Axioms (Kolmogorov Axioms)

Non-negativity: For any event $A$ , the probability of $A$ is a non-negative number.

P(A) \geq 0

Normalization: The probability of the entire sample space $S$ is 1.

P(S) = 1

Additivity: For any two mutually exclusive (disjoint) events $A$ and $B$ (i.e., events that cannot both occur at the same time), the probability of their union is the sum of their probabilities.

P(A \cup B) = P(A) + P(B) \quad \text{if} \quad A \cap B = \emptyset

2. Derived Rules

Complement Rule: The probability of the complement of an event $A$ (i.e., the event that $A$ does not occur) is given by: $P(A^c) = 1 - P(A)$

Union of Two Events: For any two events $A$ and $B$ , the probability of their union is given by:

P(A \cup B) = P(A) + P(B) - P(A \cap B)

Conditional Probability: The probability of event $A$ given that event $B$ has occurred is defined as:

P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{if} \quad P(B) > 0

Multiplication Rule: For any two events $A$ and $B$ , the probability of both $A$ and $B$ occurring is:

P(A \cap B) = P(A|B) \cdot P(B)

Total Probability Theorem: If $B_{1}, B_{2}, \dots, B_{n}$ are mutually exclusive and exhaustive events, then for any event $A$ :

P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)

II. Conditional Probability and Bayes' Theorem.

The probability of an event occurring given that another event has already occurred. It's denoted as

P(A|B)

, which reads as "the probability of

A

given

B

It is defined as:

P(A|B) = \frac{P(A \cap B)}{P(B)}

where:

$P(A \cap B)$ is the probability that both events $A$ and $B$ occur.
$P(B)$ is the probability that event $B$ occurs, provided $P(B) > 0$ .

Eg : Drawing Cards from a Deck

Suppose you have a standard deck of 52 cards. Let $A$ be the event that the card drawn is an ace, and let $B$ be the event that the card drawn is a spade.

There are 4 aces in the deck.
There are 13 spades in the deck.
There is 1 ace of spades in the deck.

To find $P(A|B)$ , the probability that the card drawn is an ace given that it's a spade:

P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{\text{Number of aces of spades}}{\text{Number of spades}} = \frac{1}{13} \approx 0.0769

Properties:

Non-negativity: $0 \leq P(A|B) \leq 1$
Normalization: $P(A|B) + P(A^c|B) = 1$
Multiplication Rule: $P(A \cap B) = P(A|B) \cdot P(B)$

Bayes' Theorem

Bayes' Theorem relates the conditional probability of event $A$ given $B$ to the conditional probability of event $B$ given $A$ , as well as the probabilities of $A$ and $B$ . The formula is as follows:

P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)

where:

P(A|B)

is the conditional probability of

A

given

B

P(B|A)

is the conditional probability of

B

given

A

P(A)

is the prior probability of

A

P(B)

is the prior probability of

B

Aspects of Bayes' Theorem

Prior Probability ( $P(A)$ ): This is the initial probability of event $A$ before any new evidence is considered.

Likelihood ( $P(B|A)$ ): This is the probability of observing event $B$ given that $A$ is true.

Marginal Probability ( $P(B)$ ): This is the total probability of observing event

B

, which can be found by considering all possible ways

B

can occur. If we have a set of mutually exclusive and exhaustive events

A_1, A_2, \ldots, A_n

, the marginal probability

P(B)

can be calculated as:

P(B) = \sum_{i=1}^{n} P(B|A_i) \cdot P(A_i)

Posterior Probability ( $P(A|B)$ ): This is the updated probability of event $A$ after considering the new evidence $B$ .

III. Random Variables and Probability Distributions.

1. Random Variables

A numerical outcome of a random phenomenon.Two types:

Discrete Random Variables: These take on a countable number of distinct values.
Continuous Random Variables: These take on an infinite number of possible values within a given range.

2. Probability Distributions

A probability distribution describes how the values of a random variable are distributed providing the probabilities of occurrence of different possible outcomes.

(i) Discrete Probability Distributions

For a discrete random variable, the probability distribution is described by a probability mass function (PMF), $P(X = x)$ , which gives the probability that the random variable $X$ equals a specific value $x$ .

Example:

Consider a fair six-sided die. Let $X$ be the random variable representing the outcome of a roll. The PMF is:

P(X = x) = \begin{cases} \frac{1}{6} & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \\ 0 & \text{otherwise} \end{cases

(ii) Continuous Probability Distributions

For a continuous random variable, the probability distribution is described by a probability density function (PDF), $f(x)$ , which gives the relative likelihood of the random variable taking on a particular value. The probability that $X$ lies within an interval $[a, b]$ is given by the integral of the PDF over that interval:

P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx

Example:

Consider a continuous random variable $X$ that is uniformly distributed between 0 and 1. The PDF is:

f(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}

3. Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) of a random variable $X$ is a function $F(x)$ that gives the probability that $X$ will take a value less than or equal to $x$ :

F(x) = P(X \leq x)

For a discrete random variable, the CDF is the sum of the probabilities up to $x$ :

F(x) = \sum_{t \leq x} P(X = t)

For a continuous random variable, the CDF is the integral of the PDF up to $x$ :

F(x) = \int_{-\infty}^{x} f(t) \, dt

4. Common Probability Distributions

Discrete Distributions:
Binomial Distribution: Describes the number of successes in a fixed number of independent Bernoulli trials (e.g., flipping a coin $n$ times).
Poisson Distribution: Describes the number of events occurring in a fixed interval of time or space (e.g., number of emails received in an hour).
Continuous Distributions:
Normal Distribution: Describes a continuous random variable with a bell-shaped probability density function (e.g., heights of people).
Exponential Distribution: Describes the time between events in a Poisson process (e.g., time until the next phone call at a call center).

5. Expected Value and Variance

Expected Value (Mean): The expected value of a random variable $X$ provides a measure of the center of its distribution. For a discrete random variable, it is calculated as:
$E(X) = \sum_{x} x \cdot P(X = x)$
For a continuous random variable, it is calculated as:
$E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx$
Variance: The variance of a random variable $X$ measures the spread of its distribution. For a discrete random variable, it is calculated as:
$\text{Var}(X) = E[(X - E(X))^2] = \sum_{x} (x - E(X))^2 \cdot P(X = x)$
For a continuous random variable, it is calculated as:
$\text{Var}(X) = \int_{-\infty}^{\infty} (x - E(X))^2 \cdot f(x) \, dx$

IV. Descriptive Statistics.

To describe the main features of a dataset. Three key measures of central tendency in descriptive statistics are the mean, median, and mode.

1. Mean

The mean (or average) is the sum of all the values in a dataset divided by the number of values. It provides a measure of the central location of the data.

For a dataset with $n$ values $x_1, x_2, \ldots, x_n$ , the mean $\mu$ is calculated as:

\mu = \frac{1}{n} \sum_{i=1}^{n} x_i

2. Median

The median is the middle value of a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

Steps to find the median:

Arrange the data in ascending order.
If the number of observations $n$ is odd, the median is the middle value.
If $n$ is even, the median is the average of the two middle values.

3. Mode

The mode is the value that appears most frequently in a dataset. A dataset can have more than one mode if multiple values have the same highest frequency, or no mode if all values are unique.

Comparison

Mean: Sensitive to outliers (extreme values) because it considers all values in the dataset. It is a good measure of central tendency for symmetric distributions without outliers.
Median: Not sensitive to outliers. It is a better measure of central tendency for skewed distributions or datasets with outliers.
Mode: Useful for categorical data or when identifying the most common value in a dataset. It can be less informative for continuous data or datasets with no repeated values.

Each measure of central tendency provides different insights, and the choice of which to use depends on the nature of the data and the specific context of the analysis.

V. Point Estimation and Confidence Intervals.

1. Point Estimation

Point estimation involves the use of sample data to calculate a single value (known as a statistic) which serves as the best guess or estimate of an unknown population parameter.

Common Point Estimators

Mean (μ): The sample mean ( $\bar{x}$ ) is used to estimate the population mean.
$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i$
Proportion (p): The sample proportion ( $\hat{p}$ ) is used to estimate the population proportion.
$\hat{p} = \frac{x}{n$ where $x$ is the number of successes in the sample and $n$ is the sample size.
Variance (σ²): The sample variance ( $s^2$ ) is used to estimate the population variance.
$s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2$

Properties of Point Estimators:

Unbiasedness: An estimator is unbiased if the expected value of the estimator equals the population parameter.
Consistency: An estimator is consistent if it converges to the true parameter value as the sample size increases.
Efficiency: An estimator is efficient if it has the smallest variance among all unbiased estimators.

2. Confidence Intervals

Confidence intervals provide a range of values, derived from the sample data, that is likely to contain the value of an unknown population parameter. The interval is associated with a confidence level, typically expressed as a percentage (e.g., 95%, 99%).

General Form of a Confidence Interval

A confidence interval for a population parameter $\theta$ is generally given by:

\text{Point Estimate} \pm \text{Margin of Error}

The margin of error reflects the precision of the estimate and is affected by the variability in the data and the sample size.

A 95% confidence interval means that if we were to take 100 different samples and compute a confidence interval for each sample, we would expect about 95 of the intervals to contain the true population parameter. Confidence intervals provide a range of plausible values for the parameter, giving a sense of the precision and reliability of the estimate.

VI. Hypothesis Testing.

A statistical method used to make decisions or inferences about a population based on sample data.

Key Concepts in Hypothesis Testing

Null Hypothesis ( $H_{0}$ ): The statement being tested, usually a statement of no effect or no difference. It is assumed true until evidence indicates otherwise.

Alternative Hypothesis ( $H_1$ or $H_a$ ): The statement we want to test against the null hypothesis. It represents a new effect or difference.

Test Statistic: A standardized value calculated from sample data, used to decide whether to reject the null hypothesis. The form of the test statistic depends on the type of data and the hypothesis test being performed.

P-value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.

Significance Level ( $\alpha$ ): The threshold for rejecting the null hypothesis. Common choices are 0.05, 0.01, or 0.10.

Decision Rule: A rule based on the p-value and the significance level. If $\text{p-value} \leq \alpha$ , reject $H_0$ ; otherwise, fail to reject $H_0$ .

Steps in Hypothesis Testing

State the Hypotheses: Formulate the null and alternative hypotheses.
$H_0: \text{parameter} = \text{value}$ $H_1: \text{parameter} \neq \text{value} \quad \text{(or >, <)}$
Choose the Significance Level ( $\alpha$ ): Decide on the level of significance for the test (e.g., $\alpha = 0.05$ ).
Calculate the Test Statistic: Compute the test statistic based on the sample data.
Determine the P-value: Find the p-value corresponding to the test statistic.
Make a Decision: Compare the p-value to the significance level and decide whether to reject or fail to reject the null hypothesis.

Types of Hypothesis Tests

(i) One-Sample Z-Test for Means

Used when the population standard deviation is known and the sample size is large ( $n \geq 30$ ).

Example: Testing whether the mean height of a population is 170 cm based on a sample of 50 people.

H_0: \mu = 170

H_1: \mu \neq 170

Test statistic:

z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

(ii) One-Sample T-Test for Means

Used when the population standard deviation is unknown and the sample size is small ( $n < 30$ ).

Example: Testing whether the mean score of students is 75 based on a sample of 15 students.

H_0: \mu = 75

H_1: \mu \neq 75

Test statistic:

t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

(iii) One-Sample Z-Test for Proportions

Used to test hypotheses about population proportions.

Example: Testing whether the proportion of defective items is 0.05 based on a sample of 200 items.

H_0: p = 0.05

H_1: p \neq 0.05

Test statistic:

z = \frac{\hat{p} - p_0}{\sqrt{p_0 (1 - p_0) / n}}

(iv) Two-Sample T-Test for Means

Used to compare the means of two independent groups.

Example: Testing whether the mean scores of two classes are different.

H_0: \mu_1 = \mu_

H_1: \mu_1 \neq \mu_2

Test statistic:

t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

VII. Expectation and Variance.

This is already added as a sub section of random variables hence summarizing the equations :

Expected Value (Mean):

Discrete: $E (X) = \sum_{i} x_{i} \cdot p_{i}$
Continuous: $E(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, d$

Variance:

Discrete: $\text{Var}(X) = \sum_{i} (x_i - \mu)^2 \cdot p_i$ or $\text{Var}(X) = E(X^2) - [E(X)]^2$
Continuous: $\text{Var}(X) = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x) \, dx$ or $\text{Var}(X) = E(X^2) - [E(X)]^2$

VIII. Bayesian Inference.

Bayesian inference allows to incorporate prior knowledge and update our beliefs in light of new evidence.

It is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as more evidence or information becomes available. It combines prior beliefs with new data to form a posterior belief.

Key Concepts

Prior Probability (Prior): The initial belief about the probability of a hypothesis before new data is considered, denoted as $P(H)$ .
Likelihood: The probability of observing the data given the hypothesis, denoted as $P(D \mid H)$ .
Posterior Probability (Posterior): The updated belief about the probability of a hypothesis after considering new data, denoted as $P(H \mid D)$ .
Marginal Likelihood (Evidence): The total probability of the data under all possible hypotheses, denoted as $P(D)$ .

To recall Bayes' Theorem

P(H \mid D) = \frac{P(D \mid H) \cdot P(H)}{P(D)}

Where:

$P(H \mid D)$ is the posterior probability.
$P(D \mid H)$ is the likelihood.
$P(H)$ is the prior probability.
$P(D)$ is the marginal likelihood.

The marginal likelihood is calculated as:

P(D) = \sum_{i} P(D \mid H_i) \cdot P(H_i)

for discrete hypotheses, or:

P(D) = \int P(D \mid \theta) \cdot P(\theta) \, d\theta

for continuous hypotheses.

Steps in Bayesian Inference

Specify the Prior: Determine the prior distribution $P(H)$ based on previous knowledge or beliefs about the hypothesis.
Collect Data: Obtain new data $D$ .
Compute the Likelihood: Calculate the likelihood $P(D \mid H)$ of observing the data given the hypothesis.
Apply Bayes' Theorem: Use Bayes' theorem to update the prior distribution with the likelihood to obtain the posterior distribution $P(H \mid D)$ .
Make Inferences: Use the posterior distribution to make probabilistic statements or decisions about the hypothesis.

technotes.

Search This Blog