A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Probability and Statistics (Important Pointers only)
Module - III : Probability and Statistics
I. Probability Axioms and Rules.
1. Probability Axioms (Kolmogorov Axioms)
- Non-negativity: For any event , the probability of is a non-negative number.
- Normalization: The probability of the entire sample space is 1.
- Additivity: For any two mutually exclusive (disjoint) events and (i.e., events that cannot both occur at the same time), the probability of their union is the sum of their probabilities.
2. Derived Rules
- Complement Rule: The probability of the complement of an event (i.e., the event that does not occur) is given by:
- Union of Two Events: For any two events and , the probability of their union is given by:
- Conditional Probability: The probability of event given that event has occurred is defined as:
- Multiplication Rule: For any two events and , the probability of both and occurring is:
- Total Probability Theorem: If are mutually exclusive and exhaustive events, then for any event :
II. Conditional Probability and Bayes' Theorem.
It is defined as:
where:
- is the probability that both events and occur.
- is the probability that event occurs, provided .
Eg : Drawing Cards from a Deck
Suppose you have a standard deck of 52 cards. Let be the event that the card drawn is an ace, and let be the event that the card drawn is a spade.
- There are 4 aces in the deck.
- There are 13 spades in the deck.
- There is 1 ace of spades in the deck.
To find , the probability that the card drawn is an ace given that it's a spade:
Properties:
- Non-negativity:
- Normalization:
- Multiplication Rule:
Bayes' Theorem relates the conditional probability of event given to the conditional probability of event given , as well as the probabilities of and . The formula is as follows:
where:is the conditional probability of given .
is the prior probability of .
is the prior probability of .
Aspects of Bayes' Theorem
Prior Probability (): This is the initial probability of event before any new evidence is considered.
Likelihood (): This is the probability of observing event given that is true.
Marginal Probability (): This is the total probability of observing event , which can be found by considering all possible ways can occur. If we have a set of mutually exclusive and exhaustive events , the marginal probability can be calculated as:Posterior Probability (): This is the updated probability of event after considering the new evidence .
III. Random Variables and Probability Distributions.
1. Random Variables
A numerical outcome of a random phenomenon.Two types:
- Discrete Random Variables: These take on a countable number of distinct values.
- Continuous Random Variables: These take on an infinite number of possible values within a given range.
2. Probability Distributions
A probability distribution describes how the values of a random variable are distributed providing the probabilities of occurrence of different possible outcomes.
(i) Discrete Probability Distributions
For a discrete random variable, the probability distribution is described by a probability mass function (PMF), , which gives the probability that the random variable equals a specific value .
Example:
Consider a fair six-sided die. Let be the random variable representing the outcome of a roll. The PMF is:
(ii) Continuous Probability Distributions
For a continuous random variable, the probability distribution is described by a probability density function (PDF), , which gives the relative likelihood of the random variable taking on a particular value. The probability that lies within an interval is given by the integral of the PDF over that interval:
Example:
Consider a continuous random variable that is uniformly distributed between 0 and 1. The PDF is:
3. Cumulative Distribution Function (CDF)
The cumulative distribution function (CDF) of a random variable is a function that gives the probability that will take a value less than or equal to :
For a discrete random variable, the CDF is the sum of the probabilities up to :
For a continuous random variable, the CDF is the integral of the PDF up to :
4. Common Probability Distributions
Discrete Distributions:
Binomial Distribution: Describes the number of successes in a fixed number of independent Bernoulli trials (e.g., flipping a coin times).
Poisson Distribution: Describes the number of events occurring in a fixed interval of time or space (e.g., number of emails received in an hour).Continuous Distributions:
Normal Distribution: Describes a continuous random variable with a bell-shaped probability density function (e.g., heights of people).
Exponential Distribution: Describes the time between events in a Poisson process (e.g., time until the next phone call at a call center).
5. Expected Value and Variance
- Expected Value (Mean): The expected value of a random variable provides a measure of the center of its distribution. For a discrete random variable, it is calculated as:
For a continuous random variable, it is calculated as:
Variance: The variance of a random variable measures the spread of its distribution. For a discrete random variable, it is calculated as:
For a continuous random variable, it is calculated as:
IV. Descriptive Statistics.
To describe the main features of a dataset. Three key measures of central tendency in descriptive statistics are the mean, median, and mode.
1. Mean
The mean (or average) is the sum of all the values in a dataset divided by the number of values. It provides a measure of the central location of the data.
For a dataset with values , the mean is calculated as:
2. Median
The median is the middle value of a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.
Steps to find the median:
- Arrange the data in ascending order.
- If the number of observations is odd, the median is the middle value.
- If is even, the median is the average of the two middle values.
3. Mode
The mode is the value that appears most frequently in a dataset. A dataset can have more than one mode if multiple values have the same highest frequency, or no mode if all values are unique.
Comparison
- Mean: Sensitive to outliers (extreme values) because it considers all values in the dataset. It is a good measure of central tendency for symmetric distributions without outliers.
- Median: Not sensitive to outliers. It is a better measure of central tendency for skewed distributions or datasets with outliers.
- Mode: Useful for categorical data or when identifying the most common value in a dataset. It can be less informative for continuous data or datasets with no repeated values.
Each measure of central tendency provides different insights, and the choice of which to use depends on the nature of the data and the specific context of the analysis.
V. Point Estimation and Confidence Intervals.
1. Point Estimation
Point estimation involves the use of sample data to calculate a single value (known as a statistic) which serves as the best guess or estimate of an unknown population parameter.
Common Point Estimators
Mean (μ): The sample mean () is used to estimate the population mean.
Proportion (p): The sample proportion () is used to estimate the population proportion.
where is the number of successes in the sample and is the sample size.Variance (σ²): The sample variance () is used to estimate the population variance.
Properties of Point Estimators:
- Unbiasedness: An estimator is unbiased if the expected value of the estimator equals the population parameter.
- Consistency: An estimator is consistent if it converges to the true parameter value as the sample size increases.
- Efficiency: An estimator is efficient if it has the smallest variance among all unbiased estimators.
2. Confidence Intervals
Confidence intervals provide a range of values, derived from the sample data, that is likely to contain the value of an unknown population parameter. The interval is associated with a confidence level, typically expressed as a percentage (e.g., 95%, 99%).
General Form of a Confidence Interval
A confidence interval for a population parameter is generally given by:
The margin of error reflects the precision of the estimate and is affected by the variability in the data and the sample size.
A 95% confidence interval means that if we were to take 100 different samples and compute a confidence interval for each sample, we would expect about 95 of the intervals to contain the true population parameter. Confidence intervals provide a range of plausible values for the parameter, giving a sense of the precision and reliability of the estimate.
VI. Hypothesis Testing.
A statistical method used to make decisions or inferences about a population based on sample data.
Key Concepts in Hypothesis Testing
- Null Hypothesis (): The statement being tested, usually a statement of no effect or no difference. It is assumed true until evidence indicates otherwise.
- Alternative Hypothesis ( or ): The statement we want to test against the null hypothesis. It represents a new effect or difference.
- Test Statistic: A standardized value calculated from sample data, used to decide whether to reject the null hypothesis. The form of the test statistic depends on the type of data and the hypothesis test being performed.
- P-value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
- Significance Level (): The threshold for rejecting the null hypothesis. Common choices are 0.05, 0.01, or 0.10.
- Decision Rule: A rule based on the p-value and the significance level. If , reject ; otherwise, fail to reject .
Steps in Hypothesis Testing
State the Hypotheses: Formulate the null and alternative hypotheses.
Choose the Significance Level (): Decide on the level of significance for the test (e.g., ).
Calculate the Test Statistic: Compute the test statistic based on the sample data.
Determine the P-value: Find the p-value corresponding to the test statistic.
Make a Decision: Compare the p-value to the significance level and decide whether to reject or fail to reject the null hypothesis.
Types of Hypothesis Tests
(i) One-Sample Z-Test for Means
Used when the population standard deviation is known and the sample size is large ().
Example: Testing whether the mean height of a population is 170 cm based on a sample of 50 people.
Test statistic:
(ii) One-Sample T-Test for Means
Used when the population standard deviation is unknown and the sample size is small ().
Example: Testing whether the mean score of students is 75 based on a sample of 15 students.
Test statistic:
(iii) One-Sample Z-Test for Proportions
Used to test hypotheses about population proportions.
Example: Testing whether the proportion of defective items is 0.05 based on a sample of 200 items.
Test statistic:
(iv) Two-Sample T-Test for Means
Used to compare the means of two independent groups.
Example: Testing whether the mean scores of two classes are different.
Test statistic:
VII. Expectation and Variance.
This is already added as a sub section of random variables hence summarizing the equations :
Expected Value (Mean):
- Discrete:
- Continuous:
Variance:
- Discrete: or
- Continuous: or
VIII. Bayesian Inference.
Bayesian inference allows to incorporate prior knowledge and update our beliefs in light of new evidence.
It is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as more evidence or information becomes available. It combines prior beliefs with new data to form a posterior belief.
Key Concepts
- Prior Probability (Prior): The initial belief about the probability of a hypothesis before new data is considered, denoted as .
- Likelihood: The probability of observing the data given the hypothesis, denoted as .
- Posterior Probability (Posterior): The updated belief about the probability of a hypothesis after considering new data, denoted as .
- Marginal Likelihood (Evidence): The total probability of the data under all possible hypotheses, denoted as .
To recall Bayes' Theorem
Where:- is the posterior probability.
- is the likelihood.
- is the prior probability.
- is the marginal likelihood.
The marginal likelihood is calculated as:
for discrete hypotheses, or:for continuous hypotheses.Steps in Bayesian Inference
- Specify the Prior: Determine the prior distribution based on previous knowledge or beliefs about the hypothesis.
- Collect Data: Obtain new data .
- Compute the Likelihood: Calculate the likelihood of observing the data given the hypothesis.
- Apply Bayes' Theorem: Use Bayes' theorem to update the prior distribution with the likelihood to obtain the posterior distribution .
- Make Inferences: Use the posterior distribution to make probabilistic statements or decisions about the hypothesis.