Skip to main content

Mathematics for Artificial Intelligence : Probability and Statistics

 A simplified guide on how to prep up on Mathematics for Artificial Intelligence, Machine Learning and Data Science: Probability and Statistics (Important Pointers only)

 

Module - III : Probability and Statistics

I. Probability Axioms and Rules.

1. Probability Axioms (Kolmogorov Axioms)

  • Non-negativity: For any event AA, the probability of AA is a non-negative number.
P(A)0P(A) \geq 0
  • Normalization: The probability of the entire sample space SS is 1.
P(S)=1P(S) = 1
  • Additivity: For any two mutually exclusive (disjoint) events AA and BB (i.e., events that cannot both occur at the same time), the probability of their union is the sum of their probabilities.
P(AB)=P(A)+P(B)ifAB=P(A \cup B) = P(A) + P(B) \quad \text{if} \quad A \cap B = \emptyset

 2. Derived Rules

  •  Complement Rule: The probability of the complement of an event AA (i.e., the event that AA does not occur) is given by:P(Ac)=1P(A)P(A^c) = 1 - P(A)
  • Union of Two Events: For any two events AA and BB, the probability of their union is given by:
P(AB)=P(A)+P(B)P(AB)P(A \cup B) = P(A) + P(B) - P(A \cap B)
  •  Conditional Probability: The probability of event AA given that event BB has occurred is defined as:
P(AB)=P(AB)P(B)ifP(B)>0P(A|B) = \frac{P(A \cap B)}{P(B)} \quad \text{if} \quad P(B) > 0
  • Multiplication Rule: For any two events AA and BB, the probability of both AA and BB occurring is:
P(AB)=P(AB)P(B)P(A \cap B) = P(A|B) \cdot P(B)
  •  Total Probability Theorem: If B1,B2,,Bn are mutually exclusive and exhaustive events, then for any event AA:
P(A)=i=1nP(ABi)P(Bi)P(A) = \sum_{i=1}^{n} P(A|B_i) \cdot P(B_i)

II. Conditional Probability and Bayes' Theorem.

The probability of an event occurring given that another event has already occurred. It's denoted as P(AB)P(A|B), which reads as "the probability of AA given BB." 

It is defined as:

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

where:

  • P(AB)P(A \cap B) is the probability that both events AA and BB occur.
  • P(B)P(B) is the probability that event BB occurs, provided P(B)>0P(B) > 0.

Eg : Drawing Cards from a Deck

Suppose you have a standard deck of 52 cards. Let AA be the event that the card drawn is an ace, and let BB be the event that the card drawn is a spade.

  • There are 4 aces in the deck.
  • There are 13 spades in the deck.
  • There is 1 ace of spades in the deck.

To find P(AB)P(A|B), the probability that the card drawn is an ace given that it's a spade:

P(AB)=P(AB)P(B)=Number of aces of spadesNumber of spades=1130.0769P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{\text{Number of aces of spades}}{\text{Number of spades}} = \frac{1}{13} \approx 0.0769

Properties:

  • Non-negativity: 0P(AB)10 \leq P(A|B) \leq 1
  • Normalization: P(AB)+P(AcB)=1P(A|B) + P(A^c|B) = 1
  • Multiplication Rule: P(AB)=P(AB)P(B)P(A \cap B) = P(A|B) \cdot P(B)
Bayes' Theorem

Bayes' Theorem relates the conditional probability of event AA given BB to the conditional probability of event BB given AA, as well as the probabilities of AA and BB. The formula is as follows:

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)where:
P(AB)P(A|B) is the conditional probability of AA given BB.
P(BA)P(B|A) is the conditional probability of BB given AA.
P(A)P(A) is the prior probability of AA.
P(B)P(B) is the prior probability of BB.

Aspects of Bayes' Theorem

Prior Probability (P(A)P(A)): This is the initial probability of event AA before any new evidence is considered.

Likelihood (P(BA)P(B|A)): This is the probability of observing event BB given that AA is true.

Marginal Probability (P(B)P(B)): This is the total probability of observing event BB, which can be found by considering all possible ways BB can occur. If we have a set of mutually exclusive and exhaustive events A1,A2,,AnA_1, A_2, \ldots, A_n, the marginal probability P(B)P(B) can be calculated as:
P(B)=i=1nP(BAi)P(Ai)P(B) = \sum_{i=1}^{n} P(B|A_i) \cdot P(A_i)

Posterior Probability (P(AB)P(A|B)): This is the updated probability of event AA after considering the new evidence BB.

 

III. Random Variables and Probability Distributions.

 1. Random Variables

A numerical outcome of a random phenomenon.Two types:

  • Discrete Random Variables: These take on a countable number of distinct values.
  • Continuous Random Variables: These take on an infinite number of possible values within a given range.

 2. Probability Distributions

A probability distribution describes how the values of a random variable are distributed providing the probabilities of occurrence of different possible outcomes.

(i) Discrete Probability Distributions

For a discrete random variable, the probability distribution is described by a probability mass function (PMF), P(X=x)P(X = x), which gives the probability that the random variable XX equals a specific value xx.

Example:

Consider a fair six-sided die. Let XX be the random variable representing the outcome of a roll. The PMF is:

P(X=x)={16if x{1,2,3,4,5,6}0otherwiseP(X = x) = \begin{cases} \frac{1}{6} & \text{if } x \in \{1, 2, 3, 4, 5, 6\} \\ 0 & \text{otherwise} \end{cases

(ii) Continuous Probability Distributions

For a continuous random variable, the probability distribution is described by a probability density function (PDF), f(x)f(x), which gives the relative likelihood of the random variable taking on a particular value. The probability that XX lies within an interval [a,b][a, b] is given by the integral of the PDF over that interval:

P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx

Example:

Consider a continuous random variable XX that is uniformly distributed between 0 and 1. The PDF is:

f(x)={1if 0x10otherwisef(x) = \begin{cases} 1 & \text{if } 0 \leq x \leq 1 \\ 0 & \text{otherwise} \end{cases}

3. Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) of a random variable XX is a function F(x)F(x) that gives the probability that XX will take a value less than or equal to xx:

F(x)=P(Xx)F(x) = P(X \leq x)

For a discrete random variable, the CDF is the sum of the probabilities up to xx:

F(x)=txP(X=t)F(x) = \sum_{t \leq x} P(X = t)

For a continuous random variable, the CDF is the integral of the PDF up to xx:

F(x)=xf(t)dtF(x) = \int_{-\infty}^{x} f(t) \, dt

4. Common Probability Distributions

  1. Discrete Distributions:

    Binomial Distribution: Describes the number of successes in a fixed number of independent Bernoulli trials (e.g., flipping a coin nn times).
    Poisson Distribution: Describes the number of events occurring in a fixed interval of time or space (e.g., number of emails received in an hour).
  2. Continuous Distributions:

    Normal Distribution: Describes a continuous random variable with a bell-shaped probability density function (e.g., heights of people).
    Exponential Distribution: Describes the time between events in a Poisson process (e.g., time until the next phone call at a call center).

5. Expected Value and Variance

  • Expected Value (Mean): The expected value of a random variable XX provides a measure of the center of its distribution. For a discrete random variable, it is calculated as:
    E(X)=xxP(X=x)E(X) = \sum_{x} x \cdot P(X = x)

    For a continuous random variable, it is calculated as:

    E(X)=xf(x)dxE(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, dx
  • Variance: The variance of a random variable XX measures the spread of its distribution. For a discrete random variable, it is calculated as:

    Var(X)=E[(XE(X))2]=x(xE(X))2P(X=x)\text{Var}(X) = E[(X - E(X))^2] = \sum_{x} (x - E(X))^2 \cdot P(X = x)

    For a continuous random variable, it is calculated as:

    Var(X)=(xE(X))2f(x)dx\text{Var}(X) = \int_{-\infty}^{\infty} (x - E(X))^2 \cdot f(x) \, dx

IV. Descriptive Statistics.

 To describe the main features of a dataset. Three key measures of central tendency in descriptive statistics are the mean, median, and mode.

1. Mean

The mean (or average) is the sum of all the values in a dataset divided by the number of values. It provides a measure of the central location of the data.

For a dataset with nn values x1,x2,,xnx_1, x_2, \ldots, x_n, the mean μ\mu is calculated as:

μ=1ni=1nxi\mu = \frac{1}{n} \sum_{i=1}^{n} x_i

2. Median

The median is the middle value of a dataset when the values are arranged in ascending or descending order. If the dataset has an odd number of values, the median is the middle value. If the dataset has an even number of values, the median is the average of the two middle values.

Steps to find the median:

  1. Arrange the data in ascending order.
  2. If the number of observations nn is odd, the median is the middle value.
  3. If nn is even, the median is the average of the two middle values.

3. Mode

The mode is the value that appears most frequently in a dataset. A dataset can have more than one mode if multiple values have the same highest frequency, or no mode if all values are unique.

Comparison

  • Mean: Sensitive to outliers (extreme values) because it considers all values in the dataset. It is a good measure of central tendency for symmetric distributions without outliers.
  • Median: Not sensitive to outliers. It is a better measure of central tendency for skewed distributions or datasets with outliers.
  • Mode: Useful for categorical data or when identifying the most common value in a dataset. It can be less informative for continuous data or datasets with no repeated values.

Each measure of central tendency provides different insights, and the choice of which to use depends on the nature of the data and the specific context of the analysis.

 

V. Point Estimation and Confidence Intervals.

 1. Point Estimation

Point estimation involves the use of sample data to calculate a single value (known as a statistic) which serves as the best guess or estimate of an unknown population parameter.

Common Point Estimators

  1. Mean (μ): The sample mean (xˉ\bar{x}) is used to estimate the population mean.

    xˉ=1ni=1nxi\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i
  2. Proportion (p): The sample proportion (p^\hat{p}) is used to estimate the population proportion.

    p^=xn\hat{p} = \frac{x}{nwhere xx is the number of successes in the sample and nn is the sample size.
  3. Variance (σ²): The sample variance (s2s^2) is used to estimate the population variance.

    s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2

Properties of Point Estimators:

  • Unbiasedness: An estimator is unbiased if the expected value of the estimator equals the population parameter.
  • Consistency: An estimator is consistent if it converges to the true parameter value as the sample size increases.
  • Efficiency: An estimator is efficient if it has the smallest variance among all unbiased estimators.

2. Confidence Intervals

Confidence intervals provide a range of values, derived from the sample data, that is likely to contain the value of an unknown population parameter. The interval is associated with a confidence level, typically expressed as a percentage (e.g., 95%, 99%).

General Form of a Confidence Interval

A confidence interval for a population parameter θ\theta is generally given by:

Point Estimate±Margin of Error\text{Point Estimate} \pm \text{Margin of Error}

The margin of error reflects the precision of the estimate and is affected by the variability in the data and the sample size.

A 95% confidence interval means that if we were to take 100 different samples and compute a confidence interval for each sample, we would expect about 95 of the intervals to contain the true population parameter. Confidence intervals provide a range of plausible values for the parameter, giving a sense of the precision and reliability of the estimate.


VI. Hypothesis Testing.

 A statistical method used to make decisions or inferences about a population based on sample data.

 Key Concepts in Hypothesis Testing

  • Null Hypothesis (H0): The statement being tested, usually a statement of no effect or no difference. It is assumed true until evidence indicates otherwise.
  • Alternative Hypothesis (H1H_1 or HaH_a): The statement we want to test against the null hypothesis. It represents a new effect or difference.
  • Test Statistic: A standardized value calculated from sample data, used to decide whether to reject the null hypothesis. The form of the test statistic depends on the type of data and the hypothesis test being performed.
  • P-value: The probability of obtaining a test statistic at least as extreme as the one observed, assuming the null hypothesis is true. A smaller p-value indicates stronger evidence against the null hypothesis.
  • Significance Level (α\alpha): The threshold for rejecting the null hypothesis. Common choices are 0.05, 0.01, or 0.10.
  • Decision Rule: A rule based on the p-value and the significance level. If p-valueα\text{p-value} \leq \alpha, reject H0H_0; otherwise, fail to reject H0H_0.

Steps in Hypothesis Testing

  1. State the Hypotheses: Formulate the null and alternative hypotheses.

    H0:parameter=valueH_0: \text{parameter} = \text{value} H1:parametervalue(or >, <)H_1: \text{parameter} \neq \text{value} \quad \text{(or >, <)}
  2. Choose the Significance Level (α\alpha): Decide on the level of significance for the test (e.g., α=0.05\alpha = 0.05).

  3. Calculate the Test Statistic: Compute the test statistic based on the sample data.

  4. Determine the P-value: Find the p-value corresponding to the test statistic.

  5. Make a Decision: Compare the p-value to the significance level and decide whether to reject or fail to reject the null hypothesis.

Types of Hypothesis Tests

(i) One-Sample Z-Test for Means

Used when the population standard deviation is known and the sample size is large (n30n \geq 30).

Example: Testing whether the mean height of a population is 170 cm based on a sample of 50 people.

H0:μ=170H_0: \mu = 170 H1:μ170H_1: \mu \neq 170

Test statistic:

z=xˉμ0σ/nz = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}}

(ii) One-Sample T-Test for Means

Used when the population standard deviation is unknown and the sample size is small (n<30n < 30).

Example: Testing whether the mean score of students is 75 based on a sample of 15 students.

H0:μ=75H_0: \mu = 75 H1:μ75H_1: \mu \neq 75

Test statistic:

t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

(iii) One-Sample Z-Test for Proportions

Used to test hypotheses about population proportions.

Example: Testing whether the proportion of defective items is 0.05 based on a sample of 200 items.

H0:p=0.05H_0: p = 0.05 H1:p0.05H_1: p \neq 0.05

Test statistic:

z=p^p0p0(1p0)/nz = \frac{\hat{p} - p_0}{\sqrt{p_0 (1 - p_0) / n}}

(iv) Two-Sample T-Test for Means

Used to compare the means of two independent groups.

Example: Testing whether the mean scores of two classes are different.

H0:μ1=μ2H_0: \mu_1 = \mu_ H1:μ1μ2H_1: \mu_1 \neq \mu_2

Test statistic:

t=xˉ1xˉ2s12n1+s22n2t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}


VII. Expectation and Variance.

This is already added as a sub section of random variables hence summarizing the equations :

Expected Value (Mean):

  • Discrete: E(X)=ixipi
  • Continuous: E(X)=xf(x)dxE(X) = \int_{-\infty}^{\infty} x \cdot f(x) \, d

  • Variance:

    • Discrete: Var(X)=i(xiμ)2pi\text{Var}(X) = \sum_{i} (x_i - \mu)^2 \cdot p_i or Var(X)=E(X2)[E(X)]2\text{Var}(X) = E(X^2) - [E(X)]^2
    • Continuous: Var(X)=(xμ)2f(x)dx\text{Var}(X) = \int_{-\infty}^{\infty} (x - \mu)^2 \cdot f(x) \, dx or Var(X)=E(X2)[E(X)]2\text{Var}(X) = E(X^2) - [E(X)]^2

  • VIII. Bayesian Inference.

    Bayesian inference allows to incorporate prior knowledge and update our beliefs in light of new evidence.

    It is a method of statistical inference in which Bayes' theorem is used to update the probability of a hypothesis as more evidence or information becomes available. It combines prior beliefs with new data to form a posterior belief.

    Key Concepts

    • Prior Probability (Prior): The initial belief about the probability of a hypothesis before new data is considered, denoted as P(H)P(H).
    • Likelihood: The probability of observing the data given the hypothesis, denoted as P(DH)P(D \mid H).
    • Posterior Probability (Posterior): The updated belief about the probability of a hypothesis after considering new data, denoted as P(HD)P(H \mid D).
    • Marginal Likelihood (Evidence): The total probability of the data under all possible hypotheses, denoted as P(D)P(D).

     To recall Bayes' Theorem

    P(HD)=P(DH)P(H)P(D)P(H \mid D) = \frac{P(D \mid H) \cdot P(H)}{P(D)}Where:
    • P(HD)P(H \mid D) is the posterior probability.
    • P(DH)P(D \mid H) is the likelihood.
    • P(H)P(H) is the prior probability.
    • P(D)P(D) is the marginal likelihood.

    The marginal likelihood is calculated as:

    P(D)=iP(DHi)P(Hi)P(D) = \sum_{i} P(D \mid H_i) \cdot P(H_i)for discrete hypotheses, or:P(D)=P(Dθ)P(θ)dθP(D) = \int P(D \mid \theta) \cdot P(\theta) \, d\thetafor continuous hypotheses.

     Steps in Bayesian Inference

    • Specify the Prior: Determine the prior distribution P(H)P(H) based on previous knowledge or beliefs about the hypothesis.
    • Collect Data: Obtain new data DD.
    • Compute the Likelihood: Calculate the likelihood P(DH)P(D \mid H) of observing the data given the hypothesis.
    • Apply Bayes' Theorem: Use Bayes' theorem to update the prior distribution with the likelihood to obtain the posterior distribution P(HD)P(H \mid D).
    • Make Inferences: Use the posterior distribution to make probabilistic statements or decisions about the hypothesis.

     

     

     

    Popular posts from this blog

    Case Study: Reported Rape Cases Analysis

    Case Study  : Rape Cases Analysis Country : India Samples used are the reports of rape cases from 2016 to 2021 in Indian states and Union Territories Abstract : Analyzing rape cases reported in India is crucial for understanding patterns, identifying systemic failures and driving policy reforms to ensure justice and safety. With high underreporting and societal stigma, data-driven insights can help reveal gaps in law enforcement, judicial processes and victim support systems. Examining factors such as regional trends, conviction rates and yearly variations aids in developing more effective legal frameworks and prevention strategies. Furthermore, such analysis raises awareness, encourages institutional accountability and empowers advocacy efforts aimed at addressing gender-based violence. A comprehensive approach to studying these cases is essential to creating a safer, legally sound and legitimate society. This study is being carried out with an objective to perform descriptive a...

    Artificial intelligence on Cloud

      Cloud computing is a technology model that enables convenient, on-demand access to a shared pool of computing resources (such as servers, storage, networking, databases, applications, and services) over the internet. Instead of owning and maintaining physical hardware and infrastructure, users can access and use computing resources on a pay-as-you-go basis, similar to a utility service.  Cloud computing also has deployment models, indicating how cloud services are hosted and made available to users: Public Cloud: Services are provided over the public internet and are available to anyone who wants to use or purchase them. Examples include AWS, Azure, and Google Cloud. Private Cloud: Cloud resources are used exclusively by a single organization. Private clouds can be hosted on-premises or by a third-party provider. Hybrid Cloud: Combines elements of both public and private clouds. It allows data and applications to be shared between them, offering greater flexibility a...

    Everything/Anything as a Service (XaaS)

      "Anything as a Service" or "Everything as a Service."     XaaS, or "Anything as a Service," represents the comprehensive and evolving suite of services and applications delivered to users via the internet. This paradigm encompasses a wide array of cloud-based solutions, transcending traditional boundaries to include software, infrastructure, platforms and more. There are numerous types of XaaS: Software as a service Platform as a service Infrastructure as a service Storage as a service Mobility as a service Database as a service Communications as a service Network as a service  .. and this list goes on by each passing day  Most familiar and known services in Cloud Computing : Software as a service ...