Bayesian Statistics: From Intuition to Practice

Introduction: Model-Based Thinking

Statistics shapes how we understand uncertainty. For much of the 20th century, the frequentist approach dominated scientific research. This framework treats probability as the long-run frequency of events. A coin has a 50% chance of heads because, over many flips, it lands heads about half the time.

Bayesian statistics offers a fundamentally different perspective. Rather than defining probability as long-run frequency, Bayesians treat probability as a degree of belief — a tool for reasoning about uncertainty itself. This shift in interpretation unlocks a powerful modeling philosophy: model-based thinking.

In the model-based view, every statistical analysis begins with an explicit generative model — a story about how the data could have been produced. You write down the process that generated your observations, assign probability distributions to each unknown quantity, and then use Bayesian inference to learn from data. The model is not just a convenience; it is the entire analysis.

As Richard McElreath emphasizes in Statistical Rethinking, the goal is not to test whether a model is "true" — no model is true. The goal is to understand how different assumptions produce different inferences, and to compare models by how well they predict. This philosophy makes Bayesian methods uniquely suited for modern data science: small datasets, complex hierarchical structures, and the need to quantify uncertainty in predictions all arise naturally within this framework.

The Garden of Forking Data

Before formalizing Bayes' theorem, it helps to think about inference in a way McElreath calls the Garden of Forking Data. Imagine all the possible ways a study could have turned out. Each possible dataset is a "path" through this garden. The data you actually observed is just one path.

The key insight: a good model should have produced your observed data as a plausible path. If your model assigns very low probability to the data you actually saw, something is wrong — either the model or the data. This idea underpins both prior and posterior predictive checks.

Consider tossing a small globe and noting whether it lands on water (W) or land (L). Suppose we believe the true proportion of water is 70%. How surprising would it be to observe the sequence W, W, L, W? We can trace through the branching paths:

Toss 1

W 0.7

L 0.3

Toss 2

W 0.7

L 0.3

Toss 3

W 0.7

L 0.3

Toss 4

W 0.7

L 0.3

W,W,L,W

P = 0.7 x 0.7 x 0.3 x 0.7

= 0.1029

W,W,L,L

P = 0.7 x 0.7 x 0.3 x 0.3

= 0.0441

...

Total: 16 possible paths

Observed path is among

the more probable ones

Each path's probability = product of conditional probabilities along the branch

The probability of any particular sequence depends on the parameter value we assume. Different parameter values produce different probability distributions over the "garden" of possible datasets. Bayesian inference works backward: given the observed path, which parameter values are most consistent with it?

Bayes' Theorem: The Mathematical Foundation

At the heart of Bayesian statistics lies Bayes' theorem, a simple equation with profound implications:

P(theta | data) = P(data | theta) * P(theta) / P(data)

This formula tells us how to update our beliefs about a parameter theta after observing data. Let's unpack each component:

P(theta | data) is the posterior probability — our updated belief about theta after seeing the data
P(data | theta) is the likelihood — how probable the observed data is, given a particular value of theta
P(theta) is the prior probability — what we believed about theta before collecting data
P(data) is the marginal likelihood or evidence — a normalizing constant that ensures the posterior sums to one

The theorem follows directly from the definition of conditional probability. Since the joint probability of A and B equals both the probability of A given B times the probability of B, and the probability of B given A times the probability of A, we can rearrange to get Bayes' theorem.

In practice, the denominator is often hard to compute directly. Instead, we work with the unnormalized posterior:

Posterior ~ Likelihood * Prior

This proportionality is all we need for most computational approaches, including grid approximation and MCMC sampling.

A Concrete Example: Medical Testing

Consider a disease that affects 1% of the population. A test for this disease has 95% sensitivity (correctly identifies diseased patients) and 90% specificity (correctly identifies healthy patients). You test positive. What is the probability you actually have the disease?

Define:

D = having the disease
T+ = testing positive

We know:

P(D) = 0.01 (prevalence, our prior)
P(T+ given D) = 0.95 (sensitivity)
P(T+ given not D) = 0.10 (false positive rate)

Using Bayes' theorem:

P(D | T+) = P(T+ | D) * P(D) / P(T+)

The denominator expands to:

P(T+) = P(T+ | D) * P(D) + P(T+ | not D) * P(not D)
      = 0.95 * 0.01 + 0.10 * 0.99
      = 0.0095 + 0.099
      = 0.1085

Therefore:

P(D | T+) = 0.0095 / 0.1085 = 0.088

Despite testing positive with a reasonably accurate test, you have only about an 8.8% chance of actually having the disease. This counterintuitive result stems from the low base rate — the prior probability is doing heavy lifting. Bayes' theorem forces us to account for it.

Medical Testing: Probability Tree

All

D (1%)

not D (99%)

not D

T+ (95%)

T- (5%)

T+ (10%)

T- (90%)

T+ and D

0.01 x 0.95 = 0.0095

T- and D

0.01 x 0.05 = 0.0005

T+ and not D

0.99 x 0.10 = 0.099

T- and not D

0.99 x 0.90 = 0.891

P(D | T+)

= 0.0095 / 0.1085

= 0.088 (8.8%)

Most positive tests are false positives

because the disease is rare (low prior)

Prior, Likelihood, and Posterior: The Three Pillars

The Prior Distribution

The prior distribution encodes what we know about a parameter before observing data. In the model-based framework, the prior is part of the generative model — it is not an optional add-on but a statement about what parameter values are plausible.

Priors come in several flavors:

Informative priors incorporate substantive knowledge. If previous studies suggest a drug reduces blood pressure by 10-15 mmHg, we might use a normal prior centered at 12.5 with appropriate uncertainty.

Weakly informative priors provide mild regularization without strong assumptions. A normal distribution centered at zero with a standard deviation of 10 on a standardized effect size says large effects are unlikely but does not rule them out.

Flat or non-informative priors attempt minimal influence. A uniform prior across all possible values lets the data speak loudest. However, truly non-informative priors rarely exist, and flat priors can sometimes be surprisingly informative on transformed scales.

Critics often object that priors introduce subjectivity. Yet all statistical methods involve assumptions — Bayesian analysis simply makes these explicit rather than hiding them. Moreover, as data accumulates, the prior's influence diminishes. The honest approach is to choose priors thoughtfully, justify them, and check sensitivity.

The Likelihood Function

The likelihood function connects our model to observed data. For a parameter theta and data y, the likelihood quantifies how well different parameter values explain the observations. It is the probability of the data given the parameter, viewed as a function of the parameter.

Consider tossing a globe 10 times and observing 7 water landings. If p represents the proportion of water on the globe, the likelihood follows a binomial distribution:

L(p | data) = C(10,7) * p^7 * (1-p)^3

This function peaks at p = 0.7, the maximum likelihood estimate. Values near 0.7 explain the data well; extreme values like p = 0.1 or p = 0.9 explain it poorly.

The Posterior Distribution

The posterior distribution combines prior and likelihood through Bayes' theorem:

Posterior ~ Likelihood * Prior

The posterior represents our updated knowledge. It balances what we knew before with what the data tells us. With limited data, the prior matters more. With abundant data, the likelihood dominates.

Continuing the globe-tossing example, suppose we use a Beta(2, 2) prior (a slight belief the globe is half water). The posterior becomes Beta(2+7, 2+3) = Beta(9, 5). The prior pulled our estimate slightly toward 0.5, but the data dominated.

Beta Distribution: Prior and Posterior

0.25

0.50

0.75

1.0

Density

Prior: Beta(2,2)

Posterior: Beta(9,5)

0.64

Prior mean = 0.50

Bayesian Inference by Simulation: Grid Approximation

Before diving into MCMC, it is worth understanding the simplest approach to Bayesian inference: grid approximation. The idea is straightforward: evaluate the posterior at a grid of parameter values, then normalize. This makes the logic of Bayesian updating completely transparent.

For the globe-tossing example, suppose we want to infer the proportion of water p. We define a grid of possible values and compute the posterior at each point:

# Grid approximation of globe-tossing posterior
# Data: 6 water, 3 land in 9 tosses

# Define the grid of possible p values
p_grid <- seq(from = 0, to = 1, length.out = 100)

# Define the prior: uniform (Beta(1,1))
prior <- rep(1, 100)

# Define the likelihood: Binomial(9, p)
likelihood <- dbinom(6, size = 9, prob = p_grid)

# Compute the unstandardized posterior
unstd_posterior <- likelihood * prior

# Standardize so it sums to 1
posterior <- unstd_posterior / sum(unstd_posterior)

# Plot
plot(p_grid, posterior, type = "l",
     xlab = "proportion of water (p)",
     ylab = "posterior probability")

Grid approximation is pedagogically valuable because every step is visible. You can see exactly how the prior and likelihood combine. In practice, grid approximation scales poorly to many parameters — the curse of dimensionality means the grid grows exponentially. But for one- or two-parameter problems, it works well and builds intuition.

A Worked Example: Globe Tossing

Let us work through a complete example following McElreath's globe-tossing setup. We toss a small globe 10 times. It lands on water 7 times and land 3 times. We want to infer the true proportion of water, p.

Specifying the Model

We write down a complete generative model:

W ~ Binomial(n=10, p)     # the data-generating process
p ~ Beta(2, 2)             # the prior

The Beta(2, 2) prior is weakly informative — it expresses a mild preference for p near 0.5 but allows values across the full range. The choice of Beta distribution is convenient because it is conjugate with the binomial, giving us a closed-form posterior.

Computing the Posterior

With Beta-Binomial conjugacy, the posterior is:

p | data ~ Beta(2 + 7, 2 + 3) = Beta(9, 5)

The posterior mean is:

E[p | data] = 9 / (9 + 5) = 0.643

The posterior mode is:

Mode[p | data] = (9 - 1) / (9 + 5 - 2) = 8/12 = 0.667

Credible Intervals

Bayesian uncertainty quantification uses credible intervals. An 89% credible interval contains 89% of the posterior probability. We can directly state: "There is an 89% probability the parameter lies in this interval."

Why 89%? McElreath advocates this as a default because it avoids the magical thinking around 95% that pervades frequentist practice. There is nothing special about 95% — it is a historical accident. Using 89% makes it clear that the threshold is a choice, not a law of nature. It also produces slightly narrower intervals that are more informative for model comparison.

For our Beta(9, 5) posterior, we compute the 89% interval using quantiles:

qbeta(c(0.055, 0.945), shape1 = 9, shape2 = 5)
# Returns approximately [0.40, 0.85]

This means: given our model and data, there is an 89% probability that the true proportion of water on the globe lies between about 0.40 and 0.85. Compare this to the maximum likelihood estimate of 0.7 — the interval tells a richer story.

Prior Predictive Checks

Before fitting a model, we should check whether our prior produces plausible data. A prior predictive check simulates data from the model using only the prior, without conditioning on the observed data. If the simulated data look ridiculous, the prior is poorly specified.

# Prior predictive check for the globe-tossing model
set.seed(42)
n_sim <- 1000
n_tosses <- 10

# Sample p from the prior
p_prior <- rbeta(n_sim, shape1 = 2, shape2 = 2)

# Simulate data for each prior draw
w_sim <- rbinom(n_sim, size = n_tosses, prob = p_prior)

# Visualize the distribution of simulated water counts
hist(w_sim, breaks = -0.5:10.5,
     main = "Prior predictive: water counts",
     xlab = "Number of water landings (out of 10)",
     col = "steelblue", border = "white")

If we instead used a very strong prior like Beta(100, 100) — which concentrates probability tightly around 0.5 — the prior predictive distribution would cluster around 5 water out of 10. If we then observed 7 water, the prior would be in tension with the data. Prior predictive checks surface these tensions before you waste time fitting a model.

Posterior Predictive Checks

After fitting, we verify that the model can reproduce the observed data patterns. A posterior predictive check simulates new data from the posterior — plugging in posterior samples of p and generating replicate datasets. We then compare these replicates to what we actually observed.

# Posterior predictive check
set.seed(42)
n_sim <- 10000

# Sample from the posterior: Beta(9, 5)
p_post <- rbeta(n_sim, shape1 = 9, shape2 = 5)

# Simulate replicate datasets
w_rep <- rbinom(n_sim, size = 10, prob = p_post)

# How often does the simulation match our observation (7 water)?
mean(w_rep == 7)
# Should be reasonably high if the model fits well

# Compare the distribution of replicates to observed
hist(w_rep, breaks = -0.5:10.5,
     main = "Posterior predictive: water counts",
     xlab = "Number of water landings (out of 10)",
     col = "steelblue", border = "white")
abline(v = 7, col = "red", lwd = 2)
# The red line marks the observed count

If the observed data fall in the tails of the posterior predictive distribution, the model is inadequate — it cannot reproduce what you actually saw. This might indicate a missing predictor, a wrong likelihood family, or a poorly specified prior. Posterior predictive checks are one of the most important diagnostic tools in Bayesian data analysis.

Conjugate Priors: Common Families

Conjugate priors simplify Bayesian analysis. When the prior and posterior belong to the same distribution family, calculations become tractable. Here are three essential conjugate pairs:

Beta-Binomial Conjugacy

For binomial data (successes out of n trials), the Beta distribution serves as a conjugate prior for the success probability p.

Prior: Beta(alpha, beta)
Likelihood: Binomial(n, p) with k successes
Posterior: Beta(alpha + k, beta + n - k)

If we start with Beta(1, 1), a uniform prior, and observe 7 water in 10 globe tosses, the posterior is Beta(8, 4). The mean shifts from 0.5 to 8/(8+4) = 0.67.

Normal-Normal Conjugacy

For normally distributed data with known variance, the normal distribution is conjugate for the mean mu.

Prior: Normal(mu_0, sigma_0^2)
Likelihood: Normal(mu, sigma^2) with n observations
Posterior: Normal(mu_n, sigma_n^2)

The posterior mean is a weighted average of prior mean and sample mean, with weights proportional to their precisions (inverse variances).

Gamma-Poisson Conjugacy

For count data following a Poisson distribution, the Gamma distribution is conjugate for the rate parameter lambda.

Prior: Gamma(alpha, beta)
Likelihood: Poisson(lambda) with n observations summing to k
Posterior: Gamma(alpha + k, beta + n)

This pairing appears in modeling event rates, such as customer arrivals or website clicks.

Bayes Factors

The Bayes factor quantifies evidence for one model over another, independent of prior probabilities:

BF = P(data | H1) / P(data | H0)

A Bayes factor of 10 means the data is 10 times more likely under model H1 than under model H0. Common interpretation guidelines suggest:

Bayes factors naturally penalize model complexity, addressing overfitting concerns. A more complex model must substantially improve the likelihood to overcome the penalty of its broader prior. However, Bayes factors can be sensitive to prior specification, especially for the alternative hypothesis. They are most useful when the models under comparison are well-motivated scientifically, not when used as an automatic model selection procedure.

MCMC Methods: Sampling from Posteriors

Analytical solutions exist only for simple models. Real-world problems require numerical approximation. Markov Chain Monte Carlo (MCMC) methods generate samples from the posterior distribution, which can then be summarized to produce estimates, intervals, and predictions.

Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm constructs a Markov chain that converges to the target posterior:

Start at an initial parameter value theta_0
Propose a new value theta_prime from a proposal distribution
Calculate the acceptance ratio: r = P(theta_prime | data) / P(theta | data)
Accept theta_prime with probability min(1, r); otherwise stay at theta
Repeat

The algorithm explores the parameter space, spending more time in high-probability regions. After a burn-in period, samples approximate the posterior.

Gibbs Sampling

Gibbs sampling simplifies Metropolis-Hastings for multi-parameter models. Instead of proposing joint updates, it samples each parameter from its full conditional distribution (given all other parameters and data).

For parameters (theta_1, theta_2, theta_3):

Sample theta_1 from P(theta_1 | theta_2, theta_3, data)
Sample theta_2 from P(theta_2 | theta_1, theta_3, data)
Sample theta_3 from P(theta_3 | theta_1, theta_2, data)
Repeat

Gibbs sampling always accepts proposals but requires tractable conditional distributions.

Practical Considerations

MCMC introduces diagnostic challenges:

Convergence: Has the chain reached the target distribution?
Mixing: How efficiently does the chain explore the space?
Autocorrelation: How many samples are effectively independent?

Tools like trace plots, the Gelman-Rubin statistic (R-hat), and effective sample size calculations help assess MCMC quality. Modern probabilistic programming frameworks like Stan handle most of these diagnostics automatically.

Multilevel Models

Bayesian multilevel models (also called hierarchical models or mixed-effects models) are one of the greatest strengths of the Bayesian framework. They capture structure in grouped data by allowing parameters to vary across groups while sharing information between groups.

The Problem: Pooled vs. Unpooled Estimates

Consider test scores from students across multiple schools. We could pool all data and estimate one overall mean, ignoring school differences. Or we could estimate each school separately (no pooling), but small schools would have wildly unstable estimates. Multilevel models offer a middle path: partial pooling.

The Hierarchical Structure

score_ij ~ Normal(mu_j, sigma^2)       # student-level model
mu_j ~ Normal(mu, tau^2)               # school-level model (prior)
mu ~ Normal(0, 10)                     # group-level prior
tau ~ Exponential(1)                   # prior on variation between schools

Hierarchical Model: Plate Notation

Group mean

Hyperprior

tau

Group SD

Hyperprior

mu_j

School j mean

score_ij

Student i in school j

i, j

Hyperpriors define the group-level distribution

Schools share information through partial pooling

School-level parameters mu_j share information through the group-level prior. Schools with more data get estimates closer to their own sample mean, while schools with less data are pulled toward the group average. This partial pooling is automatic — the model figures out how much to shrink each estimate based on the data.

Why Multilevel Models Matter

Multilevel models shine whenever data has a grouped or clustered structure: students within schools, patients within hospitals, repeated measures within individuals, or observations across geographic regions. They provide better estimates for small groups, naturally quantify variation at each level, and avoid the ad-hoc decisions that plague traditional approaches (such as deciding whether to pool or not).

Applications in AI and Machine Learning

Naive Bayes Classifiers

Naive Bayes applies Bayes' theorem to classification with a simplifying assumption: features are conditionally independent given the class. For text classification:

P(class | words) ~ P(class) * product(P(word_i | class))

Despite the unrealistic independence assumption, Naive Bayes performs remarkably well for spam filtering, sentiment analysis, and document categorization.

Bayesian Optimization

Bayesian optimization finds optimal inputs for expensive black-box functions. It maintains a probabilistic surrogate model (often a Gaussian process) and uses an acquisition function to balance exploration and exploitation.

This approach powers hyperparameter tuning in machine learning, where each evaluation requires training a model. Bayesian optimization finds good configurations with fewer evaluations than grid or random search.

Bayesian Neural Networks

Bayesian neural networks place probability distributions over weights instead of point estimates. This provides:

Uncertainty quantification: Predictions include confidence estimates
Regularization: The prior prevents overfitting
Model averaging: Predictions average over plausible weight configurations

Variational inference and MCMC enable training, though computational costs remain high for large networks.

Tools and Software

Modern Bayesian modeling is made practical by a rich ecosystem of tools. Here are the three most important ones, each with a different philosophy:

rethinking (R)

McElreath's rethinking package accompanies Statistical Rethinking and is designed for teaching. It exposes the mechanics of Bayesian computation rather than hiding them:

library(rethinking)

# Globe-tossing model using map (quadratic approximation)
data <- list(
  w = 6,   # water count
  n = 9    # total tosses
)

fit <- map(
  alist(
    w ~ dbinom(n, p),    # likelihood
    p ~ dbeta(2, 2)       # prior
  ),
  data = data
)

# Extract 89% credible interval
precis(fit, prob = 0.89)

# Or use MCMC with map2stan
fit_mcmc <- map2stan(
  alist(
    w ~ dbinom(n, p),
    p ~ dbeta(2, 2)
  ),
  data = data,
  iter = 4000, warmup = 1000, chains = 4
)

# Posterior predictive check
postcheck(fit_mcmc)

brms (R)

The brms package provides a formula interface for Bayesian regression models powered by Stan. It feels familiar to users of R's lm() and lmer() functions:

library(brms)

# Fit a multilevel regression
fit <- brm(
  score ~ 1 + (1 | school),
  data = student_data,
  family = gaussian(),
  prior = c(
    prior(normal(0, 10), class = Intercept),
    prior(cauchy(0, 2), class = sd),
    prior(exponential(1), class = sigma)
  ),
  iter = 4000, warmup = 1000, chains = 4,
  seed = 42
)

# Summary with 89% credible intervals
summary(fit, prob = 0.89)

# Posterior predictive checks
pp_check(fit)

# Extract posterior samples for custom summaries
posterior_samples <- as_draws_df(fit)

PyMC (Python)

PyMC brings probabilistic programming to Python with a clean, intuitive API and efficient NUTS sampling:

import pymc as pm
import numpy as np

# Globe-tossing model
water_observed = 6
total_tosses = 9

with pm.Model() as globe_model:
    # Prior
    p = pm.Beta("p", alpha=2, beta=2)

    # Likelihood
    w = pm.Binomial("w", n=total_tosses, p=p,
                     observed=water_observed)

    # Sample from the posterior
    trace = pm.sample(4000, tune=1000, chains=4,
                      cores=4, random_seed=42)

# Summary with 89% credible intervals
import arviz as az
az.summary(trace, hdi_prob=0.89)

# Posterior predictive check
pm.sample_posterior_predictive(trace, model=globe_model)
az.plot_ppc(trace)

Other notable tools include Stan (the underlying engine for rethinking and brms, usable from R, Python, and Julia), Turing.jl (probabilistic programming in Julia), and NumPyro (fast MCMC in Python built on JAX).

Common Misconceptions

"Bayesian methods are subjective"

While priors encode subjective knowledge, this transparency is a strength. Different analysts can use different priors and compare results. Moreover, with sufficient data, priors matter less. The real question is whether assumptions are reasonable, not whether they exist.

"Bayesian computation is too slow"

MCMC does require more computation than point estimates, but modern tools like Stan, PyMC, and Turing make this manageable. Approximate methods like variational inference trade some accuracy for speed. For many problems, the benefits outweigh computational costs.

"Priors bias results"

Priors influence posteriors, but this is the point. Ignoring relevant prior information wastes data. Weakly informative priors provide regularization without strong assumptions. Sensitivity analysis — trying different priors — assesses robustness.

"Bayesian methods don't control error rates"

Bayesian inference focuses on posterior probabilities, not long-run error rates. This is not a bug but a feature. When error rate control matters (for example, in clinical trials), Bayesian designs can incorporate frequentist operating characteristics through simulation.

Practical Considerations: When to Use Bayesian Methods

Choose Bayesian approaches when:

Prior information exists: Previous studies, expert knowledge, or physical constraints inform parameters
Uncertainty quantification matters: Decisions require full posterior distributions, not just point estimates
Data is limited: Priors stabilize estimates when data is sparse
Complex models are needed: Hierarchical structures, missing data, or latent variables fit naturally
Sequential updating is required: Posteriors become priors for new data

Frequentist methods may suffice when:

Data is abundant: Likelihood dominates, making prior choice irrelevant
Computational simplicity is critical: Quick analyses with standard methods
Regulatory requirements demand it: Some fields mandate frequentist approaches
Only point estimates matter: No need for full uncertainty quantification

Many practitioners use both frameworks, selecting tools based on the problem at hand.

Bayesian Statistics: From Intuition to Practice

Introduction: Model-Based Thinking

The Garden of Forking Data

Toss 1

W 0.7

L 0.3

Toss 2

W 0.7

L 0.3

Toss 3

W 0.7

L 0.3

Toss 4

W 0.7

L 0.3

W,W,L,W

P = 0.7 x 0.7 x 0.3 x 0.7

= 0.1029

W,W,L,L

P = 0.7 x 0.7 x 0.3 x 0.3

= 0.0441

...

Total: 16 possible paths

Observed path is among

the more probable ones

Each path's probability = product of conditional probabilities along the branch

Bayes' Theorem: The Mathematical Foundation

At the heart of Bayesian statistics lies Bayes' theorem, a simple equation with profound implications:

P(theta | data) = P(data | theta) * P(theta) / P(data)

This formula tells us how to update our beliefs about a parameter theta after observing data. Let's unpack each component:

P(theta | data) is the posterior probability — our updated belief about theta after seeing the data
P(data | theta) is the likelihood — how probable the observed data is, given a particular value of theta
P(theta) is the prior probability — what we believed about theta before collecting data
P(data) is the marginal likelihood or evidence — a normalizing constant that ensures the posterior sums to one

In practice, the denominator is often hard to compute directly. Instead, we work with the unnormalized posterior:

Posterior ~ Likelihood * Prior

This proportionality is all we need for most computational approaches, including grid approximation and MCMC sampling.

A Concrete Example: Medical Testing

Define:

D = having the disease
T+ = testing positive

We know:

P(D) = 0.01 (prevalence, our prior)
P(T+ given D) = 0.95 (sensitivity)
P(T+ given not D) = 0.10 (false positive rate)

Using Bayes' theorem:

P(D | T+) = P(T+ | D) * P(D) / P(T+)

The denominator expands to:

P(T+) = P(T+ | D) * P(D) + P(T+ | not D) * P(not D)
      = 0.95 * 0.01 + 0.10 * 0.99
      = 0.0095 + 0.099
      = 0.1085

Therefore:

P(D | T+) = 0.0095 / 0.1085 = 0.088

Medical Testing: Probability Tree

All

D (1%)

not D (99%)

not D

T+ (95%)

T- (5%)

T+ (10%)

T- (90%)

T+ and D

0.01 x 0.95 = 0.0095

T- and D

0.01 x 0.05 = 0.0005

T+ and not D

0.99 x 0.10 = 0.099

T- and not D

0.99 x 0.90 = 0.891

P(D | T+)

= 0.0095 / 0.1085

= 0.088 (8.8%)

Most positive tests are false positives

because the disease is rare (low prior)

Prior, Likelihood, and Posterior: The Three Pillars

The Prior Distribution

Priors come in several flavors:

The Likelihood Function

Consider tossing a globe 10 times and observing 7 water landings. If p represents the proportion of water on the globe, the likelihood follows a binomial distribution:

L(p | data) = C(10,7) * p^7 * (1-p)^3

This function peaks at p = 0.7, the maximum likelihood estimate. Values near 0.7 explain the data well; extreme values like p = 0.1 or p = 0.9 explain it poorly.

The Posterior Distribution

The posterior distribution combines prior and likelihood through Bayes' theorem:

Posterior ~ Likelihood * Prior

The posterior represents our updated knowledge. It balances what we knew before with what the data tells us. With limited data, the prior matters more. With abundant data, the likelihood dominates.

Beta Distribution: Prior and Posterior

0.25

0.50

0.75

1.0

Density

Prior: Beta(2,2)

Posterior: Beta(9,5)

0.64

Prior mean = 0.50

Bayesian Inference by Simulation: Grid Approximation

For the globe-tossing example, suppose we want to infer the proportion of water p. We define a grid of possible values and compute the posterior at each point:

# Grid approximation of globe-tossing posterior
# Data: 6 water, 3 land in 9 tosses

# Define the grid of possible p values
p_grid <- seq(from = 0, to = 1, length.out = 100)

# Define the prior: uniform (Beta(1,1))
prior <- rep(1, 100)

# Define the likelihood: Binomial(9, p)
likelihood <- dbinom(6, size = 9, prob = p_grid)

# Compute the unstandardized posterior
unstd_posterior <- likelihood * prior

# Standardize so it sums to 1
posterior <- unstd_posterior / sum(unstd_posterior)

# Plot
plot(p_grid, posterior, type = "l",
     xlab = "proportion of water (p)",
     ylab = "posterior probability")

A Worked Example: Globe Tossing

Specifying the Model

We write down a complete generative model:

W ~ Binomial(n=10, p)     # the data-generating process
p ~ Beta(2, 2)             # the prior

Computing the Posterior

With Beta-Binomial conjugacy, the posterior is:

p | data ~ Beta(2 + 7, 2 + 3) = Beta(9, 5)

The posterior mean is:

E[p | data] = 9 / (9 + 5) = 0.643

The posterior mode is:

Mode[p | data] = (9 - 1) / (9 + 5 - 2) = 8/12 = 0.667

Credible Intervals

For our Beta(9, 5) posterior, we compute the 89% interval using quantiles:

qbeta(c(0.055, 0.945), shape1 = 9, shape2 = 5)
# Returns approximately [0.40, 0.85]

Prior Predictive Checks

# Prior predictive check for the globe-tossing model
set.seed(42)
n_sim <- 1000
n_tosses <- 10

# Sample p from the prior
p_prior <- rbeta(n_sim, shape1 = 2, shape2 = 2)

# Simulate data for each prior draw
w_sim <- rbinom(n_sim, size = n_tosses, prob = p_prior)

# Visualize the distribution of simulated water counts
hist(w_sim, breaks = -0.5:10.5,
     main = "Prior predictive: water counts",
     xlab = "Number of water landings (out of 10)",
     col = "steelblue", border = "white")

Posterior Predictive Checks

# Posterior predictive check
set.seed(42)
n_sim <- 10000

# Sample from the posterior: Beta(9, 5)
p_post <- rbeta(n_sim, shape1 = 9, shape2 = 5)

# Simulate replicate datasets
w_rep <- rbinom(n_sim, size = 10, prob = p_post)

# How often does the simulation match our observation (7 water)?
mean(w_rep == 7)
# Should be reasonably high if the model fits well

# Compare the distribution of replicates to observed
hist(w_rep, breaks = -0.5:10.5,
     main = "Posterior predictive: water counts",
     xlab = "Number of water landings (out of 10)",
     col = "steelblue", border = "white")
abline(v = 7, col = "red", lwd = 2)
# The red line marks the observed count

Conjugate Priors: Common Families

Conjugate priors simplify Bayesian analysis. When the prior and posterior belong to the same distribution family, calculations become tractable. Here are three essential conjugate pairs:

Beta-Binomial Conjugacy

For binomial data (successes out of n trials), the Beta distribution serves as a conjugate prior for the success probability p.

Prior: Beta(alpha, beta)
Likelihood: Binomial(n, p) with k successes
Posterior: Beta(alpha + k, beta + n - k)

If we start with Beta(1, 1), a uniform prior, and observe 7 water in 10 globe tosses, the posterior is Beta(8, 4). The mean shifts from 0.5 to 8/(8+4) = 0.67.

Normal-Normal Conjugacy

For normally distributed data with known variance, the normal distribution is conjugate for the mean mu.

Prior: Normal(mu_0, sigma_0^2)
Likelihood: Normal(mu, sigma^2) with n observations
Posterior: Normal(mu_n, sigma_n^2)

The posterior mean is a weighted average of prior mean and sample mean, with weights proportional to their precisions (inverse variances).

Gamma-Poisson Conjugacy

For count data following a Poisson distribution, the Gamma distribution is conjugate for the rate parameter lambda.

Prior: Gamma(alpha, beta)
Likelihood: Poisson(lambda) with n observations summing to k
Posterior: Gamma(alpha + k, beta + n)

This pairing appears in modeling event rates, such as customer arrivals or website clicks.

Bayes Factors

The Bayes factor quantifies evidence for one model over another, independent of prior probabilities:

BF = P(data | H1) / P(data | H0)

A Bayes factor of 10 means the data is 10 times more likely under model H1 than under model H0. Common interpretation guidelines suggest:

MCMC Methods: Sampling from Posteriors

Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm constructs a Markov chain that converges to the target posterior:

Start at an initial parameter value theta_0
Propose a new value theta_prime from a proposal distribution
Calculate the acceptance ratio: r = P(theta_prime | data) / P(theta | data)
Accept theta_prime with probability min(1, r); otherwise stay at theta
Repeat

The algorithm explores the parameter space, spending more time in high-probability regions. After a burn-in period, samples approximate the posterior.

Gibbs Sampling

For parameters (theta_1, theta_2, theta_3):

Sample theta_1 from P(theta_1 | theta_2, theta_3, data)
Sample theta_2 from P(theta_2 | theta_1, theta_3, data)
Sample theta_3 from P(theta_3 | theta_1, theta_2, data)
Repeat

Gibbs sampling always accepts proposals but requires tractable conditional distributions.

Practical Considerations

MCMC introduces diagnostic challenges:

Convergence: Has the chain reached the target distribution?
Mixing: How efficiently does the chain explore the space?
Autocorrelation: How many samples are effectively independent?

Multilevel Models

The Problem: Pooled vs. Unpooled Estimates

The Hierarchical Structure

score_ij ~ Normal(mu_j, sigma^2)       # student-level model
mu_j ~ Normal(mu, tau^2)               # school-level model (prior)
mu ~ Normal(0, 10)                     # group-level prior
tau ~ Exponential(1)                   # prior on variation between schools

Hierarchical Model: Plate Notation

Group mean

Hyperprior

tau

Group SD

Hyperprior

mu_j

School j mean

score_ij

Student i in school j

i, j

Hyperpriors define the group-level distribution

Schools share information through partial pooling

Why Multilevel Models Matter

Applications in AI and Machine Learning

Naive Bayes Classifiers

Naive Bayes applies Bayes' theorem to classification with a simplifying assumption: features are conditionally independent given the class. For text classification:

P(class | words) ~ P(class) * product(P(word_i | class))

Despite the unrealistic independence assumption, Naive Bayes performs remarkably well for spam filtering, sentiment analysis, and document categorization.

Bayesian Optimization

Bayesian Neural Networks

Bayesian neural networks place probability distributions over weights instead of point estimates. This provides:

Uncertainty quantification: Predictions include confidence estimates
Regularization: The prior prevents overfitting
Model averaging: Predictions average over plausible weight configurations

Variational inference and MCMC enable training, though computational costs remain high for large networks.

Tools and Software

Modern Bayesian modeling is made practical by a rich ecosystem of tools. Here are the three most important ones, each with a different philosophy:

rethinking (R)

McElreath's rethinking package accompanies Statistical Rethinking and is designed for teaching. It exposes the mechanics of Bayesian computation rather than hiding them:

library(rethinking)

# Globe-tossing model using map (quadratic approximation)
data <- list(
  w = 6,   # water count
  n = 9    # total tosses
)

fit <- map(
  alist(
    w ~ dbinom(n, p),    # likelihood
    p ~ dbeta(2, 2)       # prior
  ),
  data = data
)

# Extract 89% credible interval
precis(fit, prob = 0.89)

# Or use MCMC with map2stan
fit_mcmc <- map2stan(
  alist(
    w ~ dbinom(n, p),
    p ~ dbeta(2, 2)
  ),
  data = data,
  iter = 4000, warmup = 1000, chains = 4
)

# Posterior predictive check
postcheck(fit_mcmc)

brms (R)

The brms package provides a formula interface for Bayesian regression models powered by Stan. It feels familiar to users of R's lm() and lmer() functions:

library(brms)

# Fit a multilevel regression
fit <- brm(
  score ~ 1 + (1 | school),
  data = student_data,
  family = gaussian(),
  prior = c(
    prior(normal(0, 10), class = Intercept),
    prior(cauchy(0, 2), class = sd),
    prior(exponential(1), class = sigma)
  ),
  iter = 4000, warmup = 1000, chains = 4,
  seed = 42
)

# Summary with 89% credible intervals
summary(fit, prob = 0.89)

# Posterior predictive checks
pp_check(fit)

# Extract posterior samples for custom summaries
posterior_samples <- as_draws_df(fit)

PyMC (Python)

PyMC brings probabilistic programming to Python with a clean, intuitive API and efficient NUTS sampling:

import pymc as pm
import numpy as np

# Globe-tossing model
water_observed = 6
total_tosses = 9

with pm.Model() as globe_model:
    # Prior
    p = pm.Beta("p", alpha=2, beta=2)

    # Likelihood
    w = pm.Binomial("w", n=total_tosses, p=p,
                     observed=water_observed)

    # Sample from the posterior
    trace = pm.sample(4000, tune=1000, chains=4,
                      cores=4, random_seed=42)

# Summary with 89% credible intervals
import arviz as az
az.summary(trace, hdi_prob=0.89)

# Posterior predictive check
pm.sample_posterior_predictive(trace, model=globe_model)
az.plot_ppc(trace)

Common Misconceptions

"Bayesian methods are subjective"

"Bayesian computation is too slow"

"Priors bias results"

"Bayesian methods don't control error rates"

Practical Considerations: When to Use Bayesian Methods

Choose Bayesian approaches when:

Prior information exists: Previous studies, expert knowledge, or physical constraints inform parameters
Uncertainty quantification matters: Decisions require full posterior distributions, not just point estimates
Data is limited: Priors stabilize estimates when data is sparse
Complex models are needed: Hierarchical structures, missing data, or latent variables fit naturally
Sequential updating is required: Posteriors become priors for new data

Frequentist methods may suffice when:

Data is abundant: Likelihood dominates, making prior choice irrelevant
Computational simplicity is critical: Quick analyses with standard methods
Regulatory requirements demand it: Some fields mandate frequentist approaches
Only point estimates matter: No need for full uncertainty quantification

Many practitioners use both frameworks, selecting tools based on the problem at hand.

Bayesian Statistics: From Intuition to Practice

Introduction: Model-Based Thinking

The Garden of Forking Data

Bayes' Theorem: The Mathematical Foundation

A Concrete Example: Medical Testing

Prior, Likelihood, and Posterior: The Three Pillars

The Prior Distribution

The Likelihood Function

The Posterior Distribution

Bayesian Inference by Simulation: Grid Approximation

A Worked Example: Globe Tossing

Specifying the Model

Computing the Posterior

Credible Intervals

Prior Predictive Checks

Posterior Predictive Checks

Conjugate Priors: Common Families

Beta-Binomial Conjugacy

Normal-Normal Conjugacy

Gamma-Poisson Conjugacy

Bayes Factors

MCMC Methods: Sampling from Posteriors

Metropolis-Hastings Algorithm

Gibbs Sampling

Practical Considerations

Multilevel Models

The Problem: Pooled vs. Unpooled Estimates

The Hierarchical Structure

Why Multilevel Models Matter

Applications in AI and Machine Learning

Naive Bayes Classifiers

Bayesian Optimization

Bayesian Neural Networks

Tools and Software

rethinking (R)

brms (R)

PyMC (Python)

Common Misconceptions

"Bayesian methods are subjective"

"Bayesian computation is too slow"

"Priors bias results"

"Bayesian methods don't control error rates"

Practical Considerations: When to Use Bayesian Methods

Further Reading and Resources

Books

Online Resources

Bayesian Statistics: From Intuition to Practice

Introduction: Model-Based Thinking

The Garden of Forking Data

Bayes' Theorem: The Mathematical Foundation

A Concrete Example: Medical Testing

Prior, Likelihood, and Posterior: The Three Pillars

The Prior Distribution

The Likelihood Function

The Posterior Distribution

Bayesian Inference by Simulation: Grid Approximation

A Worked Example: Globe Tossing

Specifying the Model

Computing the Posterior

Credible Intervals

Prior Predictive Checks

Posterior Predictive Checks

Conjugate Priors: Common Families

Beta-Binomial Conjugacy

Normal-Normal Conjugacy

Gamma-Poisson Conjugacy

Bayes Factors

MCMC Methods: Sampling from Posteriors

Metropolis-Hastings Algorithm

Gibbs Sampling

Practical Considerations

Multilevel Models

The Problem: Pooled vs. Unpooled Estimates

The Hierarchical Structure

Why Multilevel Models Matter

Applications in AI and Machine Learning

Naive Bayes Classifiers

Bayesian Optimization

Bayesian Neural Networks

Tools and Software