Hypergeometric Distribution
The hypergeometric distribution models the number of successes in a sample drawn without replacement from a finite population. Unlike the binomial distribution, where each trial is independent, the hypergeometric accounts for the changing composition of the population as items are drawn. It is used extensively in quality control, ecological capture-recapture studies, and combinatorial probability problems like card-drawing scenarios.
Formula
P(X = k) = [C(K, k) ยท C(N - K, n - k)] / C(N, n), where C(a, b) = a! / (b!(a - b)!)
Mean (Expected Value)
nK/N
Variance
n ยท (K/N) ยท (1 - K/N) ยท (N - n)/(N - 1)
Parameters
The total number of items in the population. Must be a non-negative integer.
The total number of success items in the population. Must satisfy 0 โค K โค N.
The number of items drawn without replacement from the population. Must satisfy 0 โค n โค N.
Key Properties
- โขModels sampling without replacement from a finite population, so trials are not independent
- โขX can range from max(0, n + K - N) to min(n, K)
- โขThe factor (N - n)/(N - 1) in the variance is called the finite population correction factor; it makes the variance smaller than the corresponding binomial variance
- โขAs N โ โ with K/N โ p held constant, the hypergeometric converges to the binomial Bin(n, p)
- โขWhen n/N < 0.05 (sample is less than 5% of population), the binomial is a good approximation
Example
A deck of 52 cards contains 13 hearts. You draw 5 cards without replacement. What is the probability of getting exactly 2 hearts?
Here N = 52, K = 13, n = 5, k = 2. P(X = 2) = [C(13, 2) ยท C(39, 3)] / C(52, 5) = [78 ยท 9139] / 2598960 = 712842 / 2598960 โ 0.2743.
Result: P(X = 2) โ 0.2743, or about 27.43%
There is about a 27.43% chance of drawing exactly 2 hearts in a 5-card hand. The expected number of hearts is nK/N = 5 ร 13/52 = 1.25. Getting 2 hearts is slightly above average but is the most likely individual outcome.
When to Use
- โWhen sampling without replacement from a finite population with two categories (defective/non-defective, tagged/untagged, hearts/non-hearts)
- โIn quality control when inspecting a batch of items without replacement and counting defectives
- โIn ecological studies using capture-recapture methods to estimate population sizes
- โWhen the sample size is a substantial fraction of the population (n/N > 0.05), making the binomial approximation inaccurate
Common Mistakes
- โUsing the binomial distribution when sampling without replacement from a small population. The binomial assumes independence, which is violated without replacement. Use hypergeometric when n/N > 0.05.
- โGetting the parameters confused. N is the population size, K is the number of success items in the population, n is the sample size, and k is the number of successes in the sample.
- โForgetting the constraints on k: it must satisfy max(0, n + K - N) โค k โค min(n, K). Not all values from 0 to n are possible.
- โNeglecting the finite population correction when computing the variance. The hypergeometric variance is always less than or equal to the binomial variance for the same n and p = K/N.
Need Help with Distribution Problems?
Snap a photo of any distribution problem for instant step-by-step solutions.
Download StatsIQFAQs
Common questions about Hypergeometric Distribution
The binomial is a good approximation to the hypergeometric when the sample size is small relative to the population size, specifically when n/N < 0.05 (the 5% rule). In this case, removing items from the population barely changes the composition, so the trials are approximately independent. Set p = K/N and use Bin(n, p). For example, sampling 10 items from a population of 10,000 can safely use the binomial, but sampling 10 from 50 should use the hypergeometric.
The finite population correction (FPC) factor is (N - n)/(N - 1), which appears in the hypergeometric variance formula. It adjusts for the fact that sampling without replacement from a finite population reduces variability compared to sampling with replacement. When N is much larger than n, the FPC is close to 1 and can be ignored. When n is a substantial fraction of N, the FPC significantly reduces the variance. In the extreme case where n = N (you sample the entire population), the FPC equals 0 and there is no variability at all.
Fisher's exact test uses the hypergeometric distribution to test for association in a 2x2 contingency table. Under the null hypothesis of no association, the cell counts follow a hypergeometric distribution with the row and column totals fixed. The test computes the exact probability of observing data as extreme as (or more extreme than) the actual table. It is preferred over the chi-square test when sample sizes are small or expected frequencies are below 5, because it does not rely on any large-sample approximation.