>[!abstract]
>The birthday paradox is the counterintuitive probability result that in a group of just 23 people, there is about a 50% chance that at least two share the same birthday (same day and month, ignoring leap years).
>
>Intuition may suggest that the probability should rise linearly, such that $365 / 2 = 183$ people might be needed for a 50% chance of a birthday collision.
>
>The paradox arises because the number of possible pairwise comparisons grows surprisingly rapidly (at the rate $\frac{n (n-1)}{2}$ for $n$ people). This means that even small groups yield many opportunities for matches.
>
>The birthday paradox illustrates how human intuition struggles with combinatorial growth and probabilistic reasoning, and has applications in cryptography, such as in hash collision analysis.
## Problem statement
Let $p$ be the probability that any two of $n$ people share the same birthday (same day and month, regardless of year). It helps to start from the reverse probability that no two birthdays coincide, i.e., $\bar{p} = 1 - p$.
For example, the probability of two people ($n = 2$) *not* sharing the same birthday depends on the second person having its birthday in the remaining $364$ days other than the birthday of the first person, i.e. $\bar{p} = \frac{365}{365} \times \frac{364}{365} \approx 0.997$So to avoid a birthday collision, the first person has $365$ days available to them, the second person has $364$, the third person $363$, etc. Generalizing, we can extend this equation to larger values of $n$: $\bar{p} = \frac{365}{365} \times \frac{364}{365} \times \dots \times \frac{365 - (n - 1)}{365}$Eventually, once $n \gt 365$ (i.e., there are more people than days in the year), then at least two people will necessarily share the same birthday with absolute certainty, such that $\bar{p} = 0$. This is called the [[Pigeonhole principle]].
We are interested in finding the first $n$ at which $\bar{p} \leq \frac{1}{2}$, i.e., the number of people at which it becomes *more likely than not* that two of them share a birthday.
We'll explore two approaches: computational and analytical.
## Computational approach
This is essentially a brute-force, trial-and-error method of feeding the previous equation with increasingly larger values of $n$. Let’s write some R code that does exactly that.
``` r
days <- 365
p <- numeric()
for (n in 1:days) {
p_bar <- prod((days:(days - n + 1)) / days)
p[n] <- 1 - p_bar
}
```
In the code block above, we iterate from $n = 1$ to $n = 365$. The line starting with `p_bar` is where the previous equation is expressed. In the last row, the function just returns the probability $p$ for each value of $n$ (which is equal to $1 - \bar{p}$) as a vector for later analysis and plotting.
Let’s check which is the first value of $n$ for which $p \gt 0.5$:
``` r
which(p > 0.5)[1]
```
[1] 23
**In a group of 23 people, it is very slightly more likely than not that two of them will share a birthday — QED.**
``` r
which(p > 0.99)[1]
```
[1] 57
Just as counterintuitively, with only 57 people in the group, there is a 99% chance that two of them share a birthday.
Lastly, we can plot the results in a graph:
``` r
plot(
p,
type = "l",
xlab = "n",
ylab = "p",
ylim = c(0, 1)
)
abline(h = 0.5, lty = 2)
abline(v = which(p > 0.5)[1], lty = 2)
```
![[Birthday paradox.png]]
We can see a very rapid increase in the probability $p$ as $n$ increases, with the curve flattening out as the probability of a birthday collision nears certainty starting with around 57 people. $p = 1$ is only reached when $n = 366$.
## Analytical approach
Let's try to find an upper bound on $\bar{p}$, which is also a lower bound on $n$.
Generalizing the previous equation, the probability that $n$ people don't share a birthday can be written as the following product, in which $k = n - 1$ : $\bar{p} = \prod\limits_{k=0}^{n-1} (\frac{365 - k}{365}) $We are interested in the nearest value of $k$ for which $\bar{p} = \frac{1}{2}$, i.e., when the probability of $k$ people not sharing a birthday reaches 50%. This means that with $k + 1$ people, it will be more likely than not that two of them share a birthday, which is the bound we are looking for. $\frac{1}{2} = \prod\limits_{k=0}^{n-1} (\frac{365 - k}{365}) $We can also factor the right-hand side of that equation given the presence of $365$ in both numerator and denominator: $\frac{1}{2} = \prod\limits_{k=0}^{n-1} (1 - \frac{k}{365})$Now, we can simplify the product into a sum by first taking the natural logarithm of both terms: $\ln(\frac{1}{2}) = \ln(\prod\limits_{k=0}^{n-1} (1 - \frac{k}{365}))$And then applying the property of logarithms that $\ln(ab) = \ln(a) + \ln(b)$ to the right-hand side of the previous equation: $\ln(\frac{1}{2}) = \sum_{k=0}^{n-1} \ln(1 - \frac{k}{365})$In the above equation, if we replace the term $-\frac{k}{365}$ with $x$, we surface the term $\ln(1 + x)$ after the sum sign, which can conveniently be expanded using the Taylor-Maclaurin power series of the form: $\ln(1 + x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \dots$The only condition to use that expansion is that $-1 \lt x \leq 1$. Is it the case here? As a reminder, $x$ here is just shorthand for $\frac{-k}{365}$, in which $k$ is the number of people above which it will be more likely than not that two of them share a birthday. It should be obvious that in a group of $365$ people, it is very much more likely than not than at least two of them share a birthday (said differently, it would be extremely unlikely that all $365$ have a unique birthday). Therefore, $k \lt 365$, and therefore $-1 \lt \frac{-k}{365} \leq 1$. The condition is satisfied and we can use the power series expansion.
Knowing that the condition $-1 \lt x \leq 1$ is true, it should also be clear that in the power series expansion of $\ln(1 + x)$ above, only the first term $x$ is meaningful; the remaining terms get exponentially smaller. Therefore, we can discard them such that: $\ln(1 + x) \approx x$This enables us to revisit the earlier equation: $\ln(\frac{1}{2}) = \sum_{k=0}^{n-1} \ln(1 - \frac{k}{365})$and simplify it into the following approximation: $\ln(\frac{1}{2}) \approx \sum_{k=0}^{n-1} \frac{-k}{365}$Since $\ln(\frac{1}{2}) = -\ln(2)$, leaving negatives on both sides of the equation that can be removed, we get: $\ln(2) \approx \sum_{k=0}^{n-1} \frac{k}{365}$Which transforms into: $\ln(2) \times 365 \approx \sum_{k=0}^{n-1} k$The left-hand side simplifies to $253$ and the right-hand side is the Gaussian sum of integers from $0$ to $n - 1$, which can be rewritten as $\frac{n(n-1)}{2}$, leaving us with: $253 \approx \frac{n(n-1)}{2}$$253 \times 2 \approx n(n-1)$
$506 \approx n^2 - n$
Which we can finally rewrite as: $n^2 - n - 506 = 0$The only two factors of $506$ that allow us to factorize this equation are $22$ and $23$, such that: $(n - 23)(n + 22) = 0$From here, the only two roots are $n = 23$ and $n = -22$. Since we are referring to people, the negative root is nonsensical. **Therefore, the number of people for which it becomes more likely than not that two of them share a birthday is $n = 23$ — QED.**
>[!related]
>- **North** (upstream): [[Probability theory]], [[Veridical paradox]]
>- **West** (similar): [[Monty Hall problem]], [[Three prisoners problem]]
>- **East** (different): [[Intuitive reasoning]]
>- **South** (downstream): [[Collision probability]]
![[related.base|no-toolbar]]