>[!abstract] >The birthday paradox is the counterintuitive probability result that in a group of just 23 people, there is about a 50% chance that at least two share the same birthday (same day and month, ignoring leap years). > >Intuition may suggest that the probability should rise linearly, such that $365 / 2 = 183$ people might be needed for a 50% chance of a birthday collision. > >The paradox arises because the number of possible pairwise comparisons grows surprisingly rapidly (at the rate $\frac{n (n-1)}{2}$ for $n$ people). This means that even small groups yield many opportunities for matches. > >The birthday paradox illustrates how human intuition struggles with combinatorial growth and probabilistic reasoning, and has applications in cryptography, such as in hash collision analysis. ## Problem statement Let $p$ be the probability that any two of $n$ people share the same birthday (same day and month, regardless of year). It helps to start from the reverse probability that no two birthdays coincide, i.e., $\bar{p} = 1 - p$. For example, the probability of two people ($n = 2$) *not* sharing the same birthday depends on the second person having its birthday in the remaining $364$ days other than the birthday of the first person, i.e. $\bar{p} = \frac{365}{365} \times \frac{364}{365} \approx 0.997$So to avoid a birthday collision, the first person has $365$ days available to them, the second person has $364$, the third person $363$, etc. Generalizing, we can extend this equation to larger values of $n$: $\bar{p} = \frac{365}{365} \times \frac{364}{365} \times \dots \times \frac{365 - (n - 1)}{365}$Eventually, once $n \gt 365$ (i.e., there are more people than days in the year), then at least two people will necessarily share the same birthday with absolute certainty, such that $\bar{p} = 0$. This is called the [[Pigeonhole principle]]. We are interested in finding the first $n$ at which $\bar{p} \leq \frac{1}{2}$, i.e., the number of people at which it becomes *more likely than not* that two of them share a birthday. We'll explore two approaches: computational and analytical. ## Computational approach This is essentially a brute-force, trial-and-error method of feeding the previous equation with increasingly larger values of $n$. Let’s write some R code that does exactly that. ``` r days <- 365 p <- numeric() for (n in 1:days) { p_bar <- prod((days:(days - n + 1)) / days) p[n] <- 1 - p_bar } ``` In the code block above, we iterate from $n = 1$ to $n = 365$. The line starting with `p_bar` is where the previous equation is expressed. In the last row, the function just returns the probability $p$ for each value of $n$ (which is equal to $1 - \bar{p}$) as a vector for later analysis and plotting. Let’s check which is the first value of $n$ for which $p \gt 0.5$: ``` r which(p > 0.5)[1] ``` [1] 23 **In a group of 23 people, it is very slightly more likely than not that two of them will share a birthday — QED.** ``` r which(p > 0.99)[1] ``` [1] 57 Just as counterintuitively, with only 57 people in the group, there is a 99% chance that two of them share a birthday. Lastly, we can plot the results in a graph: ``` r plot( p, type = "l", xlab = "n", ylab = "p", ylim = c(0, 1) ) abline(h = 0.5, lty = 2) abline(v = which(p > 0.5)[1], lty = 2) ``` ![[Birthday paradox.png]] We can see a very rapid increase in the probability $p$ as $n$ increases, with the curve flattening out as the probability of a birthday collision nears certainty starting with around 57 people. $p = 1$ is only reached when $n = 366$. ## Analytical approach Let's try to find an upper bound on $\bar{p}$, which is also a lower bound on $n$. Generalizing the previous equation, the probability that $n$ people don't share a birthday can be written as the following product, in which $k = n - 1$ : $\bar{p} = \prod\limits_{k=0}^{n-1} (\frac{365 - k}{365}) $We are interested in the nearest value of $k$ for which $\bar{p} = \frac{1}{2}$, i.e., when the probability of $k$ people not sharing a birthday reaches 50%. This means that with $k + 1$ people, it will be more likely than not that two of them share a birthday, which is the bound we are looking for. $\frac{1}{2} = \prod\limits_{k=0}^{n-1} (\frac{365 - k}{365}) $We can also factor the right-hand side of that equation given the presence of $365$ in both numerator and denominator: $\frac{1}{2} = \prod\limits_{k=0}^{n-1} (1 - \frac{k}{365})$Now, we can simplify the product into a sum by first taking the natural logarithm of both terms: $\ln(\frac{1}{2}) = \ln(\prod\limits_{k=0}^{n-1} (1 - \frac{k}{365}))$And then applying the property of logarithms that $\ln(ab) = \ln(a) + \ln(b)$ to the right-hand side of the previous equation: $\ln(\frac{1}{2}) = \sum_{k=0}^{n-1} \ln(1 - \frac{k}{365})$In the above equation, if we replace the term $-\frac{k}{365}$ with $x$, we surface the term $\ln(1 + x)$ after the sum sign, which can conveniently be expanded using the Taylor-Maclaurin power series of the form: $\ln(1 + x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \dots$The only condition to use that expansion is that $-1 \lt x \leq 1$. Is it the case here? As a reminder, $x$ here is just shorthand for $\frac{-k}{365}$, in which $k$ is the number of people above which it will be more likely than not that two of them share a birthday. It should be obvious that in a group of $365$ people, it is very much more likely than not than at least two of them share a birthday (said differently, it would be extremely unlikely that all $365$ have a unique birthday). Therefore, $k \lt 365$, and therefore $-1 \lt \frac{-k}{365} \leq 1$. The condition is satisfied and we can use the power series expansion. Knowing that the condition $-1 \lt x \leq 1$ is true, it should also be clear that in the power series expansion of $\ln(1 + x)$ above, only the first term $x$ is meaningful; the remaining terms get exponentially smaller. Therefore, we can discard them such that: $\ln(1 + x) \approx x$This enables us to revisit the earlier equation: $\ln(\frac{1}{2}) = \sum_{k=0}^{n-1} \ln(1 - \frac{k}{365})$and simplify it into the following approximation: $\ln(\frac{1}{2}) \approx \sum_{k=0}^{n-1} \frac{-k}{365}$Since $\ln(\frac{1}{2}) = -\ln(2)$, leaving negatives on both sides of the equation that can be removed, we get: $\ln(2) \approx \sum_{k=0}^{n-1} \frac{k}{365}$Which transforms into: $\ln(2) \times 365 \approx \sum_{k=0}^{n-1} k$The left-hand side simplifies to $253$ and the right-hand side is the Gaussian sum of integers from $0$ to $n - 1$, which can be rewritten as $\frac{n(n-1)}{2}$, leaving us with: $253 \approx \frac{n(n-1)}{2}$$253 \times 2 \approx n(n-1)$ $506 \approx n^2 - n$ Which we can finally rewrite as: $n^2 - n - 506 = 0$The only two factors of $506$ that allow us to factorize this equation are $22$ and $23$, such that: $(n - 23)(n + 22) = 0$From here, the only two roots are $n = 23$ and $n = -22$. Since we are referring to people, the negative root is nonsensical. **Therefore, the number of people for which it becomes more likely than not that two of them share a birthday is $n = 23$ — QED.** >[!related] >- **North** (upstream): [[Probability theory]], [[Veridical paradox]] >- **West** (similar): [[Monty Hall problem]], [[Three prisoners problem]] >- **East** (different): [[Intuitive reasoning]] >- **South** (downstream): [[Collision probability]] ![[related.base|no-toolbar]]