Home / Intelligence Log / Probability Theory Probability Theory

Absorbing Markov Chains: Why E[HH] = 6 and E[HTH] = 10

The Question

You flip a fair coin — one with probability 1/2 of landing heads and 1/2 of landing tails — repeatedly, recording every result. What is the expected number of flips required until the sequence HH appears for the first time as consecutive results? What is the expected number of flips required until HTH appears for the first time?

Both questions have the same surface structure: you want a specific consecutive pattern, and you want to know, on average, how many flips it takes to observe it. The coin is fair, the flips are independent, and the patterns are short. These seem like they should yield similar answers. They do not. HH takes exactly 6 flips on average. HTH takes exactly 10. The four-flip gap between those two answers is not a rounding artifact or a computational error — it is a precise consequence of the internal structure of each pattern, and deriving it rigorously is one of the cleanest demonstrations of absorbing Markov chain analysis you will encounter.

This problem appears frequently in quantitative finance interviews — at firms like Jane Street, Citadel, and Two Sigma — precisely because it separates candidates who understand Markov structure from those who rely on heuristic reasoning. Getting the answer right, and being able to explain it, requires building a state machine, writing the system of first-step equations, and solving it algebraically. That is exactly what we will do.

The Intuition Trap

Before the formal derivation, it is worth examining why intuition fails here. The most common wrong answer from candidates is that both expected values should be "similar" because the patterns are comparable in length. This intuition imports the correct observation that both HH and HTH are short sequences of a fair coin, but ignores the critical role of failure recovery — what happens when you are partway through building a pattern and the next flip breaks it.

A slightly more sophisticated wrong approach is to reason from the probability of success in any given window. The probability that two consecutive flips form HH is \(\frac{1}{4}\), so "on average you need 4 pairs of flips, meaning 8 flips total." The probability that three consecutive flips form HTH is \(\frac{1}{8}\), so "on average you need 8 triples, meaning about 24 flips total." Both of these estimates are badly wrong. The true answers are 6 and 10 respectively. The flaw in this reasoning is that it treats successive windows as independent, when in reality they overlap: the second flip of one pair is the first flip of the next pair.

A more tempting but equally wrong approach is to note that HH consists of two heads and P(H) = 1/2, so E[HH] = 1/P(HH) = 4, and similarly for HTH. This confuses waiting for a single event with waiting for a consecutive subsequence to appear in a random string — a fundamentally different problem. The probability of observing HH at any given pair of positions is 1/4, but the expected time until you first observe it is not 4 because consecutive positions are correlated. The absorbing Markov chain framework handles this correlation exactly.

The Key Distinction

The gap between E[HH] = 6 and E[HTH] = 10 is not about the lengths or probabilities of the patterns. It is about what happens when a partial match fails. The failure mode of each pattern has a completely different structure, and that structure determines how many flips are "wasted" when progress is lost.

Building the State Machine for HH

An absorbing Markov chain for a pattern-waiting problem tracks the longest suffix of the current flip history that is also a prefix of the target pattern. This is the essential insight: you do not need to remember the entire history, only how much progress toward the target you currently hold. For HH, this progress can be 0 characters (no useful suffix), 1 character (the last flip was H, giving us a one-character prefix match), or 2 characters (absorbed — you just completed HH).

We therefore have three states:

The transition rules follow directly from the coin flip probabilities:

From S₀: With probability 1/2 you flip H and move to S₁. With probability 1/2 you flip T and remain in S₀. The tails neither helps nor hurts — you are still starting from zero progress.

From S₁: With probability 1/2 you flip H and move to S₂ — you are done. With probability 1/2 you flip T and fall back to S₀. The tails destroys your one-character of progress completely, because a tails cannot appear anywhere in HH.

State machine for HH: 1/2 (H) 1/2 (H) S0 ─────────► S1 ─────────► S2 [ABSORBED] ▲ │ │ 1/2 (T) │ 1/2 (T) └────────────┘ └── self-loop (T) ──┘ S0 --[H, 1/2]--> S1 --[H, 1/2]--> S2 (done) S0 --[T, 1/2]--> S0 S1 --[T, 1/2]--> S0

This diagram encodes the entire stochastic process. From any transient state, you flip the coin, the outcome determines which state you move to, and you continue until reaching S₂. The question is: what is the expected number of flips to reach S₂ starting from S₀?

Solving the System: E[HH] = 6

Let \(E_0\) denote the expected number of additional flips to reach absorption (complete HH) starting from state S₀, and let \(E_1\) denote the same quantity starting from state S₁. We condition on the next flip using the first-step decomposition principle: you always pay exactly one flip for the next toss, and then you find yourself in a new state with its own remaining expected cost.

\[E_0 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_0\] \[E_1 = 1 + \frac{1}{2}(0) + \frac{1}{2}E_0\]

Reading each equation: the leading 1 accounts for the flip you are about to take. In the first equation, that flip is H with probability 1/2 (moving to S₁, paying \(E_1\) more) or T with probability 1/2 (staying in S₀, paying \(E_0\) more). In the second equation, the flip is H with probability 1/2 — reaching absorption immediately, paying zero additional flips — or T with probability 1/2, resetting to S₀ and paying \(E_0\) more.

The algebra is clean. From the first equation:

\[E_0 - \frac{1}{2}E_0 = 1 + \frac{1}{2}E_1 \implies \frac{1}{2}E_0 = 1 + \frac{1}{2}E_1 \implies E_0 = 2 + E_1\]

Substituting into the second equation:

\[E_1 = 1 + \frac{1}{2}(2 + E_1) = 1 + 1 + \frac{1}{2}E_1 = 2 + \frac{1}{2}E_1\] \[\frac{1}{2}E_1 = 2 \implies E_1 = 4\] \[E_0 = 2 + 4 = 6\]
Result

The expected number of flips to see HH, starting from scratch, is exactly 6. This is a hard number — not an approximation, not a simulation average. It is the exact solution to a linear system of two equations in two unknowns.

The result has a satisfying interpretation. From S₀, you expect to need 2 flips just to get your first H. That gets you to S₁. From there, you expect to need 4 more flips to land the second H without being knocked back to S₀. The constant resets to S₀ from state S₁ are what push the expected value from the naive estimate of 4 to the correct answer of 6. Every time you are in S₁ and flip tails, you lose your progress and must rebuild it at a cost of \(E_0\) expected flips — which is itself expensive because of the same feedback loop.

Building the State Machine for HTH

The state machine for HTH requires four states instead of two transient states, because the pattern is three characters long and the possible "prefix match lengths" are 0, 1, 2, or 3 (absorbed). But what makes this problem fundamentally harder is not the extra state — it is the non-trivial transition structure when a partial match fails.

The states are:

Now consider the transition rules carefully, because the transitions from S₁ are where the real complexity lives.

From S₀: Flip H (prob 1/2) → move to S₁. Flip T (prob 1/2) → stay in S₀. A leading tails is useless — HTH begins with H.

From S₁ (have seen H): Flip T (prob 1/2) → move to S₂. You've now matched HT and are set up for the final H. But critically: Flip H (prob 1/2) → stay in S₁. This is the non-obvious transition. You were in S₁ (last flip was H) and flipped another H. Your progress at HTH is not destroyed — your most recent H is still a valid start of HTH. You remain in S₁ with a fresh single-H prefix match.

From S₂ (have seen HT): Flip H (prob 1/2) → move to S₃, absorbed. You completed HTH. Flip T (prob 1/2) → reset to S₀. After HT, a second T gives you HTT. The trailing T does not form the beginning of any prefix of HTH (which begins with H), so all progress is lost.

State machine for HTH: H (1/2) T (1/2) H (1/2) S0 ──────────► S1 ──────────► S2 ──────────► S3 [ABSORBED] ▲ │ │ │ T (1/2) │ H (1/2) │ T (1/2) └─────────────┘ (self-loop) └──────────────► S0 └── T self-loop on S0 ──┘ S0 --[H, 1/2]--> S1 S0 --[T, 1/2]--> S0 S1 --[T, 1/2]--> S2 S1 --[H, 1/2]--> S1 (self-loop!) S2 --[H, 1/2]--> S3 S2 --[T, 1/2]--> S0

The self-loop on S₁ is the defining structural feature of this problem. When you are in S₁ and flip another H, you do not advance and you do not fully regress — you stay exactly where you are. This seems like it should help: you are not losing your H prefix. But this transition creates a probability sink: you can bounce around S₁ repeatedly before ever making it to S₂, and each bounce costs you a flip. The net effect is that reaching S₂ from S₁ is itself an expensive sub-problem.

Solving the System: E[HTH] = 10

Let \(E_0\), \(E_1\), \(E_2\) denote the expected additional flips to absorption starting from S₀, S₁, S₂ respectively. The first-step equations are:

\[E_0 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_0\] \[E_1 = 1 + \frac{1}{2}E_1 + \frac{1}{2}E_2\] \[E_2 = 1 + \frac{1}{2}(0) + \frac{1}{2}E_0\]

Note how each equation reflects the transition diagram above. In the equation for \(E_1\), the term \(\frac{1}{2}E_1\) arises from the self-loop: with probability 1/2 you flip H and return to S₁ with expected remaining cost \(E_1\). This makes the second equation an equation in two unknowns (\(E_1\) and \(E_2\)) rather than just one. We solve the system from the equations we can simplify first.

From equation 1, isolating \(E_0\):

\[\frac{1}{2}E_0 = 1 + \frac{1}{2}E_1 \implies E_0 = 2 + E_1\]

From equation 3, substituting the expression for \(E_0\):

\[E_2 = 1 + \frac{1}{2}(2 + E_1) = 1 + 1 + \frac{1}{2}E_1 = 2 + \frac{1}{2}E_1\]

Now substitute both \(E_0 = 2 + E_1\) and \(E_2 = 2 + \frac{1}{2}E_1\) into equation 2:

\[E_1 = 1 + \frac{1}{2}E_1 + \frac{1}{2}\!\left(2 + \frac{1}{2}E_1\right) = 1 + \frac{1}{2}E_1 + 1 + \frac{1}{4}E_1 = 2 + \frac{3}{4}E_1\] \[\frac{1}{4}E_1 = 2 \implies E_1 = 8\]

Back-substituting:

\[E_0 = 2 + 8 = 10 \qquad E_2 = 2 + \frac{1}{2}(8) = 6\]
Result

E[HTH] = 10, even though HTH is only one flip longer than HH. The structure of the pattern — not just its length — determines the expected wait. The self-loop on S₁ and the catastrophic reset from S₂ on tails together add four full expected flips compared to HH.

Also note \(E_2 = 6\): starting from the state where you have already matched HT, you still expect to need 6 more flips to complete HTH. That might seem surprising — you are two-thirds of the way through the pattern. But from S₂, a tails (probability 1/2) returns you to S₀ and costs you \(E_0 = 10\) more expected flips. The cost of that reset, weighted by its probability, dominates the calculation.

Why Overlapping Patterns Change Everything

The structural difference between HH and HTH comes down to a single concept: the failure autocorrelation of the pattern, also called its overlap or self-correlation structure. When you are building toward a pattern and experience a failed extension, how much of your progress can you retain?

For HH: when you have matched one H (state S₁) and flip T, you lose everything. T cannot appear in HH at any position, so no suffix of your current history is a prefix of HH. You fall to S₀. When a failed attempt at HH costs you all your progress, the machine is "memoryless after failure," which is actually advantageous: at least you do not spend time bouncing between partial progress states.

For HTH: the failure modes are asymmetric and more expensive. From S₁ (matched H), flipping another H does not reset you to zero — it leaves you in S₁. This looks like progress preservation, but it is actually a costly trap. Because you cannot advance past S₁ until you flip T, the self-loop delays your arrival at S₂. You may flip H many times in succession before finally getting the T you need. Each extra H flip costs one step of expected time while contributing no forward progress.

From S₂ (matched HT), flipping T sends you all the way back to S₀ — despite being two-thirds of the way to completion. The final T in HTT cannot be reused as the beginning of a new HTH (which begins with H), so all progress is wiped. This catastrophic reset is expensive precisely because \(E_0 = 10\) is itself large: each time you reach S₂ and fail, you are paying a heavy price.

The Conway leading number method provides an elegant algebraic characterization of expected pattern waiting times. For a pattern \(P = p_1 p_2 \cdots p_k\) over a fair coin, define the correlation polynomial by checking, for each \(i\) from 1 to \(k\), whether the length-\(i\) suffix of \(P\) equals the length-\(i\) prefix of \(P\). If it does, contribute \(2^{k-i}\) to the sum. For HH: the length-2 suffix is HH = the length-2 prefix, contributing \(2^0 = 1\); the length-1 suffix is H = the length-1 prefix H, contributing \(2^1 = 2\). Total: \(2^2 + 2 + 1 = 4 + 2 + 1 = 7\)... Actually the standard Conway formula gives E[P] = \(\sum_{i=1}^{k} c_i \cdot 2^i\) where \(c_i = 1\) if the length-\(i\) prefix equals the length-\(i\) suffix, and 0 otherwise. For HH: both overlap conditions hold, giving \(2^2 + 2^1 = 4 + 2 = 6\). For HTH: only the full pattern overlaps with itself, giving \(2^3 + 0 + 2^1 = 8 + 0 + 2 = 10\). This is a remarkable shortcut that recovers our exact results — and it illuminates why the internal self-similarity of a pattern drives its expected waiting time.

Pattern Length Self-Overlaps E[flips] States Needed
HH 2 Length-1 and length-2 6 2 transient + 1 absorbing
HT 2 Length-2 only 4 2 transient + 1 absorbing
HTH 3 Length-1 and length-3 10 3 transient + 1 absorbing
HTT 3 Length-3 only 8 3 transient + 1 absorbing

Python Simulation: 100,000 Trials

A 100,000-trial Monte Carlo simulation provides empirical confirmation of both results. The function below tracks a sliding window of the most recent flips equal to the length of the target pattern. When the window equals the target, the trial ends and the flip count is recorded. Averaging over all trials converges to the theoretical expected value as the number of trials grows.

Python markov_coin_simulation.py
import random

def simulate_expected_flips(target: list, n: int = 100_000) -> float:
    """Simulate expected flips to see target sequence via Monte Carlo.

    Args:
        target: List of ints (1 = Heads, 0 = Tails) representing the pattern.
        n:      Number of independent trials to simulate.

    Returns:
        The empirical mean number of flips across all trials.
    """
    total = 0
    k = len(target)
    for _ in range(n):
        history = []
        flips = 0
        while True:
            flip = random.randint(0, 1)   # 0 = Tails, 1 = Heads
            flips += 1
            history.append(flip)
            if len(history) > k:
                history.pop(0)       # Keep only the last k flips
            if history == target:
                break
        total += flips
    return total / n


HH  = [1, 1]        # Heads-Heads
HTH = [1, 0, 1]     # Heads-Tails-Heads

random.seed(42)
e_hh  = simulate_expected_flips(HH)
e_hth = simulate_expected_flips(HTH)

print(f"E[HH]  = {e_hh:.2f}  (exact: 6.00)")
print(f"E[HTH] = {e_hth:.2f}  (exact: 10.00)")
print(f"HH error:  {abs(e_hh - 6.0):.4f} flips")
print(f"HTH error: {abs(e_hth - 10.0):.4f} flips")

With 100,000 trials the empirical estimates typically land within 0.05 flips of the exact values. Typical output:

Output stdout
E[HH]  = 6.01  (exact: 6.00)
E[HTH] = 9.98  (exact: 10.00)
HH error:  0.0121 flips
HTH error: 0.0183 flips

The simulation converges cleanly to the exact values. At 100,000 trials the standard error of the mean for E[HTH] — with a variance of roughly 46 (you can verify this analytically) — is approximately \(\sqrt{46/100000} \approx 0.021\), so a discrepancy of 0.02 is entirely expected. Running 1,000,000 trials typically produces agreement to within 0.005.

Business Application: Credit Migration & Web Ranking

Absorbing Markov chains are not academic curiosities. They are the backbone of several major financial models used daily by banks, asset managers, and technology companies.

In credit risk, every major rating agency and bank uses a credit migration matrix — a transition matrix where each row gives the probability that a bond rated BBB today will be rated AAA, AA, A, BBB, BB, B, CCC, or Default one year from now. Default is the absorbing state: once a bond defaults, it does not recover its investment-grade rating. The expected time to default starting from any rating class is computed exactly as we computed \(E_0\) above — by solving a linear system of first-step equations. The same framework drives the Internal Ratings-Based approach under Basel III, where expected loss calculations require expected time-to-default estimates for every rating bucket in a loan portfolio.

Google's original PageRank algorithm is a non-absorbing Markov chain over the directed graph of the web. The transition from any page to another follows link probabilities, and a small "teleportation" probability prevents the chain from getting stuck in sink nodes. The stationary distribution of this chain — the vector \(\pi\) satisfying \(\pi = \pi P\) — is the PageRank vector, and each component gives the long-run fraction of time a random walk spends on that page. High-PageRank pages are those the walk visits most often; they are structurally central to the web graph in a way that is captured precisely by the Markov chain's stationary distribution.

Any sequential decision process with memory-free state transitions and a target event — a manufacturing line waiting for a defective part to appear, a clinical trial tracking patient progression through disease stages, a network protocol waiting for a specific acknowledgment sequence — can be modeled as a waiting-time Markov chain and solved using exactly the methods demonstrated here. The key is identifying the minimal state representation (what information from history is necessary to predict future outcomes?), writing the first-step equations, and solving the resulting linear system. For large state spaces, this system is solved numerically using the fundamental matrix of the absorbing chain, but the conceptual structure is identical to what we have done above.

Need Quantitative Modeling for Your Business?

We apply stochastic models, Monte Carlo simulation, and probabilistic forecasting to real business problems — from credit risk to operational planning.

Talk to Our Team