A Visual History — From Euclid's Ropes to the Transformer
Antiquity before AD 1
12 algorithms
BC ~600Words, Letters, CodesL1▾
Anagram Check
The Galileo Anagram
Greek poetry to Padua
Made it practical to compare letter structure instead of reading meaning.
For instanceA spell checker can tell that “listen” and “silent” use the same letters.
s ←"listen"
t ←"silent"IFlength(s) != length(t) THENRETURNFALSEENDIF
count ← empty map<char, int>
FOR EACH c IN s
count[c] ← count.get(c, 0) + 1ENDFORFOR EACH c IN t
IF c NOTIN count THENRETURNFALSEENDIF
count[c] ← count[c] - 1IF count[c] < 0THENRETURNFALSEENDIFENDFORRETURNTRUE
Before computers, anagrams were literary games and secret codes. Modern computing transformed the same idea into frequency-count algorithms used in indexing and text analysis.
Humans have long enjoyed rearranging letters into meaningful patterns. Computers later needed fast ways to compare word structure for spell checkers, dictionaries, cryptography, and search systems.
Teaches: Order doesn't matter; count structure, not position
The Idea
Forget the letters' positions and just count how many of each letter the first word has. Then walk through the second word and decrement those counts, one letter at a time. If at any point a count goes negative — or a letter shows up that wasn't in the first word — the words can't be anagrams. If the strings have different lengths, they obviously can't match either.
The invariant that makes this work: at every step during the second walk, count[c] equals "how many of letter c are still left from s that haven't been matched by t." If both strings hold the same letter multiset, every letter in t finds a partner and we end with zeros (or non-negative counts). The whole thing runs in O(n) time — one pass to build the map, one pass to drain it.
Trace
letter
l
i
s
t
e
n
count
1
1
1
1
1
1
Where It's Used Today
Spell-checkers — many spell engines look for words with the same letter multiset as a misspelled candidate (e.g., teh → the).
Word games — Scrabble, Words with Friends, and crossword tools all use anagram checks to find every valid play from a tile rack.
Anagram solvers and word-puzzle helpers — apps that take a tile rack or letter set and list every valid word use this exact check against a dictionary.
Cryptanalysis education — early ciphers (and Galileo's famous announcement of Saturn's rings) used anagrams to hide messages; the same counting trick reveals them.
Hash-based grouping — many string-grouping algorithms hash each word by its sorted-letter signature, which is the same idea written differently — letters compared by count, not order.
When NOT to Use
When letter order matters, such as comparing exact words or sentences.
When case, spaces, accents, or punctuation must be treated carefully but you have not normalized them.
When the text is huge and memory is very limited.
Common Mistakes
Forgetting to check lengths first.
Sorting the strings when a frequency map would be clearer or faster.
Ignoring Unicode/case rules and getting surprising results.
Try It with an AI Assistant
short
Write is_anagram(a, b) returning true if strings a and b are anagrams of each other.
behavior
Write a function that, given two strings, first checks they are the same length; then builds a map from letter to count using the first string, and walks through the second string decrementing the count for each letter — returning false if any letter is missing or the count goes negative, and true if the walk completes.
Made systematic selection possible without missing or repeating choices.
For instanceA coach can list every possible 3-player team from 10 students.
n ←4
k ←2
result ← empty list
FUNCTIONbacktrack(start, path)
IFlength(path) = k THEN
APPEND copy(path) TO result
RETURNENDIFFOR i FROM start TO n
APPEND i TO path
backtrack(i + 1, path)
remove last element FROM path
ENDFOREND FUNCTIONbacktrack(1, empty list)
RETURN result
Gambling, astronomy, and military planning all pushed mathematicians to understand “how many ways” choices could occur. Recursive generation later became fundamental in programming and search problems.
People needed systematic ways to count selections, probabilities, gambling outcomes, and arrangements without listing everything manually.
Teaches: Build possibilities systematically without repetition
The Idea
The trick is to make every choice only from things to the right of the previous choice — that one rule is enough to generate every combination exactly once.
Use backtracking. Maintain a growing path list — the items chosen so far — and a start index — the smallest item we're still allowed to pick. At each step, walk forward from start, append the current item to path, and recurse with start = i + 1. When the path's length reaches k, record a copy and return. After the recursive call, remove the last item — that's the "back" in backtracking — so the next iteration of the loop tries a different choice at this depth.
Why does this avoid duplicates? Because the recursion only ever picks indices greater than the previously chosen one. The set {2, 3} is reachable as path = [2, 3] but never as [3, 2] — by the time you've picked 3, you can no longer go back to 2. That single rule — "always move forward" — guarantees each combination appears exactly once, in lexicographic order, with no extra bookkeeping.
Trace
step
start
path
action
result so far
1
1
[]
i=1: append 1, recurse
2
2
[1]
i=2: append 2, recurse
3
3
[1,2]
length=k → record [1,2], pop
[[1,2]]
4
2
[1]
i=3: append 3, recurse, record [1,3]
[[1,2], [1,3]]
5
2
[1]
i=4: append 4, recurse, record [1,4]
[[1,2], [1,3], [1,4]]
6
1
[]
i=2: append 2, recurse
7
3
[2]
i=3 → record [2,3]; i=4 → record [2,4]
[..., [2,3], [2,4]]
8
1
[]
i=3: append 3, recurse, record [3,4]
[..., [3,4]]
9
1
[]
i=4: appended, but not enough remain
done
Where It's Used Today
Lottery and gambling odds — counting and enumerating winning ticket combinations relies directly on this algorithm.
Sports rosters and team selection — listing every possible lineup of k players from a squad of n.
Combinatorial test design — software testers generate all k-way combinations of feature flags to find bugs from interactions.
Drug discovery — pharmacologists screen candidate combinations of compounds to find pairs or triples with synergistic effects.
Backtracking solvers — Sudoku solvers, knapsack problems, and many puzzle-solving algorithms internally use this same "choose forward" pattern.
When NOT to Use
When order matters; then you need permutations, not combinations.
When the number of results is enormous and you only need a count or a sample.
When duplicates in the input require special handling.
Common Mistakes
Generating the same combination in different orders.
Forgetting to undo the last choice during backtracking.
Not pruning branches where too few items remain.
Try It with an AI Assistant
short
Write combinations(n, k) that returns all distinct k-element combinations of {1..n} in lexicographic order.
behavior
Write a recursive function that takes integers n and k and returns every list of length k whose elements are chosen from 1..n, in increasing order. Maintain a growing 'path' and a 'start' index. At each step, walk forward from start: append the current value, recurse with start = current + 1, then remove it before trying the next value. When the path length equals k, record a copy of it.
Made exact simplification of ratios fast and reliable.
For instanceA baker can reduce 48/180 cups to its simplest ratio.
a ←60
b ←48WHILE a != b // keep going until both numbers are equalIF a > b THEN
a ← a - b // shrink a if a is largerELSE
b ← b - a // shrink b if b is largerENDIFENDWHILE
PRINT a // a and b are now equal — print either one
Over 2,300 years ago, in the city of Alexandria in ancient Egypt, a Greek mathematician named Euclid wrote one of the most famous books in history — the Elements. Euclid is often called the Father of Geometry, and he worked in Alexandria around 300 BC.
Euclid's procedure can be understood through a practical puzzle. Imagine you have two ropes of different lengths — say, one 60 feet long and one 48 feet long. You want a measuring stick that fits a whole number of times into both ropes, with no leftover piece. What is the longest such stick?
Euclid came up with a clever trick: keep cutting the shorter length off the longer one. When both ropes finally end up the same length, that length is your measuring stick.
This same idea, written today with numbers instead of ropes, is the algorithm we are about to code. It is one of the oldest algorithms still in use anywhere in the world.
Teaches: Reduce using remainders; preserve what stays invariant
Anecdote
Euclid never presented it as a "number theory" algorithm. In Elements, it appears as a geometric procedure using line segments, not numbers — you repeatedly "cut off" the smaller segment from the larger. The idea of GCD as arithmetic came much later; Euclid thought in geometry, not integers.
The Idea
Take two numbers, a and b. While they are not equal, subtract the smaller from the larger. When they finally become equal, that number is the GCD.
Why does this work? If some number g divides both a and b, then g also divides their difference a − b. So replacing the larger number with the difference doesn't change the GCD — it just makes the numbers smaller and easier to work with. Eventually the two numbers meet, and that meeting point is our answer.
A faster version of this algorithm uses modulo (a mod b) instead of repeated subtraction — but they're the same idea: modulo is repeated subtraction in one jump. Subtracting b from a over and over until a < b is exactly what a mod b computes.
Trace
step
a
b
what happens
0
60
48
a is larger → a = 60 − 48 = 12
1
12
48
b is larger → b = 48 − 12 = 36
2
12
36
b is larger → b = 36 − 12 = 24
3
12
24
b is larger → b = 24 − 12 = 12
4
12
12
equal — stop
Where It's Used Today
Reducing fractions — math software simplifies 60/48 to 5/4 by dividing both by the GCD (12).
Cryptography — the GCD is part of the math behind RSA: it validates key candidates and is the foundation of the modular-inverse step that makes signatures work.
Computer graphics — finding the largest square tile that fits both the width and height of an image.
Music software — lining up two rhythms by finding the longest note value that divides both.
Engineering — gear design, where you need to know when two rotating gears will return to their starting positions together.
When NOT to Use
When inputs are not integers.
When using subtraction GCD on very large numbers with very different sizes.
When zero and negative inputs have not been defined.
Common Mistakes
Using slow repeated subtraction instead of modulo for large values.
Forgetting that gcd(0, b) should be |b|.
Not taking absolute values for negative inputs.
Try It with an AI Assistant
short
Write gcd(a, b) that returns the greatest common divisor of two non-negative integers.
behavior
Write a function that, given two non-negative integers, repeatedly replaces the larger one with the difference of the two until they are equal, and then returns that final value.
Made synchronization of repeating cycles calculable.
For instanceIf one bus comes every 12 minutes and another every 18 minutes, find when both arrive together.
a ←4
b ←6RETURN (a * b) / gcd(a, b)
The LCM is the twin of the GCD we met in the last chapter. Where the GCD asks "what's the longest measuring stick that fits a whole number of times into both lengths?", the LCM asks the opposite: "what's the shortest length that BOTH ropes can measure into a whole number of pieces?"
The Greeks used this constantly. Music was studied as the science of ratios — the lengths of vibrating strings producing harmonious notes. (Pythagoras of Samos, around 530 BC, had figured out that strings in a 2:3 ratio sounded like a perfect fifth.) To make multiple instruments sing together, the LCM tells you when the rhythms align. Calendars used it too: the Athenian calendar reconciled lunar months with solar years on an eight-year cycle (the Octaeteris) — the LCM of two natural cycles. And early Greek machines and gears used LCMs to predict when two interlocking wheels would return to their starting positions.
The clever trick that makes the LCM easy was already known to Euclid himself, around 300 BC: once you have the GCD, the LCM falls out almost for free.
Teaches: Align independent cycles into a shared rhythm
Anecdote
The Athenian calendar reconciled the lunar month (29.53 days) with the solar year (365.24 days) on an **eight-year cycle called the *Octaeteris*** — the smallest period over which both cycles approximately resync. That practical calendar problem was the everyday LCM in Greek life: when do the next New Moon and Spring Equinox land on the same day again? Music and gear-design pulled on the same trick.
The Idea
Here is the trick. For any two positive numbers a and b,
``
gcd(a, b) × lcm(a, b) ← a × b
``
So once we know the GCD, the LCM is just a × b / gcd(a, b). Two lines of code. The hard work was already done in the last chapter.
Why does dividing by the gcd work? Multiplying a × b counts every shared factor twice — once from each number. The gcd is exactly the product of those shared factors, so dividing by it removes the duplicate. What remains is each prime factor counted as many times as the larger of the two — which is what an LCM is.
Trace
step
value
meaning
1
gcd(4, 6) = 2
the largest common divisor
2
4 × 6 = 24
the product of the two numbers
3
24 / 2 = 12
divide the product by the gcd → the LCM
Where It's Used Today
Adding fractions — to add 1/4 and 1/6 you need a common denominator; LCM(4, 6) = 12 gives you 3/12 + 2/12.
Calendars and astronomy — when will two cycles (lunar and solar, or the orbits of two planets) line up again?
Music software — finding the smallest unit of time that contains a whole number of two different rhythms.
Gears and engineering — when will two interlocking gears return to their starting positions together?
Computer scheduling — operating systems use LCM math to predict when two periodic tasks (one running every 4 ms, one every 6 ms) will collide.
When NOT to Use
When inputs are not periodic/cyclic or integer-based.
When either input is zero and you have not defined the expected result.
When direct a*b may overflow before division.
Common Mistakes
Computing a*b/gcd in an overflow-prone order.
Forgetting lcm(0,b)=0.
Confusing GCD and LCM roles.
Try It with an AI Assistant
short
Write lcm(a, b) returning the least common multiple of two positive integers.
behavior
Write a function that, given two positive integers a and b, first computes their greatest common divisor g, then returns (a × b) / g. To avoid overflow on large inputs, divide before multiplying: return (a / g) × b.
For instanceA computer can find the word “apple” inside a paragraph.
text ←"BANANA"
pattern ←"ANA"
n ←length(text)
m ←length(pattern)
FOR i FROM0TO n - m
match ←TRUEFOR j FROM0TO m - 1IF text[i + j] != pattern[j] THEN
match ←FALSEBREAKENDIFENDFORIF match THENRETURN i
ENDIFENDFORRETURN -1
Before the printing press, before spreadsheets, before Ctrl-F — every reader looking up a word in a manuscript did exactly this algorithm by eye.
- A scribe in Alexandria's Great Library scanning a scroll for a citation.
- A Greek philosopher hunting for a quotation in a copy of Homer.
- A medieval monk searching a Bible for the word amen.
Their procedure was always the same: try every starting position in the text. At each, compare letter by letter to the pattern. If you find a complete match, you're done. If you exhaust all positions, the pattern isn't there.
It is called naive because much faster algorithms exist today (Knuth-Morris-Pratt, 1977; Boyer-Moore, 1977), but the naive version is what your brain actually does, and what every search engine does first as a baseline. It also lasted unchanged for at least 2,300 years — and still works on every piece of text you'll ever see.
Teaches: Try every alignment; correctness before efficiency
Anecdote
Before the printing press, every reader looking up a word in a manuscript was running this algorithm by eye — try every starting position, compare letter by letter. The naive substring algorithm is the procedure your brain still runs when you scan a page. It survived 2,300 years because the brain still runs it.
The Idea
Imagine placing a small stencil over the text and sliding it one character at a time. At each position, you check whether the cut-out letters in the stencil match what's underneath. If yes, you've found it. If no, slide the stencil one character right and check again.
Two nested loops:
- Outer loop: try every possible starting position i in the text where the pattern could begin (positions 0 through n − m, where n is the text length and m is the pattern length).
- Inner loop: at each starting position, compare text[i + j] to pattern[j] for every position j in the pattern. If any character disagrees, this starting position fails — break and try the next i. If all m characters agree, return i.
Trace
i
text[i…i+2]
match?
action
0
"BAN"
B ≠ A
move on
1
"ANA"
YES
return 1
Where It's Used Today
Ctrl-F / Cmd-F in any text editor — the simplest form of "find" is exactly this loop. (Modern editors and grep typically use faster algorithms like Boyer-Moore for longer patterns.)
Spam filters — scanning email for trigger phrases.
Configuration parsing — looking up a key in an INI file, env-var list, or small text-based config.
Teaching string algorithms — naive search is the canonical first algorithm, the baseline against which KMP, Boyer-Moore, and Rabin-Karp are measured.
Single-pattern scans of small text — when the pattern is short and the text fits in memory, the simple version is fast enough that no library bothers with anything fancier.
When NOT to Use
When searching huge text repeatedly; use KMP/Boyer-Moore/Rabin-Karp or indexes.
When pattern matching needs wildcards or regular expressions.
When Unicode normalization or case-insensitive search matters.
Common Mistakes
Forgetting the last valid start index is n-m.
Not defining what an empty pattern returns.
Returning only one match when the task asks for all matches.
Try It with an AI Assistant
short
Write find(text, pattern) returning the index of the first occurrence, or -1.
behavior
Write a function that, given a long string and a shorter pattern, tries every possible starting position in the long string, and at each starting position compares characters one by one against the pattern; returns the first starting position where every character matches, or -1 if no position fully matches.
Laws of Computational Thinking
Primary: Explore Systematically (10)
Secondary: Use Structure, Not Brute Force (8) [by contrast]
Made symmetry in text easy to detect mechanically.
For instanceA program can test whether “racecar” reads the same backward.
s ←"racecar"
i ←0
j ←length(s) - 1WHILE i < j
IF s[i] != s[j] THENRETURNFALSEENDIF
i ← i + 1
j ← j - 1ENDWHILERETURNTRUE
Sotades of Maroneia was an Alexandrian poet who built reversible verses for sport. Legend has him drowned in a lead-weighted sack by Ptolemy II for satirizing the king — though probably for a satirical hexameter, not a true palindrome.
The check itself is simple. Walk one finger from the front, one from the back. They meet in the middle. If every pair of letters matches on the way in, the word reads the same both ways.
Teaches: Compare symmetry from both ends inward
The Idea
Use two pointers: i starting at the left end of the string and j starting at the right end. Compare s[i] to s[j]. If they don't match, the string isn't a palindrome — return FALSE immediately. If they do, step i one to the right and j one to the left, and check the next pair. Stop when the pointers meet or cross — that means every pair has matched, so the string is a palindrome.
Why does this work? A string is a palindrome exactly when s[k] = s[n − 1 − k] for every k. Walking inward from both ends checks each such pair exactly once, in n/2 comparisons rather than n. The invariant: every pair of mirror positions outside the current (i, j) window has already been verified equal. The first mismatch ends the search early — you never need to look at the inside if the outside already disagrees.
Trace
step
i
j
s[i]
s[j]
match?
action
0
0
6
r
r
yes
i=1, j=5
1
1
5
a
a
yes
i=2, j=4
2
2
4
c
c
yes
i=3, j=3
3
3
3
—
—
—
i not < j, stop
Where It's Used Today
DNA analysis — biologists scan genomes for palindromic sequences, which often mark binding sites for proteins like restriction enzymes.
Word puzzles and games — Scrabble helpers, Wordle solvers, and crossword tools all need a fast palindrome check on the dictionary.
Programming-language parsers — palindrome detection is the textbook example of a symmetry check and shows up in compiler tests.
Number-theory recreations — palindromic numbers, palindromic primes, and Lychrel-number searches all begin with this exact check on the digit string.
Coding-interview warm-ups — palindrome check is one of the most common first programming exercises, taught alongside loops and string indexing.
When NOT to Use
When spaces, punctuation, and case should be ignored but the string is not normalized.
When the data is a stream and you cannot access both ends.
When you need all palindromic substrings, not just yes/no.
Common Mistakes
Comparing the full reversed string unnecessarily.
Off-by-one pointer movement.
Forgetting odd-length strings have a harmless middle character.
Try It with an AI Assistant
short
Write is_palindrome(s) — true if s reads the same forwards and backwards.
behavior
Write a function that, given a string, places one cursor at the leftmost character and another at the rightmost. Compare the two characters; if they differ, return false. If they match, move the left cursor one step right and the right cursor one step left, and compare again. Continue until the cursors meet or cross. If every pair matched, return true.
For instanceA student can list all primes below 100 without testing every number separately.
n ←30
primes ← [TRUE] × (n + 1)
primes[0] ←FALSE
primes[1] ←FALSEFOR i FROM2TOsqrt(n)
IF primes[i] THENFOR j FROM i * i TO n STEP i
primes[j] ←FALSEENDFORENDIFENDFORRETURN primes
Eratosthenes ran the Library of Alexandria. He measured Earth's circumference (with remarkable accuracy for the third century BC), gave geography its modern name with his work Geographika, and reformed the calendar. His peers called him Beta — second-best at everything.
The sieve was a side project. Cross out the multiples of 2, then 3, then 5, then 7. What survives is prime. It is, fittingly, a librarian’s procedure for tidying up the integers.
Teaches: Eliminate impossibilities to reveal hidden structure
Anecdote
Eratosthenes was the librarian at Alexandria, not a "number theorist." His work Geographika gave geography its modern name. The sieve was a side tool, not his main work.
The Idea
Make a boolean array primes[0..n], marking everyone "true" except 0 and 1. Then walk i from 2 upward. If primes[i] is still true, i is prime — and we can cross out ii, ii + i, i*i + 2i, ... as multiples of i, marking them false. Stop when i exceeds √n.
Why stop at √n? Any composite number c ≤ n must have at least one prime factor p ≤ √n (otherwise its two smallest factors would multiply to more than n). So by the time i reaches √n, every composite up to n has already been touched by some earlier prime. There's nothing left to cross out.
**Why start crossing out at i*i and not 2i?** Because every smaller multiple of i was already crossed out by an even smaller prime: 2i was caught when p = 2 ran, 3i when p = 3 ran, and so on. The first multiple of i that hasn't yet been touched by any smaller prime is i*i.
The invariant: when i reaches a still-unmarked entry, it must be a real prime — every composite below it is already gone.
Cryptography pre-computation — generating tables of small primes used to seed and test candidates for RSA and Diffie-Hellman keys.
Coding theory — pre-computing primes for Galois fields used in error-correcting codes (CDs, QR codes, LTE/5G).
Project Euler and competitive programming — the sieve is the go-to technique for any problem that needs many primes fast.
Hash function design — picking prime moduli for hash tables and cyclic redundancy checks.
Number theory research — sieves are still used (in much-evolved forms) to study prime distributions and find new large twin primes.
When NOT to Use
When n is extremely large and memory for an array 0..n is too big.
When you only need to test one number for primality.
When you need cryptographic-size primes directly.
Common Mistakes
Starting multiples at 2i instead of ii, causing extra work.
Looping factor past sqrt(n) unnecessarily.
Forgetting 0 and 1 are not prime.
Try It with an AI Assistant
short
Write sieve(n) returning a boolean array marking primes from 0 to n.
behavior
Write a function that, given an integer n, creates a boolean array of length n+1 marked all true except indices 0 and 1. For each i from 2 up to √n, if entry i is still true, mark every multiple of i starting at i·i as false. Return the final boolean array — true entries are the primes.
For instanceA teacher can show that 84 is really 2 × 2 × 3 × 7.
factors ← []
d ←2WHILE n > 1WHILE n MOD d = 0
APPEND d TO factors
n ← n / d
ENDWHILE
d ← d + 1ENDWHILERETURN factors
Greek mathematicians since Pythagoras saw primes as the atoms of arithmetic — the indivisible building blocks from which every other integer is constructed. The procedure for breaking a number down is so natural it has no inventor.
Try 2 as long as it divides. Then 3, then 5. Stop when your divisor exceeds √n. The same trial division a 7th grader does today; the same procedure a Greek scholar did on wax 2,300 years ago.
Teaches: Break complex things into irreducible building blocks
The Idea
Start with the smallest possible divisor d = 2. Try dividing n by d as many times as it divides cleanly — each successful division contributes one copy of d to the factor list and shrinks n. When d no longer divides, increment d and try again. Stop when n becomes 1.
Why does this only ever find primes? Because at the moment we test d, every prime smaller than d has already been pulled out — so any composite d like 4 or 6 can't divide what's left (its prime factors would have been removed already). And why does the leftover n keep dropping? Because we only divide n by something at least 2, so each division at least halves it; the algorithm always terminates. A common optimization is to stop testing once d * d > n, because by then the remaining n is itself prime.
Trace
step
n
d
n MOD d = 0?
action
factors so far
1
84
2
yes
append 2; n = 42
[2]
2
42
2
yes
append 2; n = 21
[2, 2]
3
21
2
no
inner loop ends; d = 3
[2, 2]
4
21
3
yes
append 3; n = 7
[2, 2, 3]
5
7
3
no
inner loop ends; d = 4
[2, 2, 3]
6
7
4
no
d = 5
[2, 2, 3]
7
7
5
no
d = 6
[2, 2, 3]
8
7
6
no
d = 7
[2, 2, 3]
9
7
7
yes
append 7; n = 1; outer loop ends
[2, 2, 3, 7]
Where It's Used Today
Cryptography — RSA's security rests on the fact that factoring a large number (hundreds of digits) is hard, even though the algorithm itself is simple.
Reducing fractions and simplifying radicals — math software factors numerators, denominators, and arguments of square roots to put expressions in simplest form.
Number-theory homework — the standard tool for finding GCDs, LCMs, and divisor counts in school problems.
Hashing and randomized algorithms — choosing prime moduli and ruling out small prime factors when designing hash functions.
Music and rhythm — composers use the prime factorization of meter (e.g., 12 = 2 × 2 × 3) to find which polyrhythms divide a measure cleanly.
When NOT to Use
When n is hundreds of digits long — trial division is hopelessly slow; use Pollard rho or specialized sieves.
When you only need to know whether n is prime — a primality test like Miller-Rabin is far faster than full factorization.
When you need every prime up to a bound rather than the factors of one number — use the Sieve of Eratosthenes instead.
Common Mistakes
Continuing to increment d past sqrt(n) instead of stopping and recording the leftover n as a final prime factor.
Skipping the inner WHILE loop and missing repeated prime factors, returning [2, 3, 7] instead of [2, 2, 3, 7] for 84.
Starting d at 1 or forgetting to handle n = 1, producing an infinite loop or an empty result on edge inputs.
Try It with an AI Assistant
short
Write prime_factors(n) returning the prime factorization with multiplicity.
behavior
Write a function that takes a positive integer n and returns a list of integers whose product is n. Start with a divisor d = 2. While n is greater than 1, divide n by d as long as d divides cleanly, appending d each time. When d no longer divides, increment d and repeat.
For instanceA lesson can model growth where each month depends on the previous two months.
a ←0
b ←1FOR i FROM1TO n
t ← a + b
a ← b
b ← t
ENDFORRETURN a
In India around 200 BC, Pingala wrote Chandahsastra, a treatise on Sanskrit poetic meter. He counted the ways to fill a line of length n with short syllables (length 1) and long syllables (length 2).
The answers were 1, 1, 2, 3, 5, 8, 13... — each the sum of the previous two. Leonardo of Pisa met the same sequence again 1,400 years later via breeding rabbits, and the rabbits stuck. The poetry came first.
Teaches: Reuse previous results to build the next
Anecdote
Pingala described the sequence while analyzing poetry rhythms (long/short syllables). Rabbits came 1,400 years later with Fibonacci — the sequence's name is a thousand-year accident.
The Idea
Keep just two variables: a (the previous number) and b (the current number). Each step, compute the next number t = a + b, then slide the window forward — a becomes the old b, and b becomes t. Repeat n times, and a ends up holding F(n).
Why does this work? It captures the recurrence F(n) = F(n−1) + F(n−2) without ever calling itself recursively. The invariant: after iteration i, a = F(i) and b = F(i+1). That's the whole trick — a always lags b by exactly one position, and adding them gives the next Fibonacci number. A naive recursive version recomputes the same Fibonacci numbers exponentially many times; this iterative version does it in n simple additions and uses constant memory.
Trace
i
a (before)
b (before)
t = a + b
a after
b after
1
0
1
1
1
1
2
1
1
2
1
2
3
1
2
3
2
3
4
2
3
5
3
5
5
3
5
8
5
8
6
5
8
13
8
13
7
8
13
21
13
21
8
13
21
34
21
34
9
21
34
55
34
55
10
34
55
89
55
89
Where It's Used Today
Teaching dynamic programming — Fibonacci is the canonical introduction to memoization and bottom-up DP in every CS course.
Computer graphics and design — the golden ratio (closely related to consecutive Fibonacci numbers) shows up in layout grids, image cropping, and procedural tree generation.
Financial markets — "Fibonacci retracement" levels, while not mathematically grounded, are widely used by traders to mark support and resistance.
Algorithm design — the Fibonacci heap data structure (used in some shortest-path algorithms) gets its amortized bounds from the sequence's growth rate.
Biology — sunflower seed spirals, pinecone scales, and pineapple rings exhibit Fibonacci patterns because the golden angle packs seeds most efficiently.
When NOT to Use
When you only need a single very large F(n) and a closed-form like Binet's formula (or fast-doubling) is faster than n additions.
When the growth model isn't actually F(n) = F(n-1) + F(n-2) — picking Fibonacci because it "looks like growth" misses the real recurrence (e.g., compound interest is geometric, not Fibonacci).
When n is so large the result overflows a fixed-width integer and you haven't switched to big integers — silent wrap-around will return garbage.
Common Mistakes
Writing the naive recursive fib(n-1) + fib(n-2) and watching it take exponential time on n = 40 because nothing is memoized.
Off-by-one indexing — returning b instead of a, or starting the loop at 0 instead of 1, so fibonacci(10) gives 89 instead of 55.
Forgetting the base cases F(0) = 0 and F(1) = 1, or hard-coding the wrong starting pair like 1, 1 and shifting every answer by one.
Try It with an AI Assistant
short
Write fibonacci(n) returning the n-th Fibonacci number using iteration.
behavior
Write a function that, given a non-negative integer n, returns the n-th term of the sequence that starts 0, 1, where each subsequent term is the sum of the previous two. Use a loop with two variables holding the last two terms; do not use recursion.
Laws of Computational Thinking
Primary: Reuse Work Aggressively (4)
Secondary: Local Decisions Shape Global Outcomes (17)
Made combination counts and binomial expansion easy to generate.
For instanceA student can find coefficients of (a+b)^5 without multiplying it out.
n ←4
row ← array[0..n]
row[0] ←1FOR i FROM1TO n
row[i] ←1FOR j FROM i - 1TO1STEP -1
row[j] ← row[j] + row[j - 1]
ENDFORENDFORRETURN row
Around the 2nd century BCE the Indian prosodist Pingala described the binomial coefficients in his Chandaḥśāstra — a treatise on Sanskrit metres that needed to count how many syllable-patterns of a given length contained a given number of long syllables. The 10th-century commentator Halayudha drew them out as the explicit triangular figure called Meru-prastāra. Independently, Persian mathematicians (al-Karaji, Omar Khayyam) and Chinese mathematicians (Jia Xian, Yang Hui) rediscovered the same pattern by the 13th century. Blaise Pascal's Traité du triangle arithmétique (1654) tied the triangle to probability via his correspondence with Fermat — and although the figure was already centuries old, his name stuck in Europe.
Teaches: Construct answers from overlapping smaller subproblems
Anecdote
Blaise Pascal thought he was inventing something new — but similar versions had appeared in China, Persia, and India for centuries. In India it was traced back to Pingala (~2nd century BCE) and called Meru-prastāra ("the staircase of Mount Meru"); in China it was studied by Yang Hui in the 13th century and is still known as Yang Hui's Triangle. Pascal's name stuck only in European priority.
The Idea
Build the row in place. Start with row = [1]. To go from row i − 1 to row i, append a new 1 at the right end, then walk right-to-left updating each interior entry by adding the value to its left: row[j] ← row[j] + row[j − 1]. The walk has to go right-to-left so each addition reads the old value of row[j − 1] instead of the freshly updated one.
This works because every interior entry in Pascal's Triangle equals the sum of the two numbers directly above it. Reusing the row buffer makes the algorithm use only O(n) memory — instead of storing the whole triangle, we only ever keep the row we're currently building.
Trace
i
start row
after row[i] ← 1
inner updates (right-to-left)
end-of-step row
1
[1]
[1, 1]
(no interior j)
[1, 1]
2
[1, 1]
[1, 1, 1]
j=1: row[1] = 1 + 1 = 2
[1, 2, 1]
3
[1, 2, 1]
[1, 2, 1, 1]
j=2: 1+2=3; j=1: 2+1=3
[1, 3, 3, 1]
4
[1, 3, 3, 1]
[1, 3, 3, 1, 1]
j=3: 1+3=4; j=2: 3+3=6; j=1: 3+1=4
[1, 4, 6, 4, 1]
Where It's Used Today
Probability and statistics — binomial distribution probabilities use these coefficients to compute things like "what's the chance of exactly 3 heads in 5 coin flips?"
Algebra classrooms — expanding (a + b)^n without multiplying step by step.
Combinatorics — counting committees, lottery picks, or any "choose k of n" question (C(n, k)).
Computer graphics — Bézier curves of degree n are weighted sums whose weights are exactly row-n Pascal coefficients.
Error-correcting codes — some classical codes (Reed-Muller) have generator matrices built from rows of Pascal's Triangle modulo 2.
When NOT to Use
When you only need a single binomial coefficient C(n, k) for large n — a direct factorial-ratio formula is faster than building the whole row.
When n is huge and the entries overflow native integers — switch to a big-integer type or compute modulo a prime.
When you need many rows at once for repeated lookup — precompute the full triangle once instead of recomputing each row.
Common Mistakes
Walking the inner loop left-to-right, which overwrites row[j-1] before it's read and corrupts every later entry.
Forgetting to append the trailing 1 before the inner update, leaving the row one element short.
Using C(n, k) = n! / (k!·(n-k)!) naively for large n and overflowing instead of using the additive recurrence.
Try It with an AI Assistant
short
Write pascal_row(n) returning the n-th row of Pascal's triangle as a list of integers.
behavior
Start with a list containing just the number 1. Repeat n times: append a 1 at the end, then walk from the second-to-last entry back to the second entry, replacing each entry with itself plus the entry just before it. Return the final list.
Made huge power calculations possible on small machines.
For instanceEncryption can compute 7^560 mod 561 without writing the giant number.
b ←3
e ←13
m ←7
r ←1
b ← b MOD m
WHILE e > 0IF e MOD2 = 1THEN
r ← (r × b) MOD m
ENDIF
b ← (b × b) MOD m
e ← e / 2ENDWHILERETURN r
In the same Sanskrit treatise, Pingala worked out a binary scheme to enumerate every meter of length n — the same binary representation Leibniz would “invent” nineteen centuries later in Europe.
Square-and-multiply: walk the binary digits of the exponent, square at each step, multiply by the base when the bit is one. Without this trick, RSA would take longer than the universe has existed to encrypt a single message.
Teaches: Replace repeated work with divide-and-square reuse
The Idea
Write the exponent e in binary. For example, 13 is 1101 in binary, meaning 13 = 8 + 4 + 1. So b^13 = b^8 · b^4 · b^1. You can produce b^1, b^2, b^4, b^8, b^16, … by repeatedly squaring: each is the previous one squared. Then multiply together exactly the powers whose binary digit is 1.
Why does this work? Two reasons. First, squaring e times reaches b^(2^e) instead of b^e — exponential progress for linear effort. Second, modular arithmetic lets you reduce by m after every multiplication, keeping numbers from blowing up. The invariant is: at every step, r × b^(remaining bits) equals the original b^e, all reduced modulo m. The loop maintains this invariant by either folding the current b into r (when the current bit is 1) or just squaring b and shifting to the next bit. After log₂ e iterations, we're done.
Trace
step
e (current)
e MOD 2
r before
b before
r after
b after (= b² mod 7)
new e (= e/2)
1
13
1
1
3
(1 × 3) mod 7 = 3
9 mod 7 = 2
6
2
6
0
3
2
3 (unchanged)
4 mod 7 = 4
3
3
3
1
3
4
(3 × 4) mod 7 = 5
16 mod 7 = 2
1
4
1
1
5
2
(5 × 2) mod 7 = 3
4 mod 7 = 4
0
5
0
—
3
—
loop ends
—
—
Where It's Used Today
RSA encryption — every secure web request (HTTPS) calls modular exponentiation to encrypt the session key.
Diffie-Hellman key exchange — two parties agree on a shared secret using two modular-exponentiation calls each.
Digital signatures — ECDSA, RSA-PSS, and other signing schemes are built on this primitive.
Primality testing — Miller-Rabin and Fermat tests rely on fast modular exponentiation as their inner loop.
Cryptocurrency — Bitcoin and Ethereum signatures use modular exponentiation (over elliptic curves) on every transaction.
When NOT to Use
When you need the actual value of b^e (no modulus) — the result blows up exponentially and a big-integer power routine is the right tool.
When the modulus is 1 — the answer is always 0 and the loop wastes work; short-circuit instead.
When the exponent is small (say under 20) — a plain loop is clearer and the squaring trick adds no real speedup.
Common Mistakes
Forgetting to reduce b modulo m at the start, so the first squaring overflows on large bases.
Squaring b after the loop already exited, wasting one round and sometimes corrupting the answer in buggy variants.
Returning 0 for e = 0 instead of 1 — the empty product is 1, even when b = 0, in most conventions.
Try It with an AI Assistant
short
Write mod_pow(base, exp, m) using fast exponentiation by squaring.
behavior
Write a function that computes b raised to the e, modulo m, by walking through the binary digits of e from least significant to most significant. Keep a running result starting at 1 and a running base starting at b mod m. At each step, if the current bit of e is 1, multiply the result by the running base and reduce mod m; then square the running base mod m and shift e one bit right.
Made secret message transformation simple enough for humans.
For instanceA child can shift A→D, B→E to hide a note.
text ←"HELLO"
k ←3
out ←""FOR EACH char c IN text
IF c is letter THEN
c ←shift(c, k)
ENDIFappend(out, c)
ENDFORRETURN out
Julius Caesar, on military campaign, shifted every letter of his dispatches three places forward in the alphabet. VENI became YHQL. Suetonius reports it as a curiosity in his life of the emperor.
Three letters — that is the whole cipher. Trivially broken today by anyone counting letter frequencies, it nonetheless gave the field of cryptography its starting point and its first villain to defeat.
Teaches: Transform data predictably using reversible rules
Anecdote
Suetonius reports Julius Caesar used a fixed shift of three in his military dispatches — VENI became YHQL. The cipher's value was speed; encoding and decoding could be done by anyone trained in five minutes, while modern frequency analysis would crack it in seconds.
The Idea
Treat the alphabet like a 26-hour clock. Shifting by k rotates every letter forward by k hours; shifting past Z wraps back around to A — exactly like 11 PM + 3 hours wraps to 2 AM.
Walk through the message one character at a time. If the character is a letter, replace it with the letter k positions later in the alphabet, wrapping around: new = ((c − 'A') + k) mod 26 + 'A'. If it's a space or punctuation, leave it alone. Append the result to your output string and move on.
Why does this work? The shift is a one-to-one rule on the 26 letters — every input letter maps to a different output letter, and shifting by −k undoes shifting by k. The cipher is reversible because modular arithmetic preserves invertibility. It's also famously easy to break: there are only 25 useful keys (k = 1..25), so an attacker can simply try all of them and pick the version that looks like English. But the underlying pattern — a fixed, reversible transformation of each symbol — is the seed of every cipher that came after.
Trace
step
c
is letter?
shift(c, 3)
out so far
1
H
yes
(7 + 3) mod 26 = 10 → K
"K"
2
E
yes
(4 + 3) mod 26 = 7 → H
"KH"
3
L
yes
(11 + 3) mod 26 = 14 → O
"KHO"
4
L
yes
(11 + 3) mod 26 = 14 → O
"KHOO"
5
O
yes
(14 + 3) mod 26 = 17 → R
"KHOOR"
Where It's Used Today
ROT13 — a Caesar cipher with k = 13 is built into many forums, email clients, and Unix tools to lightly hide spoilers and joke punchlines.
Teaching cryptography — every Cryptography 101 course (Stanford, MIT, Coursera) opens with the Caesar cipher to introduce keys, encryption, and ciphertext attacks.
Captcha and puzzle games — Wordle-like puzzles, escape rooms, and treasure hunts hide clues with simple shift ciphers because the user can decode by hand.
Programming exercises — Caesar is the "Hello, World!" of crypto coding interviews and AP Computer Science problem sets.
Steganography helpers — light obfuscation of strings inside binaries (e.g. constants in older malware analysis labs) often turns out to be a Caesar shift.
When NOT to Use
For real security; it is a teaching cipher, not safe encryption.
When the alphabet is not fixed or has accents/symbols.
When attackers can see many examples of encrypted text.
Common Mistakes
Forgetting wrap-around from Z to A.
Changing punctuation when you intended to preserve it.
Treating it as secure encryption.
Try It with an AI Assistant
short
Write caesar(text, k) shifting each letter by k positions, wrapping A–Z.
behavior
Write a function that takes a string and an integer k. Walk through the string one character at a time. If the character is an A–Z letter, replace it with the letter k positions later in the alphabet, wrapping past Z back to A. Leave spaces and punctuation alone. Return the new string.
Made square-root approximation practical before calculators.
For instanceA builder can estimate the side of a square floor from its area.
n ←30IF n = 0THENRETURN0ENDIF
x ← n
REPEAT
new_x ← (x + n / x) / 2IF new_x >= x THENBREAK
x ← new_x
UNTIL stable
RETURN x
Heron of Alexandria ran the most famous engineering school in the ancient world — he designed automatic temple doors, vending machines, and a steam-powered toy. The integer-square-root algorithm was a tool he needed for his treatise Metrica on practical surveying.
The “average of x and n/x” insight was his way of telling stoneworkers how to figure out the side of a square equal in area to any given rectangle. The same iteration is what every modern sqrt function still does, just with floating-point.
Teaches: Iteratively refine guesses using feedback from error
The Idea
Start with a guess x — any guess works, but a safe choice is x = n itself. Now look at n / x. If x is too high, then n / x is too low; if x is too low, then n / x is too high. So the true square root must lie between them. The trick: average them. The average (x + n/x) / 2 is closer to the answer than x was.
Repeat. Each step Heron's update brings you closer to the true square root. With integer division, the sequence eventually stops shrinking — that's when the next guess new_x is no smaller than x. At that moment, x is exactly floor(sqrt(n)). The invariant is simple: x always stays at or above the true answer, and it strictly decreases until it lands on the floor.
Trace
step
x
n / x
new_x = (x + n/x) / 2
what happens
0
30
1
(30 + 1) / 2 = 15
new_x < x → x ← 15
1
15
2
(15 + 2) / 2 = 8
new_x < x → x ← 8
2
8
3
(8 + 3) / 2 = 5
new_x < x → x ← 5
3
5
6
(5 + 6) / 2 = 5
new_x ≥ x → stop
Where It's Used Today
Computer graphics — distance comparisons in 2D and 3D engines often need an integer-only square root (think: how far is the enemy from the player?).
Embedded systems — microcontrollers without a floating-point unit use this exact iteration to compute sqrt.
Cryptography — primality tests (like Miller-Rabin) need isqrt(n) to bound their search range.
Physics simulations — initial guesses for floating-point square root often start from Heron's iteration before any modern hardware refinement.
Construction and surveying — Heron's original use: given a square plot of area n, what's the side length? Builders still ask the same question today.
When NOT to Use
When you need the exact decimal square root, not the floor — use floating-point sqrt or a fixed-point variant instead.
When n is negative — Heron's iteration assumes non-negative input and will diverge or loop forever.
When the hardware already has a fast FPU sqrt and you don't need bit-exact integer answers — the hardware is faster.
Common Mistakes
Starting with x = 0 or x = 1 for large n — the iteration breaks because n / 0 is undefined and n / 1 = n may not converge cleanly.
Forgetting the n = 0 guard — the first division n / x becomes 0 / 0.
Using the wrong stopping condition (new_x == x instead of new_x >= x) — for some n the sequence oscillates between two values and never becomes equal.
Try It with an AI Assistant
short
Write isqrt(n) returning the integer square root of n using Heron’s method.
behavior
Write a function that, given a non-negative integer n, starts with x = n and repeatedly replaces x with the integer average of x and n divided by x. Stop as soon as the new value is no smaller than the old one, and return the old one.
Made multiple cycle constraints solvable together.
For instanceFind a number that leaves remainder 2 by 3, 3 by 5, and 2 by 7.
N ← product of moduli
x ←0FOR EACH (r_i, n_i)
M_i ← N / n_i
y_i ←inv(M_i, n_i)
x ← x + r_i * M_i * y_i
ENDFORRETURN x MOD N
The original problem in Sunzi Suanjing is phrased as a riddle about remainders, not a theorem: “There is a number; when divided by 3 the remainder is 2; by 5, the remainder is 3; by 7, the remainder is 2. What is the number?”
It was a puzzle for merchants and officials, not abstract math — the algebraic theory came centuries later. The answer is 23, or 128, or any 23 + 105k. The same trick now powers RSA-CRT, accelerating decryption by 3–4×.
Teaches: Solve globally by combining consistent local constraints
Anecdote
The original problem in Sunzi Suanjing is phrased as a riddle about reminders, not a theorem. It was a puzzle for merchants and officials, not abstract math — theory came centuries later.
The Idea
For each remainder rule (r_i, n_i), build a "selector" that contributes the right remainder modulo n_i and contributes zero modulo every other modulus. Multiply by M_i = N / n_i (so it's already a multiple of every other n_j), then multiply by the modular inverse of M_i modulo n_i to make it equal to 1 there. Multiply by r_i to get the wanted remainder, sum them all up, and reduce mod N.
Why does it work? Each term in the sum is "invisible" to every modulus except its own — it's a multiple of all the other n_j. So when you reduce the total mod n_i, only the i-th term survives, and we built it to equal r_i. The construction needs the moduli to share no common factor (pairwise coprime), so each modular inverse exists.
Trace
i
r_i
n_i
M_i = N/n_i
y_i = inv(M_i, n_i)
r_i · M_i · y_i
1
2
3
35
inv(35 mod 3) = inv(2,3) = 2
2·35·2 = 140
2
3
5
21
inv(21 mod 5) = inv(1,5) = 1
3·21·1 = 63
3
2
7
15
inv(15 mod 7) = inv(1,7) = 1
2·15·1 = 30
Where It's Used Today
RSA-CRT decryption — every modern crypto library splits the big modular exponentiation into two smaller ones using CRT, speeding up RSA decryption 3–4×.
Error-correcting codes — Reed-Solomon codes (used on CDs, DVDs, QR codes, and deep-space telemetry) rely on CRT-style reconstruction.
Secret sharing — Mignotte's and Asmuth-Bloom's schemes split a secret into shares using coprime moduli; you need enough shares to recombine via CRT.
Calendar puzzles — aligning lunar months, solar years, and planting cycles is exactly the original Sun Tzu problem in disguise.
Big-number multiplication — number-theoretic transforms multiply huge integers by working modulo several small primes and then stitching the answer back together with CRT.
When NOT to Use
When the moduli share a common factor — the modular inverses don't exist and the standard CRT construction breaks; you need the generalized Bezout-based variant instead.
When you only need one congruence — CRT is overkill; just take r mod n directly.
When the moduli are huge and you need only an approximate answer — a numerical method may be cheaper than computing exact modular inverses.
Common Mistakes
Forgetting to check that the moduli are pairwise coprime before calling inv(M_i, n_i), then crashing on a non-invertible value.
Returning the raw sum r_i M_i y_i without the final MOD N, leaving an answer larger than N.
Computing M_i = N / n_i with floating-point division instead of integer division, introducing rounding errors on large moduli.
Try It with an AI Assistant
short
Write crt(remainders, moduli) that returns the unique x mod product(moduli) satisfying all the congruences, assuming pairwise coprime moduli.
behavior
Write a function that, given a list of remainders and a list of pairwise coprime moduli, finds an integer between 0 and the product of all moduli minus 1 that leaves each remainder when divided by its corresponding modulus. For each rule, build a term that is zero modulo every other modulus and equals the wanted remainder modulo its own; sum the terms and reduce.
For instanceCryptography can undo multiplication even when working only with remainders.
(g, x, y) ←ext_gcd(a, m)
IF g != 1THENRETURNNONEENDIFRETURN x MOD m
FUNCTIONext_gcd(a, b)
IF b = 0THENRETURN (a, 1, 0)
ENDIF
(g, x1, y1) ←ext_gcd(b, a MOD b)
RETURN (g, y1, x1 - (a/b)*y1)
END FUNCTION
Aryabhata called it the pulverizer — kuttaka. The name is literal: you grind numbers down through repeated division, each step reducing the problem. A vivid Sanskrit metaphor lost in modern translation.
Aryabhata used it to solve calendar-alignment problems where lunar months had to be reconciled with solar years. The same algorithm is the inner loop of every RSA decryption today — finding d such that e·d ≡ 1 (mod φ(n)).
Teaches: Undo operations by solving backward constraints
Anecdote
Aryabhata called it the "pulverizer" (kuttaka). The name is literal — you grind numbers down through repeated division — a vivid metaphor, lost in modern translation.
The Idea
The trick is the Extended Euclidean algorithm: while you're computing gcd(a, m) by remainders, you also keep track of how to write that gcd as a combination a·x + m·y. If the gcd turns out to be 1, then a·x + m·y = 1, which means a·x ≡ 1 (mod m) — and that x is exactly the inverse you wanted.
Why does it work? Each recursive call shrinks the problem the same way ordinary GCD does (replace (a, b) with (b, a mod b)), but it also unwinds on the way back, threading the bookkeeping coefficients through. When the recursion bottoms out at b = 0, you know a·1 + 0·0 = a — that's the seed. Each return step adjusts the coefficients so the equation stays true. If the final gcd isn't 1, no inverse exists.
Trace
step
call
unwinds to (g, x, y)
1
ext_gcd(3, 11)
calls ext_gcd(11, 3)
2
ext_gcd(11, 3)
calls ext_gcd(3, 2)
3
ext_gcd(3, 2)
calls ext_gcd(2, 1)
4
ext_gcd(2, 1)
calls ext_gcd(1, 0)
5
ext_gcd(1, 0)
base: returns (1, 1, 0)
4
ext_gcd(2, 1)
returns (1, 0, 1)
3
ext_gcd(3, 2)
returns (1, 1, −1)
2
ext_gcd(11, 3)
returns (1, −1, 4)
1
ext_gcd(3, 11)
returns (1, 4, −1)
Where It's Used Today
RSA decryption — every secure web connection (HTTPS, online banking) computes a private exponent d as the modular inverse of the public exponent e modulo φ(n).
Cryptographic signatures — DSA and ECDSA signatures call modular inverse on every signature, billions of times a day.
Hash table tricks — some perfect-hash and primes-based hashing schemes need modular inverses to undo the hash.
Computer algebra systems — solving linear equations over modular arithmetic (used in coding theory, error-correcting codes).
Calendar arithmetic — Aryabhata's original problem: finding the day of the week, or aligning lunar cycles to solar years, both reduce to modular inverse.
When NOT to Use
When gcd(a, m) != 1 — no inverse exists, so the algorithm must report failure rather than return a number.
When the modulus is a prime and you can use Fermat's little theorem (a^(m-2) mod m), which is shorter to code.
When you only need ordinary division over rationals — modular inverse is for clock arithmetic, not real numbers.
Common Mistakes
Returning x directly instead of x mod m — recursion can produce a negative coefficient that needs reducing.
Forgetting to check the returned gcd is 1 before treating x as a valid inverse.
Using plain Euclid (which only returns the gcd) instead of the extended version that tracks the coefficients.
Try It with an AI Assistant
short
Write mod_inverse(a, m) returning x such that a·x ≡ 1 (mod m), or report that no inverse exists.
behavior
Write a function that, given two positive integers a and m, finds an integer x between 1 and m−1 such that the remainder of a·x divided by m is 1. If no such x exists, report that. Use the extended remainder process: track how each remainder can be written as a combination of the original two numbers, and unwind to recover the coefficient on a.
For instanceA teacher can generate every seating arrangement for 5 students.
items ← [A, B, C]
k ←0IF k = length(items) - 1THENEMIT(items)
RETURNENDIFFOR i FROM k TOlength(items) - 1swap(items, k, i)
permute(items, k + 1)
swap(items, k, i)
ENDFOR
Lilavati — the 1150 book that contains this recursive permutation procedure — was written by Bhāskara II and named after a daughter. According to a later legend, an astrologer predicted that Bhāskara's daughter (also named Lilavati) would die young if she ever married, so he wrote her this elegant math textbook as a wedding substitute.
Whether or not the legend is true, the algorithm we now use for n! arrangements lives in a book named after a daughter, written for someone the author loved.
Teaches: Explore all outcomes by fixing one choice at a time
The Idea
Fix the first seat, then permute everything that comes after. To fix the first seat, try each item in turn: swap it into position k, recursively permute the tail starting at k + 1, then swap it back so the list looks the same as before. When k reaches the last position, you've fixed every seat — emit the current arrangement.
Why does this work? At every level of the recursion, the prefix items[0..k−1] is locked in and unique to this branch. The recursive call below it is responsible for producing every ordering of the rest. Because the swap-and-unswap pair leaves the list unchanged, sibling branches see the same starting state, and no permutation is generated twice or missed. The total number of leaves is exactly n!.
Brute-force puzzle solving — anagram solvers, crossword fillers, and Sudoku checkers try permutations of letters or candidates.
Routing and scheduling — the travelling-salesman problem on small instances enumerates permutations of city orders to find the shortest tour.
Cryptography testing — checking that a cipher behaves correctly across every permutation of a small alphabet.
Statistics — permutation tests in data science shuffle labels to estimate how likely an observed effect is by chance.
Game design — generating every seating arrangement, every dealing order, or every move-order in turn-based games for AI search.
When NOT to Use
When n is more than about 10 — n! explodes (10! = 3.6M, 13! = 6 billion) and enumeration becomes infeasible.
When you only need a random ordering — use Fisher-Yates shuffle instead of generating all permutations.
When duplicates exist in the input and you need distinct orderings — naive recursion will emit the same arrangement multiple times.
Common Mistakes
Forgetting to swap back after the recursive call, so the list gets corrupted across sibling branches.
Emitting a reference to the same list at every leaf instead of a copy, so all results end up identical.
Using the wrong base case (k = length(items) vs k = length(items) - 1) and missing or duplicating arrangements.
Try It with an AI Assistant
short
Write permutations(items) that yields every permutation of the input list.
behavior
Write a function that takes a list and prints every possible reordering of it. Do this by fixing one position at a time: for each choice of what goes in position k, swap that item into position k, recursively reorder the items after it, then swap back to restore the list before the next choice.
Made repeated-shift encryption stronger than one fixed shift.
For instanceA keyword can change how each letter of a message is encrypted.
j ←0FOR EACH char c IN text
IF c is letter THEN
s ← key[j MODlen(key)]
out.append(shift(c, s))
j ← j + 1ELSE
out.append(c)
ENDIFENDFORRETURN out
Giovan Battista Bellaso published the cipher in 1553. Blaise de Vigenère didn’t invent it but did publish a stronger variant 33 years later — and history attached his name to the wrong version.
The cipher Vigenère actually gets credit for (“autokey”) is more secure than the polyalphabetic shift everyone calls the Vigenère cipher. The misnomer stuck for 400 years.
Teaches: Strengthen patterns by varying transformation context
The Idea
Walk through the message one character at a time. For each letter, look at the next letter of the key (wrapping around to the start when you run out). Shift the message letter forward in the alphabet by the value of the key letter — A=0, B=1, ..., Z=25. Non-letters (spaces, punctuation) pass through unchanged, and the key counter j only advances when you actually encrypt a letter.
Why does this work? A single Caesar shift is easy to break because the most common letter in English (E) stays the most common in the cipher — you can spot it. Vigenère breaks that pattern: the same plaintext letter is encrypted differently depending on where it falls under the key. The invariant is simple — position j in the message always lines up with position j mod len(key) in the key. Decryption uses the same loop with subtraction instead of addition.
Trace
step
c
j
s = key[j mod 3]
shift
output letter
0
H
0
K
+10
R
1
E
1
E
+4
I
2
L
2
Y
+24
J
3
L
3
K
+10
V
4
O
4
E
+4
S
Where It's Used Today
Teaching cryptography — Vigenère is the standard introduction to polyalphabetic ciphers in every intro security course.
Capture-the-flag puzzles — beginner CTF challenges hide flags behind Vigenère, often broken with Kasiski's method.
Stream ciphers — modern ciphers like RC4 generalize the same idea: a key stream that varies per position.
One-time pads — when the key is as long as the message and never reused, Vigenère becomes provably unbreakable.
Historical document analysis — codebreakers still recognize Vigenère in 18th- and 19th-century diplomatic correspondence.
When NOT to Use
When you need real security — Vigenère falls to Kasiski examination and frequency analysis once the key length is guessed.
When the same key is reused across many messages — that pattern leaks the key entirely, so use a modern stream cipher instead.
When the key is shorter than a few characters — a tiny key collapses Vigenère into a small set of Caesar shifts that are trivially broken.
Common Mistakes
Advancing the key index j on non-letter characters, which throws the alignment off after the first space or comma.
Forgetting to handle uppercase versus lowercase — mixing cases shifts letters by the wrong amount or out of the alphabet entirely.
Using subtraction without adding 26 before the modulo, producing negative indices when decrypting.
Try It with an AI Assistant
short
Write vigenere(text, key) that shifts each letter by the corresponding letter of the cyclically-repeated key.
behavior
Write a function that takes a message and a keyword. Walk through the message one letter at a time. For each letter, take the next letter of the keyword (wrapping back to the start when you run out), convert it to a number 0–25, and shift the message letter forward in the alphabet by that amount. Leave non-letters alone. Return the shifted message.
Made solving hard equations practical by approximation.
For instanceA computer can approximate where x²−2=0 without knowing √2 exactly.
x0 ←1.5
iterations ←4// f(x) = x² − 2, fprime(x) = 2x
x ← x0
FOR i FROM1TO iterations
x ← x - f(x) / fprime(x)
ENDFORRETURN x
In late-1660s Cambridge, Newton was deep in the calculus he had just invented and was wrestling with planetary orbits — problems where polynomial roots arose constantly and no closed form existed. His insight was simple but powerful: at any point on a smooth curve, the tangent line is an excellent local approximation, so following the tangent down to the x-axis gives a much better guess than any algebraic trick. The method converges so quickly — roughly doubling the number of correct digits each step — that 5–10 iterations are usually enough for full double-precision accuracy.
Teaches: Use local linear approximations to reach global solutions
Anecdote
Isaac Newton didn't publish it. It circulated in private letters, and the version we use today was actually clarified and extended by others (notably Raphson). Newton himself is hazy to attribute on its astronomy problems.
The Idea
Imagine standing on a curve at the point (x, f(x)). Draw the tangent line — the straight line that just barely kisses the curve at that point. Slide down that tangent until it crosses the x-axis. That crossing point is your next, better guess. Repeat. The formula for the next guess is x ← x − f(x) / f'(x), where f' is the derivative (the slope of the tangent).
Why does it work? Near any smooth curve, a straight tangent line is a great local approximation. If your current guess is close to the true root, the tangent crosses the x-axis very near the root too — and each step roughly doubles the number of correct digits. After only a handful of iterations, you typically have more precision than a calculator can display.
Trace
i
x (before)
f(x)
f'(x)
x ← x − f(x)/f'(x)
1
1.5
0.25
3.0
1.41666666…
2
1.41666666…
0.00694444…
2.833…
1.41421568…
3
1.41421568…
0.00000601…
2.828…
1.41421356…
4
1.41421356…
≈ 0
2.828…
1.41421356…
Where It's Used Today
Calculator square roots and logarithms — every time you press √ on a phone, Newton's method (or a close cousin) computes the answer.
Computer graphics — finding where a ray hits a curved surface in ray tracing reduces to root-finding on the surface equation.
Physics simulators — solving equations of motion when no closed-form solution exists (orbital mechanics, fluid dynamics).
Machine learning — second-order optimizers like Newton's method for logistic regression find the parameters that minimize loss.
Engineering — circuit simulators (SPICE) solve nonlinear current-voltage equations at every time step using Newton iterations.
When NOT to Use
When the derivative is zero or near-zero at the guess — the tangent is flat and the step blows up to infinity.
When the function has multiple roots and you need a specific one; Newton can jump unpredictably to a different root or diverge.
When you cannot compute the derivative cheaply or analytically — use the secant method or bisection instead.
Common Mistakes
Forgetting to check f'(x) ≈ 0 before dividing, which crashes the program or returns NaN.
Choosing a starting guess too far from the root — Newton can oscillate forever or shoot off to infinity.
Iterating a fixed count instead of stopping when |f(x)| is below a tolerance, wasting work or quitting too early.
Try It with an AI Assistant
short
Approximate equation roots using tangent-line iterative refinement.
behavior
Write a function that, given a function f, its derivative fprime, an initial guess x0, and a number of iterations, repeatedly replaces the current guess with the guess minus f(guess) divided by fprime(guess), and returns the final guess as an approximate root of f.
Made simple probabilistic classification practical.
For instanceAn email app can label a message as spam based on words it contains.
best ←NULL
best_score ← -infinity
FOR EACH class C
s ←log(prior[C])
FOR EACH feature f IN doc
s ← s + log(p[C][f])
ENDFORIF s > best_score THEN
best ← C
best_score ← s
ENDIFENDFORRETURN best
Thomas Bayes never published the theorem during his life. After Bayes died in 1761, his friend Richard Price found the manuscript among his papers and submitted it to the Royal Society in 1763 — two years after Bayes was already buried.
The Reverend’s posthumous theorem is exactly what the hook says: Bayes never knew his theorem would become the math behind every spam filter.
Teaches: Combine weak independent signals into strong predictions
The Idea
Each class C has a base prior — how common that class is overall — and each word f has a learned probability p[C][f] of appearing in class C. To score a document, multiply the prior by every word's probability under that class. The class with the largest product wins. Because multiplying many small probabilities underflows quickly, we add logs instead.
The "naive" assumption is that words act independently, given the class. That's not really true — click and here tend to come together — but pretending they're independent works astonishingly well in practice, because each word still casts a small vote in the right direction. With enough training data, the right class accumulates more evidence than any wrong one. Add Laplace smoothing so a single never-seen word doesn't drive a probability to zero.
Trace
C
start s
+ log p[C]["free"]
+ log p[C]["money"]
best_score
spam
-0.7
-1.9
-3.4
-3.4 (best ← spam)
ham
-0.7
-4.2
-7.0
-3.4 (no change)
Where It's Used Today
Email spam filters — Gmail's earliest filters and many self-hosted spam tools use Naive Bayes on word features.
News topic classification — sorting articles into sports, politics, technology, and so on.
Sentiment analysis — labeling product reviews as positive or negative based on word frequencies.
Medical screening — combining independent symptom indicators into a probability of a diagnosis.
Language detection — picking the most likely language for a short string of text from character or word patterns.
When NOT to Use
When features are strongly correlated and order matters, like phrase-level meaning — the independence assumption hides the real signal.
When a class has very little training data — the learned word probabilities are too noisy even with smoothing.
When you need calibrated probabilities (not just the top class) — Naive Bayes scores are notoriously over-confident.
Common Mistakes
Skipping Laplace smoothing — a single unseen word zeroes out an entire class probability.
Multiplying raw probabilities instead of summing logs — long documents underflow to zero and ties become random.
Forgetting to add the log prior, so a rare class is judged on the same footing as a common one.
Try It with an AI Assistant
short
Write naive_bayes_train(docs, labels) and naive_bayes_predict(doc) for text classification with Laplace smoothing.
behavior
Write two functions. The first reads labeled documents and learns, for every class, how common it is and how often each word appears in that class. The second takes a new document and, for every class, adds up the log of the class's frequency plus the log of how often each word in the document appeared in that class. Return the class with the highest total.
Made randomness useful for estimating hard quantities.
For instanceThrow random points at a square and estimate the circle’s area.
inside ←0FOR i FROM1TO n
x ←uniform(0, 1)
y ←uniform(0, 1)
IF x*x + y*y <= 1THEN
inside ← inside + 1ENDIFENDFORRETURN4 * inside / n
Georges-Louis Leclerc, Comte de Buffon, was 70 years old and director of the Royal Botanical Garden when he posed the needle problem. He was famous for calculating Earth was at least 75,000 years old — the Bible said 6,000 — and got hauled in front of religious authorities for the math.
The needle problem was a casual aside in a paper otherwise about probability and morality, but it is the ancestor of every Monte Carlo method invented since.
Teaches: Approximate truth through random sampling
The Idea
Throw a lot of random points into the unit square and count what fraction land inside the quarter-circle of radius 1. A point (x, y) is inside the quarter-circle exactly when xx + yy <= 1, which Pythagoras already told us. Count those points, divide by the total n, multiply by 4, and you have an estimate of π.
Why does it work? Because area is what probability picks out. Each random dart is equally likely to land anywhere in the square, so on average the fraction inside the quarter-circle equals the quarter-circle's area divided by the square's area — which is π/4. The estimate is noisy for small n and the error shrinks like 1/√n, so doubling your accuracy means quadrupling your darts. The point isn't speed; it's that randomness can compute things you don't know how to integrate.
Trace
i
x
y
xx + yy
inside?
inside count
1
0.10
0.20
0.05
yes
1
2
0.80
0.90
1.45
no
1
3
0.50
0.30
0.34
yes
2
4
0.95
0.20
0.94
yes
3
5
0.70
0.80
1.13
no
3
6
0.40
0.40
0.32
yes
4
7
0.60
0.85
1.08
no
4
8
0.20
0.95
0.94
yes
5
Where It's Used Today
Physics simulations — particle collisions and neutron transport at places like Los Alamos and CERN use Monte Carlo to integrate over high-dimensional spaces nobody can solve in closed form.
Finance — pricing complex options and stress-testing portfolios by simulating thousands of random market futures.
Computer graphics — every modern movie and game uses Monte Carlo ray tracing to estimate how light bounces around a scene.
Risk analysis — insurance companies, climate modelers, and engineers run Monte Carlo simulations to estimate the chance of rare events.
Statistics teaching — the dart-in-circle demo is the classic first lesson in randomized estimation, used in classrooms worldwide.
When NOT to Use
When you need π to many digits — series like Machin's or BBP converge in tens of terms; Monte Carlo needs billions of samples for six digits.
When the problem has a closed-form integral or a low-dimensional grid will do — deterministic quadrature beats random sampling.
When your random number generator is biased or correlated — the estimate inherits the bias and quietly converges to the wrong number.
Common Mistakes
Forgetting to multiply by 4 — the count gives π/4, not π, because we sample only the unit square (one quadrant).
Using < instead of <= and worrying about it — the boundary has measure zero, but mixing strict/non-strict between trace and code confuses readers.
Expecting the error to halve when you double n — error shrinks like 1/√n, so you need 4× the samples for 2× the accuracy.
Try It with an AI Assistant
short
Write estimate_pi(n) that uses Monte Carlo sampling in the unit square to estimate π.
behavior
Write a function that takes an integer n, generates n random points with both coordinates between 0 and 1, counts how many of them satisfy xx + yy <= 1, and returns four times that count divided by n.
Made polygon area computable directly from coordinates.
For instanceA surveyor can compute land area from boundary points.
sum ←0FOR i FROM0TO n - 1
j ← (i + 1) MOD n
sum ← sum + v[i].x * v[j].y
sum ← sum - v[j].x * v[i].y
ENDFORRETURNabs(sum) / 2
Gauss was 18 when he derived the shoelace formula for polygon area as part of triangulation work on land surveys for the Duke of Brunswick. The same year he proved the constructibility of the 17-gon, and a few years later wrote the Disquisitiones Arithmeticae.
The “shoelace” mnemonic — visualize lacing left, right, left, right — is much later. Gauss just used the alternating-sign sum directly.
Teaches: Turn geometry into simple accumulations
The Idea
Walk around the polygon, vertex by vertex. For each edge from vertex v[i] to its neighbor v[j], add v[i].x v[j].y and subtract v[j].x v[i].y. After visiting all edges, take the absolute value and divide by 2.
Why does it work? Each v[i].x v[j].y - v[j].x v[i].y term is twice the signed area of the triangle formed by the origin and that one edge. Summing over all edges, the inside of the polygon gets counted once and everything outside cancels out. The trick is the alternating signs — they're what makes the cancellation work, no matter how complicated the polygon's shape. Take half of the absolute value at the end and you have the area.
Trace
i
j
v[i]
v[j]
v[i].x · v[j].y
v[j].x · v[i].y
sum
0
1
(0,0)
(4,0)
0·0 = 0
4·0 = 0
0
1
2
(4,0)
(0,3)
4·3 = 12
0·0 = 0
12
2
0
(0,3)
(0,0)
0·0 = 0
0·3 = 0
12
Where It's Used Today
Land surveying — computing acreage of a parcel from its GPS-traced boundary points.
Geographic Information Systems (GIS) — measuring the area of a country, lake, or wildfire perimeter from a polygon outline.
Computer graphics — finding the area of a polygon to decide how brightly it should be lit, or to test if it's degenerate.
3D printing and CAD — computing cross-sectional area of a part for material estimates.
Game engines — fast area checks for collision shapes and field-of-view polygons.
When NOT to Use
When the polygon is self-intersecting (a figure-eight) — the formula returns the signed difference of lobes, not the visible area.
When edges are curved (arcs, splines) — the shoelace formula assumes straight segments and will undercount or overcount.
When points are given as latitude/longitude on a sphere for large regions — you need a spherical-area formula, not the planar shoelace.
Common Mistakes
Forgetting to wrap the last vertex back to the first; the closing edge v[n-1] → v[0] is essential to the cancellation.
Skipping the absolute value and reporting a negative area when the vertices happen to be listed clockwise.
Forgetting to divide by 2 at the end and reporting double the true area.
Try It with an AI Assistant
short
Write polygon_area(vertices) that returns the area of a simple polygon using the shoelace formula.
behavior
Write a function that takes a list of (x, y) corner points of a closed polygon, walks once around the boundary, and for each consecutive pair of corners adds the product of the first x with the second y and subtracts the product of the second x with the first y. Return half the absolute value of that running sum.
For instanceA cryptography system can check whether a number behaves like a square mod p.
r ←mod_pow(a, (p-1)/2, p)
IF r = 1THENRETURNTRUEENDIFIF r = p - 1THENRETURNFALSEENDIFRETURN"p NOT prime"
Quadratic reciprocity, which underlies the test, was something Gauss called the Theorema Aureum (“the Golden Theorem”). He proved it eight different ways during his lifetime — each new proof was an obsession that delighted him.
The first proof appears in Disquisitiones Arithmeticae in 1801, when Gauss was 24. Today the same test sits inside Miller-Rabin and every modern primality routine.
Teaches: Test properties using compact mathematical shortcuts
The Idea
Euler's criterion says: for an odd prime p and a not divisible by p, compute r = a^((p−1)/2) mod p. The answer is always either 1 or p − 1. If r = 1, then a is a quadratic residue — some x exists with x² ≡ a. If r = p − 1 (which is the same as −1 mod p), then a is a non-residue.
Why does it work? Fermat's little theorem says a^(p−1) ≡ 1 (mod p). So a^((p−1)/2) squared is 1, which means a^((p−1)/2) itself must be a square root of 1 — and modulo a prime, the only square roots of 1 are +1 and −1. The exponent (p−1)/2 is just big enough to land you on whichever of those two marks tells you whether a is a square. If you ever get a third value, your "prime" p wasn't really prime — which is why Miller-Rabin uses the same machinery.
Trace
step
computation
r
0
start: r = 1
1
1
multiply by 2: 1·2
2
2
multiply by 2: 2·2
4
3
multiply by 2: 4·2 = 8 ≡ 1 mod 7
1
Where It's Used Today
Primality testing — Miller-Rabin and Solovay-Strassen use Euler's criterion as their inner check; every time your browser handshakes a new HTTPS session, this test runs.
Cryptographic key generation — RSA and elliptic-curve key generators repeatedly test candidate primes using exactly this criterion.
Square roots modulo a prime — algorithms like Tonelli-Shanks first call Euler's criterion to confirm a square root exists before computing it.
Coding theory — quadratic residue codes (used in error correction) are built directly from the residue/non-residue structure of a prime.
Pseudo-random number generation — Blum-Blum-Shub and similar generators rely on the difficulty of distinguishing residues from non-residues modulo a composite.
When NOT to Use
When p is even or composite — Euler's criterion assumes an odd prime modulus, and the "+1 / -1" dichotomy breaks for composites (use Jacobi symbol instead).
When a is divisible by p — the criterion returns 0, which means neither residue nor non-residue, and code that only checks ==1 will misclassify it.
When you actually need the square root, not just the yes/no answer — Euler's criterion confirms existence but doesn't produce x; reach for Tonelli-Shanks.
Common Mistakes
Using ordinary pow(a, (p-1)//2) and then taking % p at the end — the intermediate value is astronomically large; you must use modular exponentiation throughout.
Comparing the result to -1 instead of p - 1 — in modular arithmetic the answer is p - 1, and r == -1 will always be false.
Forgetting integer division in (p-1)/2 — using floating-point division on a large prime loses precision and the exponent becomes wrong.
Try It with an AI Assistant
short
Write is_quadratic_residue(a, p) using Euler’s criterion (assume p is an odd prime).
behavior
Write a function that, given a number a and an odd prime p, decides whether some integer squared can leave the remainder a when divided by p. Compute a raised to the power (p−1)/2, taken mod p. If the result is 1, return TRUE. If it is p − 1, return FALSE.
Made best-fit prediction from noisy data practical.
For instanceA scientist can estimate a trend line through imperfect measurements.
n ←0
sx ← sy ← sxx ← sxy ←0FOR EACH (x, y) IN points
n ← n + 1
sx ← sx + x; sy ← sy + y
sxx ← sxx + x*x
sxy ← sxy + x*y
ENDFOR
m ← (n*sxy - sx*sy) /
(n*sxx - sx*sx)
b ← (sy - m*sx) / n
RETURN (m, b)
Legendre published least squares in 1805. Gauss claimed in 1809 that he’d had it since 1795 — setting off a quiet priority dispute that lasted decades.
The technique became indispensable when Gauss used it to predict where to find Ceres — a “lost” dwarf planet only spotted briefly in 1801 — within half a degree of the actual sky position. Legendre published the method, but Gauss’s prediction made it famous.
Teaches: Summarize data incrementally without storing everything
Anecdote
This led to a quiet priority dispute — Gauss had used the method secretly for years but hadn't published. It became famous when Gauss used it to predict the orbit of the dwarf planet Ceres with stunning accuracy.
The Idea
Try a line; every point pulls it up or down. Least-squares chooses the line where the total squared vertical mistake is smallest.
"Best fit" has to mean something precise. Legendre's choice — the least-squares rule — is to minimize the total of the squared vertical distances from every point to the line. Why squared? Squaring the gaps means a point twice as far away counts four times as much, so the line refuses to leave any single point too lonely; it also makes the math come out to a clean closed-form solution. Calculus then gives a single formula for the slope m and intercept b that minimize this total.
The clever practical trick is that you don't need to remember every point to compute the answer. You only need five running totals: how many points you've seen (n), the sum of x, the sum of y, the sum of xx, and the sum of xy. As each new point streams in, you update these five numbers — and at the end, a small formula gives you slope and intercept. This one-pass structure is why the same algorithm runs on a billion-row dataset just as easily as on five points.
Trace
step
(x, y)
n
sx
sy
sxx
sxy
0
start
0
0
0
0
0
1
(1, 2)
1
1
2
1
2
2
(2, 3)
2
3
5
5
8
3
(3, 5)
3
6
10
14
23
4
(4, 4)
4
10
14
30
39
5
(5, 6)
5
15
20
55
69
Where It's Used Today
Spreadsheets — the SLOPE and INTERCEPT functions in Excel and Google Sheets run exactly this calculation.
Science labs — fitting a line through experimental measurements to extract a physical constant from noisy data.
Sports analytics — relating practice time to performance, or shot distance to accuracy.
Economics and forecasting — predicting demand from price, or sales from advertising spend.
Machine learning — least-squares is the simplest baseline model, and the foundation that more complex regressors are measured against.
When NOT to Use
When the relationship between x and y is clearly curved (parabolic, exponential, periodic) — a straight line will misrepresent every prediction.
When the data has heavy outliers — squaring the residuals lets one bad point swing the whole line; use a robust regressor (Huber, RANSAC) instead.
When all x values are equal or nearly so — the denominator n·Σxx − (Σx)² collapses to zero and the slope is undefined.
Common Mistakes
Computing Σ(x·y) as Σx · Σy — these are very different quantities and the second is wrong.
Accumulating sums in 32-bit floats over millions of points — precision degrades and the slope drifts; use 64-bit or a numerically stable form (Welford-style).
Reporting a slope without checking the fit — the formula always returns numbers, even when the line explains almost nothing about the data.
Try It with an AI Assistant
short
Write linear_fit(points) returning (slope, intercept) using least-squares — implement the formulas directly, don't call a library.
behavior
Write a function that takes a list of (x, y) points and, in a single pass, accumulates five running totals: the count of points, the sum of x, the sum of y, the sum of x·x, and the sum of x·y. After the pass, use these to compute the slope m = (n·Σxy − Σx·Σy) / (n·Σxx − Σx·Σx) and the intercept b = (Σy − m·Σx) / n. Return (m, b).
For instanceA programmer can count valid ways to parenthesize expressions.
c[0] ←1FOR i FROM1TO n
c[i] ←0FOR j FROM0TO i - 1
c[i] ← c[i] +
c[j] * c[i-1-j]
ENDFORENDFORRETURN c[n]
Euler discovered Catalan numbers in 1751 while counting triangulations of polygons — 87 years before Eugène Catalan. The numbers are named after Catalan, who in 1838 used them for a different problem (counting balanced parentheses).
By the time Catalan published, Euler had been dead 55 years. The credit landed on the wrong mathematician through the same priority accident that misnamed the Vigenère cipher.
Teaches: Count complex structures via recursive decomposition
The Idea
The trick is recursive decomposition. Pick any balanced object of "size" i — say a string of i+1 matched parentheses. There is one outermost pair. Inside that pair sits a balanced object of some size j, and to its right sits another balanced object of size i − 1 − j. Sum over every possible split j = 0, 1, …, i − 1, and you get the total count for size i.
That's the convolution recurrence: c[i] = c[0]·c[i−1] + c[1]·c[i−2] + … + c[i−1]·c[0]. The invariant is simple — c[i] is the number of balanced structures of size i. We seed c[0] = 1 (one way to do nothing) and build up. Because each entry depends only on smaller entries, the table fills cleanly in O(n²) time.
Computer algebra — counting binary trees, which appear everywhere from expression trees to syntax trees.
Combinatorics on the page — answering "how many ways can n people pair up at a round table without crossing handshakes."
Computational biology — counting RNA secondary structures, which are essentially nested matchings.
Probability and random walks — the n-th Catalan number counts lattice paths from (0,0) to (n,n) that never cross the diagonal.
When NOT to Use
When n is large (say above a few hundred) — the numbers explode in size and the O(n²) DP becomes both slow and big-integer heavy; use the closed-form C(2n, n)/(n+1) with arbitrary-precision arithmetic.
When you only need a single value — the multiplicative recurrence c[i] = c[i-1] · 2(2i-1)/(i+1) is O(n) and skips the table entirely.
When you actually want to enumerate the structures, not count them — Catalan numbers tell you "how many," not "which ones."
Common Mistakes
Forgetting to seed c[0] = 1, which silently gives 0 for every later term.
Off-by-one in the inner sum — using c[i-j] instead of c[i-1-j] shifts the recurrence and produces wrong values starting at c[2].
Using plain int for the result — c[35] already overflows 64-bit signed; switch to big integers or modular arithmetic.
Try It with an AI Assistant
short
Write catalan(n) returning the n-th Catalan number using the convolution recurrence.
behavior
Write a function that, given a non-negative integer n, builds a table c[0..n] where c[0] = 1 and each later c[i] is the sum, over j from 0 to i−1, of c[j] · c[i−1−j]. Return c[n].
For instanceA clockmaker can choose gear ratios close to a desired value.
path ← []
left ← (0, 1); right ← (1, 0)
WHILETRUE
m ← (left + right) /vec
IF m = (p, q) THENBREAKENDIFIF p/q < m_x/m_y THEN
right ← m
path.append('L')
ELSE
left ← m
path.append('R')
ENDIFENDWHILERETURN path
Stern (a German mathematician) and Brocot (a French clockmaker) discovered the same fraction tree independently, three years apart. Brocot’s motivation was practical — he needed gear ratios closest to a desired ratio, with constraints on tooth counts.
The world’s most elegant rational-approximation tool came from a working clockmaker’s shop, not an academic paper. Every Stern-Brocot fraction is automatically in lowest terms.
Teaches: Generate all possibilities through ordered mediants
The Idea
Keep two "boundary" fractions, left = 0/1 and right = 1/0 (think of 1/0 as "infinity"). The mediant of two fractions a/b and c/d is (a+c)/(b+d). At each step, compute the mediant m of left and right. If the mediant equals our target, we're done. If our target is smaller than the mediant, the target lives in the left half — so set right = m and record 'L'. Otherwise set left = m and record 'R'.
Why does this work? The mediant always falls strictly between left and right, and it is automatically already in lowest terms. So each step halves the "fraction interval" containing our target while maintaining the invariant that left < target < right. Because every reduced fraction sits at a unique node, the search must terminate exactly when the mediant matches.
Trace
step
left
right
m (mediant)
compare
action
0
(0,1)
(1,0)
(1,1)
3/5 < 1/1
right ← m, path += 'L'
1
(0,1)
(1,1)
(1,2)
3/5 > 1/2
left ← m, path += 'R'
2
(1,2)
(1,1)
(2,3)
3/5 < 2/3
right ← m, path += 'L'
3
(1,2)
(2,3)
(3,5)
m = (p,q)
break
Where It's Used Today
Clockmaking and gear design — Brocot's original use: pick a gear ratio with a small tooth count that approximates an irrational number like π or 365.2425/365 (calendar drift).
Music tuning — finding rational approximations to irrational frequency ratios for just-intonation instruments.
Computer graphics and CAD — choosing pixel-aligned slopes and rational approximations for line drawing.
Continued-fraction algorithms — the Stern-Brocot tree is a visual cousin of continued fractions; both produce the best rational approximations to a real number.
Number theory pedagogy — every reduced fraction appears exactly once in the tree, which makes it a beautiful proof that the rationals are countable.
When NOT to Use
When the input fraction is not in lowest terms — the search will overshoot the target and never converge; reduce with gcd first.
When you only need a decimal approximation of an irrational — a continued-fraction expansion or floating-point math is more direct.
When the target denominator is huge (millions) — the path can be very long, so iterative gcd-style approaches are faster.
Common Mistakes
Comparing p/q to the mediant with floating-point division — use cross-multiplication (p m_y vs m_x q) to stay exact.
Forgetting that right starts as 1/0 (not 1/1) — using 1/1 cuts off all fractions greater than 1.
Appending the move after checking equality without breaking — produces an extra trailing letter on the returned path.
Try It with an AI Assistant
short
Write stern_brocot_path(p, q) returning the L/R path to fraction p/q in the Stern-Brocot tree.
behavior
Write a function that, given a reduced fraction p/q, starts with two boundary fractions 0/1 and 1/0, repeatedly forms the mediant (a+c)/(b+d) of the boundaries, and records 'L' if p/q is smaller than the mediant (and replaces the right boundary) or 'R' if larger (and replaces the left boundary). Stop when the mediant equals p/q and return the recorded string.
For instanceA math class can compare two sequences with same rule but different starts.
a ←2
b ←1FOR i FROM1TO n
t ← a + b
a ← b
b ← t
ENDFORRETURN a
Édouard Lucas was far more famous for his recreational creations (Tower of Hanoi puzzle, the Récréations Mathématiques book series) than rigorous theorems. He died at 49 from septicemia caused by — improbably — a banquet plate that broke and cut his cheek.
The Lucas numbers were a footnote to his work on prime testing. They survived because Lucas-Lehmer tests for Mersenne primes still rely on them. Same Fibonacci recurrence, different seeds: 2, 1, 3, 4, 7, 11…
Teaches: Small rule changes create whole new sequences
The Idea
Keep two running variables, a and b, holding the previous and current Lucas numbers. Start with a = 2 and b = 1. Each step, compute t = a + b (the next number), then slide forward: a takes the old value of b, and b takes the new value t. After n steps, a holds the answer.
Why does this work? The invariant is that after i steps, a is lucas(i) and b is lucas(i+1). The update preserves this exactly: the new a becomes the old b (which is lucas(i+1)), and the new b becomes a + b = lucas(i) + lucas(i+1) = lucas(i+2). We use only two variables, so the whole calculation is fast and uses constant memory.
Trace
i
a
b
t = a + b
what happens
start
2
1
—
seeds
1
1
3
3
a ← 1, b ← 3
2
3
4
4
a ← 3, b ← 4
3
4
7
7
a ← 4, b ← 7
4
7
11
11
a ← 7, b ← 11
5
11
18
18
a ← 11, b ← 18
Where It's Used Today
Mersenne prime testing — the Lucas-Lehmer test, the standard way to certify huge primes of the form 2^p − 1, runs on a Lucas-style recurrence.
Cryptography — Lucas pseudoprimes appear in primality tests like Baillie-PSW, used by libraries that generate prime numbers for RSA.
Number theory teaching — Lucas numbers are the standard "second example" after Fibonacci, used to show that the same recurrence with different seeds produces a whole new sequence.
Combinatorics — Lucas numbers count specific tilings (for example, ways to tile a circular strip of length n with squares and dominoes).
Algorithm classrooms — a clean, two-line example of dynamic-programming-with-rolling-variables, often used to teach the technique before tackling harder problems.
When NOT to Use
When you actually want Fibonacci numbers — the seeds (0, 1) give a different sequence; don't conflate them.
When n is huge (millions) — the loop is fine but the numbers themselves grow exponentially and need big-integer arithmetic; consider matrix exponentiation for O(log n) instead.
When you need a closed-form value modulo something specific — there are direct identities like lucas(n) = phi^n + psi^n that may suit better.
Common Mistakes
Seeding with (1, 2) instead of (2, 1) — flips the parity of the sequence and produces wrong values from index 0.
Updating a and b in the wrong order, e.g. setting a ← b before computing t, which destroys the previous value needed for the sum.
Returning b instead of a after the loop — gives lucas(n+1) rather than lucas(n).
Try It with an AI Assistant
short
Write lucas_number(n) using the same Fibonacci recurrence with seeds (2, 1).
behavior
Write a function that, given a non-negative integer n, returns the n-th term of a sequence that starts with 2 and 1, where every later term is the sum of the two terms before it. Use only two rolling variables, not recursion.
For instanceA maze solver can follow a path until stuck, then backtrack.
FUNCTIONdfs(g, v, seen)
seen.add(v)
visit(v)
FOR EACH n IN g[v]
IF n NOTIN seen THENdfs(g, n, seen)
ENDIFENDFOREND FUNCTION
g ← {1: [2, 3], 2: [4], 3: [], 4: []}
start ←1
seen ← empty set
dfs(g, start, seen)
RETURN seen
Charles Pierre Trémaux invented the rule to solve mazes in the Paris sewer system — exactly the DFS we know today, but disguised. His rule: mark every passage you enter; never enter a fully-marked passage if there’s an unmarked one.
Came 70 years before formal graph theory. The same recursion now drives topological sort, cycle detection, connected components, and every backtracking solver.
Teaches: Explore deeply before considering alternatives
Anecdote
Charles Pierre Trémaux invented it to solve mazes in the Paris sewer system — exactly the DFS we know today, in disguise.
The Idea
Keep a set called seen of vertices you've already visited. From the current vertex v, mark it seen, then look at each neighbor n. If you haven't seen n yet, recurse into it — make n the new "current" vertex and repeat the same rule. When all neighbors of v have been handled, the recursion unwinds and you back up to where you came from.
Why does it work? The seen set is the invariant: a vertex enters the set the moment we arrive at it, and we never recurse into a vertex already in the set. So the recursion can't loop forever, and it can't miss a reachable vertex either — every neighbor is either visited (already seen) or visited next (recursive call). When the call stack finally drains, seen contains exactly the connected component starting from your initial vertex.
Trace
step
call
seen (after)
visit
what happens
1
dfs(g, A)
{A}
A
first neighbor B not seen → recurse
2
dfs(g, B)
{A, B}
B
neighbor D not seen → recurse
3
dfs(g, D)
{A, B, D}
D
no neighbors — return
4
back in B
{A, B, D}
—
no more neighbors — return
5
back in A
{A, B, D}
—
next neighbor C not seen → recurse
6
dfs(g, C)
{A, B, D, C}
C
neighbor E not seen → recurse
7
dfs(g, E)
{A, B, D, C, E}
E
no neighbors — return
Where It's Used Today
Compilers — topological sort of import and include statements (and detecting circular imports) is a DFS at heart.
Web crawlers — many crawlers explore "depth-first" within a single domain to map a site before moving on.
Maze and puzzle solvers — Sudoku, N-queens, and crossword fillers all use DFS with backtracking.
Network analysis — finding connected components in social networks (e.g., who can reach whom) and detecting cycles in dependency graphs.
File system tools — find on UNIX walks a directory tree depth-first; so do "delete folder and everything inside" operations.
When NOT to Use
When you need the shortest path between two nodes — DFS may find a long route first; use BFS instead.
When the graph is very deep and recursion would blow the call stack; switch to an iterative version with an explicit stack.
When you need to process nodes in order of distance from the start (level-by-level), which is BFS's job.
Common Mistakes
Forgetting to mark a vertex as seen before recursing, causing infinite loops on graphs with cycles.
Marking only after the recursive call returns, so the same neighbor gets visited multiple times.
Treating the visit order as a shortest-path order — DFS gives reachability, not distance.
Try It with an AI Assistant
short
Write dfs(graph, start) that visits every reachable vertex via depth-first search.
behavior
Write a function that, starting from one vertex of a graph, walks as far along edges as it can, marking every vertex it touches; whenever it reaches a vertex with no unmarked neighbors, it backs up to the most recent vertex that still has one and continues from there. The walk ends when no marked vertex has an unmarked neighbor.
Made sorting by digits possible without comparisons.
For instancePostal codes can be sorted digit by digit.
nums ← [170, 45, 75, 90, 2, 802, 24, 66]
max_digits ←3FOR pos FROM0TO max_digits - 1
bins ← array of 10 lists
FOR EACH n IN nums
d ← (n / 10^pos) MOD10append(bins[d], n)
ENDFOR
nums ←concat(bins)
ENDFORRETURN nums
Herman Hollerith built machines to sort punch cards for the US Census. His company later became IBM — radix sort literally helped launch the modern computing industry.
The 1890 Census, expected to take 10 years to tabulate by hand, was done in 6 weeks with Hollerith’s machines. The trick: sort by least-significant digit first, then by next, with a stable sort each pass.
Teaches: Sort by parts instead of whole comparisons
Anecdote
Herman Hollerith built machines to sort punch cards for the US Census. His company later became IBM — radix sort literally helped launch modern computing companies.
The Idea
Pretend you have ten labeled bins, one for each digit 0 through 9. Make a pass over the numbers, and drop each one into the bin matching its ones digit. Empty the bins in order back into the list. Now repeat using the tens digit, then the hundreds digit, all the way up to the longest number's most significant digit.
Why does this work? After pass pos, the list is sorted by the lastpos + 1 digits, ties broken by their original order. Because each bin pass is stable — items keep their relative order when going into the same bin — the work done on earlier digits is never undone. By the final pass, every digit has been considered, and the list is fully sorted. No comparison between whole numbers is ever needed.
Trace
pos
digit looked at
nums after this pass
0
ones
[170, 90, 2, 802, 24, 45, 75, 66]
1
tens
[2, 802, 24, 45, 66, 170, 75, 90]
2
hundreds
[2, 24, 45, 66, 75, 90, 170, 802]
Where It's Used Today
Postal mail sorting — automated mail centers route letters by ZIP code one digit at a time, exactly the way Hollerith's machines did.
Database engines — sorting fixed-width integer keys (timestamps, IDs) where comparisons would be slow.
Graphics pipelines — sorting pixels or particles by depth or screen position when the values are bounded integers.
Suffix array construction — bioinformatics and full-text search use radix-style passes to order genomic suffixes for fast lookup.
Network packet processing — routers sort millions of packets per second by IP address fields using digit-bucket techniques.
When NOT to Use
When the keys are floating-point numbers or arbitrary objects — there's no natural digit decomposition to bin them by.
When the maximum value is huge relative to the list size (e.g., 10 numbers up to 10^18) — comparison sorts are faster.
When memory is tight — radix sort needs O(n + base) extra space for the bins, unlike in-place sorts like heapsort.
Common Mistakes
Sorting from most-significant digit first without recursing into buckets — the result is not actually sorted.
Using an unstable per-pass sort, which destroys the ordering work done by previous digit passes.
Using floor division by 10^pos with floating-point arithmetic and losing low-order digits to rounding.
Try It with an AI Assistant
short
Write radix_sort(numbers) that sorts non-negative integers using LSD radix sort.
behavior
Write a function that sorts a list of non-negative integers without ever comparing two numbers directly. Instead, repeatedly drop each number into one of ten buckets based on its current digit (start with the ones place, then the tens, then the hundreds, and so on up to the longest number), and after each pass concatenate the buckets back into the list, preserving their order.
For instanceA teacher can see how test scores are spread across ranges.
min_v ←min(values)
max_v ←max(values)
w ← (max_v - min_v) / k
bins ← k zeros
FOR EACH x IN values
i ←floor((x - min_v) / w)
i ←min(k - 1, i)
bins[i] ← bins[i] + 1ENDFORRETURN bins
Karl Pearson coined the word “histogram” in lectures at University College London in 1891. He took it from Greek histos (mast, web) + gramma (writing) — literally writing about masts, because the bars looked like a row of upright posts.
Pearson coined “standard deviation,” “moment,” and “histogram” all in the same productive decade. He was inventing modern statistics on the fly.
Teaches: Summarize data by grouping into meaningful bins
The Idea
First, find the range of the data — the smallest value min_v and the largest max_v. Divide that range into k equal-width slices; each slice has width w = (max_v − min_v) / k. Make an array bins of k zeros. Then walk through every value x, compute which bin it falls in by i = floor((x − min_v) / w), and increment bins[i].
Why does it work? Each value lands in exactly one bin because the bins partition the range without overlap. The one edge case is x = max_v itself: with floating-point division, i can come out equal to k, one past the last valid index. The line i ← min(k - 1, i) clamps that case so the maximum lands cleanly in the last bin. The invariant: after processing the first t values, the bins sum to exactly t.
Trace
step
x
i = floor((x−55)/10)
clamped i
bins after
0
55
0
0
[1, 0, 0, 0]
1
62
0
0
[2, 0, 0, 0]
2
67
1
1
[2, 1, 0, 0]
3
70
1
1
[2, 2, 0, 0]
4
71
1
1
[2, 3, 0, 0]
5
75
2
2
[2, 3, 1, 0]
6
78
2
2
[2, 3, 2, 0]
7
82
2
2
[2, 3, 3, 0]
8
88
3
3
[2, 3, 3, 1]
9
95
4
3
[2, 3, 3, 2]
Where It's Used Today
Photo editing — every camera and phone shows a brightness histogram so you can see if the photo is too dark or blown out.
Grading and education — teachers plot test-score distributions to decide where to set the curve.
Image processing — histogram equalization stretches a histogram across the full range to improve contrast in medical scans and satellite images.
Quality control — factories plot histograms of part dimensions to spot drift in a manufacturing process.
Data science onboarding — usually the first chart you draw on a new dataset, before any model.
When NOT to Use
When the data is categorical (colors, country names) — use a bar chart of counts per category instead, since there is no numeric range to slice.
When the dataset has heavy outliers stretching the range — equal-width bins leave most of the chart empty; switch to log bins or a box plot.
When you need exact percentile or rank information — a histogram smears values within each bin, so use a sorted list or empirical CDF.
Common Mistakes
Forgetting to clamp the maximum value into the last bin, so it overflows to index k and crashes or undercounts.
Picking too few bins (everything looks flat) or too many (every bar is height 1), hiding the actual shape of the distribution.
Computing bin width from (max - min) / k without handling the case where all values are equal, which causes a divide-by-zero.
Try It with an AI Assistant
short
Write histogram(values, n_bins) returning a list of n_bins counts.
behavior
Write a function that takes a list of numbers and a positive integer k. Find the minimum and maximum of the list, divide that range into k equal-width slices, and return a list of k counts where each count is how many input numbers fell into that slice. Make sure the maximum value lands in the last slice rather than overflowing.
Made fair comparison across different scales possible.
For instanceCompare a math score and a reading score by standard deviations.
values ← [2, 4, 4, 4, 5, 5, 7, 9]
n ←length(values)
mean ←sum(values) / n
var ←sum((x - mean)^2) / n
std ←sqrt(var)
result ← empty list
FOR EACH x IN values
append(result, (x - mean) / std)
ENDFORRETURN result
The z-score is so fundamental that no single person gets credit. The earliest fully-recognizable use is in Pearson’s 1894 paper on biological measurement — but the underlying idea emerged from astronomy and physics in the 19th century, where measurement errors had to be standardized to combine observations from different instruments.
Quetelet’s “average man” (1835) is an ancestor: every property tracked relative to its population mean. Today every distance-based ML algorithm needs standardized inputs — otherwise one feature dominates.
Teaches: Compare fairly by removing scale and centering
The Idea
Compute the mean (the average) of the values. Subtract the mean from every value — that centers the data around zero. Compute the standard deviation (the typical distance from the mean), then divide every centered value by that standard deviation — that rescales the spread to 1.
After this two-step shift-and-stretch, the new list has mean 0 and standard deviation 1, regardless of what units the original numbers were in. This is what "removing scale" means: a transformed value of +2 always means "two standard deviations above average," whether the original measurement was in dollars, kilograms, or test points. The procedure is purely mechanical, but the effect is profound — every distance-based machine-learning algorithm depends on it, because otherwise one feature with a large numeric range would dominate every distance computation.
Machine learning preprocessing — almost every classifier (k-NN, SVM, neural networks) standardizes features so that "income" in dollars doesn't drown out "age" in years.
Standardized testing — the SAT and GRE convert raw scores into a fixed mean-and-spread scale so results are comparable across test years.
Medical lab results — your blood-test report often shows how far a value sits from the typical population, expressed in standard-deviation units.
Finance — risk models report asset moves in standard deviations ("a 3-sigma event") to flag unusual price swings.
Quality control — factories use z-scores on measurements (bolt diameters, fill weights) to detect when a manufacturing process drifts off-target.
When NOT to Use
When the data is highly skewed or has heavy outliers — the mean and standard deviation get pulled around, so prefer median/IQR-based scaling.
When features are sparse counts or one-hot vectors and zero has a real meaning — standardizing destroys sparsity and the meaning of zero.
When you only need values bounded to [0, 1] for a sigmoid or image pixel — use min-max scaling instead, which doesn't assume a bell shape.
Common Mistakes
Computing mean and standard deviation on the full dataset before splitting train/test, leaking test information into the model.
Dividing by a standard deviation of zero when a feature is constant, producing NaN throughout the column.
Re-fitting the standardizer on each new batch in production instead of saving the train-time mean and std and reusing them.
Try It with an AI Assistant
short
Write standardize(values) that returns the z-scores of the input.
behavior
Write a function that takes a list of numbers, computes their mean and standard deviation, then returns a new list where each value is replaced by its distance from the mean divided by the standard deviation.
For instanceA chess program assumes the opponent will make the best reply.
node ← root // root of the game tree above
depth ←2// search to depth 2 (reaches leaves)
maximizing ←TRUE// X to moveIF depth = 0ORterminal(node) THENRETURNevaluate(node)
ENDIFIF maximizing THEN
best ← -∞
FOR EACH child INchildren(node)
best ←max(best, minimax(child, depth - 1, FALSE))
ENDFORRETURN best
ELSE
best ← +∞
FOR EACH child INchildren(node)
best ←min(best, minimax(child, depth - 1, TRUE))
ENDFORRETURN best
ENDIF
In 1928, John von Neumann published Zur Theorie der Gesellschaftsspiele ("On the Theory of Games of Society") at the University of Berlin, proving the famous minimax theorem for two-player zero-sum games. He later carried the idea to the Institute for Advanced Study at Princeton, where it became the foundation of game theory and, by the 1950s, the core search procedure for the first chess and checkers programs. Claude Shannon's 1950 paper on chess and Arthur Samuel's checkers player both used minimax search trees with depth-limited evaluation — the same recipe that powered Stockfish decades later.
Teaches: Plan by assuming the worst response from your opponent
The Idea
Imagine the game as a tree. The current position is the root. Each child is a position you could reach in one move. The leaves are positions where the game has ended (or where you've decided to stop and just evaluate the board). To score the root, score the leaves directly, then bubble values up: at each internal node, take the max of the children's values if it's your turn, or the min if it's the opponent's. The value at the root is what perfect play yields.
This works because both sides are playing optimally inside the model — you assume the worst-case opponent, and the value you compute is the score you can guarantee. The depth parameter limits how deep the recursion goes; for chess you can't reach the leaves, so you stop at some depth and call evaluate(node) to estimate who's winning.
Trace
node
maximizing?
children values
best
A
no (min)
3, 5
3
B
no (min)
2, 9
2
C
no (min)
1
1
root
yes (max)
3, 2, 1
3
Where It's Used Today
Chess and Go engines — Stockfish and many predecessors search a minimax tree, with handcrafted or neural-network evaluation at the leaves.
Tic-tac-toe and Connect Four — small games where Minimax can search to the end and play perfectly.
Game AI for board games — Othello, Checkers, and most two-player turn-based programs.
Adversarial decision making — robust planning where you assume an adversary (a competitor, weather, hardware failure) plays the worst response.
Economics and game theory — von Neumann's original setting; pricing and bidding strategies use minimax-style reasoning.
When NOT to Use
When the game is not zero-sum or has more than two players — the max/min duality stops describing the right thing.
When the branching factor is huge and there's no time bound on search, like full Go — pure minimax explores billions of useless nodes.
When moves involve hidden information or randomness, like poker — minimax assumes both sides see the same board.
Common Mistakes
Forgetting to flip the maximizing flag in the recursive call, so both sides act like the same player.
Returning the score but losing track of which move produced it, leaving the engine unable to actually play.
Treating depth as plies for one side instead of one ply per call — the search ends in the middle of an exchange.
Try It with an AI Assistant
short
Write minimax(node, depth, max_player) returning the optimal value of a two-player zero-sum game.
behavior
Write a recursive function on a game tree. If the position is a leaf or you've recursed deep enough, return the position's evaluation. Otherwise, recurse into every child position. If it's the maximizer's turn, return the largest child value; if it's the minimizer's turn, return the smallest.
For instanceA card game can shuffle so every deck order is equally likely.
arr ← [A, B, C, D]
n ←length(arr)
FOR i FROM n - 1 DOWN TO1
j ←random_int(0, i)
swap(arr[i], arr[j])
ENDFORRETURN arr
Earlier manual shuffling methods often introduced bias. Fisher and Yates designed a systematic process for generating every permutation with equal probability.
Needed a mathematically fair way to randomize lists and statistical samples.
Teaches: Ensure fairness by swapping with shrinking random choices
Anecdote
Originally designed for manual shuffling using random number tables, not computers. The "modern" version (in-place swap) was later popularized by Donald Knuth — many still incorrectly credit him as the inventor.
The Idea
Walk the array from the back to the front. At each position i, pick a random index j between 0 and i (inclusive), and swap arr[i] with arr[j]. After the swap, position i is "locked in" — its value will never move again. Then move on to i - 1 and pick from a smaller pool of remaining slots.
The invariant is: after the swap at position i, every value that could end up at position i had an equal 1/(i+1) chance of getting there. Multiply those probabilities down the array and every permutation comes out with probability exactly 1/n!. The trick that makes this work — and that everyone gets wrong on their first try — is that the random index j must be drawn from 0 to i, not from 0 to n-1. Drawing from the whole array biases the result.
Trace
i
range for j
j picked
swap
arr after
3
0..3
1
swap arr[3], arr[1]
[A, D, C, B]
2
0..2
0
swap arr[2], arr[0]
[C, D, A, B]
1
0..1
0
swap arr[1], arr[0]
[D, C, A, B]
Where It's Used Today
Online card games — every poker, solitaire, and trading-card-game server runs Fisher-Yates to deal a fair hand.
Music players — Spotify and Apple Music use shuffle variants of Fisher-Yates to randomize a playlist without repeats.
Statistics and machine learning — randomly shuffling a dataset before splitting it into training and test sets, or before each epoch of training.
Cryptography test vectors — generating random permutations for testing cipher and hash function behavior.
Election audits and survey sampling — randomly ordering voters or respondents so the audit pool is unbiased.
When NOT to Use
When you need a cryptographically unpredictable shuffle — pair Fisher-Yates with a CSPRNG, not the default Math.random or rand().
When you need the same shuffle repeated across machines without sharing state — use a seeded RNG or a permutation derived from a hash.
When the array is enormous and lives on disk — the in-place swap pattern assumes random access; for sequential storage use a different shuffle.
Common Mistakes
Drawing j from 0..n-1 instead of 0..i — every position then has n choices and the resulting distribution is biased (only n^n / n! of orderings are reachable evenly).
Walking front-to-back without swapping the right range — the classic "naive shuffle" looks fine but produces some permutations more often than others.
Using a poor-quality RNG with too small a state — you can't generate all n! permutations if the RNG has fewer than n! possible seeds (e.g., 32-bit seed and n = 13).
Try It with an AI Assistant
short
Randomly shuffle array elements by swapping each position with a random earlier index.
behavior
Write a function that takes a list and shuffles it in place. Walk from the last index down to index 1. At each index i, pick a uniformly random integer j between 0 and i inclusive, and swap the elements at positions i and j. Return the modified list.
Made unique Fibonacci-based number representation possible.
For instanceRepresent 100 as non-neighboring Fibonacci numbers.
n ←100
fibs ← [1, 2]
WHILElast(fibs) <= n
append(fibs, fibs[-1] + fibs[-2])
ENDWHILE
result ← empty list
FOR i FROMlength(fibs) - 1 DOWN TO0IF fibs[i] <= n THENappend(result, fibs[i])
n ← n - fibs[i]
ENDIFENDFORRETURN result
Zeckendorf was a Belgian Army medic who pursued mathematics as a hobby. Working alone in the 1930s, he proved that every positive integer admits a unique decomposition into non-consecutive Fibonacci numbers — a theorem that revealed unexpected structure hidden inside the most familiar sequence in mathematics. The result quietly seeded a whole family of numeral systems and coding schemes that would matter decades later.
Teaches: Express uniquely using non-overlapping building blocks
Anecdote
Édouard Zeckendorf was an amateur mathematician and Belgian Army medic who proved his theorem in 1939 — but didn't publish it until 1972, 33 years later, after he'd retired. By then a Dutch mathematician had independently proved the same result in 1952. Zeckendorf's slow publication cost him primary credit on a theorem that's now in every introductory number theory course.
The Idea
Be greedy. Build a list of Fibonacci numbers fibs up to (and just past) n. Then walk that list from the largest down. At each step, if the current Fibonacci number fits into the remaining n, take it — add it to the answer and subtract it from n. Move to the next-smaller Fibonacci number. Stop when n reaches 0.
Why does it always produce non-consecutive Fibonacci numbers? Because once you take fib[i], the leftover is strictly less than fib[i-1]. (If it weren't, then fib[i] + fib[i-1] = fib[i+1] would have been the smarter choice — but you already passed fib[i+1].) So the next Fibonacci you can possibly take is fib[i-2] or smaller — never the immediate neighbor. This is the invariant that makes the representation unique.
Trace
step
n
fib being checked
take it?
result so far
0
100
144
no (too big)
[]
1
100
89
yes
[89]
2
11
55
no
[89]
3
11
34
no
[89]
4
11
21
no
[89]
5
11
13
no
[89]
6
11
8
yes
[89, 8]
7
3
5
no
[89, 8]
8
3
3
yes
[89, 8, 3]
9
0
(done)
[89, 8, 3]
Where It's Used Today
Fibonacci coding — a variable-length code used to compress data with small integers, where the "no two consecutive" rule lets a decoder spot the boundary between numbers.
Number theory teaching — a clean example of a "non-positional" numeral system, used in introductory courses on representations.
Combinatorial game theory — Wythoff's game and other Fibonacci-based games use Zeckendorf decompositions to describe winning positions.
Hashing tricks — Fibonacci hashing uses related properties of the golden ratio for spreading keys across hash tables.
Recreational math and puzzles — competition problems often hinge on the uniqueness of the Zeckendorf representation.
When NOT to Use
When you need a positional binary or decimal representation for arithmetic — Zeckendorf form makes addition and multiplication awkward.
When n is zero or negative — the theorem is defined only for positive integers, so the algorithm has no meaning here.
When you need every Fibonacci-sum decomposition of n — Zeckendorf gives the unique non-consecutive one, not all of them.
Common Mistakes
Including 1 twice at the start of fibs ([1, 1, 2, 3, ...]) — Zeckendorf uses each Fibonacci value once, so start with [1, 2].
Walking the Fibonacci list from smallest to largest instead of largest to smallest — the greedy property only holds top-down.
Allowing the algorithm to pick fib[i] and fib[i-1] (neighbors) by skipping the implicit invariant check, breaking uniqueness.
Try It with an AI Assistant
short
Write zeckendorf(n) returning the unique list of non-consecutive Fibonacci numbers whose values sum to n.
behavior
Write a function that, given a positive integer n, builds the Fibonacci numbers up to n and then greedily subtracts the largest Fibonacci number that still fits, repeating until n reaches zero. Return the list of Fibonacci numbers used; no two of them should be neighbors in the Fibonacci sequence.
For instanceA computer can sort a huge list by splitting and merging.
arr ← [5, 2, 8, 1]
IFlength(arr) <= 1THENRETURN arr
ENDIF
mid ←length(arr) DIV2
left ←mergeSort(arr[0..mid-1])
right ←mergeSort(arr[mid..END])
RETURNmerge(left, right)
Computers were becoming powerful enough to process large datasets, but simple quadratic sorts became too slow. Merge sort's divide-and-conquer strategy scaled dramatically better.
Needed an efficient sorting algorithm for early stored-program computers.
Teaches: Divide problems to conquer them
Anecdote
John von Neumann designed it for magnetic tape storage, where random access was expensive. Merge sort is fundamentally a sequential-access algorithm, which is why it still dominates external sorting today.
The Idea
Merge sort is built on a simple recursive recipe. Divide the list in half. Sort each half by calling merge sort on it. Merge the two sorted halves into one sorted list by repeatedly taking whichever front item is smaller. The base case is a list of one element — already sorted, by definition.
Why does it work? The merge step is the heart. If left and right are both already sorted, then the smallest item in the combined result must be the smaller of left[0] and right[0]. Take it, advance that side's pointer, and repeat. This invariant — both inputs to merge are sorted — is exactly what the recursion guarantees. Splitting halves the problem each time, so a list of a million items takes only about 20 levels of splitting, and each level does total work proportional to the list size. That's why it runs in time roughly n log n instead of n².
Trace
step
call
result
1
mergeSort([5, 2, 8, 1])
split at mid = 2
2
mergeSort([5, 2])
split at mid = 1
3
mergeSort([5]) → [5]
base case
4
mergeSort([2]) → [2]
base case
5
merge([5], [2]) → [2, 5]
left = [2, 5]
6
mergeSort([8, 1])
split at mid = 1
7
mergeSort([8]) → [8]
base case
8
mergeSort([1]) → [1]
base case
9
merge([8], [1]) → [1, 8]
right = [1, 8]
10
merge([2, 5], [1, 8])
take 1, 2, 5, 8
Where It's Used Today
External sorting — sorting files too large to fit in memory (database indexes, log files) still uses merge sort, because it streams data sequentially and never needs random access.
Java's Arrays.sort for objects — uses Timsort, a merge-sort variant that detects already-sorted runs and merges them.
Python's sorted() and list.sort() — also Timsort, the same merge-sort variant.
Database query engines — sort-merge joins use merge sort to align two tables on a key before merging matching rows.
Inversion counting — counting how out-of-order a list is (used in recommendation systems and statistics) is a tiny tweak to the merge step.
When NOT to Use
When memory is extremely limited and extra arrays are a problem.
When the dataset is tiny; insertion sort may be simpler and faster.
When in-place sorting is required and implementation complexity matters.
Common Mistakes
Forgetting the merge step is where sorted order is created.
Not handling leftover items after one half is empty.
Copying too much data without noticing memory cost.
Try It with an AI Assistant
short
Write merge_sort(a) returning a new list, sorted ascending; recursive and stable.
behavior
Write a function that sorts a list by splitting it in half, recursively sorting each half, then walking down the two sorted halves with two pointers and repeatedly taking whichever front item is smaller until both halves are exhausted. Return the combined result. A list of length 0 or 1 is already sorted.
For instanceFind a name in a phone book by opening near the middle repeatedly.
low ←0
high ←length(arr) - 1WHILE low <= high
mid ← (low + high) // 2IF arr[mid] = target THENRETURN mid
ENDIFIF arr[mid] < target THEN
low ← mid + 1ELSE
high ← mid - 1ENDIFENDWHILERETURN -1
Instead of checking items one by one, binary search repeatedly halves the search space. It became one of the defining examples of logarithmic efficiency.
Needed fast lookup in sorted collections.
Teaches: Ask smarter questions to eliminate possibilities
Anecdote
Early published versions were subtly wrong. Even in the 20th century, many implementations had overflow bugs or off-by-one errors — famously, a correct version wasn't widely standardized until decades later.
The Idea
Look at the middle of the list. If that value is the target, you're done. If the target is smaller, the target — if it's there at all — must be in the left half, so throw the right half away. If the target is larger, throw the left half away. Repeat on whichever half remains.
Why this works: because the list is sorted, comparing the target to the middle tells you exactly which half it could possibly live in. The "could-possibly-be" range, tracked by low and high, is the invariant — and that range cuts in half every step. Starting with a million entries, after one step you have 500,000 left to consider; after twenty steps, you're down to one. That's why binary search can find a name in a million-entry phone book in roughly 20 comparisons instead of a million.
Trace
step
low
high
mid
arr[mid]
action
0
0
8
4
14
14 < 19 → low = mid + 1 = 5
1
5
8
6
23
23 > 19 → high = mid − 1 = 5
2
5
5
5
19
match → return 5
Where It's Used Today
Database indexes — looking up a row by primary key in a sorted B-tree uses binary search at every level.
Phone contacts and dictionaries — finding a name in a contact list, or a word in a digital dictionary, when the entries are kept in sorted order.
Version control — git bisect does a binary search through commits to find which one introduced a bug.
Numerical methods — locating the root of a continuous function (the bisection method) is binary search on real numbers.
Game programming — finding a frame in an animation by timestamp, or picking the right item by score.
When NOT to Use
When the data is not sorted.
When data changes so often that maintaining order costs more than the search saves.
When the dataset is tiny and simple linear search is clearer.
Common Mistakes
Off-by-one errors in left/right bounds.
Infinite loops from not moving left or right correctly.
Using it on unsorted input.
Try It with an AI Assistant
short
Write binary_search(a, x) over a sorted list, returning the index of x or -1 if not found.
behavior
Write a function that, given a sorted list and a target value, repeatedly looks at the middle element of the still-possible range. If the middle equals the target, return its index. If the middle is too small, restrict the range to the right half; if too large, restrict to the left half. When the range becomes empty, return −1.
Though simple, insertion sort works extremely well for small or nearly sorted datasets and remains important inside modern hybrid sorting algorithms. Decades after Mauchly's lecture, when designers built Python's Timsort and Java's Dual-Pivot Quicksort, they discovered that the fastest "modern" sort is really a hybrid: divide the list with a fancy strategy down to runs of about 32 elements, then finish each tiny run with the same insertion sort that humans have used to organize playing cards for centuries. The old algorithm never left — it just got tucked inside the new ones.
Teaches: Maintain a growing sorted prefix incrementally
Anecdote
The earliest formal description appears in John Mauchly's 1946 lecture notes for the Moore School Lectures — the same series that launched modern computing as a discipline. Mauchly demonstrated insertion sort because it was the algorithm humans already used when sorting cards by hand, and he was teaching engineers to think about computer programs as formalizations of human procedures. Most of computing's first algorithms were just human routines written down precisely enough for a machine.
The Idea
Walk through the list from left to right. At step i, the prefix arr[0..i−1] is already sorted — that's the invariant. Pick up arr[i] (call it the key), then slide it leftward past every sorted element bigger than it, finally dropping it into the gap. After the drop, the prefix arr[0..i] is sorted, and we move on.
Why does it work? Each iteration grows the sorted prefix by exactly one element while preserving the order of everything inside it. By the time i reaches the last index, the entire array is the sorted prefix. The inner WHILE loop runs only as far as it needs to — so on a list that's already nearly sorted, almost nothing moves and the algorithm runs in essentially linear time.
Trace
i
key
arr before insert
arr after insert
1
2
[5, 2, 4, 6, 1, 3]
[2, 5, 4, 6, 1, 3]
2
4
[2, 5, 4, 6, 1, 3]
[2, 4, 5, 6, 1, 3]
3
6
[2, 4, 5, 6, 1, 3]
[2, 4, 5, 6, 1, 3]
4
1
[2, 4, 5, 6, 1, 3]
[1, 2, 4, 5, 6, 3]
5
3
[1, 2, 4, 5, 6, 3]
[1, 2, 3, 4, 5, 6]
Where It's Used Today
Hybrid sorting libraries — Python's Timsort and Java's library sort use insertion sort for small subarrays (typically fewer than 32 elements) because it has tiny overhead.
Online sorting — when items arrive one at a time (live leaderboards, streaming sensor readings) and the existing sorted list needs a single new entry inserted.
Spreadsheet sort-on-edit — when a user changes one cell, the row often needs to slide a few positions; insertion sort fits perfectly.
Embedded systems — small microcontrollers sorting tiny lists (sensor calibration tables) where simplicity beats asymptotic speed.
Teaching and interviews — it's the canonical example of a stable, in-place, intuitive sort, used in every introductory algorithms course.
When NOT to Use
When the list is large and randomly ordered — the O(n^2) cost crushes you; merge sort or quicksort wins by orders of magnitude past a few hundred elements.
When the data lives in a linked list with no random access — the "slide right" step costs O(n) per shift, killing the algorithm's main advantage.
When you need a stable parallel sort across many cores — insertion sort is inherently serial because each step depends on the previous sorted prefix.
Common Mistakes
Starting the outer loop at i = 0 instead of i = 1 — the first element is the trivially-sorted prefix and trying to insert it before itself causes an out-of-bounds read at arr[-1].
Forgetting the j >= 0 guard in the inner WHILE and walking off the left edge of the array.
Overwriting key before placing it — reading arr[i] once into key, then shifting and finally dropping key into arr[j+1] is essential; skipping the temp variable destroys the value being inserted.
Try It with an AI Assistant
short
Write insertion_sort(a) that sorts a list in place using insertion sort.
behavior
Write a function that sorts a list in place by walking from left to right; at each position i, take the element there as a key, then slide every earlier element that is larger one slot to the right, and drop the key into the resulting gap so the prefix up to and including i is sorted.
For instanceA browser can go back through pages in reverse order visited.
stack ← empty list
// push(x): append x onto the topappend(stack, x)
// pop(): remove and return the top elementIFlength(stack) = 0THENRETURNNULLENDIF
x ← last element of stack
remove last element of stack
RETURN x
// peek(): look at the top without removingIFlength(stack) = 0THENRETURNNULLENDIFRETURN last element of stack
In the late 1940s, the first programmable computers were just learning how to call subroutines, and the question of where to put the return address was wide open. Turing's 1945 ACE report described a "bury and unbury" mechanism on the machine's drum memory — a disciplined LIFO store for nested calls. By 1957 the Bauer-Samelson "Kellerprinzip" paper made the idea explicit, and within a decade the call stack was hardware in essentially every CPU on Earth.
Alan Turing didn't call it a stack. He called the operations "bury" and "unbury" — you bury the return address before a subroutine call, then unbury it on return. Konrad Zuse independently invented the same idea in Germany during the war. The names "push" and "pop" came later, in IBM literature; Turing's gravedigger metaphor is older and arguably better.
The Idea
Use a list (or array). Always work on the end of the list. To push(x), append x to the end. To pop(), remove the last element and return it. To peek(), just read the last element. Three short operations, each one constant-time.
Why is this so important? Because the world is full of nested things — a function calls another, which calls another, which calls another. To get back where you came from, you need to remember the trail in reverse. A stack does exactly that. The invariant: the top of the stack is always the most recent thing you haven't yet finished with. This makes stacks the natural fit for matching parentheses, undoing actions, and tracking function calls.
Trace
step
operation
stack (bottom → top)
returns
0
new()
[ ]
—
1
push(1)
[1]
—
2
push(2)
[1, 2]
—
3
push(3)
[1, 2, 3]
—
4
peek()
[1, 2, 3]
3
5
pop()
[1, 2]
3
6
pop()
[1]
2
7
pop()
[ ]
1
Where It's Used Today
Function calls — every running program uses a call stack to remember which function called which, so it knows where to return.
Browser back button — each page you visit is pushed onto a stack; pressing "back" pops the top page.
Undo in text editors — every action you take is pushed onto a stack; Ctrl-Z pops the most recent action and reverses it.
Compilers and parsers — checking that brackets (), [], {} match in source code is a textbook stack problem.
Calculator engines — Reverse Polish Notation calculators (3 4 + instead of 3 + 4) work entirely on a stack of operands.
When NOT to Use
When you need first-in-first-out order (print queues, BFS frontiers, request handlers) — a stack reverses arrival order; use a queue.
When you need to access elements by position or in the middle — stacks expose only the top; use an array or list.
When recursion depth would overflow the language's call stack — switch to an explicit heap-allocated stack you control.
Common Mistakes
Calling pop on an empty stack and crashing — every pop and peek must check size or return a sentinel.
Using pop(0) (remove from the front) on a Python list and getting accidental O(n) behavior — push and pop must both touch the same end.
Forgetting to push the return state (not just the value) when simulating recursion, so the rebuilt walk loses its place after each pop.
Try It with an AI Assistant
short
Write a class Stack with push(x), pop(), and peek().
behavior
Write a class for a container that supports three operations: adding an item to the top, removing and returning the item that was added most recently, and peeking at the most recent item without removing it. The container should remember everything you've added but not yet removed.
Made fair first-come-first-served processing programmable.
For instanceA printer can process jobs in arrival order.
queue ← empty list
// enqueue(x): add to backappend(queue, x)
// dequeue(): remove and return front, or NULL if emptyIFlength(queue) = 0THENRETURNNULLENDIF
x ← queue[0]
remove first element FROM queue
RETURN x
// peek(): look at front, or NULL if emptyIFlength(queue) = 0THENRETURNNULLENDIFRETURN queue[0]
The queue arrived alongside the stack in the 1940s and 1950s as engineers built the first batch operating systems and message buffers. Print spoolers, teletype message switches, and ticket-request handlers all needed the same primitive — process jobs in the order they came in — and the data structure crystallized so naturally that no single inventor is credited. By the time Knuth surveyed it in The Art of Computer Programming (1968), it had been in continuous use under a dozen different names: FIFO buffer, wait list, channel, mailbox.
Teaches: First-in, first-out preserves arrival order
Anecdote
The queue's name comes from British etiquette — the orderly line at a bus stop or shop counter. Early computer scientists in the UK borrowed the everyday word; American computer scientists used "FIFO buffer" or "wait list." It's the only major data structure named after a national stereotype: Britons stand patiently in queues, and so does this data.
The Idea
Keep a list. To enqueue(x), append x to the back. To dequeue(), remove and return the front element. To peek(), look at the front element without removing it. If the list is empty when you ask for an item, return NULL — the queue cannot give you something that isn't there.
Why does this work? The invariant is simple: items leave the queue in the same order they entered. Adding to one end and removing from the other guarantees that whatever was added first is also removed first. This is the structural opposite of a stack, which adds and removes at the same end (last in, first out). The queue's fairness — never letting a newcomer cut the line — is what makes it the right tool whenever order of arrival has to be respected.
Trace
step
operation
queue before
queue after
returns
1
enqueue('A')
[]
[A]
—
2
enqueue('B')
[A]
[A, B]
—
3
enqueue('C')
[A, B]
[A, B, C]
—
4
dequeue()
[A, B, C]
[B, C]
A
5
peek
[B, C]
[B, C]
B
6
dequeue()
[B, C]
[C]
B
Where It's Used Today
Print spoolers — every printer queues your documents and prints them in the order you sent them.
Operating system task scheduling — round-robin schedulers cycle through processes via a queue of "ready to run" tasks.
Network packet buffers — routers and switches enqueue packets when traffic spikes, then dequeue them as bandwidth frees up.
Breadth-first search — exploring a graph or maze level-by-level relies on a queue of "frontier" nodes.
Customer-service systems — call centers, online support chats, and ticket-tracking systems are all literal queues holding people in line.
When NOT to Use
When ordering should depend on priority rather than arrival time — use a priority queue (heap) instead so urgent items jump ahead.
When you need last-in-first-out semantics, like undo history or recursive call frames — that's a stack, not a queue.
When you need random access into the middle (peek the third item, remove a specific element) — a queue only exposes the front; switch to a deque or list.
Common Mistakes
Using a plain Python list and calling pop(0) to dequeue — that's O(n) per call; use collections.deque for O(1) at both ends.
Forgetting to handle dequeue() on an empty queue, so the call crashes instead of returning a sentinel like NULL.
Pushing and popping at the same end by accident, turning the queue into a stack and silently breaking arrival order.
Try It with an AI Assistant
short
Write a class Queue with enqueue(x) and dequeue() that adds at the back and removes from the front, returning NULL on an empty dequeue.
behavior
Write a class that holds a list of items, with two operations: an 'add' that appends an item to the back of the list, and a 'remove' that takes the item at the front of the list, returns it, and removes it from the list. If 'remove' is called when the list is empty, return null.
Made fast pseudo-random numbers available to computers.
For instanceA simulation can generate repeatable random-looking values.
seed ←7
a ←5
c ←3
m ←16
n ←6
x ← seed
result ← empty list
FOR i FROM1TO n
x ← (a * x + c) MOD m
append(result, x)
ENDFORRETURN result
Early machines lacked hardware randomness. Derrick Lehmer, working with ENIAC at the Institute for Numerical Analysis, proposed in 1949 to generate long sequences using simple modular arithmetic. ENIAC had no spare memory for storing pre-computed random tables, so Lehmer needed numbers produced on the fly — and a single multiply-and-modulus per draw was about as cheap as arithmetic gets. His recipe ran for decades inside scientific simulations, slot machines, and game engines before stronger generators like Mersenne Twister and PCG took over.
Teaches: Generate sequences using simple deterministic recurrence
Anecdote
Derrick Lehmer designed it for ENIAC, the first general-purpose electronic computer. ENIAC had no memory for storing random tables, so Lehmer needed a way to generate randomness on the fly. The LCG's beauty was that it required only one multiplication and one modulus per call — both fast on ENIAC's hardware. Almost every video game's random-feeling behavior, for the next 50 years, traced back to Lehmer's choice of constants.
The Idea
Pick three constants — multiplier a, increment c, modulus m — and a seed x. To get the next number, compute x ← (a · x + c) mod m. Output that, then feed it back as the input for the next call. One multiplication, one addition, one modulus per draw. That's it.
Why does it work as "random-looking"? With well-chosen constants the recurrence visits every value in 0…m−1 before ever repeating itself, so the output cycles through the whole range. Tiny changes in x blow up after the multiplication and get scrambled by the modulus, hiding the underlying simplicity. The sequence is not truly random — it's fully predictable if you know the constants — but that's exactly why it's perfect for reproducible simulations.
Trace
i
x (before)
a·x + c
x ← … MOD 16
result so far
1
7
5·7 + 3 = 38
6
[6]
2
6
5·6 + 3 = 33
1
[6, 1]
3
1
5·1 + 3 = 8
8
[6, 1, 8]
4
8
5·8 + 3 = 43
11
[6, 1, 8, 11]
5
11
5·11 + 3 = 58
10
[6, 1, 8, 11, 10]
6
10
5·10 + 3 = 53
5
[6, 1, 8, 11, 10, 5]
Where It's Used Today
Game development — many older video games used an LCG behind every "random" enemy spawn, loot drop, or shuffle.
Embedded systems — microcontrollers without crypto hardware still ship LCGs because they cost only a few CPU cycles.
Scientific simulations — Monte Carlo experiments use seeded LCGs so a colleague on another machine can reproduce the same result.
Glibc and Java's java.util.Random — both expose LCG-family generators (with carefully chosen constants) for everyday non-cryptographic randomness.
Procedural content generation — terrain, dungeon layouts, and noise textures often start from a seeded LCG so the same map can be regenerated from a 32-bit seed.
When NOT to Use
For cryptography, password generation, or session tokens — LCG output is fully predictable from a few samples; use a CSPRNG instead.
When you need high-dimensional uniformity (Monte Carlo with many parameters) — LCGs fall on hyperplanes; use Mersenne Twister or PCG.
When you need a very long period — common LCG moduli (2^32) wrap after only ~4 billion draws, which a fast simulation exhausts in seconds.
Common Mistakes
Picking a, c, m arbitrarily — bad constants give short cycles or visible patterns; only Hull-Dobell-satisfying triples reach full period.
Using the low bits of an LCG output as a coin flip — for power-of-two moduli the low bits have very short cycles; use the high bits.
Forgetting that the seed must be in 0…m−1 — passing a negative or oversized seed silently produces a different sequence than intended.
Try It with an AI Assistant
short
Write a class LCG(seed) implementing a linear congruential RNG; method next() returns the next sample.
behavior
Write a function that, given a starting integer and three fixed constants a, c, m, repeatedly updates the integer with the rule x = (a*x + c) mod m, yielding n successive values of x. The function should be deterministic — same starting integer, same output sequence.
For instanceRandomly pick points but keep only those inside a circle.
WHILETRUE
x ←proposal_sampler()
u ←uniform(0, 1)
IF u <= target_pdf(x) / (M * proposal_pdf(x)) THENRETURN x
ENDIFENDWHILE
The shift was conceptual: instead of inventing a new generator for every awkward distribution, statisticians realized they could propose and filter — draw from a simple distribution, then accept or reject based on a comparison with the target. Once enshrined in textbooks and the Monte Carlo literature, the technique seeded an entire family of methods (importance sampling, slice sampling, Metropolis-Hastings) that today underpin Bayesian inference, computational physics, and modern probabilistic programming.
Teaches: Discard invalid samples until one fits constraints
Anecdote
John von Neumann formalized rejection sampling at Los Alamos while developing Monte Carlo simulations for nuclear reactor design. The simulations needed random numbers from non-uniform distributions, and von Neumann's trick — sample from a uniform box, throw away points that don't fit the target shape — was simple enough to run on early computers but powerful enough to model neutron physics. Rejection sampling is in some sense a child of the bomb.
The Idea
Loop forever. Each pass, ask generator() for a candidate x from a distribution that's easy to sample (often uniform). Test whether x satisfies the rule with isValid(x). If yes, return it. If no, discard it and loop again.
Why does it work? If candidates are drawn uniformly from a region, and you keep only the ones that fall inside a sub-region, the keepers are uniformly distributed over that sub-region — no bias is introduced by the rejection step. The invariant: every accepted sample has the right distribution, no matter how many times we had to try. The cost is efficiency: if the valid region is tiny compared to the candidate region, you reject most of them. The acceptance rate is the ratio of the two areas — for a circle inside its bounding square, that's π/4 ≈ 78%.
Trace
step
x
y
x² + y²
valid?
action
1
0.81
0.72
1.175
no
reject, loop
2
−0.94
0.55
1.186
no
reject, loop
3
0.30
−0.40
0.250
yes
RETURN (0.30, −0.40)
Where It's Used Today
Monte Carlo physics simulations — von Neumann's original use, now standard for neutron transport, particle physics, and weather modeling.
Bayesian statistics — when a posterior distribution has no closed-form sampler, rejection sampling (or its cousin, MCMC) draws samples from it.
Computer graphics — generating uniform points on a sphere or inside any odd-shaped region for ray tracing and texture synthesis.
Game development — placing trees, rocks, or enemies inside a complex map by sampling a bounding box and rejecting points outside the playable area.
Machine learning — training data filtering, where you reject candidates that fail label or quality constraints.
When NOT to Use
When the acceptance region is tiny relative to the proposal region — you'll reject 99.99% of samples and waste enormous compute; switch to MCMC or importance sampling.
When the target distribution is high-dimensional — acceptance rates drop exponentially with dimension, making the method useless past a few dozen dimensions.
When you need a guaranteed runtime — rejection sampling has unbounded worst-case time; an unlucky run can loop arbitrarily long.
Common Mistakes
Using a proposal distribution that doesn't fully cover the target's support, so some valid regions can never be sampled.
Forgetting the envelope constant M and accepting samples with the wrong probability ratio, biasing the output distribution.
Treating consecutive rejected samples as correlated and discarding good candidates — each draw is independent; just keep trying.
Try It with an AI Assistant
short
Write rejection_sample(target_pdf, proposal_sampler, M) returning one sample from target_pdf using rejection sampling with envelope constant M.
behavior
Write a function that, in a loop, generates a random candidate from an easy distribution, tests whether the candidate satisfies a given rule, and returns it as soon as one passes. Discard the candidates that fail and try again. Show how to use it to draw a uniform random point inside the unit circle.
For instanceA fruit can be labeled apple or orange by comparing nearby examples.
points ← [((1,1), A), ((2,3), A), ((3,4), A), ((5,5), O), ((6,2), O)]
query ← (3, 3)
k ←3
distances ← empty list
FOR EACH p IN points
d ←distance(p, query)
append(distances, (d, p.label))
ENDFOR
sort distances by d
counts ← empty map
FOR i FROM0TO k - 1
label ← distances[i].label
counts[label] ← counts.get(label, 0) + 1ENDFORRETURN label with maximum count
In 1951 at Berkeley, statisticians Evelyn Fix and Joseph Hodges were asked by the US Air Force to study non-parametric classification — recognizing patterns without assuming a probability model. Their answer, written up only as an internal technical report, was disarmingly simple: ask the nearest known examples how they were labeled and take a majority vote. The idea sat almost unread for decades while pattern-recognition researchers independently rediscovered it; today k-NN is the textbook starting point for machine-learning classification because of its sheer transparency.
Teaches: Classify using similarity to known examples
Anecdote
Evelyn Fix and Joseph Hodges Jr. wrote it as an unpublished USAF technical report at Berkeley. The report sat in a drawer for 32 years before being formally published in 1989, after both authors had retired. By then k-NN had been independently rediscovered, named, and become a textbook algorithm — the original 1951 report is a piece of pre-history that hardly anyone read until after the algorithm was famous.
The Idea
Compute the distance from the query point to every training point (Euclidean distance is the usual choice). Sort the training points by distance, smallest first. Look at the top k of them, count how many carry each label, and return the label with the highest count.
Why does this work? It rests on a simple assumption: points that are close in feature space tend to share a label. If your features are well-chosen, similar inputs really do have similar outputs, and the local majority is a good guess. There is no "training" beyond memorizing the data — the work happens at query time. The choice of k matters: too small and a single noisy neighbor can mislead you; too large and faraway points start drowning out the relevant ones.
Trace
p (point)
label
d (distance to query)
(1, 1)
A
2.83
(2, 3)
A
1.00
(3, 4)
A
1.00
(5, 5)
O
2.83
(6, 2)
O
3.16
Where It's Used Today
Recommendation systems — "users who watched what you just watched also liked these other movies" is a nearest-neighbor query in viewer-similarity space.
Handwriting recognition — early postal-code readers compared each digit image to a database of labeled examples.
Medical diagnosis support — matching a patient's lab results against historical records of patients with confirmed diagnoses.
Image search — "find similar pictures" features in photo libraries find the nearest matches in a learned feature space.
Anomaly detection — credit-card fraud systems flag a transaction whose nearest neighbors in feature space are all confirmed fraud.
When NOT to Use
When the training set is huge and queries must be fast — every prediction scans all data; use a tree-based or learned model instead.
When features have very different scales (income in dollars, age in years) without normalization — distance becomes meaningless.
When the data lives in very high dimensions — the "curse of dimensionality" makes all points roughly equidistant and the vote uninformative.
Common Mistakes
Picking an even k that lets the vote tie, with no rule for breaking ties.
Skipping feature scaling, so one large-magnitude feature drowns out all the others in the distance calculation.
Including the query point itself in its own neighbor list when running on training data, inflating accuracy.
Try It with an AI Assistant
short
Write a class KNN(k) with fit(X, y) (just stores) and predict(x) returning the majority label among the k closest training points.
behavior
Write a class that, given a set of labeled training points, can classify a new point by computing its distance to every training point, picking the k closest ones, and returning the label that appears most often among those k. There is no training step beyond memorizing the data.
For instanceCommon letters like E can get shorter codes than rare letters like Z.
pq ← priority queue of (freq, node)
FOR EACH symbol s IN freq
push (freq[s], new node(s)) into pq
ENDFORWHILEsize(pq) > 1
(f1, n1) ←extractMin(pq)
(f2, n2) ←extractMin(pq)
parent ← new node(NULL)
parent.left ← n1
parent.right ← n2
push (f1 + f2, parent) into pq
ENDWHILERETURNextractMin(pq).node
In 1951, MIT professor Robert Fano gave his information-theory class a choice: take the final exam, or write a term paper on optimal binary coding. Graduate student David Huffman picked the paper, struggled for months trying to extend Fano's own top-down approach, and was about to give up when he flipped the problem upside-down — building the tree from the leaves rather than the root. The greedy bottom-up merge worked, was provably optimal, and instantly outdid the method his own professor had been teaching.
Teaches: Assign shorter codes to more frequent items
Anecdote
David A. Huffman invented it as a last-minute term paper. His professor had assigned a coding problem; Huffman tried every approach, failed, and then — just before giving up — found the greedy solution. It beat all other student submissions and became optimal.
The Idea
Build a binary tree from the bottom up. Put each symbol in its own tiny tree, labelled with its frequency. Then, repeatedly: take the two trees with the smallest frequencies, merge them under a new parent whose frequency is their sum, and put the merged tree back in the pool. When only one tree is left, that's your Huffman tree. Read each symbol's code by walking from the root to its leaf — left child means 0, right child means 1.
Why is this optimal? Because rare symbols end up deep in the tree (long codes) and frequent symbols stay near the root (short codes), and the two-smallest greedy choice can be proven to never "waste" a bit. The prefix-free property is automatic: every symbol sits at a leaf, so no symbol's code can be the prefix of another's. The total bit count is the sum of (symbol frequency × depth in the tree), and Huffman's tree minimizes this sum.
Trace
step
extracted (f1, f2)
new parent freq
pq after step
0
—
—
5(a), 9(b), 12(c), 13(d), 16(e), 45(f)
1
5(a), 9(b)
14
12(c), 13(d), 14, 16(e), 45(f)
2
12(c), 13(d)
25
14, 16(e), 25, 45(f)
3
14, 16(e)
30
25, 30, 45(f)
4
25, 30
55
45(f), 55
5
45(f), 55
100
100 (root)
Where It's Used Today
ZIP and gzip files — the DEFLATE format combines Huffman coding with a dictionary scheme to compress almost every file you download.
JPEG images — after the colors and frequencies are quantized, the resulting numbers are squeezed further with Huffman codes.
MP3, AAC, and MPEG video — the final stage of audio and video compression is a Huffman pass.
PNG image format — uses Huffman as part of its lossless compression pipeline.
Text compression in protocols — HTTP/2 header compression (HPACK) uses a static Huffman code to shrink common header strings.
When NOT to Use
When all symbols have nearly equal frequency — Huffman provides little benefit over a fixed-length encoding.
When the symbol frequencies change rapidly during the stream — adaptive arithmetic coding compresses better.
When you must support random-access reads into the compressed data — variable-length codes force you to decode from the start.
Common Mistakes
Using a regular queue or list instead of a min-heap, making each merge O(n) instead of O(log n).
Forgetting to send the tree (or the frequency table) alongside the encoded stream — the decoder cannot reconstruct codes without it.
Mishandling the single-symbol case where the tree has only one leaf and no code bits get assigned.
Try It with an AI Assistant
short
Write huffman_codes(freqs) where freqs maps symbol → count; return a dict mapping symbol → bitstring such that codes are prefix-free and minimize expected bit length.
behavior
Write a function that, given a mapping of symbols to frequencies, repeatedly takes the two least-frequent items, merges them under a new parent whose weight is their sum, and puts the parent back in the pool until only one tree remains. Then walk the tree to assign each symbol a binary string — left edges are 0, right edges are 1 — and return that mapping.
For instanceA dictionary app can find a word without scanning every word.
i ←hash(key) MODsize(table)
WHILE table[i] occupied AND table[i].key != key
i ← (i + 1) MODsize(table)
ENDWHILE
table[i] ← (key, value)
i ←hash(key) MODsize(table)
WHILE table[i] NOT empty
IF table[i].key = key THENRETURN table[i].value
ENDIF
i ← (i + 1) MODsize(table)
ENDWHILERETURNNULL
The first hash table appeared in an internal IBM memo in January 1953, written by Hans Peter Luhn while exploring ways to speed up sorting and lookup on the IBM 701. Luhn's "scatter storage" idea — compute an address from the data itself rather than searching for it — was so unusual that it took years to spread; Arnold Dumey published a similar scheme in 1956, and Robert Morris's 1968 CACM paper finally gave the technique its modern name. By the 1970s, hash tables were the default associative structure in nearly every programming language and database engine.
Teaches: Trade collisions for speed using computed locations
Anecdote
Modern hash tables (like Swiss tables) are tuned to CPU cache lines, not just theory. Performance now comes from understanding hardware architecture, not just asymptotic ideas.
The Idea
A hash function turns any key into a number. Take that number mod size(table) and you have an array index. Insert puts the pair at that slot; lookup goes directly there. The catch is collisions: two different keys can hash to the same slot. Open addressing handles this with linear probing — if slot i is taken, try i+1, then i+2, and so on (wrapping around) until we find an empty slot or the matching key.
This works because both insert and lookup follow the same probe sequence starting from hash(key) mod size. As long as the table isn't too full and the hash function spreads keys evenly, the probe chain stays short. Average lookup time stays close to one slot regardless of how many keys are stored.
Trace
step
action
i
table state (slot: value)
1
insert apple
4
4: apple→10
2
insert banana
3
3: banana→20, 4: apple→10
3
insert cherry; 4 occupied, probe
5
3: banana→20, 4: apple→10, 5: cherry→30
Where It's Used Today
Programming language built-ins — Python dict, JavaScript Map/objects, Java HashMap, C++ unordered_map are all hash tables.
Database indexes — hash indexes power equality lookups (WHERE id = 42) in PostgreSQL, MySQL, and most key-value stores.
Caches — Redis, Memcached, and in-process LRU caches rely on hash tables for instant key lookup.
Compilers and interpreters — symbol tables that map variable names to types and locations.
Network routing — flow tables in switches and load balancers map packet headers to destinations using hardware-friendly hash structures.
When NOT to Use
When you need keys retrieved in sorted order or by range — hash tables scatter keys randomly; use a balanced BST or skip list instead.
When inputs are adversarially chosen and the hash function is exposed — attackers can force every key into one slot, turning lookup into O(n).
When the table is tiny (a handful of items) — the hash overhead and collision logic are slower than a plain array scan.
Common Mistakes
Forgetting to resize when the load factor approaches 1.0 — the probe chain grows until insert and lookup degrade to linear time.
Marking deleted slots as empty instead of as tombstones, which breaks the probe chain so existing keys can no longer be found.
Using a weak hash function like key.length or the first character — most keys collide into the same few slots and performance collapses.
Try It with an AI Assistant
short
Store and retrieve key-value pairs using hashed array positions and collision probing.
behavior
Keep a fixed-size array. To insert a (key, value) pair, compute an integer from the key, take it modulo the array size to get a starting slot, and walk forward (wrapping around) until you find an empty slot or the same key, then store the value there. To look up a key, walk the same way until you find the key — return its value — or hit an empty slot, in which case the key isn't present.
Made linear-time sorting possible for small integer ranges.
For instanceSort exam scores from 0 to 100 by counting each score.
arr ← [4, 2, 2, 0, 3, 2, 1]
k ←4
count ← array[0..k] filled with 0FOR EACH x IN arr
count[x] ← count[x] + 1ENDFOR
index ←0FOR i FROM0TO k
WHILE count[i] > 0
arr[index] ← i
index ← index + 1
count[i] ← count[i] - 1ENDWHILEENDFORRETURN arr
Instead of comparing values, counting sort simply counts occurrences. This idea revealed that some sorting problems could bypass the famous n log n lower bound for comparison sorts. When values live in a small known range — exam scores, employee numbers, byte values — tallying is dramatically faster than any comparison-based method, and the linear-time payoff justified the extra memory for a count array.
Teaches: Sort by counting occurrences, not comparisons
Anecdote
Harold H. Seward devised counting sort as part of his MIT master's thesis on early data-processing systems. The thesis was on practical sorting for IBM's commercial customers — counting was attractive because it didn't require expensive comparisons, and the customers' data (employee numbers, product IDs) usually had bounded value ranges. Counting sort is one of the few sorting algorithms invented for billing systems, not for theoretical purity.
The Idea
Make a count array with one slot per possible value, all starting at zero. Walk through the input once and bump count[x] for each x you see — that's a tally. Then walk the count array from 0 to k, and for each value i, write i back into the output count[i] times. Done.
Why does it work? Because the tally tells you exactly how many copies of each value you owe, and writing them back in order — all the 0s first, then all the 1s, then all the 2s — is by definition sorted order. There's no comparison anywhere, which is why the textbook lower bound of n log n for comparison sorts doesn't apply: counting sort runs in O(n + k). The catch is the trade-off — you need an array of size k, so it's only practical when k is small (think 0–255 for a byte, or 0–100 for exam scores). Sorting a million 64-bit integers this way would need an absurd amount of memory.
Trace
value i
count[i]
0
1
1
1
2
3
3
1
4
1
Where It's Used Today
Radix sort — counting sort is the inner loop of radix sort, which is how databases and big-data systems sort billions of integers and strings.
Histogram construction — image-processing pipelines compute pixel-intensity histograms with the same tally pass.
Bucket layout in suffix arrays — bioinformatics tools that build suffix arrays for DNA use counting sort to bucket characters in linear time.
Grade reports — ranking exam scores in [0, 100] is a textbook counting-sort use case.
Network packet sorting — routers binning packets by priority class (a tiny range like 0–7) use counting sort because it's branch-free and cache-friendly.
When NOT to Use
When the value range k is much larger than n — allocating a count array for 4-byte integers needs gigabytes of memory for almost no payoff.
When the items are floating-point numbers, strings, or arbitrary objects — counting sort needs integer values that index directly into a slot.
When you must sort by a custom comparator (case-insensitive strings, locale ordering) — counting sort has no comparison hook to plug into.
Common Mistakes
Allocating the count array of size k instead of k + 1, missing the largest value and writing out of bounds.
Forgetting to clear or reset the count array between calls when reusing buffers, mixing in stale tallies from the previous run.
Claiming "stable" while writing values back from the count array — this loses original order; for stability you need the prefix-sum variant that places items by cumulative count.
Try It with an AI Assistant
short
Write counting_sort(a, k) for non-negative integers in [0, k); return a new sorted list.
behavior
Write a function that sorts a list of non-negative integers whose values are at most k. First make an array of size k+1 filled with zeros and increment position x for every x in the input. Then walk that array from 0 to k and, for each index i, write i into the output as many times as the count says. Return the output.
For instanceA chess engine skips moves that cannot affect the final choice.
node ← root
depth ←2
α ← -∞
β ← +∞
maximizing ←TRUEFUNCTIONalphabeta(node, depth, α, β, maximizing)
IF depth = 0ORterminal(node) THENRETURNevaluate(node)
ENDIFIF maximizing THEN
value ← -∞
FOR EACH child INchildren(node)
value ←max(value, alphabeta(child, depth - 1, α, β, FALSE))
α ←max(α, value)
IF α >= β THENBREAKENDIFENDFORRETURN value
ELSE
value ← +∞
FOR EACH child INchildren(node)
value ←min(value, alphabeta(child, depth - 1, α, β, TRUE))
β ←min(β, value)
IF β <= α THENBREAKENDIFENDFORRETURN value
ENDIFEND FUNCTIONRETURNalphabeta(node, depth, α, β, maximizing)
Researchers realized many branches could never influence the final decision and could safely be skipped. This dramatically accelerated chess-playing programs.
Full minimax search explored far too many game positions.
Teaches: Stop exploring once a branch cannot change the answer
The Idea
Carry two running bounds through the search: α is the best score the maximizing player can already guarantee, and β is the best the minimizing player can already guarantee. When you're about to explore a child node, check whether anything you find there could possibly improve the final answer. If α >= β, the answer is already pinned down — the opponent will never let you reach this branch — so you can stop exploring (the cutoff).
Why does it work? Because in minimax, both players play optimally. If the maximizer has already found a move worth at least α, and a deeper search reveals a child that can guarantee the minimizer at most β with β <= α, the minimizer would never choose this path. The leftover children become irrelevant — they can be pruned without changing the answer. Same final move, fewer nodes touched.
Trace
step
node
α
β
what happens
1
A
-∞
+∞
enter A as MIN; look at leaf 3 → value=3, β=3
2
A
-∞
3
look at leaf 5; min(3,5)=3, β stays 3 → return 3
3
root
3
+∞
back at root; α=3 because we have a guarantee of 3
4
B
3
+∞
enter B as MIN; look at leaf 2 → value=2, β=2
5
B
3
2
β=2 ≤ α=3 → cutoff! skip the ?? leaf entirely
6
root
3
+∞
B returns 2; max(3,2)=3 → root's value is 3
Where It's Used Today
Chess engines — Stockfish and other top engines use alpha-beta as the backbone of their search, with many extensions stacked on top.
Checkers and Go programs — pre-AlphaGo, all serious programs used alpha-beta-style search.
Tic-tac-toe and Connect Four teaching demos — the cleanest small example most computer science classes use.
Decision-tree pruning in operations research — searching plans where the cost of an action is bounded.
Adversarial game AI in video games — pathfinding bots that need to decide their move while assuming a clever opponent.
When NOT to Use
When children are explored in random order — pruning barely helps; you need a good move-ordering heuristic to see the speedup.
When the evaluation function is noisy or non-monotonic, like games with chance nodes — the bounds aren't valid and cuts become wrong.
When the game tree is small enough that plain minimax already finishes in milliseconds — alpha-beta only adds bookkeeping.
Common Mistakes
Swapping the cutoff condition (alpha >= beta vs beta <= alpha) on the wrong side, silently producing wrong values.
Updating alpha or beta globally instead of passing them down by value, leaking bounds across unrelated subtrees.
Pruning before initializing value from at least one child, so the function returns -infinity when every child got cut.
Try It with an AI Assistant
short
Write alphabeta(node, depth, alpha, beta, max_player) minimax with α–β cutoffs.
behavior
Write a recursive function for a two-player game tree. At each node, pass two running bounds: the best score the maximizer can already guarantee and the best the minimizer can already guarantee. Update them as children are evaluated, and stop exploring siblings whenever the bounds cross — the rest cannot change the final value.
For instanceReverse a playlist stored as linked songs.
head ←node(1) → node(2) → node(3) → NULL
prev ←NULL
curr ← head
WHILE curr != NULL
next ← curr.next
curr.next ← prev
prev ← curr
curr ← next
ENDWHILERETURN prev
Reversing a linked list became the canonical test of pointer reasoning almost as soon as Newell, Shaw, and Simon's IPL gave programmers their first dynamic chains of nodes. The exercise asks for nothing fancy — no new data structure, no clever math — just the discipline to keep three pointers straight and never lose the thread back to the rest of the list. Decades later, it became one of the most-asked questions in software interviews precisely because failing it reveals exactly the kind of off-by-one pointer bug that crashes real systems.
Teaches: Rewire pointers to invert structure in-place
Anecdote
Allen Newell, Cliff Shaw, and Herbert Simon invented the linked list at RAND while building IPL — Information Processing Language — the world's first AI programming language. They needed dynamic data structures to represent symbolic logic, and pointer-based lists let them grow and shrink memory on demand. One generation later, John McCarthy borrowed the idea for LISP — and modern functional programming traces directly back to a 1956 RAND project.
The Idea
Walk down the list with three pointers: prev (the part we've already reversed), curr (the box we're working on now), and next (a temporary holder so we don't lose the rest of the list). At each step: remember next = curr.next, flip curr's arrow with curr.next = prev, then advance prev = curr and curr = next. Stop when curr is NULL.
The invariant is the key: everything to the left of curr is already reversed, and prev is its new head. Saving next before flipping the arrow is crucial — once we overwrite curr.next, the path forward is gone if we didn't stash it. When the loop ends, prev points at the last node we processed, which is the original tail — that's our new head.
Trace
step
prev
curr
next
action
0
NULL
1 → 2 → 3
—
start
1
NULL
1 → 2 → 3
2 → 3
save next; set 1.next = NULL
2
1 → NULL
2 → 3
3
save next; set 2.next = 1; advance
3
2 → 1 → NULL
3
NULL
save next; set 3.next = 2; advance
4
3 → 2 → 1 → NULL
NULL
—
curr is NULL → stop
Where It's Used Today
Music apps — reversing a playlist that's stored as linked songs (so "play in reverse order" doesn't need a copy).
Undo stacks — many editors and browsers store actions as linked nodes; reversing or splicing them is the same primitive.
Operating systems — kernel data structures (process lists, free-block lists) are often singly linked, and reversal shows up in queue-to-stack conversions.
Functional programming — Lisp, Scheme, and Haskell are built on linked lists; reverse is a basic library function used everywhere.
Coding interviews — this is one of the most-asked questions in software hiring, because it tests pointer reasoning in just a few lines.
When NOT to Use
When the data is in an array or vector — arr[::-1] or a two-pointer swap is faster and uses no extra pointer overhead.
When the list is doubly linked — you only need to swap each node's prev and next and flip the head/tail pointers.
When you need the reversed view without mutating the original — build an iterator that walks backward, or copy into a new list.
Common Mistakes
Forgetting to save next = curr.next before overwriting curr.next = prev — the rest of the list is now unreachable and the loop dies.
Returning head instead of prev at the end — head still points at the original first node, which is now the tail with a NULL next.
Initializing prev to head instead of NULL — the original head ends up pointing at itself, creating an infinite loop on traversal.
Try It with an AI Assistant
short
Write reverse_list(head) reversing a singly linked list in place; return the new head.
behavior
Write a function that walks down a singly linked list with three pointers — previous, current, and next — and at each step saves the current node's next pointer, then flips the current node's next pointer to point at the previous node, then advances the previous and current pointers. When the walk ends, return the previous pointer.
For instanceConnect villages with minimum road cost and no loops.
edges ←sort(E by weight)
uf ←UnionFind(V)
mst ← []
FOR EACH (u, v, w) IN edges
IF uf.find(u) != uf.find(v) THEN
uf.union(u, v)
mst.append((u, v, w))
ENDIFENDFORRETURN mst
In 1956, Bell Labs and AT&T were spending fortunes on long-distance telephone cable, and the question of which town-to-town links to lay first was a real business problem. Joseph Kruskal, then a graduate student at Princeton, published a beautifully simple answer: sort every candidate link by cost, then accept each cheapest link unless it would close a loop. The same greedy idea now shows up everywhere from electrical-grid planning to image segmentation in computer vision.
Teaches: Build globally cheap from locally safe choices
The Idea
Sort every candidate edge by cost, cheapest first. Then walk down the sorted list and add an edge if and only if it joins two pieces that aren't already connected — adding an edge that closes a loop would be wasteful and is skipped. Stop when every point is in the same connected piece.
The bookkeeping trick is a union-find data structure. Each point starts in its own group. Every time you accept an edge, you union the two groups it touches. Before accepting an edge (u, v), you find whether u and v are already in the same group — if so, that edge would form a cycle and you skip it. Why does taking the cheapest safe edge always work? Because at every step, an MST must contain the cheapest edge that crosses some cut separating the two pieces — and the cheapest unused, non-cycling edge is exactly such an edge.
Trace
(u, v, w)
uf.find(u) vs find(v)
action
mst total
(A, B, 1)
A, B different
union, add edge
1
(B, C, 2)
B, C different
union, add edge
3
(A, C, 3)
A, C same
skip (would form cycle)
3
(C, D, 4)
C, D different
union, add edge
7
(D, E, 5)
D, E different
union, add edge
12
(B, E, 6)
B, E same
skip
12
Where It's Used Today
Network design — laying internet backbones, fiber-optic cables, or electrical grids with minimum total wire length.
Image segmentation — grouping similar pixels together by treating pixels as points and color differences as edge weights.
Cluster analysis — finding natural groups in data by building an MST and then cutting the longest edges.
Approximating the traveling salesman problem — an MST gives a known starting bound for shortest-tour heuristics.
Maze and dungeon generation — many roguelike games carve a minimum spanning tree through a grid to create connected, loop-free corridors.
When NOT to Use
When the graph is dense (E ≈ V²) — Prim's algorithm with a heap is usually faster because it avoids sorting every edge.
When you need a shortest path between two nodes — an MST minimizes total edge weight, not point-to-point distances; use Dijkstra.
When edges are directed — MST is defined for undirected graphs; for directed graphs you need a minimum arborescence (Edmonds' algorithm).
Common Mistakes
Using a naive find without path compression or union by rank, turning each lookup linear and ruining the near-linear runtime.
Forgetting to skip an edge whose endpoints are already connected, producing a graph with cycles that is not a tree.
Stopping after a fixed edge count instead of after V - 1 accepted edges, leaving the spanning tree incomplete or oversized.
Try It with an AI Assistant
short
Build minimum spanning tree by greedily adding smallest non-cycling edges.
behavior
Write a function that takes a list of weighted edges over a set of nodes. Sort the edges by weight, smallest first. Use a union-find data structure that starts with every node in its own group. Walk the sorted edges; for each edge, if its two endpoints are in different groups, accept it into the result and merge the groups. Stop when every node is in the same group, and return the accepted edges.
For instanceRepeatedly swap neighboring students until heights are ordered.
arr ← [5, 1, 4, 2]
n ←length(arr)
FOR i FROM0TO n - 1FOR j FROM0TO n - i - 2IF arr[j] > arr[j + 1] THENswap(arr[j], arr[j + 1])
ENDIFENDFORENDFORRETURN arr
The name "bubble sort" appeared in IBM's 1956 internal report on sorting methods, alongside a dozen other early techniques being evaluated for the first generation of business mainframes. Though slow, its visual story — larger values rising one position at a time like bubbles in a glass — made it a permanent fixture in classrooms. Generations of computer-science students have met sorting for the first time through this algorithm, even as practitioners have spent the same decades urging them never to actually use it in real code.
Teaches: Repeated small improvements can create order
Anecdote
The name "bubble sort" appears in IBM's 1956 Sorting on Electronic Computer Systems report. Donald Knuth later called it "an embarrassing algorithm" — slow, unintuitive ordering, taught only to be unlearned. Knuth would also call it the most-studied algorithm of all time, because computer-science teachers couldn't resist starting with it. Bubble sort survives in textbooks as the algorithm computer scientists love to hate.
The Idea
Walk down the list comparing each item to the one beside it. If they are out of order, swap them. After one full walk, the largest item has been carried all the way to the end. Now ignore that last position and walk again — the second-largest will end up just before it. Repeat until the list is sorted.
Why does it work? After the i-th pass, the last i positions contain the i largest values, in the right order — that's the invariant. Each pass extends the sorted suffix by exactly one. Since one pass guarantees one more position is correct, n passes guarantee the whole list is correct. The cost is that each pass walks the whole unsorted prefix, so on a list of n items you do roughly n²/2 comparisons. That's fine for short lists; ruinous for long ones.
Trace
pass i
j sweeps
array after pass
0
(5,1)→swap, (5,4)→swap, (5,2)→swap
[1, 4, 2, 5]
1
(1,4)→ok, (4,2)→swap
[1, 2, 4, 5]
2
(1,2)→ok
[1, 2, 4, 5]
3
nothing left to do
[1, 2, 4, 5]
Where It's Used Today
Computer-science classrooms — bubble sort is the first sorting algorithm most students see, because each step is easy to draw on a whiteboard.
Sorting visualizations — animation channels and educational sites show bubble sort because the "bubbling" motion is so visible.
Tiny embedded systems — when only a handful of items need sorting and code size matters more than speed (firmware, sensors), the simplest sort wins.
Detecting "almost sorted" lists — bubble sort with an early-exit flag can confirm a list is already sorted in a single pass.
Interview problems — bubble sort is a common warm-up question and a stepping stone to more sophisticated sorts like insertion, merge, and quicksort.
When NOT to Use
When the list has more than a few dozen elements — its O(n^2) swap-heavy behavior is wasteful even compared to insertion sort, which moves data far less.
When writes are expensive (flash memory, network-replicated arrays) — bubble sort performs many more swaps than insertion or selection sort for the same data.
When you need anything resembling production performance — every standard library sort (Timsort, introsort) beats bubble sort by huge margins; using it outside teaching contexts is almost always wrong.
Common Mistakes
Looping the inner index up to n - 1 instead of n - i - 2 — you re-process the already-sorted suffix on every pass, doubling the work.
Forgetting the early-exit "no swaps this pass" flag — the algorithm keeps grinding through O(n^2) passes even when the list became sorted on pass 2.
Comparing arr[j] with arr[j-1] instead of arr[j+1] — the off-by-one flip causes either out-of-bounds reads or a sort that runs in the wrong direction.
Try It with an AI Assistant
short
Write bubble_sort(a) that sorts a list in place using bubble sort.
behavior
Write a function that sorts a list by repeatedly walking from the start to the end, comparing each pair of neighbors and swapping them when they are out of order. After each full walk, the largest remaining item ends up at the right edge of the unsorted region. Repeat until no swaps happen during a walk.
Made growing a cheapest network from one point practical.
For instanceExpand an electrical grid by adding cheapest nearby connection.
visited ← empty set
pq ← priority queue with (0, start, NULL)
mst ← empty list
total ←0WHILE pq NOTempty
(w, node, from) ←extractMin(pq)
IF node IN visited THENCONTINUEENDIF
add node TO visited
IF from != NULLTHENappend (from, node, w) TO mst
total ← total + w
ENDIFFOR EACH (neighbor, cost) IN graph[node]
IF neighbor NOTIN visited THENinsert (cost, neighbor, node) into pq
ENDIFENDFORENDWHILERETURN mst, total
Robert Prim was a Bell Labs researcher in 1957, working on the practical problem of laying down telephone cable cheaply between switching stations. He rediscovered the "grow one cheapest edge at a time" technique that the Czech mathematician Vojtěch Jarník had published in 1930 — but Prim's clearer formulation, in the language of graph algorithms, is what stuck. Edsger Dijkstra independently wrote down the same algorithm in 1959, which is why textbooks sometimes call it the Prim-Jarník-Dijkstra algorithm.
Teaches: Grow outward by always picking the cheapest crossing edge
The Idea
Start at any single node — call it your "island." At each step, look at every edge that has one foot on the island and one foot outside it. Pick the cheapest such crossing edge, add the new node to the island, and repeat. Stop when every node belongs to the island. A priority queue lets you always grab the cheapest crossing edge in logarithmic time.
Why does this work? At every step, the partial tree is a subset of some minimum spanning tree (this is the cut property of MSTs). The cheapest edge crossing from "in" to "out" is always safe to add — refusing it would mean accepting a more expensive edge later to cross the same boundary. So at every step the invariant holds: the tree we've built so far is part of an optimal solution, and after n − 1 edges we've connected all n nodes.
Trace
step
pq (sorted)
extract (w, node)
visited
action
0
[(0, A)]
(0, A)
{A}
push A's edges
1
[(2, B), (3, C)]
(2, B)
{A, B}
push B's edges
2
[(1, C), (3, C), (4, D)]
(1, C)
{A, B, C}
push C's edges
3
[(3, C), (4, D), (5, D)]
(3, C)
skip — visited
4
[(4, D), (5, D)]
(4, D)
{A, B, C, D}
done
Where It's Used Today
Electrical grids — utility companies plan the cheapest set of new power lines that connect every neighborhood substation back to the main plant.
Computer networks — laying out fiber-optic backbones between data centers so that every site is reachable with the least cable.
Road and rail planning — choosing which highway segments to fund first when you must connect a set of towns within a budget.
Cluster analysis — image-segmentation and biology tools build an MST over data points to find natural groupings (cut the most expensive edges and clusters fall out).
Approximate Traveling Salesman — many delivery-route heuristics start by building an MST and then walking around it, getting within a factor of 2 of the best possible tour.
When NOT to Use
When the graph is sparse and you want all-edges-sorted simplicity — Kruskal's algorithm with union-find is often easier to reason about.
When you need shortest paths from a source, not a minimum-cost spanning tree — these are different problems; use Dijkstra.
When the graph is disconnected — Prim's only spans the component containing start; you'll need to run it from each component or switch to Kruskal.
Common Mistakes
Skipping the IF node IN visited THEN CONTINUE check after extraction, so already-visited nodes get processed again and the MST gains duplicate or wrong edges.
Pushing every neighbor unconditionally (including visited ones) and ballooning the priority queue — filter at insertion or accept the lazy-deletion overhead.
Confusing the priority-queue weight with the destination node's distance from start — Prim's compares only the single crossing edge, not a cumulative path cost.
Try It with an AI Assistant
short
Write prim(graph, start) returning the MST edges using a priority queue.
behavior
Write a function that, given a weighted graph and a starting node, grows a connected set of nodes by repeatedly adding the cheapest edge that has one end in the set and one end outside it. Stop when every node is included, and return the chosen edges.
For instancePick the smallest book each time and place it next on the shelf.
arr ← [64, 25, 12, 22, 11]
n ←length(arr)
FOR i FROM0TO n - 1
minIdx ← i
FOR j FROM i + 1TO n - 1IF arr[j] < arr[minIdx] THEN
minIdx ← j
ENDIFENDFORswap(arr[i], arr[minIdx])
ENDFORRETURN arr
Instead of repeated local swaps, selection sort repeatedly finds the smallest remaining element. It reflects how humans often organize physical objects.
Needed a simple in-place sorting strategy with minimal swaps.
Teaches: Pick the smallest remaining element each step
Anecdote
Selection sort never had a single "inventor" — by the late 1950s it appears as the simplest possible sorting algorithm in lecture notes from MIT, Carnegie Tech, and IBM training programs simultaneously. It's the sorting algorithm a kid would invent if you asked them to sort their LEGO bricks by size. Its persistence in education comes from being the sort that makes a new student want to learn a better one.
The Idea
Walk an index i from the front of the array to the back. At each i, scan the rest of the array (from i + 1 to the end) to find the index minIdx of the smallest remaining element. Swap arr[i] with arr[minIdx]. After the swap, position i holds its final, correctly-sorted value.
Why does it work? After the i-th pass, the prefix arr[0..i] contains the i + 1 smallest values in sorted order, and they are guaranteed never to move again. The unsorted suffix gets one element shorter every pass. The invariant: the prefix is always sorted, and every element in the prefix is no larger than every element in the suffix. When i reaches n − 1, the prefix is the whole array. Selection sort uses at most n − 1 swaps total — fewer than bubble sort — but always does about n²/2 comparisons.
Trace
i
scan range
minIdx (value)
swap
array after
0
indices 1–4
4 (11)
0 ↔ 4
[11, 25, 12, 22, 64]
1
indices 2–4
2 (12)
1 ↔ 2
[11, 12, 25, 22, 64]
2
indices 3–4
3 (22)
2 ↔ 3
[11, 12, 22, 25, 64]
3
index 4
3 (25)
3 ↔ 3
[11, 12, 22, 25, 64]
4
(empty)
4 (64)
4 ↔ 4
[11, 12, 22, 25, 64]
Where It's Used Today
Teaching sorting — selection sort is one of the first three algorithms in nearly every intro CS course because the invariant is so visible.
Embedded systems with limited writes — flash memory wears out with each write, so the minimal-swap behavior of selection sort can be preferable to bubble sort.
Sorting tiny arrays — when n ≤ 10, the simple loop wins on cache and instruction count, so library sorts often fall back to a selection-style scan at the leaves.
Real-world physical sorting — sorting cards, books, or files by hand is closer to selection sort than to any other algorithm; you scan, pick the smallest, place it.
Online "find the top-k" — partial selection sort runs the outer loop only k times, finding the k smallest values in O(nk) time.
When NOT to Use
When n is more than a few thousand — the O(n²) comparison count is brutally slow compared to merge sort or Timsort.
When you need a stable sort that preserves the order of equal keys — selection sort's swap moves elements past equal values and breaks ties unpredictably.
When the data is already nearly sorted — selection sort does the full O(n²) scan anyway, while insertion sort or Timsort finishes in nearly O(n).
Common Mistakes
Starting the inner loop at i instead of i + 1, which compares arr[i] to itself and just wastes a comparison.
Tracking the minimum value instead of its index, then being unable to swap it back into position i.
Swapping inside the inner loop on every new minimum — the algorithm only swaps once per outer pass, after the inner loop ends.
Try It with an AI Assistant
short
Write selection_sort(a) that sorts a list in place using selection sort.
behavior
Write a function that sorts a list in place by repeatedly finding the smallest element in the unsorted portion of the list and swapping it into the next position. Start at index 0; on each pass scan from the current position to the end to find the index of the minimum, then swap it into the current position.
Made automatic grouping of unlabeled data practical.
For instanceGroup customers into clusters based on shopping behavior.
centroids ← pick k random points
FOR t FROM1TO iterations
groups ← array of k empty lists
FOR EACH p IN points
idx ← index of closest centroid TO p
append(groups[idx], p)
ENDFORFOR i FROM0TO k - 1
centroids[i] ←mean(groups[i])
ENDFORENDFORRETURN centroids
Stuart Lloyd worked out the algorithm at Bell Labs in 1957 while studying how to quantize signals for pulse-code modulation — squeezing a continuous voltage range into a small set of representative values. His write-up sat as an internal Bell memo for over twenty years before being formally published in 1982, by which time others had independently rediscovered the same iterative assign-then-recenter loop. Today it's the most-taught unsupervised learning algorithm and the default first pass on almost every clustering problem.
Teaches: Iteratively refine groups by minimizing internal distance
Anecdote
Despite its simplicity, k-means is still used in large systems. Often it's just an initialization step for more complex models — a humble algorithm sitting at the start of deep pipelines.
The Idea
Pick k random points to be the centroids — the imaginary "centers" of your clusters. Now repeat two steps until nothing changes:
1. Assign: send every data point to whichever centroid it is closest to. This forms k groups.
2. Update: move each centroid to the average position of the points that joined its group.
Why does this work? Each step only ever lowers the total distance from points to their own centroid. The "assign" step lowers it because every point switches to the closest center; the "update" step lowers it because the mean is the position that minimizes total distance to a group. The total distance can't drop forever — eventually the centroids stop moving, and the clusters are stable. That stable arrangement is your answer.
Trace
t
centroids
groups (assigned)
new centroids = mean(group)
1
[1, 12]
{1,2,3} → idx 0; {10,11,12} → idx 1
[2, 11]
2
[2, 11]
{1,2,3} → idx 0; {10,11,12} → idx 1
[2, 11]
Where It's Used Today
Customer segmentation — retailers group shoppers into clusters by age, spending, and visit frequency to target promotions.
Image compression — replacing thousands of pixel colors with k representative colors shrinks file size while keeping the picture recognizable.
Document organization — news websites cluster articles into topic groups so readers see related stories together.
Anomaly detection — fraud systems flag transactions that don't fit cleanly into any normal cluster.
Initialization for bigger models — modern deep-learning pipelines often start with k-means to give a smarter starting point than random.
When NOT to Use
When the clusters are non-spherical, elongated, or vary wildly in size — k-means assumes round, equal-radius blobs and will split or merge them wrongly.
When you don't know k — the algorithm will happily produce whatever number you ask for, even if there are no real groups.
When features have very different scales (income in dollars vs. age in years) — Euclidean distance is dominated by the larger scale; standardize first or use a different metric.
Common Mistakes
Picking initial centroids uniformly at random — bad seeds give terrible local minima; use k-means++ or run multiple restarts.
Forgetting to handle empty clusters mid-iteration — the mean of an empty set is undefined and crashes the update step.
Treating the result as global optimum — k-means is greedy and converges to a local minimum that depends on the starting centroids.
Try It with an AI Assistant
short
Write a class KMeans(k) with fit(X) running Lloyd's iteration until centroids stabilize, and predict(x) returning the nearest cluster index.
behavior
Write a function that, given a list of points and a number k, picks k initial centers, then repeatedly assigns each point to its nearest center and moves each center to the average of its assigned points, until the centers stop moving. Return the final centers and the cluster index for each point.
For instanceCalculate Fibonacci once per value instead of thousands of repeated calls.
memo ← map with memo[0]=0, memo[1]=1FUNCTIONf(x)
IF x IN memo THENRETURN memo[x]
ENDIF
memo[x] ←f(x - 1) + f(x - 2)
RETURN memo[x]
ENDFUNCTION
RETURNf(n)
In 1957, working at the RAND Corporation in Santa Monica, Richard Bellman gave a name to the family of techniques that solve a problem by remembering the answers to its smaller pieces — dynamic programming. He chose the name partly to make the work palatable to a research-averse defense budget; the technique itself was a serious intellectual leap, identifying the property of overlapping subproblems that memoization exploits. Memoized Fibonacci is the canonical teaching example because the savings are so dramatic: an exponential blow-up of redundant calls collapses to a single linear sweep.
Teaches: Cache results to avoid repeated computation
The Idea
Keep a memo table — a dictionary mapping x to its answer f(x). Seed it with the two base cases memo[0] = 0 and memo[1] = 1. When f(x) is called, look in the memo first: if the answer is already there, return it. If not, compute f(x−1) + f(x−2), store the result in memo[x], and return it.
Why does it work? Each value of x gets computed exactly once, because the second time we see it, the memo answers immediately. The same recursion that did exponential work now does linear work — we make n real subproblem calls (one for each fresh x) and unboundedly many cache hits, but cache hits are free. This is the central trick of dynamic programming: trade memory for time, store overlapping subproblems instead of recomputing them.
Trace
call
memo before
action
memo after
f(6)
{0:0, 1:1}
needs f(5) + f(4) — recurse on f(5)
—
f(5)
{0:0, 1:1}
needs f(4) + f(3) — recurse on f(4)
—
f(4)
{0:0, 1:1}
needs f(3) + f(2) — recurse on f(3)
—
f(3)
{0:0, 1:1}
needs f(2) + f(1); recurse on f(2)
—
f(2)
{0:0, 1:1}
f(1) + f(0) = 1 + 0 = 1; store
{…, 2:1}
f(3)
{…, 2:1}
memo[2]=1, memo[1]=1 → 2; store
{…, 2:1, 3:2}
f(4)
{…, 3:2}
memo[3]=2, memo[2]=1 → 3; store
{…, 4:3}
f(5)
{…, 4:3}
memo[4]=3, memo[3]=2 → 5; store
{…, 5:5}
f(6)
{…, 5:5}
memo[5]=5, memo[4]=3 → 8; store
{…, 6:8}
Where It's Used Today
Spell checkers and DNA aligners — edit-distance and sequence-alignment algorithms are memoized recursions over (i, j) pairs.
Compiler optimizers — the common subexpression elimination pass is memoization on parsed expressions.
Game AI — chess and Go engines memoize positions in a transposition table to skip re-analyzing the same board.
Web frameworks — React's useMemo and Vue's computed properties cache expensive renders so they're not redone on every keystroke.
Python's functools.lru_cache — a one-line decorator that turns any pure function into a memoized one; it's how scientific Python users speed up bottlenecks.
When NOT to Use
When the function isn't pure — memoizing a function that depends on global state or time will return stale or wrong answers.
When n is large enough that the recursion depth blows the stack — switch to bottom-up iteration with two rolling variables.
When subproblems don't actually overlap — a one-pass loop or formula is simpler and the cache is dead weight.
Common Mistakes
Forgetting to seed the base cases memo[0] = 0 and memo[1] = 1, causing infinite recursion or KeyError lookups.
Using a fresh memo on every top-level call instead of sharing one across calls — defeats the entire point of memoization.
Storing keys with mutable types (lists, dicts) that hash inconsistently, so identical inputs miss the cache.
Try It with an AI Assistant
short
Write fib_memo(n) returning the n-th Fibonacci number using top-down memoization.
behavior
Write a function that computes the n-th term of the sequence where the first two terms are 0 and 1 and every later term is the sum of the previous two. Before computing each term, check a lookup table; if the answer is already there, return it; otherwise compute it once, store it in the lookup table, and return it.
Made best use of divisible resources provably optimal.
For instanceFill a bag with gold dust first if it has highest value per pound.
items ← [(10, 60), (20, 100), (30, 120)]
capacity ←50
sort items by value/weight descending
total ←0FOR EACH item IN items
IF capacity = 0THENBREAKENDIF
take ←min(item.weight, capacity)
total ← total + take * (item.value / item.weight)
capacity ← capacity - take
ENDFORRETURN total
The general "knapsack problem" was named in the 1950s by mathematicians studying postwar logistics — bomber bays, supply trucks, even safe-cracking puzzles, all variations on packing finite capacity to maximize value. The 0/1 version turned out to be brutally hard, but the fractional relaxation cracked open immediately under Dantzig's then-new linear-programming framework: the greedy rule was suddenly provably optimal, and a problem that had felt combinatorial became one short loop.
Teaches: Greedily take highest value density first
Anecdote
George Dantzig invented linear programming and the simplex method in 1947 to optimize military supply chains for the U.S. Air Force. The fractional knapsack is the simplest case where Dantzig's framework gives a greedy solution. He famously arrived late to a statistics class at Berkeley in 1939, copied two open problems on the blackboard thinking they were homework, and solved both — the math department later told him those were unsolved problems. The man casually solved problems other people thought were impossible; the knapsack was easy by comparison.
The Idea
Compute each item's value per pound (its "value density"), and sort items from highest density to lowest. Walk down the sorted list. For each item, take as much as the remaining capacity allows: if the whole item fits, take it; if not, take just the fraction that fits and stop.
Why does this work? The greedy choice is provably optimal here. Suppose, for contradiction, the best packing left some bag space unfilled while a higher-density item was untaken — you could always swap a pound of lower-density material for a pound of higher-density material and increase the total. The invariant is that at every step, every pound already in the bag has at least as high a value density as every pound left outside it. The fractional split is what lets the greedy work; the 0/1 version is NP-hard precisely because you can't split.
Trace
step
item
density
take
total
capacity
1
(10, 60)
6.0
10
0 + 10·6 = 60
40
2
(20, 100)
5.0
20
60 + 20·5 = 160
20
3
(30, 120)
4.0
20
160 + 20·4 = 240
0
Where It's Used Today
Cargo and shipping — loading a truck or container with bulk goods (grain, ore, liquid) to maximize freight value within the weight limit.
Investment portfolio building — allocating a fixed budget across divisible assets to chase the best return-per-dollar ratio.
Cloud computing budgets — distributing a fixed compute budget across workloads with different value-per-CPU-hour, where partial allocations are allowed.
Bandwidth and CDN allocation — splitting limited bandwidth across video streams or downloads ranked by priority per byte.
Energy grid dispatch — choosing which power plants to draw from when each has a different cost-per-megawatt-hour and a maximum capacity.
When NOT to Use
When items cannot be split (a laptop, a book, a person) — that's the 0/1 knapsack and greedy gives wrong answers; use dynamic programming.
When values depend on which other items you take (synergies, discounts) — greedy ignores combinations entirely.
When weights or values can be negative or zero — the value-density sort breaks down and may divide by zero.
Common Mistakes
Sorting by value alone instead of value-per-weight, which loses to a smaller but denser item.
Taking the full last item when only a fraction fits, exceeding the capacity.
Computing value / weight with integer division, throwing away the fractional part of every density.
Try It with an AI Assistant
short
Write knapsack_fractional(items, capacity) where items are (weight, value) and items can be split; return the maximum total value.
behavior
Write a function that takes a list of items, each with a weight and a value, and a capacity. Items can be split into fractions. Sort the items by value-per-weight from highest to lowest. Walk down the list, taking each whole item that fits, and a fraction of the next one to fill any leftover capacity. Return the total value.
For instanceA statistics program can simulate heights or measurement noise.
u1 ←random(0, 1)
u2 ←random(0, 1)
r ←sqrt(-2 * log(u1))
theta ←2 * π * u2
z0 ← r * cos(theta)
z1 ← r * sin(theta)
RETURN z0, z1
In 1958, statistician George Box and graduate student Mervin Muller — working in Princeton's Department of Statistics — published a two-page note in The Annals of Mathematical Statistics showing that two independent uniform samples could be turned into two independent Gaussian samples by a single change of variables. Until then, simulation programs had to generate normals by approximate sums of uniforms (using the Central Limit Theorem) or by costly inverse-CDF tables. Box-Muller replaced both with a closed-form formula simple enough to fit on a punch card, and within a decade it was the default Gaussian generator in nearly every scientific subroutine library.
Teaches: Transform uniform randomness into normal distribution
Anecdote
George Box (a Bell Labs statistician) and Mervin Muller (a graduate student) submitted the algorithm to The Annals of Mathematical Statistics and got it back with a reviewer comment: "too simple to publish." They published anyway. The algorithm is now in every random library in every programming language; the reviewer who rejected it is forgotten. The "too simple to publish" pattern recurs: many of computing's most-used algorithms felt embarrassingly trivial to their inventors.
The Idea
Take two uniform numbers u1 and u2 from the unit square. Treat them as polar coordinates: a radius r = sqrt(-2 · log(u1)) and an angle theta = 2π · u2. The two Cartesian coordinates z0 = r · cos(theta) and z1 = r · sin(theta) come out as independent samples from the standard normal distribution — mean 0, standard deviation 1.
Why does this work? The 2-D normal distribution is rotationally symmetric, so its cloud of points looks the same from every angle. Sampling a random angle uniformly in [0, 2π) and a random radius whose squared length is exponentially distributed (which is what -2 · log(u1) produces) gives back exactly that cloud. The two coordinates we read off are independent and standard-normal — two for the price of one log and one square root.
Trace
step
variable
value
1
u1
0.5
2
u2
0.25
3
r
sqrt(-2 · log(0.5)) ≈ sqrt(1.386) ≈ 1.177
4
theta
2π · 0.25 = π/2 ≈ 1.5708
5
z0
1.177 · cos(π/2) ≈ 0.000
6
z1
1.177 · sin(π/2) ≈ 1.177
Where It's Used Today
Monte Carlo simulations — physics, chemistry, and finance use bell-curve noise to model real-world variation.
Machine learning initialization — neural network weights are seeded from a normal distribution generated this way (or a close variant).
Computer graphics — depth-of-field blur, motion blur, and noise textures sample from Gaussians for realism.
Signal processing — adding Gaussian noise to test the robustness of audio and image filters.
Statistics teaching — generating synthetic data sets with known mean and standard deviation for classroom labs.
When NOT to Use
When you need a non-normal distribution (exponential, Poisson, beta) — Box-Muller only produces standard Gaussians.
When transcendental functions (log, sin, cos) are expensive on your hardware — the Marsaglia polar method or Ziggurat algorithm is faster.
When the underlying uniform generator has known low-bit defects — those defects get amplified through log(u1) near zero.
Common Mistakes
Allowing u1 = 0, which makes log(u1) = -infinity and crashes; you must sample from (0, 1].
Using log10 instead of natural log ln, producing samples whose variance is wrong by a factor of ln(10).
Discarding z1 and only returning z0, doubling the cost since each call already pays for two samples.
Try It with an AI Assistant
short
Convert uniform random numbers into normally distributed random values.
behavior
Take two random numbers between 0 and 1. Compute a radius as the square root of negative two times the natural log of the first number. Compute an angle as two pi times the second number. Return the radius times the cosine of the angle, and the radius times the sine of the angle.
Made teaching machines from labeled examples possible.
w ← [0, 0]
b ←0
x ← [2, 1]
y ← +1
lr ←1
pred ←sign(w · x + b)
IF pred != y THEN
w ← w + lr * y * x
b ← b + lr * y
ENDIFRETURN (w, b)
At Cornell's Aeronautical Laboratory in the late 1950s, Rosenblatt was trying to build a learning machine for image recognition. The U.S. Office of Naval Research funded the work, and the press treated his perceptron demonstrations as the dawn of thinking machines. Within a decade, Minsky and Papert's 1969 book Perceptrons showed the single-layer rule could not learn XOR, triggering the first "AI winter" — and yet Rosenblatt's update rule, dressed up with backpropagation, sits inside every modern neural network.
Teaches: Adjust weights based on prediction errors
Anecdote
Frank Rosenblatt built a physical machine (Mark I Perceptron) with motors and wires. Photos exist of him posing next to it like "scary," causing huge hype — followed by a backlash when its limits were exposed.
The Idea
For each labeled example (x, y), compute the predicted label as sign(w · x + b). If the prediction matches y, do nothing — the rule already gets this example right. If the prediction is wrong, push the weights in the direction of the correct answer: add lr y x to w and lr * y to b. The learning rate lr controls how big each nudge is.
Why does it work? Each wrong-on-this-example update reduces the error margin on that very example. Add y x to w, and the new w · x becomes (w + yx) · x = w·x + y (x·x) — that's a positive nudge in the direction of y, by an amount proportional to the squared length of x. The famous Perceptron Convergence Theorem* (Novikoff, 1962) proves that if the data is linearly separable, this loop is guaranteed to stop after a finite number of mistakes — no matter how you order the examples. If the data isn't separable, the loop never settles, which is exactly the limitation Minsky and Papert pointed out in 1969.
Trace
step
computation
value
1
w · x + b = 02 + 01 + 0
0
2
pred = sign(0) = 0
0
3
pred != y (0 ≠ +1), so update
yes
4
w ← w + lr y x = [0,0] + 11[2,1]
[2, 1]
5
b ← b + lr y = 0 + 11
+1
Where It's Used Today
Online learning systems — fraud detection and ad-click prediction systems use perceptron-style updates to learn from one example at a time as new data streams in.
Spam filters — early text classifiers (SpamAssassin and similar) used perceptron and its averaged variant to weight features like word counts.
Building block for deep learning — every modern neural network is a stack of "neurons" that are direct descendants of this single update rule.
Sentiment classifiers — averaged perceptrons remain a fast, surprisingly strong baseline in NLP for tasks like positive/negative review classification.
Hardware demos — FPGA and embedded AI tutorials still implement the perceptron update because it fits in a few lines of integer math.
When NOT to Use
When the data is not linearly separable (the classic XOR problem) — the perceptron will loop forever without converging; use a multi-layer network or kernel method.
When you need calibrated probabilities, not just a label — perceptron outputs +1/-1, so use logistic regression for probability scores.
When classes are heavily imbalanced — a perceptron will learn to always predict the majority class; reweight examples or use a margin-based loss instead.
Common Mistakes
Updating the weights on every example, even when the prediction was already correct — that introduces noise and slows or prevents convergence.
Treating sign(0) as a valid match for either label — pick a tie-breaking convention (e.g., treat 0 as wrong) or the loop never starts learning.
Using a learning rate that doesn't fit the feature scale — huge x values combined with lr = 1 overshoot, so normalize features first.
Try It with an AI Assistant
short
Write perceptron_update(w, x, y, lr) returning the updated weight vector after one perceptron step on labeled example (x, y) with y in {-1, +1}.
behavior
Write a function that takes a weight vector w, a bias b, a feature vector x, a label y that is either +1 or -1, and a learning rate lr. Compute the sign of w · x + b. If that sign matches y, return w and b unchanged. Otherwise return w + lr y x and b + lr * y.
Made finding good travel routes through many cities practical.
tour ←nearest_insertion(cities)
REPEAT
improved ←FALSEFOR i, j INedge_pairs(tour)
IFdist(i, i+1) + dist(j, j+1) > dist(i, j) + dist(i+1, j+1) THEN
tour ←reverse(tour, i+1, j)
improved ←TRUEENDIFENDFORUNTILNOT improved
RETURN tour
Georges Croes proposed 2-opt as a heuristic at a time when computer scientists were still hopeful that the traveling salesman problem might have a polynomial solution. The 2-opt swap — pick two edges, swap their endpoints, accept if the new tour is shorter — is so simple a child can do it on a napkin. Sixty years later, with TSP proven NP-hard, 2-opt is still the first move every modern TSP solver makes.
Teaches: Improve solutions by locally swapping connections
Anecdote
Georges Croes proposed 2-opt as a heuristic at a time when computer scientists were still hopeful that the traveling salesman problem might have a polynomial solution. The 2-opt swap — pick two edges, swap their endpoints, accept if the new tour is shorter — is so simple a child can do it on a napkin. Sixty years later, with TSP proven NP-hard, 2-opt is still the first move every modern TSP solver makes.
The Idea
Two stages. First, build any reasonable tour with nearest insertion: start with two cities, then insert each remaining city at whichever position adds the least extra distance. This gets you a starting tour quickly.
Second, run 2-opt: scan all pairs of edges in the tour. If two edges cross — that is, if removing them and reconnecting the route the other way makes the total distance shorter — do the swap. Keep sweeping until no swap improves the tour. The invariant is that the tour length is monotonically decreasing: each accepted swap strictly shortens the route, so the loop must terminate. The result is a "locally optimal" tour — no single edge swap can make it shorter, even if a longer reshuffle could.
Trace
step
tour
length
what happens
0
A → C → B → D → A
4.83
initial (crossed) tour
1
check edges (AC) and (BD); swap them by reversing the segment between them: A → B → C → D → A
4.00
improvement! accept swap
2
A → B → C → D → A
4.00
sweep again — no further swap shortens it
3
(done)
locally optimal
Where It's Used Today
Delivery and logistics — UPS, Amazon, and grocery delivery apps run 2-opt-style heuristics on every truck's daily route.
Drilling and PCB manufacturing — 2-opt minimizes the path of a drill head moving between holes on a circuit board.
DNA sequencing — finding a short ordering of fragments that resembles a TSP tour.
Tourist trip planners — apps that sequence sightseeing stops use TSP heuristics.
Robotics and warehouse picking — Kiva/Amazon-Robotics-style fulfillment robots schedule pick paths with TSP solvers.
When NOT to Use
When you need a provably optimal tour — 2-opt only finds a local optimum and can be a few percent off; use exact ILP or Concorde for ground-truth answers.
When the distances are not symmetric (one-way streets, asymmetric travel times) — the textbook 2-opt swap reverses a segment, which only preserves length on symmetric instances.
When the city count is tiny (under ~10) — exact dynamic programming (Held-Karp) finishes instantly and gives the true optimum.
Common Mistakes
Comparing only the new edge lengths instead of the full delta d(i,j) + d(i+1,j+1) − d(i,i+1) − d(j,j+1), accepting swaps that actually make the tour longer.
Forgetting to reverse the in-between segment after the swap — leaving the tour disconnected or no longer a valid cycle.
Stopping after a single sweep instead of looping until no improving swap is found, leaving easy gains on the table.
Try It with an AI Assistant
short
Write tsp_tour_insertion_2_opt(...) implementing TSP — Tour Insertion + 2-opt.
behavior
Write a function that, given a list of city coordinates, builds a starting tour by repeatedly inserting each unvisited city into the position of the current tour that minimizes the added distance, then improves the tour by repeatedly scanning every pair of edges and reversing the segment between them whenever the swap would shorten the tour. Stop when no swap helps.
For instanceFind routes where some edges give credits or discounts.
n ←4
source ←0
edges ← [(0, 1, 4), (0, 2, 5), (1, 2, -3), (2, 3, 4)]
dist ← array[0..n-1] filled with ∞
dist[source] ←0FOR i FROM1TO n - 1FOR EACH (u, v, w) IN edges
IF dist[u] + w < dist[v] THEN
dist[v] ← dist[u] + w
ENDIFENDFORENDFOR// Extra n-th pass detects a negative cycleFOR EACH (u, v, w) IN edges
IF dist[u] + w < dist[v] THENRETURN"negative cycle detected"ENDIFENDFORRETURN dist
Lester Ford published the relaxation idea in 1956 while working on flow problems at RAND; Richard Bellman gave the technique its now-standard form in 1958 as a flagship example for the dynamic-programming framework he was developing at the same institution. Economic planning and routing problems sometimes carried negative costs (rebates, refunds, profitable currency conversions), and Dijkstra's algorithm — published the year before — silently broke on them. Bellman-Ford handled negative weights safely, and as a bonus, detected impossible "free money" cycles that would otherwise loop forever getting cheaper.
Teaches: Repeated relaxation absorbs negative weights and cycles
The Idea
Keep an array dist recording the best-known distance from the source to every node. Start with dist[source] = 0 and everything else as ∞. Then perform a relaxation pass over every edge (u, v, w): if going to u and then taking the edge is cheaper than your current dist[v], update dist[v]. Repeat the pass n − 1 times, where n is the number of nodes.
Why n − 1 passes? Any shortest path can have at most n − 1 edges (a longer path would revisit a node). After one pass, dist is correct for all 1-edge shortest paths. After two passes, all 2-edge ones. After n − 1 passes, every shortest path has been "stretched out" along the array. If a final, n-th pass still finds an improvement, the graph contains a negative cycle and no finite shortest path exists.
Trace
pass
edge processed
check
dist after
1
(0→1, 4)
0 + 4 < ∞ → update dist[1]
[0, 4, ∞, ∞]
1
(0→2, 5)
0 + 5 < ∞ → update dist[2]
[0, 4, 5, ∞]
1
(1→2, −3)
4 + (−3) = 1 < 5 → update
[0, 4, 1, ∞]
1
(2→3, 4)
1 + 4 < ∞ → update dist[3]
[0, 4, 1, 5]
2
all four edges
no improvement
[0, 4, 1, 5]
3
all four edges
no improvement
[0, 4, 1, 5]
Where It's Used Today
Internet routing — the RIP (Routing Information Protocol) inside many networks uses Bellman-Ford to compute distance vectors between routers.
Currency arbitrage detection — if a sequence of currency trades produces a negative cycle (free money), Bellman-Ford spots it.
Game economy balancing — quest reward systems where some edges carry refunds or "energy" gains can be analyzed by Bellman-Ford to find exploits.
Constraint solving — many scheduling and timing-analysis problems reduce to shortest paths in graphs that may have negative edge weights.
Robot path planning — when terrain "rewards" exist (downhill segments, charging stations), the path-cost graph has negative edges and Bellman-Ford applies.
When NOT to Use
When all weights are non-negative — Dijkstra runs in O((V+E) log V) versus Bellman-Ford's O(V·E) and is dramatically faster.
When the graph is dense and you need all-pairs distances — Floyd-Warshall is simpler and has the same asymptotic cost.
When you only need to know whether any path exists — a plain BFS or DFS settles it in linear time without arithmetic.
Common Mistakes
Doing only n - 2 passes (off-by-one) so paths of length n - 1 never finish relaxing.
Skipping the extra n-th pass, leaving negative cycles undetected and reporting bogus finite distances.
Adding dist[u] + w when dist[u] is still infinity, producing arithmetic overflow that looks like a valid update.
Try It with an AI Assistant
short
Relax all graph edges repeatedly to compute shortest paths with negative weights.
behavior
Write a function that, given a list of weighted directed edges, a node count n, and a source node, initializes a distance array to infinity (zero at the source) and then repeats n − 1 times: for every edge (u, v, w), if dist[u] + w is less than dist[v], replace dist[v]. Return the distance array.
Made sorting moderate-sized lists fast in tiny embedded code.
gap ← n / 2WHILE gap > 0FOR i FROM gap TO n - 1
t ← a[i]
j ← i
WHILE j >= gap AND a[j-gap] > t
a[j] ← a[j-gap]
j ← j - gap
ENDWHILE
a[j] ← t
ENDFOR
gap ← gap / 2ENDWHILERETURN a
Donald Shell published it in just two pages in Communications of the ACM. The mystery is the gap sequence — Shell's original choice of n/2, n/4, n/8, … is not the best, and 60 years of optimization research has produced better sequences (Sedgewick's, Pratt's, Tokuda's) — but no one has proven the optimal sequence. Shell sort is the most-studied algorithm whose precise complexity is still unknown.
Teaches: Sort distant elements first to reduce disorder
Anecdote
Donald Shell published it in just two pages in Communications of the ACM. The mystery is the gap sequence — Shell's original choice of n/2, n/4, n/8, … is not the best, and 60 years of optimization research has produced better sequences (Sedgewick's, Pratt's, Tokuda's) — but no one has proven the optimal sequence. Shell sort is the most-studied algorithm whose precise complexity is still unknown.
The Idea
Pick a gap — say, half the array length. Treat positions 0, gap, 2gap, … as one little group and sort it with insertion sort. Then 1, 1+gap, 1+2gap, … as another group, and so on. When every gap-spaced group is sorted, halve the gap and repeat. The final pass uses gap = 1, which is plain insertion sort — but by then the array is almost sorted, so insertion sort flies through it.
Why does this work? A small element trapped at the end of the array would take many one-step swaps to reach the front under plain insertion sort. With a large gap, it leaps most of the distance in a single comparison. Each pass leaves the array more nearly sorted than the last, so the cheap final pass has very little real work to do.
Trace
gap
pass result
what happens
4
[8, 2, 4, 1, 9, 3, 6, 7]
compare/swap pairs 4 apart
2
[4, 1, 6, 2, 8, 3, 9, 7]
sort the even-indexed and odd-indexed groups
1
[1, 2, 3, 4, 6, 7, 8, 9]
final insertion-sort pass on a near-sorted array
0
stop
gap reached 0, loop ends
Where It's Used Today
Embedded systems — the Linux kernel's uClibc library uses Shell sort because it's short, in-place, and needs no extra memory.
Microcontrollers — small devices (smart thermostats, fitness trackers) sort short sensor logs with Shell sort to avoid the recursion stack of quicksort.
Older C libraries — some BSD qsort fallback paths and the bzip2 compressor use Shell sort for medium-sized arrays.
Compiler bootstrapping — early-stage compilers that can't yet allocate memory use Shell sort in their symbol-table routines.
Teaching — every algorithms class still covers it as the bridge between simple O(n²) sorts and clever O(n log n) sorts.
When NOT to Use
When you have plenty of memory and want guaranteed O(n log n) — merge sort or heap sort beat Shell sort on large inputs.
When stability matters (preserving original order of equal keys) — Shell sort is not stable; the long-distance swaps reorder equal elements.
When the list is already nearly sorted and small — plain insertion sort skips the gap overhead and finishes in essentially one pass.
Common Mistakes
Picking a bad gap sequence (e.g., consecutive even numbers) so half the array never compares to the other half — the final pass then has to do all the work.
Letting the inner WHILE step j by 1 instead of gap — that quietly reverts the algorithm to plain insertion sort.
Stopping the outer loop when gap = 1 instead of after running the gap = 1 pass — without that final insertion sort pass the array is not actually sorted.
Try It with an AI Assistant
short
Write shell_sort(a) that sorts a list in place using Shell sort with the gap sequence n/2, n/4, …, 1.
behavior
Write a function that sorts a list in place by repeatedly choosing a gap (start with half the length, then halve it each round), and for each gap performs an insertion sort that compares and shifts elements that are exactly gap positions apart, finishing with a regular pass when the gap is 1.
For instanceSort a list by picking a pivot and partitioning around it.
arr ← [3, 6, 1, 4, 2]
low ←0
high ←4FUNCTIONquickSort(arr, low, high)
IF low >= high THENRETURNENDIF
pivot ← arr[high]
i ← low
FOR j FROM low TO high - 1IF arr[j] <= pivot THENswap(arr[i], arr[j])
i ← i + 1ENDIFENDFORswap(arr[i], arr[high])
quickSort(arr, low, i - 1)
quickSort(arr, i + 1, high)
END FUNCTIONquickSort(arr, low, high)
RETURN arr
Tony Hoare invented quicksort in 1959 while working as a young exchange student in Moscow on a Russian-English machine translation project. Sorting a long list of words was the bottleneck, and the in-place partition trick — pick a pivot, swing everything smaller to one side and larger to the other, then recurse — beat every alternative he could code on the available hardware. He published it the next year; sixty-five years on, it is still the default sort in most language runtimes.
Teaches: Partition around a pivot, then recurse
Anecdote
Despite worst-case O(n²), Quicksort dominates real systems. Why? Careful pivot choices and randomness make the bad case vanishingly rare, on simple "almost data" inputs.
The Idea
Pick one element as the pivot — here we use arr[high], the rightmost element. Walk through the rest with a marker i that tracks the boundary of the "small-or-equal" zone. Every time you find an element ≤ pivot, swap it into the small zone and bump i forward. When the walk ends, swap the pivot itself into position i. Now everything left of i is ≤ pivot, everything right is > pivot, and the pivot itself is in its final sorted spot.
The invariant during partitioning is exactly that: at every step, arr[low..i−1] ≤ pivot < arr[i..j−1]. Once the partition finishes, recurse on the two halves. Because each recursion roughly halves the work and the partition itself is O(n), the average running time is O(n log n) — fast enough for everyday sorting on millions of elements.
Trace
j
arr[j]
arr[j] ≤ 2?
i before
swap?
arr after
0
3
no
0
no
[3, 6, 1, 4, 2]
1
6
no
0
no
[3, 6, 1, 4, 2]
2
1
yes
0
swap arr[0], arr[2]
[1, 6, 3, 4, 2]
3
4
no
1
no
[1, 6, 3, 4, 2]
Where It's Used Today
Standard library sorts — C's qsort, the heart of countless programs since 1979, uses quicksort or a hybrid based on it.
Database query engines — PostgreSQL and MySQL use quicksort variants when sorting query results that fit in memory.
Data processing pipelines — sorting log entries, telemetry events, and analytics records before aggregation.
3D graphics — sorting transparent polygons by depth before rendering, so far-away objects draw before near ones.
Computational genomics — sorting genomic intervals or read positions, where billions of small records need ordering quickly.
When NOT to Use
When you must guarantee O(n log n) worst-case time (real-time or adversarial inputs) — use merge sort or heap sort instead.
When you need a stable sort that preserves the relative order of equal keys — quicksort is not stable; merge sort is.
When sorting linked lists or external/disk-resident data — partition-in-place breaks down without random access; merge sort fits better.
Common Mistakes
Always picking arr[0] or arr[high] as the pivot — already-sorted input then degrades to O(n²); use a random or median-of-three pivot.
Using < instead of <= in the partition (or vice versa) and creating an empty side, then recursing on the same range and looping forever.
Recursing on the wrong subranges — quickSort(arr, low, i) instead of quickSort(arr, low, i - 1) reprocesses the pivot endlessly.
Try It with an AI Assistant
short
Partition array around pivot and recursively sort smaller subarrays.
behavior
Write a recursive function that, given an array slice between two indices, picks the rightmost element as a pivot, walks through the slice moving every element less than or equal to the pivot to the front, then puts the pivot just after that front block, and finally recurses on the slice before the pivot and the slice after it.
Made shortest routes in weighted networks practical.
For instanceA GPS can find the fastest route when roads have different travel times.
graph ← {A: [(B,1), (C,4)], B: [(C,2), (D,5)], C: [(D,1)], D: []}
source ← A
dist ← map of every node → ∞
dist[source] ←0
pq ← priority queue containing (0, source)
WHILE pq is NOTempty
(d, node) ←extract_min(pq)
IF d > dist[node] THENCONTINUEENDIFFOR EACH (neighbor, weight) IN graph[node]
nd ← d + weight
IF nd < dist[neighbor] THEN
dist[neighbor] ← nd
insert (nd, neighbor) into pq
ENDIFENDFORENDWHILERETURN dist
In 1956, Edsger Dijkstra was a 26-year-old programmer at the Mathematisch Centrum in Amsterdam, asked to demonstrate the new ARMAC computer at its public unveiling. To make the demo intuitive, he picked the question "what's the shortest way to drive from Rotterdam to Groningen?" and designed the algorithm in about twenty minutes over coffee with his fiancée Maria. He didn't bother to publish it for three years — he thought it was too simple to be worth a paper — yet it became one of the most-cited results in computing.
Teaches: Expand closest nodes first to find shortest paths
Anecdote
Edsger Dijkstra refused to use cryptic variables like x, y. He published his work with clear, almost prose-like notation, because he believed code should be read by humans, not just executed by machines.
The Idea
Keep a tentative best-known distance to every node. The source starts at 0, every other node starts at ∞. Then, repeatedly, pick the unsettled node with the smallest tentative distance and "settle" it — its distance is now final. From that newly settled node, look at every neighbor and update their tentative distance if going through this node is cheaper.
A priority queue makes "pick the smallest" fast. Why does this work? Because when you settle the closest unsettled node, no future path can do better — all unsettled nodes are at least that far away, and edge weights are non-negative, so taking a detour can only add cost. That's the invariant: at every settle step, the distance picked is provably the true shortest. Dijkstra's algorithm fails if you allow negative edge weights, because then a detour might subtract.
Trace
step
extracted (d, node)
dist[A]
dist[B]
dist[C]
dist[D]
updates
0
start
0
∞
∞
∞
pq = [(0, A)]
1
(0, A)
0
1
4
∞
relax A→B, A→C
2
(1, B)
0
1
3
6
relax B→C (1+2 < 4), B→D
3
(3, C)
0
1
3
4
relax C→D (3+1 < 6)
4
(4, D)
0
1
3
4
no neighbors to improve
Where It's Used Today
GPS navigation — Google Maps, Waze, and your phone's directions app all run a variation of Dijkstra (often A*) to compute the fastest route through a road network.
Internet routing — link-state routing protocols like OSPF use Dijkstra to determine how packets should hop between routers.
Game AI pathfinding — characters in strategy and role-playing games use Dijkstra (or A*, its descendant) to walk around obstacles.
Logistics and shipping — package delivery, ride-sharing, and trucking software route vehicles by shortest weighted path.
Network reliability analysis — finding the cheapest way to reach a destination in a power grid, water network, or telecom backbone.
When NOT to Use
When any edge weight is negative — Dijkstra's "settled is final" invariant breaks because a later detour through a negative edge can beat a settled distance; use Bellman-Ford instead.
When all edge weights are equal — plain BFS finds shortest paths in O(V + E) without the priority queue overhead.
When you need shortest paths between every pair of nodes on a small dense graph — running Dijkstra from each source is O(V^2 log V); Floyd-Warshall's O(V^3) is simpler and competitive.
Common Mistakes
Marking a node visited the moment it's pushed onto the priority queue rather than when it's popped — stale longer-distance entries then get accepted as final.
Using a regular queue or stack instead of a priority queue — you no longer extract the minimum, so the settled-is-correct invariant collapses and answers become wrong.
Forgetting to skip stale entries (if d > dist[node]: continue) when the same node appears multiple times in the heap — you re-relax neighbors unnecessarily and slow the algorithm dramatically.
Try It with an AI Assistant
short
Find shortest paths from source node using priority queue and greedy distance updates.
behavior
Write a function that takes a weighted graph and a starting node, and returns the smallest total weight needed to reach every other node. Keep a tentative-distance table starting at 0 for the source and infinity for everyone else. Use a priority queue to repeatedly pull out the unsettled node with the smallest tentative distance, then for each of its neighbors check whether going through this node would lower their tentative distance. Stop when the queue is empty.
Made shortest steps in unweighted networks easy to find.
For instanceFind the fewest moves from one word to another in a word ladder.
graph ← {A: [B, C], B: [A, D], C: [A, D, E], D: [B, C], E: [C]}
start ← A
visited ← set containing start
queue ← [start]
WHILE queue NOT empty
node ←dequeue(queue)
FOR EACH neighbor IN graph[node]
IF neighbor NOTIN visited THEN
add neighbor TO visited
enqueue(queue, neighbor)
ENDIFENDFORENDWHILERETURN visited
BFS mirrors ripple expansion in water — visiting all nearby nodes before moving farther away. Edward Moore's 1959 paper at Bell Labs framed it as a maze-running procedure for relay-circuit "robots," but the same shape was rediscovered independently by Konrad Zuse in the 1940s and by C.Y. Lee in 1961 for routing wires on printed circuit boards. The pattern is so fundamental that today every social-network "degrees of separation" feature, every web crawler's frontier, and every shortest-path solver in an unweighted graph uses BFS at its core.
Teaches: Explore layer by layer outward from the start
Anecdote
Edward F. Moore used BFS to solve maze navigation for robots. The algorithm was born from the question: how should a machine explore space layer by layer without getting lost?
The Idea
BFS uses a queue — a first-in-first-out line, like waiting at a bakery counter. Add the starting node. Then repeatedly: take the front node off the queue, look at its neighbors, and add any unseen neighbor to the back of the queue. A visited set keeps us from going in circles.
Why does this guarantee layer-by-layer order? Because the queue is FIFO. The starting node enters first, so it's processed first. Its neighbors enter next, so they're all processed before any of their neighbors enter. The front of the queue always holds nodes at the smallest unprocessed distance. That invariant is what gives BFS its other superpower: when used to find a path from start to goal in an unweighted graph, the first time you dequeue the goal, you've found a shortest path in number of edges.
Trace
step
dequeue node
neighbors of node
queue after
visited
0
(start)
—
[A]
{A}
1
A
B, C
[B, C]
{A, B, C}
2
B
A (seen), D
[C, D]
{A, B, C, D}
3
C
A, D (seen), E
[D, E]
{A, B, C, D, E}
4
D
B, C (both seen)
[E]
{A, B, C, D, E}
5
E
C (seen)
[]
{A, B, C, D, E}
Where It's Used Today
Social-network "degrees of separation" — LinkedIn shows whether someone is a 1st, 2nd, or 3rd-degree connection by running BFS on the friendship graph.
GPS and maze routing — when all moves cost the same (one square, one step), BFS finds the shortest route, exactly as Edward Moore originally used it for robots.
Web crawlers — Google's earliest crawler walked the web layer by layer from a seed of URLs, BFS-style, so popular pages were indexed first.
Compilers and dependency tools — npm, pip, and build systems use BFS-style traversal to expand a package's dependency tree level by level.
Puzzle solvers — Rubik's cube and 15-puzzle solvers use BFS to find the minimum number of moves between two configurations.
When NOT to Use
When edges have differing weights and you want the cheapest path — BFS counts edges, not costs; use Dijkstra.
When you want to detect cycles, find topological order, or explore a tree's full depth — DFS is the natural fit and uses less memory on long, narrow graphs.
When the graph is enormous and you only need to confirm reachability between two specific nodes — bidirectional search can cut the work dramatically.
Common Mistakes
Marking a node as visited only when it's dequeued instead of when it's enqueued — the same node enters the queue many times and the work blows up.
Using a Python list with pop(0) instead of collections.deque — the dequeue is O(n) and BFS becomes quadratic on large graphs.
Treating the first path discovered to a node as the shortest in a weighted graph — BFS guarantees fewest edges, not lowest weight.
Try It with an AI Assistant
short
Write bfs(graph, start) that returns the visit order using level-by-level exploration with a queue and visited set.
behavior
Write a function that, given a graph and a starting node, visits every reachable node by maintaining a list of nodes still to process. Take a node off the front of the list, look at its neighbors, and add any never-seen neighbor to the back of the list. Keep a set of seen nodes so nothing is processed twice. Return the set of all visited nodes.
For instanceAutocomplete can find all words starting with “pre”.
node ← root
FOR EACH c IN word
IF c NOTIN node.children THEN
node.children[c] ← new trie node
ENDIF
node ← node.children[c]
ENDFOR
node.END ←TRUE
node ← root
FOR EACH c IN word
IF c NOTIN node.children THENRETURNFALSEENDIF
node ← node.children[c]
ENDFORRETURN node.END
Edward Fredkin introduced the trie in 1960 while at BBN (Bolt, Beranek and Newman) in Cambridge, Massachusetts, the same lab that would later help build ARPANET. He coined the name from "retrieval" — and then promptly told everyone to pronounce it "tree," a pronunciation joke that has been confusing students ever since. Fredkin's paper showed that a tree of single-letter edges turned an O(dictionary-size) word lookup into an O(word-length) walk, opening the door to fast spell-check and the autocomplete features your phone keyboard now uses on every keystroke.
Teaches: Index by prefixes to share common beginnings
Anecdote
Edward Fredkin pronounced it "tree", not "try." The spelling comes from "retrieval," but the pronunciation joke stuck — and still confuses students today.
The Idea
Each node in the trie holds a small map from letter to child node, plus a flag (call it END) marking whether some word ends right there. To insert a word, walk down from the root: if the next letter has a child, follow it; if not, create a new child. When you finish the word, set END = TRUE on the last node. To look up a word, do the same walk but never create new nodes — if any letter is missing, the word isn't there; if you reach the end, return whatever END says.
Why does it work? The invariant is that the path from the root to any node spells out exactly that node's prefix. Words that share a prefix share a path; they only fork when they actually differ. That's why prefix queries (like "all words starting with pre") are simply "walk to the node for pre, then list everything underneath."
Trace
step
action
node we are on
what happens
1
insert c
root → c
c not in root.children → create node, descend
2
insert a
c → a
a not in c.children → create node, descend
3
insert t
a → t
t not in a.children → create node, descend
4
end of "cat"
t
set t.END = TRUE
5
insert c
root → c
c already exists → descend
6
insert a
c → a
a already exists → descend
7
insert r
a → r
r not in a.children → create, descend
8
end of "car"
r
set r.END = TRUE
9
lookup "car"
root → c → a → r
r.END = TRUE → returns TRUE
10
lookup "cab"
root → c → a → ?
b not in a.children → returns FALSE
Where It's Used Today
Autocomplete — phone keyboards and search bars store the dictionary as a trie so they can list completions for whatever you've typed in microseconds.
Spell checkers — word processors use tries to confirm a word exists, and to suggest near-matches by exploring nearby paths.
IP routing — routers store the prefixes of every IP block in a trie (called a radix trie) so they can match an incoming packet to the right outgoing route.
Text editors and code completion — IDEs use tries (or related structures) to surface variable and function names as you type their first few characters.
DNA pattern search — bioinformatics tools store genome substrings in tries to find every occurrence of a pattern across long sequences.
When NOT to Use
When you only need exact-match lookup with no prefix queries — a hash set is simpler, faster, and uses far less memory.
When the alphabet is huge (e.g. full Unicode) and the dataset is small — each node carries a sparse map and the per-node overhead dwarfs the actual data.
When memory is tight relative to dictionary size — a trie can use many times the bytes of the raw word list; a sorted array with binary search may be better.
Common Mistakes
Forgetting the END flag — without it, "car" looks present whenever "cart" is inserted, because the path exists.
Using a fixed 26-slot array per node for case-insensitive ASCII, then crashing the moment a digit, hyphen, or non-Latin character arrives.
Sharing a single child-map object across nodes by accident (a Python default-argument bug), so every insert mutates the same map.
Try It with an AI Assistant
short
Store words character-by-character in prefix tree for fast lookup.
behavior
Build a tree where each node holds a small map from a single character to a child node and a flag marking the end of a word. To insert a word, walk character by character from the root, creating new children when needed, and set the end flag on the final node. To look up a word, walk the same path; if any character is missing, return false; otherwise return the end flag.
Made finding the k-th smallest value fast without sorting.
FUNCTIONquickselect(a, lo, hi, k)
IF lo = hi THENRETURN a[lo]
ENDIF
p ←partition(a, lo, hi)
IF k = p THENRETURN a[p]
ENDIFIF k < p THENRETURNquickselect(a, lo, p-1, k)
ELSERETURNquickselect(a, p+1, hi, k)
ENDIFEND FUNCTION
Tony Hoare created it as a side-effect of Quicksort. He realized you don't need to sort everything — just recurse into one side. A classic case of an optimization becoming a separate algorithm.
Teaches: Find order statistics without fully sorting
Anecdote
Tony Hoare created it as a side-effect of Quicksort. He realized you don't need to sort everything — just recurse into one side. A classic case of an optimization becoming a separate algorithm.
The Idea
Borrow the partition step from Quicksort: pick one element as the pivot, then rearrange the list so everything smaller sits to the left of the pivot and everything larger sits to the right. After this, the pivot is in its final sorted position — at index p.
Now compare p to your target k. If k = p, you're done — a[p] is the answer. If k < p, the answer must be somewhere in the left half, so recurse there and ignore the right half entirely. If k > p, recurse only on the right half. Each call throws away roughly half the remaining elements, so the total work averages out to linear time — much faster than sorting the whole list, which costs O(n log n).
Trace
call
a (after partition)
p
decision
quickselect(a, 0, 6, 2)
[2, 1, 3, 4, 9, 6, 7]
3
k=2 < p=3 → left
quickselect(a, 0, 2, 2)
[1, 2, 3]
1
k=2 > p=1 → right
quickselect(a, 2, 2, 2)
[3]
—
lo = hi → return a[2]
Where It's Used Today
Computing medians — statisticians and database engines find the median of a column without sorting it.
Top-k queries — search engines pull the top 10 results from millions of candidates, never bothering to sort the rest.
Percentiles in monitoring — server dashboards compute the 99th-percentile latency over millions of requests in real time.
Image processing — median filters (which remove salt-and-pepper noise) call quickselect on each pixel's neighborhood.
Standard library nth_element — C++'s STL ships std::nth_element, a quickselect variant used everywhere from games to compilers.
When NOT to Use
When you need many order statistics (the 10th, 20th, 30th… percentile) — sorting once is cheaper than running quickselect repeatedly.
When you need a hard worst-case time bound — adversarial input drives quickselect to O(n^2); use median-of-medians or introselect.
When the data lives on disk or in a stream — partition-in-place needs random access, so a heap-based top-k or reservoir method fits better.
Common Mistakes
Always picking the first element as pivot — already-sorted input degrades to O(n^2); use random or median-of-three pivots.
Recursing into both halves like Quicksort — defeats the whole point; only recurse into the side containing k.
Off-by-one in the index check — confusing 0-indexed k with 1-indexed rank silently returns the neighbor instead of the target.
Try It with an AI Assistant
short
Write quickselect(a, k) returning the k-th smallest element (0-indexed) of an unsorted list in average O(n).
behavior
Write a function that, given a list and an index k, picks a pivot, rearranges the list so values smaller than the pivot come first and larger ones come last, then recurses only into the side that contains position k. Return the value that ends up at index k.
It made expression parsing practical for calculators, compilers, spreadsheets, and interpreters.
output ← []
stack ← []
FOR EACH token t
IF t is number THEN
output.append(t)
ELIF t is operator
WHILE stack NOT empty ANDprec(top) >= prec(t)
output.append(stack.POP)
ENDWHILE
stack.push(t)
ELIF t = '('
stack.push(t)
ELIF t = ')'WHILE top != '('
output.append(stack.POP)
ENDWHILE
stack.POP
ENDIFENDFORWHILE stack NOT empty
output.append(stack.POP)
ENDWHILERETURN output
Edsger Dijkstra needed a clean way for computers to understand ordinary mathematical expressions such as 3 + 4 × 5. Humans write infix notation, but machines prefer a stricter order. The railroad-yard metaphor fits: operators wait on a stack like train cars waiting to be routed.
Teaches: A stack transforms notation without full parsing
The Idea
Read the input tokens left to right. Numbers go straight to the output queue — they're already in the right place. Operators go onto a stack, but before pushing a new operator, first pop off any operators on the stack that have equal or higher precedence and append them to the output. Parentheses are special: ( always goes on the stack as a marker, and ) pops everything back to the matching (, discarding both parens. When all input is read, drain the remaining stack into the output.
Why does it work? The stack remembers operators that are "waiting for their right operand." The precedence rule guarantees that when an operator finally gets emitted, both of its operands have already been emitted ahead of it — exactly what postfix demands. The invariant: the output queue always represents a fully-formed postfix prefix of what we've seen so far, and the stack holds operators in non-decreasing precedence order from bottom to top.
Trace
step
token
action
stack
output
0
3
number → output
[]
[3]
1
+
stack empty → push
[+]
[3]
2
4
number → output
[+]
[3, 4]
3
*
prec(+) < prec(*), don't pop → push
[+, *]
[3, 4]
4
5
number → output
[+, *]
[3, 4, 5]
5
end
drain stack: pop *, then +
[]
[3, 4, 5, *, +]
Where It's Used Today
Spreadsheet formulas — Excel, Google Sheets, and LibreOffice Calc all parse =A1 + B1 * 2 with a shunting-yard variant.
Pocket calculators — every scientific calculator with parentheses uses this idea internally to handle precedence.
Compiler front-ends — many small expression parsers in compilers and interpreters skip a full grammar parse and use shunting-yard for the arithmetic part.
Database query engines — SQL WHERE clauses with mixed AND/OR/NOT are converted to postfix for evaluation.
Custom DSLs and rule engines — when you need a tiny expression language for a config file, shunting-yard is the smallest correct parser.
When NOT to Use
When the grammar has function calls, ternaries, or unary minus — these need extra rules (or a Pratt parser) on top; pure shunting-yard handles only binary operators and parens.
When you already need a full AST for type checking or codegen — go straight to a recursive-descent parser; postfix is a poor intermediate form for those passes.
When the input isn't pre-tokenized — shunting-yard assumes tokens; you still need a separate lexer for the raw character stream.
Common Mistakes
Treating right-associative operators (like ^) the same as left-associative ones — should pop only on strictly greater precedence, not equal.
Forgetting to drain the stack after the input ends, leaving operators sitting on the stack and missing from the output.
Failing to discard the matching ( after popping to it on ), leaving stray parens that break later precedence checks.
Try It with an AI Assistant
short
Write shunting_yard_algorithm(tokens) implementing the Shunting-Yard Algorithm to convert an infix expression to postfix.
behavior
Write a function that takes a list of tokens — numbers, operators, and parentheses — in standard mathematical (infix) order, and returns the same expression in postfix order. Use a stack to hold operators. Send numbers straight to the output. When you see a new operator, first pop any operators on top of the stack that have equal or higher precedence. Treat ( and ) as a delimited group.
For instanceFind the longest period of increasing performance inside noisy scores.
tails ← empty list
FOR EACH x IN arr
i ←lowerBound(tails, x)
IF i = length(tails) THEN
APPEND x TO tails
ELSE
tails[i] ← x
ENDIFENDFORRETURNlength(tails)
In 1961, mathematician Craige Schensted noticed that the simple solitaire game Patience — deal cards left to right onto sorted piles — secretly counts the longest increasing subsequence of the deck. The procedure he wrote up turned out to encode a deep bijection between sequences and pairs of Young tableaux, now called the Robinson-Schensted-Knuth correspondence, which became foundational in combinatorics. The same idea was later sharpened with binary search to produce the O(n log n) LIS algorithm used in scheduling, diff tools, and DNA aligners today.
Teaches: Track optimal endings to extend sequences efficiently
Anecdote
Craige Schensted developed it via a card game called Patience (Solitaire) — the algorithm is exactly what you'd do if you were trying to play patience with the longest possible run. The mathematical structure he discovered (now called the Robinson-Schensted-Knuth correspondence) connects sorting, combinatorics, and representation theory. A card game led to a foundational bijection in modern combinatorics.
The Idea
Imagine playing the card game Patience: you sweep through the array, and for each value x you place it on the leftmost pile whose top card is ≥ x (start a new pile if no such pile exists). The number of piles at the end equals the length of the longest increasing subsequence.
The algorithm keeps a list tails where tails[k] is the smallest possible last value of any increasing subsequence of length k+1 seen so far. For each new x, find the first slot i whose value is ≥ x (binary search — that's lowerBound). If i is past the end, x extends the longest run; append it. Otherwise, x is a better (smaller) ending for length-i+1 runs, so replace tails[i]. This invariant — tails is sorted, and entry k is the best possible ending for that length — is what makes the binary-search jump from O(n²) to O(n log n).
Trace
step
x
i = lowerBound(tails, x)
action
tails after
1
10
0 (empty)
append
[10]
2
9
0 (10 ≥ 9)
replace tails[0] = 9
[9]
3
2
0 (9 ≥ 2)
replace tails[0] = 2
[2]
4
5
1 (past end)
append
[2, 5]
5
3
1 (5 ≥ 3)
replace tails[1] = 3
[2, 3]
6
7
2 (past end)
append
[2, 3, 7]
7
101
3 (past end)
append
[2, 3, 7, 101]
8
18
3 (101 ≥ 18)
replace tails[3] = 18
[2, 3, 7, 18]
Where It's Used Today
Stock analysis — finding the longest stretch of monotonically rising closing prices in a noisy time series.
Aircraft scheduling — assigning planes to landings so that no two paths cross requires LIS-like reasoning on arrival times.
DNA and protein analysis — locating regions where one strand stays "in order" with another (a building block of bioinformatics aligners).
Cable-pulling and circuit routing — minimizing wire crossings reduces to finding the LIS of a permutation.
Diff tools — diff and git diff find a longest common subsequence between file versions; LIS is a key subroutine in that family.
When NOT to Use
When you need the actual subsequence, not just its length — the tails array isn't a real subsequence, so you must store back-pointers.
When the sequence must be contiguous (a substring or window) — that's a different problem with a simpler scan.
When ties matter and you want "non-decreasing" instead of "strictly increasing" — change lowerBound to upperBound or you'll undercount.
Common Mistakes
Reading the final tails array as the answer subsequence — its values are best endings, not consecutive picks.
Using upperBound when the problem says strictly increasing, which permits duplicates and overcounts.
Falling back to the O(n²) dp[i] = max(dp[j]+1) version on long inputs, then timing out instead of using the binary-search variant.
Try It with an AI Assistant
short
Write lis(a) returning the length of the longest strictly increasing subsequence of list a.
behavior
Walk through a list of numbers left to right. Keep a sorted side-list tails. For each new number x, find the leftmost slot in tails whose value is ≥ x. If there is no such slot, append x; otherwise replace that slot with x. After all numbers are processed, return the length of tails.
For instanceSolve small traveling-salesman problems using visited-city bits.
n ←3
dist ← [[0, 1, 3], [1, 0, 2], [3, 2, 0]]
ALL_VISITED ← (1 << n) - 1
memo ← empty map
FUNCTIONtsp(mask, pos)
IF mask = ALL_VISITED THENRETURN dist[pos][0]
ENDIFIF memo[mask][pos] exists THENRETURN memo[mask][pos]
ENDIF
ans ← ∞
FOR city FROM0TO n - 1IF mask does NOT contain city THEN
ans ←min(ans, dist[pos][city] + tsp(mask | (1 << city), city))
ENDIFENDFOR
memo[mask][pos] ← ans
RETURN ans
END FUNCTIONRETURNtsp(1, 0)
In 1962, Michael Held and Richard Karp at IBM noticed that the Traveling Salesman Problem — exhaustive search over n! orderings — could be reorganized so that all tours sharing the same visited set and current city were solved only once. By encoding "visited set" as the bits of a single integer, they collapsed the work from factorial to roughly n² · 2ⁿ. It was the first algorithm to solve TSP exactly for non-trivial city counts, and it remains the standard exact method for n up to about 20 — small but useful for vehicle routing, drilling schedules, and contest problems.
Teaches: Encode subsets as bits to memoize over exponential states
The Idea
Encode "the set of cities already visited" as the bits of a single integer mask. Bit i set means city i has been visited. The function tsp(mask, pos) returns the cheapest way to visit the remaining cities starting from pos and ending back at city 0. With n cities there are 2ⁿ possible masks and n possible positions, so a memo table of size 2ⁿ · n is enough to store every subproblem and each is solved once.
The recurrence reads: if every city is visited, return the trip home; otherwise, try every unvisited city as the next stop, recurse on the smaller problem, and keep the minimum. This works because the cost of the rest of the tour depends only on which cities remain — not on the order they were visited in. Memoization saves repeated work, turning a factorial blow-up into something tractable.
Trace
call
mask
pos
tries city
cost so far
result
tsp(001, 0)
001
0
1
1 + tsp(011,1)
6
└ tsp(011, 1)
011
1
2
2 + tsp(111,2)
5
└└ tsp(111, 2)
111
2
(all set)
dist[2][0] = 3
3
tsp(001, 0)
001
0
2
3 + tsp(101,2)
6 (tie)
└ tsp(101, 2)
101
2
1
2 + tsp(111,1)
3
└└ tsp(111, 1)
111
1
(all set)
dist[1][0] = 1
1
Where It's Used Today
Vehicle routing — small delivery and service-truck schedules where the optimum tour really matters and the city count is modest.
PCB drilling — laying out the order in which a circuit-board drill visits hole positions to minimize travel.
Genome assembly — variant problems (shortest superstring) reduce to TSP on small fragment sets.
Competitive programming — bitmask DP is a textbook trick for any "visit a subset, choose a permutation" problem with n ≤ 20.
Scheduling with prerequisites — picking an order over a small set of tasks where pairwise switch costs differ.
When NOT to Use
When n exceeds about 20 — the 2ⁿ · n memo table no longer fits in memory and the runtime explodes.
When you only need a "good enough" tour for hundreds of cities — heuristics like 2-opt, Lin-Kernighan, or Christofides scale far better.
When the cost between cities depends on path history (fatigue, time-of-day pricing) — the subset-only state breaks the recurrence's correctness.
Common Mistakes
Starting with mask = 0 instead of mask = 1 (start city already visited), so city 0 gets revisited.
Forgetting to add dist[pos][0] at the base case — the algorithm finds the cheapest path, not the cheapest tour.
Using mask & (1 << city) as the visited check but writing mask | (1 << city) — confusing test with set, leading to infinite recursion.
Try It with an AI Assistant
short
Write tsp(mask, pos) returning optimal TSP tour cost using bitmask DP with memoization.
behavior
Use a recursive function whose state is a set of already-visited cities (encoded as the bits of an integer) and the current city. If every city is visited, return the distance back to the start. Otherwise, try every unvisited city, add the distance to it, recurse with that city marked visited, and return the minimum total. Cache each result keyed on the (set, current city) pair.
Made drawing straight lines on integer grids fast and exact.
dx ← x1 - x0; dy ← y1 - y0
err ←2*dy - dx
y ← y0
FOR x FROM x0 TO x1
plot(x, y)
IF err > 0THEN
y ← y + 1
err ← err - 2*dx
ENDIF
err ← err + 2*dy
ENDFOR// each plot(x, y) call lights one pixel — the loop emits the line
Jack Bresenham developed the algorithm in 1962 while at IBM in San Jose, where he was working on driving the company's Calcomp digital plotters — pen machines whose stepper motors could only travel in whole-pixel increments. The earlier line routines all used floating-point divisions inside the inner loop, far too slow for plotting hundreds of vectors per second on the period's hardware. Bresenham's incremental error-term trick eliminated every multiply and divide, and the same idea was soon adapted to circles, ellipses, and antialiased line variants used in graphics chips through today.
Teaches: Approximate continuous lines using integer decisions
Anecdote
Jack Bresenham designed it for plotters that could only move in integer steps. The brilliance: no floating-point math at all — just addition and comparison.
The Idea
Walk one column at a time from x0 to x1, plotting one pixel per column. The only question at each step is: do we keep y the same, or step up by one? Bresenham keeps a running error termerr that measures how far the ideal line is above or below the current pixel's row. If err > 0 the line has drifted up enough that we should step y up by one and subtract 2dx from the error. Either way, we add 2dy for the next column.
Why does it work? The error term is 2dy(x - x0) - 2dx(y - y0) plus a constant — a scaled version of the line's signed distance to the current pixel. Multiplying by 2dx clears the fraction, so we never need division or floating-point. Each step the error grows by 2dy; whenever it crosses zero, we step y and pay back 2*dx. The whole algorithm uses only integer addition, subtraction, and comparison — perfect for old plotters, fast even today.
Trace
x
y
plot
err before
err > 0?
action
err after
0
0
(0,0)
-2
no
err += 2*dy=6
4
1
0
(1,0)
4
yes
y=1, err-=2*dx=16; +=6
-6
2
1
(2,1)
-6
no
+= 6
0
3
1
(3,1)
0
no
+= 6
6
4
1
(4,1)
6
yes
y=2, err-=16; += 6
-4
5
2
(5,2)
-4
no
+= 6
2
6
2
(6,2)
2
yes
y=3, err-=16; += 6
-8
7
3
(7,3)
-8
no
+= 6
-2
8
3
(8,3)
-2
no
+= 6
4
Where It's Used Today
Game and graphics engines — every 2D game that draws a laser beam or a wireframe edge ultimately calls a descendant of Bresenham's line.
CAD and CNC machines — plotters, laser cutters, and 3D printers still use Bresenham (and its circle and ellipse cousins) to drive integer stepper motors.
Robotics and autonomous vehicles — raycasting a sensor line across an occupancy grid uses Bresenham to enumerate the cells the ray crosses.
Image-processing libraries — OpenCV's cv::line and Pillow's ImageDraw.line use Bresenham-style rasterization under the hood.
Embedded displays — microcontroller code that drives small LCDs and OLEDs uses Bresenham because the chips have no floating-point unit.
When NOT to Use
When you need anti-aliased lines for high-quality 2D rendering — Bresenham draws a hard staircase; use Wu's algorithm for smooth edges instead.
When the line is steeper than 45° (|dy| > |dx|) without swapping axes — the basic loop produces gaps because it advances x faster than y.
When you only need a few points sampled along the line — direct floating-point parametric (x0 + tdx, y0 + tdy) is simpler and accurate enough.
Common Mistakes
Hard-coding the loop for one octant only — drawing right-to-left or downward gives a blank line because the increments have the wrong sign.
Initializing err as dy - dx (without the factor of 2), which biases the staircase one pixel up or down.
Using floating-point slope = dy/dx inside the loop, which throws away the entire performance and exactness benefit of Bresenham.
Try It with an AI Assistant
short
Write bresenham(x0, y0, x1, y1) returning the list of integer pixels on the line from (x0,y0) to (x1,y1), using only integer arithmetic.
behavior
Write a function that returns the integer pixel positions of a line from (x0, y0) to (x1, y1). Step x from x0 to x1. At each step, plot the current (x, y). Maintain an integer error term initialized to 2dy - dx. Whenever the error is positive, increment y and subtract 2dx from the error. After every step, add 2*dy to the error.
Made computing mean and variance numerically stable in one pass.
n ←0; mean ←0; M2 ←0FOR EACH x IN stream
n ← n + 1
delta ← x - mean
mean ← mean + delta / n
M2 ← M2 + delta * (x - mean)
ENDFOR
variance ← M2 / n
RETURN (mean, variance)
B. P. Welford published it in a four-page paper that has been cited more than any other four-page paper in numerical computing. The reason: every textbook formula for variance has a catastrophic numerical bug — it subtracts two large nearly-equal numbers — and Welford's reformulation avoids it entirely. Every statistics library you trust today silently uses Welford's recurrence under the hood.
Teaches: Update mean and variance incrementally with numerical stability
Anecdote
B. P. Welford published it in a four-page paper that has been cited more than any other four-page paper in numerical computing. The reason: every textbook formula for variance has a catastrophic numerical bug — it subtracts two large nearly-equal numbers — and Welford's reformulation avoids it entirely. Every statistics library you trust today silently uses Welford's recurrence under the hood.
The Idea
Keep three running numbers: n (count so far), mean (running mean), and M2 (running sum of squared deviations from the current mean). When a new value x arrives, compute delta = x - mean (how far the new point is from the old mean), nudge the mean by delta / n, then update M2 using delta * (x - new_mean). The variance at any moment is M2 / n.
Why does it work? The clever bit is the use of two deltas — one before the mean update, one after. Their product, summed over all points, is mathematically identical to the textbook "sum of squared deviations from the final mean," but it never subtracts two huge similar numbers. The textbook formula Σx² − (Σx)²/n can lose almost all precision when both terms grow into the millions. Welford's recurrence keeps every value at the scale of one observation, so even billions of data points stay numerically stable.
Trace
step
x
n
delta = x − mean(old)
mean (new)
x − mean(new)
M2 (new)
1
4
1
4 − 0 = 4
0 + 4/1 = 4
4 − 4 = 0
0 + 4·0 = 0
2
7
2
7 − 4 = 3
4 + 3/2 = 5.5
7 − 5.5 = 1.5
0 + 3·1.5 = 4.5
3
13
3
13 − 5.5 = 7.5
5.5 + 7.5/3 = 8.0
13 − 8 = 5
4.5 + 7.5·5 = 42
4
16
4
16 − 8 = 8
8 + 8/4 = 10
16 − 10 = 6
42 + 8·6 = 90
Where It's Used Today
NumPy and pandas — the standard deviation routines in numpy.var, pandas.Series.std, and most scientific Python use Welford-style updates internally for numerical safety.
Streaming analytics — Apache Spark, Flink, and Kafka Streams compute running averages over billions of events using exactly this recurrence.
Sensor pipelines — phones, drones, and industrial sensors use Welford to track gyro/accelerometer drift without storing every reading.
Machine-learning batch normalization — neural-network layers maintain running means and variances of activations during training; a Welford-like update keeps them stable.
Finance — rolling volatility (variance of returns) for trading dashboards is computed online so each new price tick updates the chart instantly.
When NOT to Use
When you need a windowed statistic (last 100 values only) — Welford only adds points, it doesn't subtract them; use a deque-based running sum for sliding windows.
When the dataset is small and fits in memory — the classic two-pass formula is just as accurate and easier to read for a list of 50 numbers.
When you need higher moments like skewness or kurtosis — basic Welford only tracks mean and M2; the higher-order recurrences are different and more delicate.
Common Mistakes
Updating M2 with the old mean instead of the new mean — this breaks the identity that gives Welford its numerical stability.
Returning M2 directly as the variance, forgetting to divide by n (or n-1 for the sample variance).
Initializing mean and M2 to a non-zero value, contaminating the recurrence from the first observation onward.
Try It with an AI Assistant
short
Write welford(stream) implementing Welford's online statistics — return the running mean and variance after one pass.
behavior
Write a function that reads numbers one at a time and keeps a running mean and a running 'sum of squared deviations from the current mean.' For each new value, compute the gap from the old mean, nudge the mean by gap divided by the new count, then update the running sum using the gap times the new gap to the updated mean. Return mean and the sum divided by count as variance.
Made deciding whether a point lies inside any shape systematic.
count ←0FOR EACHedge (a, b) IN polygon
IF (a.y > py) != (b.y > py) THEN
x ← (b.x - a.x) *
(py - a.y) /
(b.y - a.y) + a.x
IF px < x THEN
count ← count + 1ENDIFENDIFENDFORRETURN (count MOD2 = 1)
By the early 1960s, computer graphics and CAD were emerging fields, and engineers needed a reliable way to ask "is this point inside that shape?" The ray-casting trick is older than computers — it was used by surveyors and topologists — but Shimrat was the first to write it down as a tight, publishable subroutine. Communications of the ACM in 1962 ran a regular "Algorithm" department where short, numbered procedures were submitted in ALGOL; Shimrat's contribution, half a page long, became the seed for almost every "point in polygon" test that has shipped since.
Teaches: Determine inclusion by counting boundary crossings
Anecdote
Mort Shimrat published it in CACM under the unassuming title "Algorithm 112" — algorithms in 1960s journals were just numbered. Shimrat's whole paper is half a page. The 25-line code of Algorithm 112 has shipped, almost unchanged, in every computer-aided-design package, geographic information system, and 2D game engine for the past sixty years.
The Idea
Imagine standing at the test point and shooting a horizontal ray off to the right, like an arrow pointing at the eastern horizon. Count how many edges of the polygon the ray crosses. If the count is odd, you're inside. If it's even, you're outside.
Why does this work? Each time the ray crosses a boundary, you switch sides — outside becomes inside, inside becomes outside. Far to the right (past every edge), you're definitely outside. Walking the ray backward toward your point, every crossing flips you. So an odd number of flips means you ended up inside; an even number means you flipped back out. The pseudocode does this without actually drawing a ray: for each edge, it checks whether the edge straddles the horizontal line y = py, and if so, computes the x of the intersection and counts it only when the intersection is to the right of px.
Trace
edge (a → b)
a.y > py?
b.y > py?
straddles?
intersection x
px < x?
count
(0,0) → (4,0)
0 > 2 = F
0 > 2 = F
no
—
—
0
(4,0) → (4,4)
0 > 2 = F
4 > 2 = T
yes
4
2 < 4 ✓
1
(4,4) → (0,4)
4 > 2 = T
4 > 2 = T
no
—
—
1
(0,4) → (0,0)
4 > 2 = T
0 > 2 = F
yes
0
2 < 0 ✗
1
Where It's Used Today
Geographic information systems — deciding whether a GPS point lies inside a country, school district, or delivery zone (Uber, DoorDash, ArcGIS).
2D game engines — checking whether the cursor or a bullet lies inside an irregular hit-box on screen.
CAD software — when you click "fill" on an enclosed shape, the program runs ray casting at every pixel to decide what to paint.
Lasso selection in image editors — Photoshop and Figma's freehand selection use this to determine which pixels fall inside the user's loop.
Election mapping and census tools — assigning an address to a precinct or census tract requires testing the address point against thousands of boundary polygons.
When NOT to Use
When the polygon is convex — a much simpler "point lies on the same side of every edge" test is faster and avoids ray edge cases.
When you'll test millions of points against the same shape — pre-compute a spatial index (BVH, grid, trapezoid map) instead.
When the polygon has holes or self-intersections — even-odd ray casting gives the wrong inside/outside without extra rules.
Common Mistakes
Using >= instead of strict > in the straddle check (a.y > py) != (b.y > py), double-counting points that hit a vertex.
Dividing by (b.y - a.y) without ruling out horizontal edges first, causing a divide-by-zero on flat segments.
Counting the intersection regardless of side, instead of only when it lies to the right of the test point.
Try It with an AI Assistant
short
Write point_in_polygon(p, poly) returning true if 2D point p lies inside polygon poly, using ray casting.
behavior
Write a function that, given a 2D point and a list of polygon corners (in order around the boundary), decides whether the point is inside. For each polygon edge, check whether the horizontal line through the test point crosses that edge; if it does, find the x-coordinate of the crossing and tally it only when the crossing is to the right of the point. Return true when the tally is odd.
For instanceSearch a changing set of numbers by branching left or right.
// Tree built from inserting [50, 30, 70, 20, 40, 60, 80]
node ← root
key ←40WHILE node != NULLIF key = node.key THENRETURNTRUEENDIFIF key < node.key THEN
node ← node.left
ELSE
node ← node.right
ENDIFENDWHILERETURNFALSE
By the early 1960s, programmers had two unhappy choices for keeping a sorted collection on which to do lookups: a sorted array (fast search, painful insert) or a linked list (fast insert, painful search). The binary search tree, refined in Thomas Hibbard's 1962 paper, made both operations roughly logarithmic at once — and gave us the deletion-by-in-order-successor technique that every undergraduate still rewrites by hand. Within a few years, BSTs were the default mental model for "ordered map" — and the seed from which AVL trees, red-black trees, B-trees, and most modern indexes grew.
Teaches: A good question can eliminate many possibilities
Anecdote
Thomas Hibbard wrote a 1962 paper that introduced both the binary search tree and the deletion algorithm everyone still uses — replacing the deleted node with its in-order successor. Hibbard's deletion is famously biased — repeated insertion-deletion cycles cause the tree to lean left. The fix (random replacement) is called the Hibbard deletion problem and has been a homework exercise in algorithms courses for 60 years.
The Idea
Each node holds a key and two child pointers, left and right. The BST invariant says: every key in a node's left subtree is smaller than the node, and every key in its right subtree is larger. To search for key, start at the root; if key matches, you're done; if key is smaller, walk left; otherwise walk right. Stop when you fall off the tree (NULL) — that means the key is absent.
Why does it work? Because the invariant guarantees that at each step you eliminate one entire subtree from consideration — exactly the same divide-by-two move binary search makes on a sorted array. If the tree is balanced (roughly the same height on both sides), n items take only about log₂ n comparisons to search, insert, or delete. The catch: if you insert items in a bad order (already-sorted), the tree can degenerate into a linked list. That's the problem balanced trees like AVL and red-black trees are built to fix.
Trace
step
node.key
comparison
next move
0
50
40 < 50
node ← node.left
1
30
40 > 30
node ← node.right
2
40
40 = 40 → return TRUE
found
Where It's Used Today
Database indexes — many databases use B-trees, the disk-friendly cousin of BSTs, to look up rows by key in milliseconds.
In-memory ordered maps — std::map in C++ and TreeMap in Java are red-black trees, a self-balancing BST.
File system directories — many filesystems store directory entries in BST-like structures so listing and searching are fast.
Auto-complete and spellcheck — sorted dictionary lookups in editors often sit on top of a balanced BST.
Range queries — "find all events between 9am and 11am" naturally becomes a BST in-order traversal between two keys.
When NOT to Use
When inserts arrive in sorted (or nearly sorted) order — the plain BST degenerates into a linked list with O(n) lookups; use a self-balancing tree (red-black, AVL) instead.
When you only need exact-match lookup and key order doesn't matter — a hash table gives O(1) average instead of O(log n).
When the data lives on disk — node-per-key trees thrash the cache; B-trees pack many keys per page and are dramatically faster.
Common Mistakes
Implementing insertion without a tie-breaking rule, then crashing or duplicating when the same key arrives twice.
Using Hibbard deletion (replace with in-order successor) without rebalancing — repeated insert/delete cycles lean the tree left and degrade to O(√n) height.
Comparing keys with == on objects in languages where that compares references — every search returns false even when the key is present.
Try It with an AI Assistant
short
Traverse left or right branches based on key comparisons to locate value.
behavior
Write a function that, given the root of a tree where every node has a key and two children — left for smaller keys, right for larger keys — and a target key, walks down from the root: if the current key equals the target, return true; if the target is smaller, move to the left child; otherwise move to the right child; if you fall off the tree, return false.
For instanceDecide course order when some classes require prerequisites.
graph ← {A: [C, S], C: [P], S: [P], P: []}
indegree ← count incoming edges // {A: 0, C: 1, S: 1, P: 2}
queue ← all nodes with indegree 0// [A]
order ← empty list
WHILE queue NOT empty
node ←dequeue(queue)
append(order, node)
FOR EACH neighbor IN graph[node]
indegree[neighbor] ← indegree[neighbor] - 1IF indegree[neighbor] = 0THENenqueue(queue, neighbor)
ENDIFENDFORENDWHILERETURN order
Arthur Kahn published his algorithm in 1962 while looking for a clean way to schedule the thousands of interdependent tasks in large engineering projects — exactly the kind of work that PERT charts were trying to formalize at the time. Pull a node with no remaining prerequisites, mark it done, and watch its dependents become eligible: the procedure was so direct that it now powers everything from make and package managers to spreadsheet recalculation, long before the phrase "dependency graph" entered ordinary engineering vocabulary.
Teaches: Order tasks based on dependencies first
Anecdote
Arthur B. Kahn developed it for job scheduling in large projects. It answered a practical question: what can we do next if some tasks depend on others? — long before "dependency graphs" became standard.
The Idea
For every node, count its in-degree — how many things must come before it. Tasks with in-degree zero have no prerequisites at all and are safe to do right now. Put them in a queue. Then repeatedly take a task off the queue, output it, and "release" each of its dependents by decrementing their in-degree. When a dependent's in-degree hits zero, all its prerequisites are now done, so it joins the queue.
Why does this work? At every moment, the queue holds exactly the tasks whose prerequisites have all been emitted. Pulling one out and emitting it never violates any rule, because by definition nothing is still required of it. If at the end you've emitted every node, you have a valid order. If some nodes never make it to in-degree zero, the graph contains a cycle — there's no valid order at all, and you've detected the impossibility for free.
Trace
step
dequeued node
order so far
indegree updates
queue after
0
start
[]
A=0, C=1, S=1, P=2
[A]
1
A
[A]
C: 1→0 (enqueue), S: 1→0 (enqueue)
[C, S]
2
C
[A, C]
P: 2→1
[S]
3
S
[A, C, S]
P: 1→0 (enqueue)
[P]
4
P
[A, C, S, P]
(no neighbors)
[]
Where It's Used Today
Build systems — make, Bazel, Gradle, and every modern compiler decide which files to recompile in topological order based on their #include and import graph.
Package managers — npm, pip, apt, and Homebrew install dependencies before the packages that need them.
Spreadsheet recalculation — when one cell depends on another, the spreadsheet evaluates them in topological order so that updates propagate correctly.
Course planning — university degree planners suggest a feasible class schedule that respects prerequisites.
Task pipelines — workflow engines like Airflow, Luigi, and CI/CD systems run jobs in topological order.
When NOT to Use
When the dependency graph contains cycles (mutual recursion, circular imports) — no valid order exists; you need cycle detection or SCC instead.
When edges are undirected — "comes before" needs direction; topological sort is meaningless on plain graphs.
When you need the unique or optimal ordering — many topological orders exist; if you need a specific one (shortest schedule, lex-smallest), add tie-breaking or use specialized scheduling.
Common Mistakes
Forgetting to detect cycles — if the output has fewer than V nodes, the graph has a cycle and the result is invalid; many implementations silently return a partial list.
Computing in-degrees incorrectly by counting outgoing edges instead of incoming ones, producing a reversed (or nonsense) order.
Mutating the original indegree map without restoring it, breaking subsequent runs of the algorithm on the same graph.
Try It with an AI Assistant
short
Order directed acyclic graph nodes so dependencies appear before dependents.
behavior
Write a function that takes a directed graph of tasks. For every node, count how many incoming arrows it has. Put all nodes with zero incoming arrows into a queue. Repeatedly pull one out, append it to the result, and for each of its outgoing arrows, reduce the target's incoming-count by one; if the count reaches zero, enqueue the target. Return the result list when the queue is empty.
For instanceCompute shortest travel time between every pair of cities.
n ←3
dist ← [[0, 4, 10],
[4, 0, 1],
[10, 1, 0]]
FOR k FROM0TO n - 1FOR i FROM0TO n - 1FOR j FROM0TO n - 1
d ← dist[i][k] + dist[k][j]
IF d < dist[i][j] THEN
dist[i][j] ← d
ENDIFENDFORENDFORENDFORRETURN dist
Instead of solving one route at a time, the algorithm gradually allowed more intermediate nodes until all-pairs shortest paths emerged naturally.
Needed shortest paths between every pair of nodes simultaneously.
Teaches: Allow each vertex as midpoint to reveal all distances
The Idea
Start with dist[i][j] set to the direct edge length (or ∞ if no edge exists). Then ask: for each possible "middle" vertex k, can routing through k shorten the trip from i to j? If so, replace dist[i][j] with dist[i][k] + dist[k][j]. Do this for every k, every i, every j — three nested loops.
The invariant is beautiful: after the k-th outer iteration, dist[i][j] holds the shortest path from i to j using only vertices 0, 1, ..., k as possible intermediate stops. Each outer pass adds one more vertex to the "allowed midpoints" set. After looping through all n vertices, every shortcut has been considered, so dist[i][j] is the true shortest distance. Total cost: O(n³).
Trace
k
what's checked
update?
0
does going through 0 shorten anything?
no — 0 isn't between 1 and 2
1
check dist[0][2] vs dist[0][1]+dist[1][2] = 4+1 = 5
yes! dist[0][2] drops from 10 to 5
2
does going through 2 shorten anything?
no — already optimal
Where It's Used Today
Network routing — older protocols (RIP and similar distance-vector schemes) use Floyd-Warshall-style reasoning to maintain routing tables.
Travel planners — pre-computing all-pairs distances between airports so a search can answer queries instantly.
Game AI — pathfinding in small grid worlds where you want every NPC's shortest path to every key location.
Bioinformatics — finding shortest distances between every pair of nodes in protein interaction networks.
Operations research — supply-chain analysis where every pairwise transport cost matters for planning.
When NOT to Use
When the graph is large and sparse — O(V^3) time and O(V^2) space dominate; running Dijkstra from each source via a heap is much faster on sparse networks.
When you only need shortest paths from one source — Dijkstra (or Bellman-Ford for negative edges) is O((V + E) log V) and skips most of the work.
When the graph contains a negative cycle and you don't detect it — the table fills with meaningless decreasing values; check dist[i][i] < 0 after the run to flag this.
Common Mistakes
Putting k as the innermost loop instead of the outermost — the "vertices 0..k allowed as midpoints" invariant breaks and you compute wrong distances.
Initializing missing edges with 0 or a small sentinel instead of infinity — then dist[i][k] + dist[k][j] "shortcuts" through nonexistent edges.
Forgetting dist[i][i] = 0 — without the diagonal, paths that revisit i get penalized and answers come out too large.
Try It with an AI Assistant
short
Compute shortest paths between all node pairs using dynamic intermediate nodes.
behavior
Write a function that, given a distance matrix where dist[i][j] is the direct edge length between i and j (or infinity if no edge), runs three nested loops over an intermediate vertex k, source i, and destination j. For each combination, replace dist[i][j] with dist[i][k] + dist[k][j] if that's shorter. Return the updated matrix.
Made keeping data sorted while inserting and deleting fast.
// insert / find / delete on BSTFUNCTIONinsert(node, k)
IF node = NULLTHENRETURNNode(k)
ENDIFIF k < node.key THEN
node.left ←insert(node.left, k)
ELIF k > node.key
node.right ←insert(node.right, k)
ENDIFRETURN node
END FUNCTION
When Thomas Hibbard published his 1962 analysis of binary search trees, the slick part wasn't insertion or lookup — both were already obvious — but deletion. Removing a node with two children leaves a hole that neither child alone can fill without violating the BST rule. Hibbard's solution was to find the in-order successor (the smallest key in the right subtree), copy its value into the deleted slot, and then recursively delete the successor instead. Every textbook BST in use today, from std::map to Java's TreeMap, is a descendant of that 1962 paper's three-operation interface.
Teaches: Maintain invariants while inserting, finding, deleting
Anecdote
Same Hibbard 1962 paper — the operations are the BST, just framed differently for Theme 7. The deletion case is the genuinely subtle one (which child takes the deleted node's place?), and Hibbard's solution (in-order successor with substitution) is what every CS101 course teaches.
The Idea
Each operation is a guided walk down the tree. To insert key k: at each node, if k is smaller go left, if larger go right; when you reach a NULL slot, plant a new node there. To find key k: same walk, return success when you land on k or failure when you fall off the tree. To delete key k: locate the node, then patch the gap — if the node has at most one child, splice it out; if it has two, replace the key with its in-order successor (the smallest key in the right subtree) and then delete that successor.
The invariant that makes this work is the BST rule itself: at every node, left-subtree keys < node key < right-subtree keys. Each operation either preserves the rule directly or, in the delete case, restores it by careful substitution. Because each step descends one level, all three operations cost O(h) where h is the tree's height.
Trace
step
k
walk
result
1
5
tree empty → place 5 at root
5
2
3
3 < 5 → go left, NULL → plant 3
5(L:3)
3
7
7 > 5 → go right, NULL → plant 7
5(L:3, R:7)
4
1
1 < 5 → left to 3; 1 < 3 → left, plant
5(L:3(L:1), R:7)
5
4
4 < 5 → left to 3; 4 > 3 → right, plant
5(L:3(L:1, R:4), R:7)
Where It's Used Today
Database indexes — B-trees and B+-trees, the workhorses behind every SQL index, are direct descendants of these BST operations.
In-memory key-value stores — std::map in C++ and TreeMap in Java use balanced BSTs (red-black trees) for ordered lookup.
File systems — directories that need to keep entries sorted (NTFS, HFS+) use BST variants under the hood.
Auto-complete and dictionaries — many spell-checkers and predictive text tools store their lexicon in a balanced BST for fast range queries.
Computer-algebra systems — symbolic math engines maintain expression trees with insertion, lookup, and deletion as the core operations.
When NOT to Use
When inputs may arrive in sorted or near-sorted order — an unbalanced BST degrades into a linked list with O(n) lookups; use a self-balancing variant (red-black, AVL).
When you only need fast membership testing without ordering — a hash set gives O(1) lookups without any tree-walk cost.
When keys are stored on disk in large blocks — a binary tree wastes I/O; B-trees with high fan-out are the right choice.
Common Mistakes
During delete-with-two-children, copying the in-order successor's key but forgetting to recursively remove the successor — the same key now appears twice.
Mishandling duplicates by inserting them on whichever side the comparison falls through, producing two valid locations and breaking later finds.
Updating only the local pointer inside insert instead of returning the (possibly new) subtree root — the parent never sees the new node and inserts silently fail.
Try It with an AI Assistant
short
Write a class BinaryTree with insert(value) and inorder() returning sorted values; behave as a binary search tree.
behavior
Write a class for a node-and-pointer tree where each node has a key, a left child, and a right child. Provide an insert method that, given a key, walks from the root taking the left branch when the new key is smaller and the right branch when larger, planting a new node at the first empty spot. Provide a method that returns the keys visited by a left-then-self-then-right traversal.
It made number sequences visually explorable instead of only one-dimensional lists.
// Ulam spiral: index → (x, y)
i ←10
m ←floor(sqrt(i))
IF m MOD2 = 0THEN
m ← m - 1ENDIF
t ← i - m*m
half ← (m + 1) / 2IF t <= m THEN
x ← half - t
y ← half - m
ELSEIF t <= 2*m THEN
x ← -half
y ← half - (t - m)
ELSEIF t <= 3*m THEN
x ← -half + (t - 2*m)
y ← half
ELSE
x ← half
y ← half - (t - 3*m)
ENDIFRETURN (x, y)
Stanislaw Ulam noticed patterns while doodling numbers in a spiral. When primes were marked on the spiral, diagonal structures appeared. Placing integers on a grid made hidden number patterns visible.
Teaches: Map sequences to space using structured growth patterns
The Idea
Notice that the perfect odd squares — 1, 9, 25, 49 — sit on the bottom-right diagonal of the spiral. Each of these closes a complete "ring" around the center. So given any i, we first find the largest odd m with mm ≤ i. Now i lies somewhere on the next ring, and t = i - mm tells us how far around that ring we've walked.
Each ring has four arms (right side going up, top going left, left going down, bottom going right). We figure out which arm t lies on by comparing t against m, 2m, 3m, 4*m, and within that arm we work out the offset. The radius of the ring is half = (m+1)/2. Add or subtract half and the offset along the arm, and we have (x, y) directly — no walking required.
Trace
variable
value
what happens
i
10
input
m
3
floor(sqrt(10)) = 3, already odd
t
1
t = 10 − 9 = 1
half
2
half = (3 + 1)/2 = 2
arm
first
t (=1) ≤ m (=3) → use first arm
x
1
x = half − t = 2 − 1 = 1
y
−1
y = half − m = 2 − 3 = −1
Where It's Used Today
Number-theory visualizations — when a researcher wants to plot millions of primes on a grid to see diagonal striping, the Ulam spiral lookup gives the pixel for each prime in O(1).
Image kernels and signal processing — sampling a grid in spiral order (often for radial scans or tile prioritization) uses the same index→(x,y) trick.
Game level generation — procedural games place tiles outward from a starting room in spiral order; this formula maps room number to map coordinate.
Recreational math software — Wolfram Mathematica's UlamMatrix and similar tools in SageMath expose this exact mapping.
Memory layouts — some cache-friendly 2D array traversals walk in a spiral; the inverse formula tells the loader where each item should sit.
When NOT to Use
When you actually want to iterate the spiral in order — a simple direction-stepping loop (right, up, left, down with growing leg lengths) is clearer than the closed form.
When your spiral starts from a different corner or rotates the other way — this formula bakes in one convention and silently produces wrong coordinates for a different one.
When i = 0 is your starting index — the formula assumes 1-based indexing and floor(sqrt(0)) = 0 triggers the off-by-one branch.
Common Mistakes
Forgetting to subtract 1 when m = floor(sqrt(i)) is even, so m*m > i and t goes negative.
Hard-coding only one or two arms of the ring — the algorithm needs all four (t ≤ m, ≤ 2m, ≤ 3m, ≤ 4m) cases or it returns garbage on three quarters of inputs.
Using floating-point sqrt for huge i without a correction step, picking up a bad m due to rounding.
Try It with an AI Assistant
short
Write ulam_spiral_index_xy(i) implementing Ulam Spiral — index → (x, y).
behavior
Given a positive integer i, return the (x, y) grid coordinates of i if the integers 1, 2, 3, … are written in an outward square spiral starting at the origin. Use a closed-form approach: find the largest odd m with m² ≤ i, work out which side of the ring i is on by comparing i − m² against m, 2m, 3m, 4m, and compute x and y from there.
It made spiral grids usable in simulations, puzzles, visualizations, and coordinate-based indexing without storing the whole grid.
x ←2
y ←1// (x, y) → Ulam spiral indexIF x = 0AND y = 0THENRETURN1ENDIF
m ←max(|x|, |y|)
side ←2 * m
base ← (2 * m - 1) * (2 * m - 1)
// right arm: x = m, y runs from -m+1 up to mIF x = m AND y > -m THENRETURN base + (y + m)
ENDIF// top arm: y = m, x runs from m-1 down to -mIF y = m THENRETURN base + side + (m - x)
ENDIF// left arm: x = -m, y runs from m-1 down to -mIF x = -m THENRETURN base + 2 * side + (m - y)
ENDIF// bottom arm: y = -m, x runs from -m+1 up to mRETURN base + 3 * side + (x + m)
Once numbers could be placed on a spiral, the reverse question became useful: given a grid coordinate, what number lives there? This turned the picture into a computable coordinate system.
Teaches: Invert spatial structures into linear order
The Idea
Every Ulam spiral cell lives on a square ring numbered m = max(|x|, |y|). Ring 0 is just (0,0). Ring m ≥ 1 is a square of side 2m + 1, holds 8m cells, and starts at index (2m − 1)² + 1. Once you know which ring you're on, you only need to figure out how far around the ring you've walked from its starting cell.
That's the "4-arm dispatch": each ring has four straight arms (bottom, right, top, left). Compute m, compute the ring's base index, then check which arm (x, y) lies on and add the offset along that arm. The whole computation is constant-time arithmetic — no loops, no grid stored. The invariant: cells on ring m get indices in the range ((2m−1)², (2m+1)²], and within that range the offset is determined by the arm and the coordinate.
Trace
step
computation
value
1
`m = max(
2
,
1
)`
2
2
side = 2·m
4
3
base = (2·m − 1)² = 3²
9
4
which arm? x = m = 2 and y > -m, so right arm
right arm
5
offset along right arm: y + m = 1 + 2
3
6
index = base + offset = 9 + 3
12
Where It's Used Today
Visualizing primes — plotting the indices that happen to be prime on a spiral grid is the classic way to see Ulam's diagonal patterns.
Tile-based games — some roguelikes lay out infinite worlds as a spiral, addressing tiles by (x, y) and computing the seed for each.
Sparse coordinate systems — when you need a deterministic ID for any (x, y) cell without storing the grid (e.g., procedural galaxies in space sims).
Cellular-automaton experiments — recording the order in which cells were visited in spiral order during a simulation.
Math education and puzzles — the spiral makes pattern-finding tactile, and the (x, y) → index map turns "where is 41?" into ordinary arithmetic.
When NOT to Use
When your grid uses a different spiral convention (clockwise, starts going up) — the arm formulas hardcode one orientation and silently mislabel cells.
When you want the inverse direction (index → coordinate) — solve m = ⌈(√k − 1)/2⌉ and walk the ring; the (x,y) → index code doesn't run backward.
When you only need a few coordinates inside a small bounded grid — building a lookup table once is simpler and avoids all the arm-dispatch logic.
Common Mistakes
Using m = |x| + |y| (Manhattan distance) instead of max(|x|, |y|) (Chebyshev) — picks the wrong ring for off-diagonal cells.
Forgetting that ring 0 is the single cell (0, 0) with index 1 — the general formula divides by zero or returns 0.
Mixing up which corner each arm starts at — produces an off-by-2m jump where one arm meets the next.
Try It with an AI Assistant
short
Write ulam_spiral_x_y_index(x, y) returning the integer index of the cell (x, y) on the Ulam spiral, where index 1 sits at the origin.
behavior
Numbers 1, 2, 3, … are written on an infinite grid in a counter-clockwise square spiral starting at the origin: 1 at (0,0), 2 at (1,0), 3 at (1,1), 4 at (0,1), 5 at (-1,1), and so on. Given (x, y), return the number written there. Do it in constant time using the ring m = max(|x|, |y|) and a base index (2m−1)² + 1, then add the offset along whichever arm of ring m the cell sits on.
Made guaranteed efficient in-place sorting possible.
For instanceRank tasks by priority and repeatedly remove the highest.
arr ← [3, 1, 4, 1, 5, 9, 2, 6]
buildMaxHeap(arr)
FOR END FROMlength(arr)-1 DOWN TO1
SWAP arr[0], arr[END]
siftDown(arr, 0, END - 1)
ENDFORRETURN arr
In 1964, English computer scientist J. W. J. Williams introduced the heap — a binary tree laid flat in an array — as a way to keep one extreme element instantly accessible. He showed it could power both an in-place sort and the priority queues used by operating systems. The structure works like a tournament bracket encoded in an array: every parent is the "winner" against its children, so the global champion always sits at index 0; remove it, replay the tournament, and the next champion floats up in O(log n) time.
Teaches: Use a partial order to repeatedly extract extremes
Anecdote
J. W. J. Williams introduced heaps as a way to simulate a tournament tree. Think: repeatedly finding the winner among players — the heap is literally a tournament bracket encoded in an array.
The Idea
A max-heap is a binary tree (laid out inside an array) where every parent is larger than its children. The biggest element is therefore always at index 0. buildMaxHeap runs once and turns any array into a heap. Then we repeat: swap arr[0] (the current maximum) with arr[END] (the last position not yet sorted); shrink the heap by one; and sift down the new top element until the heap property holds again. The largest values pile up at the right end of the array, in sorted order.
Why does this work? siftDown only needs to push one element down a tree of height log n, taking O(log n) time. We do that n times, so the whole sort costs O(n log n) — and crucially, it never depends on input order. Every comparison and every swap happens inside the original array, so no extra memory is used. The heap is essentially a tournament bracket: each siftDown is one round of the tournament finding a new winner.
Trace
END
swap arr[0] ↔ arr[END]
after swap
after siftDown(0, END-1) on the prefix
7
swap 9 with 1
[1, 6, 4, 3, 5, 1, 2, 9]
[6, 5, 4, 3, 1, 1, 2, 9]
6
swap 6 with 2
[2, 5, 4, 3, 1, 1, 6, 9]
[5, 3, 4, 2, 1, 1, 6, 9]
5
swap 5 with 1
[1, 3, 4, 2, 1, 5, 6, 9]
[4, 3, 1, 2, 1, 5, 6, 9]
4
swap 4 with 1
[1, 3, 1, 2, 4, 5, 6, 9]
[3, 2, 1, 1, 4, 5, 6, 9]
3
swap 3 with 1
[1, 2, 1, 3, 4, 5, 6, 9]
[2, 1, 1, 3, 4, 5, 6, 9]
2
swap 2 with 1
[1, 1, 2, 3, 4, 5, 6, 9]
[1, 1, 2, 3, 4, 5, 6, 9]
1
swap 1 with 1
[1, 1, 2, 3, 4, 5, 6, 9]
done
Where It's Used Today
Embedded systems — heap sort's predictable, in-place behavior makes it a safe choice when memory is tight and worst-case timing matters.
Linux kernel — selected library sort routines use heap-sort variants for guaranteed O(n log n).
Real-time scheduling — operating systems and game engines maintain priority queues (the heap, without the sort step) to pick the next task to run.
Top-K queries — "give me the 10 largest values from a stream of millions" is solved by maintaining a small heap.
Graph algorithms — Dijkstra's shortest-path and Prim's minimum spanning tree both use a heap (priority queue) as the inner data structure.
When NOT to Use
When you need a stable sort — heap sort reorders equal elements unpredictably; use merge sort or Timsort instead.
When cache performance matters — heap sort jumps around the array and runs noticeably slower than quicksort or Timsort on modern hardware despite the same O(n log n) bound.
When the data is mostly sorted — insertion sort or Timsort run in near-linear time on such input; heap sort still does the full n log n work.
Common Mistakes
Skipping buildMaxHeap and trusting the input — without an initial heap structure, every later siftDown is meaningless.
Off-by-one in the loop bounds, calling siftDown(0, END) instead of siftDown(0, END - 1), which re-includes the just-placed maximum.
Implementing siftDown to swap with the first larger child instead of the larger child, breaking the heap property silently.
Try It with an AI Assistant
short
Write heap_sort(a) that sorts a list in place by building a max-heap and repeatedly extracting.
behavior
Write a function that sorts an array in place. First, rearrange it so every parent at index i is larger than its children at indices 2i+1 and 2i+2. Then, repeatedly swap the first element with the last unsorted element, shrink the unsorted region by one, and re-establish the parent-larger-than-children property over what remains.
Made priority queues fast to update when new items arrive.
heap ← [10, 8, 6, 5, 3, 9] // 9 was just appended at index 5
i ←5WHILE i > 0
parent ← (i - 1) / 2IF heap[parent] >= heap[i] THENBREAKENDIFswap(heap[parent], heap[i])
i ← parent
ENDWHILERETURN heap
J. W. J. Williams introduced sift-up and sift-down as the two invariant-restoring operations for a binary heap — a single concept (a partial order) maintained by two mirror procedures. Williams' 1964 paper is barely two pages and contains both heap sort and the priority queue, ideas that would dominate CS for decades. He effectively invented an entire family of algorithms in the time most papers spend on a single result.
Teaches: Restore order by bubbling elements upward
Anecdote
J. W. J. Williams introduced sift-up and sift-down as the two invariant-restoring operations for a binary heap — a single concept (a partial order) maintained by two mirror procedures. Williams' 1964 paper is barely two pages and contains both heap sort and the priority queue, ideas that would dominate CS for decades. He effectively invented an entire family of algorithms in the time most papers spend on a single result.
The Idea
Start at index i (the position where the new value just landed). Compute its parent's index parent = (i − 1) / 2. If the parent is already greater than or equal to heap[i], the heap property holds — stop. Otherwise, swap the two values and set i ← parent. Repeat until either the parent dominates or you reach index 0.
Why does it work? Before the insertion, every parent–child relationship in the heap was already correct. The only relationship that can be wrong is the one between the new element and its parent — and after each swap, the only relationship that can still be wrong is the one between the now-promoted value and its new parent. The fix walks up at most log₂(n) levels because the tree's height is logarithmic in the array size. The invariant: every parent–child pair except possibly the one at index i satisfies the max-heap property.
Trace
step
i
parent
heap[parent]
heap[i]
swap?
heap after
0
5
2
6
9
yes
[10, 8, 9, 5, 3, 6]
1
2
0
10
9
no
[10, 8, 9, 5, 3, 6] (stop)
Where It's Used Today
Priority queues — every time you push a job into a heap-backed task scheduler (operating systems, build systems, work queues), sift-up runs.
Dijkstra's shortest path — maintaining the frontier of "best known distance" is one sift-up per relaxed edge.
Heapsort — the build-heap phase of heapsort is n sift-up calls; the sort phase mirrors the operation with sift-down.
**A\* pathfinding in games** — the open set is a min-heap, and every newly-discovered node sift-ups to find its place.
Event-driven simulations — when an event is scheduled at a future time, it's inserted into a min-heap of events and sift-up positions it correctly by timestamp.
When NOT to Use
When you're removing the root (or any non-leaf) — sift-up is for newly added leaves; sift-down is the right mirror operation.
When you're building a heap from scratch — calling sift-up n times costs O(n log n); Floyd's bottom-up sift-down build is O(n).
When the underlying array isn't already a heap below index i — sift-up only fixes one broken parent-child link, not arbitrary disorder.
Common Mistakes
Computing the parent as i/2 instead of (i-1)/2, which works for 1-indexed heaps but corrupts a 0-indexed array.
Stopping the loop when i = 0 is reached after the swap, skipping a needed comparison at the root.
Using > instead of >= (or vice versa) and looping forever when the inserted value equals its parent.
Try It with an AI Assistant
short
Write sift_up(heap, i) restoring max-heap order by moving the element at index i upward until heap order is satisfied.
behavior
Write a function that takes an array and an index i. Treat the array as a binary tree where index i's parent is at (i − 1) / 2. While i > 0 and heap[i] is greater than heap[parent], swap them and move i to parent. Stop as soon as the parent is at least as large as the current value.
Made extracting the largest item from a heap fast.
heap ← [1, 9, 5, 7, 3]
n ←5
i ←0WHILETRUE
l ←2*i + 1
r ←2*i + 2
m ← i
IF l < n AND heap[l] > heap[m] THEN
m ← l
ENDIFIF r < n AND heap[r] > heap[m] THEN
m ← r
ENDIFIF m = i THENBREAKENDIFswap(heap[i], heap[m])
i ← m
ENDWHILERETURN heap
J.W.J. Williams introduced the heap data structure in his 1964 Communications of the ACM paper "Algorithm 232: Heapsort," which used sift-down both to build the heap and to drive the sort itself. The paper was barely two pages long, but its ideas have proven astonishingly durable — every priority queue in the C++ STL, every Python heapq call, and every operating-system task scheduler still leans on the same array-indexed binary tree he sketched. Robert Floyd refined the construction phase a few months later, but the underlying sift-down loop is essentially unchanged six decades on.
Teaches: Restore order by pushing elements downward
Anecdote
Sift-down is the operation that powers extract-max (and heap-sort's main loop). Williams' choice to use a complete binary tree stored in an array, with parent and child indices computed arithmetically (2i, 2i+1), is one of computing's quietly perfect designs — no pointers, no allocations, just integer math. Modern heaps in production code, decades later, are byte-for-byte the same idea.
The Idea
Williams' beautiful trick is to store the binary tree in a flat array. The children of index i live at indices 2i + 1 and 2i + 2 — no pointers, no extra memory, just integer math. To sift down from index i: compare heap[i] against its left child heap[l] and right child heap[r], find the largest of the three (call its index m), and if m is not i, swap heap[i] with heap[m] and continue from index m. Stop the moment the parent is already the largest.
Why does this work? The invariant is that once we've stopped swapping at some index, the subtree rooted there is a valid max-heap. Each swap moves the violating value strictly downward; since the tree has height about log₂(n), the loop runs at most that many times. That's why heap operations cost O(log n) — fast enough that millions of insert/extract operations finish in a blink.
Trace
step
i
l
r
heap[i]
heap[l]
heap[r]
m
swap?
heap after
1
0
1
2
1
9
5
1
swap 0 ↔ 1
[9, 1, 5, 7, 3]
2
1
3
4
1
7
3
3
swap 1 ↔ 3
[9, 7, 5, 1, 3]
3
3
7
8
1
—
—
3
l ≥ n, no swap, m = i → break
[9, 7, 5, 1, 3]
Where It's Used Today
Priority queues — operating systems pick which process runs next, networks pick which packet to send first; sift-down keeps the queue ordered after each pop.
Heap-sort — every extract-max call in heap-sort is a sift-down on the root; the entire sort is built on this one operation.
Dijkstra's shortest path — GPS apps and network routers use a min-heap of frontier nodes, and sift-down runs on every relaxation step.
Top-K queries — finding the 100 most relevant search results, the top trending hashtags, or the highest-value transactions all use a heap sized to K, repaired by sift-down.
Event simulation — game engines and discrete simulators schedule events in a min-heap by timestamp; sift-down fires after every event is consumed.
When NOT to Use
When you need to repair after inserting at the bottom — that's sift-up, which moves the value toward the root instead.
When the underlying structure is a search tree (BST, AVL) — heaps don't support ordered traversal or find(value) cheaply.
When the array doesn't already satisfy the heap property in both subtrees of i — sift-down assumes the violation is local to i.
Common Mistakes
Comparing the parent only against the left child and forgetting the right one — the right subtree's heap order silently breaks.
Swapping with a child but then continuing from i instead of m, so the value stops moving down after one swap.
Using 2i and 2i + 1 (1-indexed convention) on a 0-indexed array, addressing the wrong children entirely.
Try It with an AI Assistant
short
Write sift_down(heap, i, n) restoring max-heap order by moving the element at index i downward through the first n elements.
behavior
Given an array treated as a binary tree where the children of index i are at indices 2i+1 and 2i+2, repeatedly compare the value at i with its two children. If a child is larger, swap with the largest child and continue from that child's index. Stop when the value is at least as large as both children, or has no children inside the first n positions.
For instanceAdd a new emergency case into a hospital priority queue.
heap ← [15, 10, 8, 7, 5]
x ←20append(heap, x)
i ←length(heap) - 1WHILE i > 0
parent ← (i - 1) DIV2IF heap[parent] >= heap[i] THENBREAKENDIFswap(heap[parent], heap[i])
i ← parent
ENDWHILERETURN heap
In 1964, J. W. J. Williams at Elliott Brothers in Cambridge published Algorithm 232 ("Heapsort") in Communications of the ACM. The two-page note slipped in a quiet revolution: by storing a binary tree implicitly inside an array — parent of i at (i-1)/2 — every priority-queue operation could be done in O(log n) with no pointers at all. Within a year Robert Floyd had built a sort around it, and within a decade the heap had become the engine behind Dijkstra, A*, Huffman coding, and event-driven simulators of every kind.
Teaches: Insert while maintaining highest-priority access
Anecdote
Insert appends to the end and sifts up — three lines of code. The genius is that the heap's invariant survives the operation in O(log n), so insert and extract together give O(n log n) sorting. Dijkstra, A*, Huffman codes, k-way merge sorts — all rest on this simple primitive Williams cracked in two pages.
The Idea
Append the new value x at the end of the heap array — the next leaf position. That keeps the shape of the tree (a complete binary tree) correct, but it might violate the heap property if x is bigger than its parent. So bubble it up: while the new element is larger than its parent, swap them. Stop as soon as the parent is larger or equal — or when you reach the root.
The parent of index i lives at index (i - 1) // 2, which is the bit of arithmetic that lets a heap sit inside a flat array with no pointers. Why does the bubble-up work? Because the heap property only fails along the path from the new leaf to the root — every other parent-child pair was already fine before the insert, and stays fine after, since none of them changed. Each swap moves the violation up one level, and the height of the tree is log n, so the operation takes at most log n swaps.
Trace
step
i
parent = (i-1)//2
heap[parent]
heap[i]
swap?
heap after
1
5
2
8
20
yes
[15, 10, 20, 7, 5, 8]
2
2
0
15
20
yes
[20, 10, 15, 7, 5, 8]
3
0
—
—
—
stop
[20, 10, 15, 7, 5, 8]
Where It's Used Today
Operating system schedulers — Linux's CFS and many real-time schedulers use heaps to pick the next process to run.
**Dijkstra and A\* path-finding** — every shortest-path computation in maps, games, and routers uses a min-heap to fetch the next node.
Event-driven simulators — physics engines, network simulators, and discrete-event simulations all push timed events into a heap and pop them in time order.
Hospital triage and emergency dispatch — software for ambulance and ER queueing uses priority-queue inserts to slot the most urgent case.
Compression and encoding — Huffman code construction repeatedly inserts and extracts the lowest-frequency tree from a heap.
When NOT to Use
When you need to update an item's priority frequently — plain heaps don't support efficient decrease-key; use an indexed heap or Fibonacci heap.
When you only ever need the maximum once on a small batch — a single linear scan is simpler than building a heap.
When you need items in fully sorted order with random access — a heap only guarantees the root is largest; sort the array directly instead.
Common Mistakes
Computing the parent as i / 2 instead of (i - 1) / 2 for a 0-indexed array — every comparison hits the wrong slot.
Forgetting to break out of the loop when the parent already dominates — keep swapping past the correct position and the heap property breaks.
Mixing up min-heap and max-heap comparisons — one wrong inequality silently turns your priority queue inside-out.
Try It with an AI Assistant
short
Write a class PriorityQueue with insert(x, priority) using a binary max-heap (higher priority served first).
behavior
Write a function that takes an array heap and a value x, appends x to the end of heap, then repeatedly compares the new element to its parent at index (i - 1) // 2. Whenever the parent is smaller than the element, swap them and continue from the parent's index. Stop when the parent is at least as large, or when you reach index 0.
For instanceAlways process the task with earliest deadline.
heap ← [2, 5, 4, 9, 7, 6]
IFlength(heap) = 0THENRETURNNULLENDIF
root ← heap[0]
heap[0] ← last element
remove last element
i ←0WHILETRUE
left ←2*i + 1
right ←2*i + 2
smallest ← i
IF left < length(heap) AND heap[left] < heap[smallest] THEN
smallest ← left
ENDIFIF right < length(heap) AND heap[right] < heap[smallest] THEN
smallest ← right
ENDIFIF smallest = i THENBREAKENDIF
SWAP heap[i], heap[smallest]
i ← smallest
ENDWHILERETURN root
J. W. J. Williams introduced the heap in 1964 as the data structure behind his new sorting algorithm, Heapsort. The "extract" operation was the engine: pull the root, fill its slot with the last leaf, sift the new root downward, repeat until the array is sorted. The same three lines that powered Heapsort soon turned up at the core of Dijkstra's shortest paths, event-driven simulators, and operating-system schedulers — wherever a program has to keep grabbing "the next smallest thing" from an ever-changing pile.
Teaches: Remove top element while restoring heap structure efficiently
Anecdote
Extract swaps the root with the last element, shrinks the array, and sifts the new root down — also three lines of code. Williams' four heap operations (insert, extract, sift up, sift down) are the most economical "complete data structure" in computing's history: 12 lines of code, supporting decades of algorithms downstream.
The Idea
Three steps: save the root (the smallest), move the last element to position 0, and sift down until the heap property is restored. Sifting down means: at each step, look at the current node's two children, and if either child is smaller, swap with the smaller one. Continue until the node is smaller than both children, or it has no children.
This works because of the heap invariant: every parent is ≤ its children. Replacing the root with the last element breaks the invariant only at the root itself. The sift-down walk fixes one violation at a time, and the violation can only travel downward, so after at most log n swaps the entire heap is valid again. Extract is therefore O(log n).
Trace
step
i
left
right
heap[i], heap[left], heap[right]
smallest
action
0
0
1
2
6, 5, 4
2 (right)
swap heap[0] and heap[2]
1
2
5
6
6, —, —
2 (i)
i has no children, break
Where It's Used Today
Dijkstra's shortest path — repeatedly extract the closest unvisited node; this is the inner loop of every map-routing service.
Operating system schedulers — the kernel pulls the next thread to run from a priority queue keyed on priority and deadline.
Event-driven simulators — physics engines and network simulators always process the earliest event next.
**A\* search in games and robotics** — extract the lowest-cost-plus-heuristic node from the open set.
Heap-sort — repeatedly extract the smallest from a heap to produce a sorted list.
When NOT to Use
When you need to remove or update an arbitrary element by key — a binary heap can't find a non-root item in O(log n); use an indexed heap or a balanced BST.
When the priority of items changes constantly — repeated removals and re-inserts thrash the heap; a Fibonacci heap or pairing heap handles decrease-key better.
When the queue is tiny (a handful of items) — a sorted array or even a linear scan beats the constant-factor overhead of heap arithmetic.
Common Mistakes
Forgetting the empty-heap case, returning heap[0] from an array of length 0 and crashing on the index access.
Sifting down by always swapping with the left child instead of the smaller of the two children, breaking the heap invariant silently.
Not shrinking the array after moving the last element to the root, so the now-duplicated last value reappears in later operations.
Try It with an AI Assistant
short
Add extract_min() to PriorityQueue (a min-heap) that removes and returns the smallest element by sifting down from the root.
behavior
Given an array that obeys the rule that every parent index is smaller than its two child indices, save the value at index 0, move the last element to index 0, shrink the array by one, and repeatedly swap the element at the current index with the smaller of its two children until both children are larger or it has no children. Return the saved value.
For instanceCheck whether two computers are already in the same network group.
n ←5
parent ← [0, 1, 2, 3, 4]
FUNCTIONfind(x)
IF parent[x] != x THEN
parent[x] ←find(parent[x])
ENDIFRETURN parent[x]
END FUNCTION// union(a, b): merge the groups of a and b
ra ←find(a)
rb ←find(b)
IF ra != rb THEN
parent[rb] ← ra
ENDIF
Bernard Galler and Michael Fischer introduced the disjoint-set data structure in a 1964 paper at the University of Michigan, originally to handle equivalence declarations in early Fortran-like compilers. The real breakthrough came a decade later in Princeton, when Robert Tarjan analyzed the combined effect of union by rank and path compression and proved the amortized cost per operation is essentially constant — bounded by the inverse Ackermann function. That analysis remains one of the most celebrated results in algorithm theory, and the structure became indispensable in everything from Kruskal's MST to modern image segmentation.
Teaches: Merge sets quickly by tracking representative roots
Anecdote
The amortized cost per operation is bounded by the inverse Ackermann function — a function that grows so slowly it stays below 5 for any input that fits in the observable universe. Engineers joke: "It's basically constant," and in practice nobody bothers calling it anything else.
The Idea
Each group is represented by a tree where every item points to a "parent." The root of a tree (the item whose parent is itself) is the group's representative. To answer find(x), walk up the parent chain until you hit a root. To union(a, b), find both roots and make one of them point to the other.
The cleverness is path compression: every time you do a find, you flatten the chain by re-pointing each visited node directly to the root. This way, future find calls on those nodes are nearly instant. Combined with union-by-rank (always linking the smaller tree under the larger one), the amortized cost per operation becomes essentially O(1) — bounded by the inverse Ackermann function, which is less than 5 for any input that fits in the universe. The invariant is simple: items in the same group always reach the same root.
Trace
step
operation
parent array
what happens
0
(initial)
[0, 1, 2, 3, 4]
every item is its own root
1
union(0, 1)
[0, 0, 2, 3, 4]
ra=0, rb=1; set parent[1] = 0
2
union(2, 3)
[0, 0, 2, 2, 4]
ra=2, rb=3; set parent[3] = 2
3
union(0, 2)
[0, 0, 0, 2, 4]
ra=0, rb=2; set parent[2] = 0
4
find(3)
[0, 0, 0, 0, 4]
walks 3 → 2 → 0; path-compresses parent[3] = 0
5
find(1)
[0, 0, 0, 0, 4]
walks 1 → 0; returns 0
Where It's Used Today
Kruskal's minimum spanning tree — every step asks "are these two endpoints already connected?" and unions them if not.
Network connectivity tracking — internet routers and peer-to-peer systems use union-find to detect partitions.
Image segmentation — grouping connected pixels of similar color into regions for computer vision.
Online social-network analysis — quickly answering "are these two people in the same friend cluster?"
Type inference in compilers — unifying type variables during program analysis.
When NOT to Use
When you need to split a group back apart — union-find supports merge, not separation; use a different structure for dynamic disconnect.
When you need to enumerate the members of a group quickly — the parent array only encodes roots, not membership lists.
When the universe of items isn't known up front and items aren't easily indexed — the array-based representation falls apart.
Common Mistakes
Skipping path compression in find, so the parent chain grows long and operations degrade to O(n).
Setting parent[a] = b directly without first calling find on both, breaking the invariant that only roots get re-parented.
Forgetting union-by-rank or union-by-size — without it, repeated unions can build a worst-case linear chain.
Try It with an AI Assistant
short
Maintain disjoint sets with fast merge and representative lookup operations.
behavior
Implement a data structure for n items where each item has a 'parent' pointer that initially points to itself. Provide a find(x) operation that walks parent pointers up to the root and re-points every visited node directly to the root, and a union(a, b) operation that finds both roots and makes one root the parent of the other.
For instanceAudio software can separate sound into frequencies quickly.
a ← [1, 2, 3, 4]
FUNCTIONfft(a)
n ←length(a)
IF n = 1THENRETURN a
ENDIF
even ←fft(elements at even indices)
odd ←fft(elements at odd indices)
angle ←2 * π / n
w ←1
wn ←cos(angle) + i*sin(angle)
y ← array[n]
FOR k FROM0TO n/2 - 1
t ← w * odd[k]
y[k] ← even[k] + t
y[k + n/2] ← even[k] - t
w ← w * wn
ENDFORRETURN y
END FUNCTIONRETURNfft(a)
FFT reduced computation dramatically from O(n²) to O(n log n), transforming signal processing, audio, telecommunications, imaging, and scientific simulation.
Direct Fourier transforms became too slow for large scientific and engineering computations.
Teaches: Split into halves, solve each, recombine with twist factors
The Idea
Cooley and Tukey's insight is divide-and-conquer with a clever twist. Split the input array into two halves: the even-indexed values and the odd-indexed values. Compute the FFT of each half recursively. Then combine the two half-results into one full result using a special "twist factor" called a root of unity — wn = cos(2π/n) + i·sin(2π/n).
Why does this work? The DFT of an n-length signal can be algebraically rewritten as a sum of two DFTs on length n/2, plus a small correction at each output index k. The correction is w · odd[k], where w rotates around the unit circle. The trick: output k and output k + n/2 use the same even/odd values but with opposite signs of the correction — so each pair of outputs costs only one multiplication, not two. This halving repeats log n times, which is where the n log n cost comes from. The invariant is that at every level, both halves are correct DFTs of their respective sub-signals.
Trace
k
w
t = w · odd[k]
y[k] = even[k] + t
y[k + 2] = even[k] − t
0
1
1·6 = 6
4 + 6 = 10
4 − 6 = −2
1
i
i·(−2) = −2i
−2 + (−2i) = −2−2i
−2 − (−2i) = −2+2i
Where It's Used Today
Audio compression — MP3, AAC, and Opus split sound into frequency bands using FFT and discard bands the ear can't hear.
Image and video compression — JPEG and MPEG use a close relative (the discrete cosine transform, computed via FFT) to compress photos and frames.
Wi-Fi and 4G/5G — modern radios encode bits onto thousands of frequency tones at once and decode them with an FFT in every chip in your phone.
Medical imaging — MRI machines collect raw data in the frequency domain and apply FFT to reconstruct the human-readable image.
Astronomy and physics — pulsar searches, gravitational-wave detectors (LIGO), and seismic analysis all run FFTs on long streams of data to spot periodic signals.
When NOT to Use
When the input length is tiny (say, n ≤ 16) — the constant factors in FFT (recursion, complex arithmetic, twiddle setup) dominate; the direct DFT or hand-coded butterflies are faster.
When you only need a few specific frequency bins — Goertzel's algorithm computes one bin in O(n) without producing the whole spectrum.
When the input length is not a power of 2 and you're using a radix-2 implementation — either zero-pad (which changes the frequency grid) or use a mixed-radix or Bluestein's FFT.
Common Mistakes
Mixing up the sign convention or forgetting the 1/n normalization on the inverse — the round-trip then doesn't recover the original signal.
Splitting the array by the first/second half instead of even/odd indices — the recursion is no longer the Cooley-Tukey decomposition and the combine step is wrong.
Recomputing cos(2π·k/n) and sin(2π·k/n) from scratch inside the inner loop — accumulating w *= wn saves trig calls but drifts; precomputing a twiddle table is more accurate at scale.
Try It with an AI Assistant
short
Write fft(a) returning the DFT using the recursive Cooley-Tukey method.
behavior
Write a recursive function that, given a list of numbers whose length is a power of two, splits it into the even-indexed and odd-indexed halves, computes the same function on each half, then combines the two half-results into one full-length output where output k is even[k] + w·odd[k] and output k + n/2 is even[k] − w·odd[k], with w rotating around the unit circle by an angle of 2π/n per step.
For instanceA phone keeps recently opened apps ready and removes old ones first.
capacity ←3
map ← empty hash map // key → list node
list ← empty doubly linked list
// get(key)IF key NOTIN map THENRETURN -1ENDIF
node ← map[key]
moveToFront(node)
RETURN node.value
// put(key, value)IF key IN map THEN
node ← map[key]
node.value ← value
moveToFront(node)
RETURNENDIFIFsize(map) = capacity THEN
remove tail node FROM map AND list
ENDIF
node ←newNode(key, value)
insertAtFront(node)
map[key] ← node
LRU emerged from operating-systems folklore in the mid-1960s as designers wrestled with virtual memory: when RAM filled up, some page had to be swapped to drum or disk, and choosing the wrong one wrecked performance. Researchers studying program behavior noticed that real workloads display temporal locality — a page touched recently is far more likely to be touched again than one untouched for hours. LRU codified that intuition, and within a decade every major OS, database, and CPU cache had adopted it or a close cousin.
Teaches: Combine two structures so each operation stays constant time
The Idea
Combine two data structures: a hash map from key to a node, plus a doubly linked list that keeps nodes in order from most-recently-used (front) to least-recently-used (back).
- get(key): look up the node in the map (O(1)), move it to the front of the list (O(1)), return its value.
- put(key, value): if the key exists, update its value and move its node to the front. Otherwise, if the cache is full, remove the tail node (the least recently used) from both the list and the map, then insert the new node at the front.
Why does this combination give O(1)? The hash map answers "where is this key" instantly, without scanning. The doubly linked list lets you splice a node out and reinsert it at the front in constant time, because each node holds direct pointers to its neighbors. The list ranks recency for free — anything near the front is fresh, anything near the back is stale.
Trace
step
operation
list (front → back)
map keys
returns
1
put(1, "A")
[1]
{1}
2
put(2, "B")
[2, 1]
{1, 2}
3
put(3, "C")
[3, 2, 1]
{1, 2, 3}
4
get(1)
[1, 3, 2]
{1, 2, 3}
"A"
5
put(4, "D")
[4, 1, 3]
{1, 3, 4}
6
get(2)
[4, 1, 3]
{1, 3, 4}
-1
Where It's Used Today
Web browsers — recently visited pages, images, and JavaScript bundles are kept in an LRU cache so the back button feels instant.
Databases — MySQL, PostgreSQL, and Redis all keep frequently used data pages in LRU-style buffer pools.
Operating systems — page replacement when RAM fills up uses LRU or close approximations to decide which memory pages to swap to disk.
CPUs — every modern processor uses LRU-like policies to evict cache lines from L1/L2/L3 caches.
Mobile phones — when you switch between apps, the OS keeps recent ones suspended in memory and kills the least-recently-used app first when memory pressure rises.
When NOT to Use
When access patterns are dominated by one-time scans (a backup or full-table read) — LRU evicts the useful hot data; use LFU, ARC, or scan-resistant policies.
When most items are equally likely to be accessed (uniform random) — recency carries no signal; a random eviction policy performs just as well with less bookkeeping.
When threads access the cache concurrently and you cannot tolerate the locking overhead — sharded caches or lock-free designs (CLOCK, segmented LRU) usually win.
Common Mistakes
Using a singly linked list — you cannot splice a middle node out in O(1) without backward pointers, so get quietly becomes O(n).
Forgetting to delete the evicted key from the hash map after removing the tail node, leaking entries until the map outgrows the list.
Failing to move a node to the front on get — the cache then evicts based on insertion order, not recency, which is FIFO, not LRU.
Try It with an AI Assistant
short
Write an LRU cache with get(key) and put(key, value) in O(1).
behavior
Build a fixed-capacity key-value store. Looking up a key marks it as just-used. Inserting a new key, when the store is at capacity, removes whichever existing key has gone the longest without being looked up or written. Make every operation run in constant time, regardless of the number of stored keys.
For instanceAuto-correct can see “kitten” is close to “sitting”.
a ←"cat"
b ←"cut"
n ←length(a)
m ←length(b)
dp ←matrix(n+1, m+1)
FOR i FROM0TO n
dp[i][0] ← i
ENDFORFOR j FROM0TO m
dp[0][j] ← j
ENDFORFOR i FROM1TO n
FOR j FROM1TO m
IF a[i-1] = b[j-1] THEN
dp[i][j] ← dp[i-1][j-1]
ELSE
dp[i][j] ←1 + min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1])
ENDIFENDFORENDFORRETURN dp[n][m]
Vladimir Levenshtein was a Soviet mathematician working on coding theory at the Keldysh Institute in Moscow, where the question of "how many bit-flips can a noisy channel inflict before the message becomes unrecoverable?" was central. His 1965 paper defined the edit distance as a clean numerical answer for binary codes — never imagining that decades later the same metric would underpin spell-checkers, DNA alignment, plagiarism detection, and "did you mean…?" boxes used billions of times a day.
Teaches: Measure difference via minimal sequence of edits
Anecdote
Vladimir Levenshtein worked in Soviet information theory, not linguistics. The algorithm was meant for error correction in communication, not spelling mistakes — autocorrect came decades later.
The Idea
Build a grid with the first string a along the rows and the second string b along the columns. The cell dp[i][j] answers: "what is the minimum cost to turn the first i characters of a into the first j characters of b?" The first row and column are easy — turning an empty string into a j-character string takes j insertions, and vice versa.
For each interior cell, look at three neighbors. If the two current characters match, the cost is exactly the diagonal neighbor dp[i−1][j−1] (no edit needed). Otherwise, it's 1 + min(left, above, diagonal) — 1 plus the cheapest of insertion (left), deletion (above), or substitution (diagonal). The invariant is that every dp[i][j] truly is the minimum cost for those prefixes; it follows by induction from the base cases. The final answer sits in dp[n][m], the bottom-right corner.
Trace
""
"c"
"cu"
"cut"
""
0
1
2
3
"c"
1
0
1
2
"ca"
2
1
1
2
"cat"
3
2
2
1
Where It's Used Today
Spell-checkers and autocorrect — when you type "teh," your phone ranks candidate words by edit distance and offers "the."
DNA and protein alignment — bioinformatics tools (BLAST, sequence aligners) extend this idea to compare genomes and find evolutionary differences.
Plagiarism detection — comparing student submissions or scientific papers to find suspicious near-duplicates.
Fuzzy search — search bars that tolerate typos ("did you mean…") use edit distance to rank suggestions.
Version control diffs — git diff and similar tools use a related dynamic-programming approach (longest common subsequence) born from the same idea.
When NOT to Use
When the strings are very long (megabytes of text) and only a few edits separate them — the O(nm) table is wasteful; use Myers' diff or Ukkonen's O(nd) algorithm instead.
When you need transposition ("acb" vs "abc" should cost 1, not 2) — use Damerau-Levenshtein, which adds a swap operation.
When edit operations have very different costs (insert is cheap, substitute is expensive) — the "1 + min(...)" rule assumes equal costs; switch to weighted edit distance.
Common Mistakes
Indexing with a[i] and b[j] instead of a[i-1] and b[j-1] — the dp table is shifted by one because row/column 0 represents the empty prefix, and the off-by-one returns wrong answers or crashes at the last cell.
Initializing the first row and column to all zeros — the base case is dp[i][0] = i and dp[0][j] = j, the cost of inserting/deleting that many characters from an empty string.
Taking the minimum of only two neighbors (left and above) and forgetting the diagonal — substitutions are then never considered and distances inflate.
Try It with an AI Assistant
short
Compute minimum insertions, deletions, and substitutions between two strings.
behavior
Write a function that, given two strings, builds a table where cell (i, j) holds the smallest number of single-character changes needed to turn the first i characters of one string into the first j characters of the other. Initialize the first row and column with the lengths, then fill each interior cell as either the diagonal neighbor when the two characters match, or one plus the smallest of the three neighbors otherwise. Return the bottom-right cell.
Made searching a sorted list by splitting into thirds an alternative to halving.
lo ←0; hi ← n - 1WHILE lo <= hi
m1 ← lo + (hi - lo) / 3
m2 ← hi - (hi - lo) / 3IF a[m1] = x THENRETURN m1
ENDIFIF a[m2] = x THENRETURN m2
ENDIFIF x < a[m1] THEN
hi ← m1 - 1
ELIF x > a[m2]
lo ← m2 + 1ELSE
lo ← m1 + 1
hi ← m2 - 1ENDIFENDWHILERETURN -1
Ternary search has no clean attribution because Soviet competitive-programming culture kept it as a folk algorithm for decades. It became a standard interview tool only in the 2000s when the IOI and ACM ICPC routes brought it into Western pedagogy. Russian programming traditions had ternary search a generation before American CS textbooks did.
Teaches: Narrow a sorted range by splitting it into thirds at each step
Anecdote
Ternary search has no clean attribution because Soviet competitive-programming culture kept it as a folk algorithm for decades. It became a standard interview tool only in the 2000s when the IOI and ACM ICPC routes brought it into Western pedagogy. Russian programming traditions had ternary search a generation before American CS textbooks did.
The Idea
Pick two dividers — m1 one-third of the way from lo to hi, and m2 two-thirds of the way. Check both. If a[m1] or a[m2] equals the target, return that index. Otherwise the comparison tells you which third to keep: if x < a[m1] the answer is in the left third, if x > a[m2] it's in the right third, and if it's in between, it's in the middle third. Discard the other two thirds and repeat.
Why does this work? Because the array is sorted, the value at m1 and m2 divides the search range into three ordered regions. The target can only live in one of them. Each iteration shrinks the range to about one third of its previous size, so the search finishes in O(log n) steps — the same big-O as binary search, just with a different base on the logarithm.
Trace
step
lo
hi
m1
m2
a[m1]
a[m2]
decision
0
0
8
2
6
5
13
x=11 between a[m1]=5 and a[m2]=13 → narrow to middle third
1
3
5
3
4
7
9
x=11 > a[m2]=9 → search right third
2
5
5
5
5
11
11
a[m1] = x → return 5
Where It's Used Today
Competitive programming — ternary search is a staple in Codeforces and ICPC contests, both for sorted-array lookup and for finding extrema of unimodal functions (the same shape, applied to f instead of a).
Educational comparisons — it's the canonical example for showing that any constant base (2, 3, k) gives O(log n) and that the choice between bases is a matter of constant-factor trade-offs.
Database B-tree variants — node fan-outs greater than 2 (ternary, k-ary trees) descend from the same divide-into-more-than-halves idea and shrink tree height for large indexes.
Skip-list and probabilistic structures — generalizations of the "split into more than two zones" trick reduce expected hops on long sequences.
Block-aware search — when reading from cache lines, splitting a range into more than two pieces per probe sometimes fits hardware better than strict halving.
When NOT to Use
When you have plain binary search available and ample CPU — binary search needs only one comparison per step instead of two and is faster in practice for sorted-array lookup.
When the data is unsorted — ternary search relies on the sorted invariant to know which third contains the target.
When the data is on a one-way medium (tape, stream) where you can't bounce around — sequential or jump search is safer.
Common Mistakes
Using < instead of <= (or vice versa) in the elimination step, accidentally discarding the third that contains the target.
Computing m1 = (lo + hi) / 3 instead of lo + (hi - lo) / 3 — the first form skews toward zero and biases the search.
Forgetting to check botha[m1] and a[m2] for equality before narrowing — many one-bug versions narrow first and then lose the answer that was sitting on a divider.
Try It with an AI Assistant
short
Write ternary_search(a, x) over a sorted list using two midpoints to narrow the range to a third each step.
behavior
Write a function that, given a sorted array and a target, picks two dividers one-third and two-thirds of the way through the current range, checks each, and on a miss keeps only the third where the target must lie. Repeat until you find it or the range is empty.
Made detecting cycles in linked structures possible with constant memory.
// Floyd's tortoise & hare
slow ← head; fast ← head
WHILE fast AND fast.next
slow ← slow.next
fast ← fast.next.next
IF slow = fast THEN// cycle found
slow ← head
WHILE slow != fast
slow ← slow.next
fast ← fast.next
ENDWHILERETURN slow
ENDIFENDWHILERETURNNULL
The trick first appears in print in Donald Knuth's The Art of Computer Programming Volume 2 (1969), where Knuth credits Robert Floyd for it without citing a paper — Floyd seems to have shared the idea verbally at Stanford and Knuth simply wrote it down. The Aesop fable language ("tortoise" for the slow pointer, "hare" for the fast) was attached to the algorithm by later teachers; in Knuth's original presentation, the pointers are just p and q. The clean, picture-book metaphor is part of why the algorithm is now taught in nearly every undergraduate data-structures course.
Teaches: Detect cycles using different traversal speeds
Anecdote
Robert W. Floyd never actually published it in a standalone paper. The "tortoise and hare" explanation became popular later — the metaphor is more famous than the original source.
The Idea
**On a circular running track, a fast runner who laps a slow runner can only catch up from behind — they meet inside the loop.** That's the whole insight: if a list has a cycle, two pointers moving at different speeds must eventually collide, and the collision can only happen within the loop. If there's no cycle, the fast pointer simply runs off the end.
Phase one — detect the cycle. Race two pointers from the head: slow advances one step at a time, fast advances two. If the list ends (fast hits null), there's no cycle. Otherwise both pointers eventually enter the cycle, and fast keeps gaining one node per step until it laps slow — they meet at some node inside the cycle.
Phase two — find where the cycle begins. Reset slow back to the head and let both pointers move at speed one. The math says they will meet again exactly at the cycle's entry node. The invariant: if μ is the distance from head to cycle start, λ is the cycle length, and they first met k steps inside the cycle, then μ + k ≡ 0 (mod λ). Walking μ more steps from the meeting point lands you at the entry — same time the new slow pointer arrives there from the head.
Trace
step
slow
fast
meet?
start
1
1
no
1
2
3
no
2
3
5
no
3
4
3
no
4
5
5
yes
Where It's Used Today
Linked-list bug detection — finding accidental cycles in pointer-based data structures during testing.
Pseudo-random number generators — measuring the cycle length of a PRNG (when its output starts repeating) by treating "next state" as a next pointer.
Cryptanalysis — Pollard's rho factoring algorithm uses Floyd's tortoise-and-hare on the iteration x ← f(x) mod n to find non-trivial factors.
State-machine analysis — verifying that an automaton can't get stuck in an infinite loop.
Garbage collectors and serialization — detecting cycles in object reference graphs to avoid infinite recursion when traversing.
When NOT to Use
When the structure has multiple successors per node (a general graph) — next.next is undefined; use DFS with a visited set instead.
When you can afford a hash set of visited nodes and the list is small — the set-based approach is shorter, easier to debug, and just as correct.
When you also need to know the cycle's length or the full ring of nodes — Brent's algorithm is often faster, and a single visited-set pass gives more diagnostics.
Common Mistakes
Checking only fast instead of both fast and fast.next before stepping, then dereferencing null on a list of even length.
Comparing slow and fastbefore moving them, so the loop returns "cycle found" on step 0 from the shared starting head.
In phase 2, stepping fast by two instead of one — the entry-finding math only works when both pointers move at speed one.
Try It with an AI Assistant
short
Write a function using Floyd's tortoise-and-hare to detect a cycle in a singly linked list and return the node where the cycle starts, or null if there's no cycle.
behavior
Walk two pointers along a linked list from the head: one moves a single step per iteration, the other moves two. If the fast one reaches null, there is no cycle and you return null. If the two pointers ever land on the same node, you've found a cycle; reset the slow pointer to the head and now move both pointers one step at a time until they meet again. Return the node where they meet.
Made trimming line segments to a screen rectangle fast.
// Cohen-Sutherland clippingWHILETRUE
c0 ←outcode(p0)
c1 ←outcode(p1)
IF (c0 OR c1) = 0THENRETURN visible
ENDIFIF (c0 AND c1) != 0THENRETURN clipped
ENDIFIF c0 != 0THEN
c ← c0
ELSE
c ← c1
ENDIF
p ←intersect(line, window, c)
IF c = c0 THEN
p0 ← p
ELSE
p1 ← p
ENDIFENDWHILE
Ivan Sutherland built Sketchpad in 1963 — the first interactive graphical interface ever — for his MIT PhD thesis. Cohen-Sutherland clipping was the algorithm needed to draw lines on a screen with finite size without overflowing the framebuffer. Sutherland later won the Turing Award (1988) and is sometimes called the father of computer graphics. Modern graphics pipelines still call this exact algorithm millions of times per frame.
Teaches: Reject segments early using region codes
Anecdote
Ivan Sutherland built Sketchpad in 1963 — the first interactive graphical interface ever — for his MIT PhD thesis. Cohen-Sutherland clipping was the algorithm needed to draw lines on a screen with finite size without overflowing the framebuffer. Sutherland later won the Turing Award (1988) and is sometimes called the father of computer graphics. Modern graphics pipelines still call this exact algorithm millions of times per frame.
The Idea
Tag each endpoint with a 4-bit outcode that says which side of the window it's on: bit 1 = above top, bit 2 = below bottom, bit 4 = right of right edge, bit 8 = left of left edge. A point inside the window has outcode 0000. Now two cheap checks decide most cases instantly: if c0 OR c1 == 0, both endpoints are inside — draw the whole line. If c0 AND c1 != 0, both share an "outside" bit (e.g. both above the top), so the line cannot intersect the window — reject it.
Why does this work? Sharing an outside bit is the geometric proof that the segment lies entirely in one half-plane outside the rectangle. Otherwise, pick whichever endpoint is outside, intersect the segment with that edge of the window, and replace the endpoint with the crossing point. The outcode shrinks toward zero on every iteration, so the loop terminates in at most four steps — one per window edge.
Trace
step
p0
p1
c0
c1
what happens
0
(-2, 5)
(8, 5)
1000
0000
not both 0, not sharing a bit — pick c (= c0 = LEFT)
1
-
-
-
-
intersect line with x = 0 → crossing at (0, 5)
2
(0, 5)
(8, 5)
0000
0000
c0 OR c1 == 0 → segment is fully visible, return
Where It's Used Today
GPU rasterization pipelines — every triangle rendered on a phone or game console is clipped against the screen rectangle, often using outcode tests as a fast reject path.
CAD software — AutoCAD, SolidWorks, and similar tools clip line segments to the viewport when you zoom and pan, hiding geometry outside the visible area.
Mapping apps — Google Maps and OpenStreetMap clip road segments at the edge of the visible map tile so only on-screen pixels get drawn.
Plotting libraries — matplotlib, D3, and ggplot clip data lines at the axis box so curves don't leak past the chart.
Vector printers and PDF rendering — line clipping decides which segments of a long vector path actually need ink on the page.
When NOT to Use
When the clip region is non-rectangular (a polygon, a circle, a 3D frustum) — outcodes are defined for axis-aligned rectangles only; use Sutherland-Hodgman or Liang-Barsky in 3D.
When you're clipping filled polygons rather than line segments — clipping each edge piecewise leaves gaps; use a polygon-clipping algorithm.
When most lines lie partly inside the window — Liang-Barsky's parametric form does fewer intersection computations on average.
Common Mistakes
Computing the intersection but forgetting to recompute the outcode of the new endpoint — the loop sees the stale code and exits early or loops forever.
Picking a non-zero outcode endpoint without checking which one — using c1 when only c0 is non-zero produces a bogus crossing.
Using floating-point intersection on integer pixel coordinates without rounding — clipped endpoints can sit half-a-pixel outside the window.
Try It with an AI Assistant
short
Write cohen_sutherland(p0, p1, window) implementing Cohen-Sutherland line clipping. Return the clipped segment or null if it lies entirely outside.
behavior
Write a function that takes two endpoints of a line and a rectangle. Tag each endpoint with bits saying whether it is left, right, above, or below the rectangle. If both bits are zero, return the line as-is. If both endpoints share an outside bit, return null. Otherwise, replace the outside endpoint with where the line crosses that edge of the rectangle, and repeat.
Made compressing long runs of repeated values easy.
// run-length encode
out ← []
i ←0WHILE i < n
j ← i
WHILE j < n AND a[j] = a[i]
j ← j + 1ENDWHILE
out.append((a[i], j - i))
i ← j
ENDWHILERETURN out
The idea of writing "fifteen of these in a row" instead of fifteen separate symbols is older than computing — bookkeepers and copyists have abbreviated repeats for centuries. What changed in the 1960s was that engineers needed to squeeze long monotone scans (radar traces, weather images, eventually fax pages) into expensive storage and slow telephone lines. RLE was the obvious first move: dirt cheap to encode, dirt cheap to decode, and dramatic savings whenever the input has even modest stretches of repetition.
Teaches: Compress by summarizing consecutive repetitions
Anecdote
RLE was widely used in IBM mainframe storage utilities by the late 1960s, and it powered the CCITT Group 3 fax compression standard adopted in 1980 — billions of fax pages around the world were squeezed almost entirely by counting runs of black and white pixels. It's one of the oldest forms of data compression still in everyday use, and almost certainly the simplest one ever standardized.
The Idea
Walk through the array with two indices. The outer index i marks the start of a run. The inner index j walks forward as long as a[j] equals a[i]. When j finally lands on a different character (or the end of the input), the run length is j − i. Append the pair (a[i], j − i) to the output, then jump i to j and start a new run.
Why does it work? Every position in the array belongs to exactly one run, and runs don't overlap, so the total work is linear in the input length — each character is examined once by the inner loop and once by the outer loop. The invariant: after i advances, out contains a complete encoding of a[0..i], and i always points to the start of a new run. Decoding is even simpler: for each pair (c, k), write c exactly k times.
Trace
step
i
j
a[i]
run
j − i
out after
0
0
3
a
"aaa"
3
[(a, 3)]
1
3
5
b
"bb"
2
[(a, 3), (b, 2)]
2
5
6
c
"c"
1
[(a, 3), (b, 2), (c, 1)]
3
6
10
d
"dddd"
4
[(a, 3), (b, 2), (c, 1), (d, 4)]
4
10
—
(i = n, stop)
Where It's Used Today
Fax machines and TIFF images — black-and-white documents have huge runs of white pixels; CCITT Group 3/4 fax compression is RLE plus a Huffman pass.
PCX, BMP, and old game graphics — sprite-art file formats with flat color regions used pure RLE for decades.
PNG (sub-step) — the DEFLATE algorithm inside PNG starts with a sliding-window match step that behaves like RLE on long repeats.
Genome storage — long stretches of "no variation from reference" in DNA sequences are stored as run-length deltas in BAM/CRAM files.
Spreadsheet "merge cells" — internally, a long row of identical values is often stored as a single value with a span count.
When NOT to Use
When the data has few or no runs (random text, photos, encrypted bytes) — RLE expands the output, often doubling its size.
When you need strong compression on natural language or general-purpose data — DEFLATE, LZ77, or Brotli will beat RLE by an order of magnitude.
When run lengths can exceed the count field (e.g. an 8-bit count on runs > 255) — you must split runs and the gain shrinks; pick a format that matches your data.
Common Mistakes
Encoding lone characters as 1c and forgetting that a single byte is now two bytes — penalizes inputs with few repeats.
Using a digit (like '3') as the count when the input alphabet also contains digits, making the output ambiguous to decode.
Forgetting to flush the final run after the loop, so the last group of characters is silently dropped.
Try It with an AI Assistant
short
Write rle_encode(s) where each run of identical characters becomes <count><char> (e.g. 'aaabb' → '3a2b').
behavior
Write a function that walks through a string left to right. Each time it sees a new character, it counts how many times that same character repeats in a row, then emits the count followed by the character, then jumps to the next different character. Continue until the string is consumed.
Made expanding compressed runs back to original data fast.
// run-length decode
out ← []
FOR EACH (val, cnt) IN encoded
FOR k FROM1TO cnt
out.append(val)
ENDFORENDFORRETURN out
Decoding is the inverse of encoding — read a count, write that many copies, repeat. The 1967 IBM patent covered both. The asymmetry is interesting: encoding has to decide whether a run is worth compressing, but decoding is mechanical — read, expand, read, expand. Most compression algorithms have this asymmetric shape.
Teaches: Reconstruct streams from compact informed function
Anecdote
Decoding is the inverse of encoding — read a count, write that many copies, repeat. The 1967 IBM patent covered both. The asymmetry is interesting: encoding has to decide whether a run is worth compressing, but decoding is mechanical — read, expand, read, expand. Most compression algorithms have this asymmetric shape.
The Idea
Walk through the encoded list one pair at a time. Each pair has a value val and a count cnt. Append val to the output cnt times. Move to the next pair. Repeat until the encoded list is exhausted.
Why does this work? The encoding promised that every original element shows up exactly once in the count of some run, and each pair captures exactly the consecutive occurrences of that value. If we append val exactly cnt times for every pair, in order, we reconstruct the original sequence character by character. There's no decision-making here — that all happened on the encoding side. Decoding is a pure mechanical expansion, which is why it's so fast.
Trace
step
(val, cnt)
k loop
out
0
start
—
[]
1
(A, 4)
k = 1, 2, 3, 4
[A, A, A, A]
2
(B, 2)
k = 1, 2
[A, A, A, A, B, B]
3
(C, 3)
k = 1, 2, 3
[A, A, A, A, B, B, C, C, C]
4
done
—
return out
Where It's Used Today
Fax machines — every fax transmission compresses scanned pages with a run-length scheme; the receiving machine decodes runs back into pixel rows.
Bitmap and TIFF images — the BMP and PackBits TIFF formats store pixels as runs; image viewers run RLE decoding on every load.
Graphics card memory — sprite and texture formats in game engines often use RLE for transparent regions; the GPU decodes runs at draw time.
PDF files — the /RunLengthDecode filter is a built-in PDF stream decoder used for scanned-document pages.
Printer protocols — laser printer command languages (PCL, PostScript) ship rasters as runs and the printer firmware decodes them onto the drum.
When NOT to Use
When the encoding format mixes literal runs and copy-runs (PackBits, LZ-style) — plain RLE decoding will misread the headers.
When the data was never RLE-encoded — running decode on arbitrary bytes produces garbage with no error.
When counts can be huge (millions) and the output won't fit in memory — stream the expansion instead of materializing the full list.
Common Mistakes
Allowing a count of 0 to mean "skip" silently when the encoding spec says it's invalid, masking corrupted streams.
Reading the count and value in the wrong order (<char><count> vs <count><char>) — output looks plausible but is shifted.
Forgetting that a run of length 1 still needs to be appended once, not skipped.
Try It with an AI Assistant
short
Write rle_decode(s) reversing rle_encode — read each <count><char> pair and expand it.
behavior
Write a function that takes a list of (value, count) pairs and returns a single sequence built by appending each value to the output exactly count times, in order. The output's length is the sum of the counts.
For instanceA game character finds a path while avoiding exploring the whole map.
pq ← priority queue with (0, start)
g ← map with g[start]=0WHILE pq NOTempty
(_, node) ←extractMin(pq)
IF node = goal THENRETURN g[node]
ENDIFFOR EACH (neighbor, cost) IN graph[node]
ng ← g[node] + cost
IF neighbor NOTIN g OR ng < g[neighbor] THEN
g[neighbor] ← ng
f ← ng + heuristic(neighbor)
insert (f, neighbor) into pq
ENDIFENDFORENDWHILERETURN ∞
Earlier searches explored too many unnecessary paths. A* combined known distance with estimated remaining distance, becoming foundational in robotics, GPS, and video games.
Needed faster pathfinding by intelligently guiding search toward the goal.
Teaches: Guide search using cost plus best guess
Anecdote
Developed at SRI for robot navigation and pathfinding. The name "A\" was chosen almost casually — the just meant "this is the best version so far," not a deep mathematical symbol.
The Idea
For every node we know about, track two numbers: g[node] — the actual cheapest cost from the start to that node so far — and f = g + heuristic(node) — that cost plus a guess of the remaining distance to the goal. A priority queue always serves the node with the smallest f next. We expand that node, look at its neighbors, and if we can reach a neighbor more cheaply than before, we update g[neighbor] and re-insert it with the new f. When we extract the goal, the answer is g[goal].
Why does this give the correct shortest path? As long as the heuristic is admissible — it never overestimates the true remaining distance — A will not commit to a node until its f value is the lowest possible. Straight-line distance always satisfies this because the actual road can only be longer than the bird's-eye line. With heuristic = 0, A degenerates to plain Dijkstra; the better the heuristic, the less of the graph A* needs to touch.
Trace
step
extracted node
g[node]
examined neighbors → updates
1
A (f=4)
0
B: g=1, f=1+3=4 ; D: g=3, f=3+1=4
2
B (f=4)
1
C: g=3, f=3+1=4
3
C (f=4)
3
E: g=4, f=4+0=4
4
D (f=4)
3
E already at g=4, new path g=3+1=4 — no improvement
5
E (f=4)
4
goal — return 4
Where It's Used Today
GPS navigation — Google Maps, Waze, and Apple Maps use A* (and faster successors built on it) to plan driving directions.
Video game AI — every NPC that walks around obstacles to reach you is running A* on the level's tile grid.
Robot pathfinding — warehouse robots (Amazon Kiva, factory AGVs) use A* on their floor maps to avoid each other.
Puzzle solvers — A* solves sliding-tile puzzles, Sokoban, and Rubik's cube using cleverly designed heuristics.
Network routing — some routing protocols use A*-like algorithms when traffic and latency vary across the graph.
When NOT to Use
When no useful heuristic exists (e.g., abstract graphs with no geometry) — A* degrades to Dijkstra plus extra overhead.
When you need the shortest paths to all nodes — Dijkstra computes them in one sweep; A* would have to be re-run per goal.
When edge weights can be negative — A* (like Dijkstra) assumes nonnegative costs; use Bellman-Ford instead.
Common Mistakes
Using a heuristic that overestimates (e.g., scaled-up Euclidean distance) — A* may return a suboptimal path.
Forgetting to update g[neighbor] and re-push when a cheaper path is discovered — the search settles on a worse cost.
Returning g[goal] the first time goal is pushed rather than extracted — the value may not yet be optimal.
Try It with an AI Assistant
short
Find shortest path using actual distance plus heuristic estimate to goal.
behavior
Write a function that, given a graph, a start node, a goal node, and a heuristic that estimates remaining distance, uses a priority queue ordered by 'cost so far plus heuristic' to expand nodes one at a time. Whenever a cheaper path to a neighbor is found, record the new cost and push the neighbor with its updated priority. Return the recorded cost when the goal is extracted.
For instanceA browser can quickly check if a URL is definitely not malicious.
filter ← bit array of size 16, all FALSE
hashes ← [h1, h2]
inserts ← ["apple", "banana"]
query_item ←"cherry"// insert phase: stamp the bits for every itemFOR EACH item IN inserts
FOR EACH hash h IN hashes
filter[h(item)] ←TRUEENDFORENDFOR// query phase: check every bit for query_item
result ←TRUEFOR EACH hash h IN hashes
IF filter[h(query_item)] = FALSETHEN
result ←FALSEENDIFENDFORRETURN result
Burton H. Bloom invented the filter in 1970 while at Computer Usage Company, originally to save disk seeks in a hyphenation dictionary — most words split predictably, and only the rare exceptions needed a full lookup, so a probabilistic "is this word an exception?" check was perfect. The structure was largely a curiosity for two decades, then exploded in popularity when web-scale systems realized they could front a giant remote dataset with a tiny in-memory filter and skip the network round-trip for items that definitely weren't there. Today Cassandra, Bitcoin SPV clients, and every major CDN run on Bloom-filter variants.
Teaches: Trade certainty for space-efficient probabilistic membership
Anecdote
Bloom filters saw a huge resurgence with web-scale systems. They became critical in databases and caches because they answer: "Should I even check disk?" — saving massive I/O at scale.
The Idea
Keep a bit array filter of size m, all bits initially 0. Pick k independent hash functions. To insert an item, compute its k hashes and set those k bits to 1. To query, compute the same k hashes and check those bits. If any of them is still 0, the item was definitely never inserted. If they are all 1, the item is probably present (but it might just be that other items happened to set the same bits).
This works because of asymmetric error: a true member's bits are guaranteed to have been set when it was inserted, so the query can't say no by accident. A non-member can only sneak through if every one of its k bits was coincidentally set by other inserts — a probability you can drive arbitrarily low by choosing m and k well. The filter trades a small false-positive rate for huge memory savings.
Trace
step
action
bits set
filter (bits 0..15)
0
(initial)
—
0000000000000000
1
insert "apple"
3, 11
0001000000010000
2
insert "banana"
7, 11
0001000100010000
Where It's Used Today
Web browsers — Chrome's Safe Browsing once used a Bloom filter to check whether a URL might be on the malicious-site list before reaching out to Google.
Databases — Apache Cassandra and HBase keep per-table Bloom filters so a SELECT can skip disk reads when a key definitely isn't there.
Spell checkers — early versions stored the dictionary as a Bloom filter to fit on tight-memory machines.
Distributed caches — content delivery networks check a Bloom filter before fetching an object from a peer cache.
Network packet routing — Bloom filters help identify duplicate or already-seen packets at line rate.
When NOT to Use
When false positives are unacceptable — a Bloom filter can wrongly say "yes," so safety-critical checks (auth, payments) need an exact set.
When you also need to delete items — standard Bloom filters can't remove without breaking other items; use a counting Bloom filter or cuckoo filter.
When you need to retrieve the items themselves, not just check membership — Bloom filters store no keys at all, only bits.
Common Mistakes
Reusing one hash function and just multiplying its output — the resulting k "hashes" are correlated, driving the false-positive rate way up.
Sizing the bit array too small for the expected number of inserts — once the filter saturates near all-1s, every query returns "probably yes."
Forgetting to use mod m on the hash before indexing, causing array-out-of-bounds errors when hash values exceed the filter size.
Try It with an AI Assistant
short
Probabilistically test set membership using multiple hash-based bit positions.
behavior
Keep a bit array of length m, all zeros. Pick k different hash functions. To add an item, compute its k hashes (mod m) and set those bits to 1. To check if an item is present, compute the same k hashes; if any of those bits is 0, return definitely-not-present, otherwise return probably-present.
Ford and Fulkerson had introduced augmenting paths in 1956, but their algorithm could be tricked: with badly chosen paths and irrational capacities, it might never terminate. In 1972, Jack Edmonds and Richard Karp showed that one tiny tweak — always pick the shortest augmenting path, found by BFS — turned the algorithm into a strict O(VE²) procedure regardless of capacity values. The same paper introduced the concept of strongly polynomial algorithms, a framework that has shaped network-flow research ever since.
Teaches: Push along shortest improving paths until none remain
The Idea
Start with zero flow. Repeatedly use BFS to find any path from source to sink that still has leftover capacity (an augmenting path). On that path, find the smallest leftover capacity — that's the bottleneck. Push that much flow along the path, and subtract that much from the residual capacity of every edge on it. Keep finding augmenting paths until none remain. The total pushed is the max flow.
There's one subtle ingredient: when you push flow forward across an edge, you also create a backward residual equal to the flow you just pushed. This lets later iterations "cancel" previous bad choices. Because Edmonds and Karp insist BFS (the shortest augmenting path in number of edges, not arbitrary), the lengths of augmenting paths can only grow over time, and the algorithm always terminates in O(VE²) operations regardless of the actual capacities — a guarantee Ford-Fulkerson without BFS doesn't give you.
Trace
iter
BFS augmenting path
pathFlow (min residual)
flow after
residual notes
1
s → A → t
min(10, 7) = 7
7
s→A drops to 3, A→t drops to 0
2
s → A → B → t
min(3, 4, 8) = 3
10
s→A drops to 0, A→B to 1, B→t to 5
3
s → B → t
min(5, 5) = 5
15
s→B drops to 0, B→t drops to 0
4
(no path found)
—
15
done
Where It's Used Today
Internet routing — backbone networks compute the maximum throughput between data centers to plan capacity upgrades.
Airline scheduling — assigning crews to flights becomes a max-flow / bipartite-matching problem solved exactly by this algorithm.
Image segmentation — computer-vision systems separate foreground from background by reducing the problem to min-cut on a pixel graph (max-flow's dual).
Sports tournament elimination — deciding when a team is mathematically eliminated from playoff contention reduces to a max-flow check on a remaining-games network.
Project staffing and matching markets — pairing students to schools, donors to recipients, ride requests to drivers all use max-flow as a core building block.
When NOT to Use
When the graph is huge and dense — Edmonds-Karp's O(VE²) is too slow; reach for Dinic's algorithm or push-relabel for production-scale flows.
When edge capacities can be negative or the graph has costs to minimize — this is min-cost flow, not max flow, and needs a different algorithm.
When you only need an s–t reachability check (yes/no) — plain BFS without the residual bookkeeping is far simpler and faster.
Common Mistakes
Forgetting to add the backward residual edge when pushing flow — without it the algorithm cannot undo bad choices and can return a sub-optimal answer.
Using DFS instead of BFS to find augmenting paths — that's Ford-Fulkerson, which loses the O(VE²) guarantee and may not terminate on irrational capacities.
Treating each undirected edge as a single residual instead of two opposite-direction edges with their own capacities, miscounting the available flow.
Try It with an AI Assistant
short
Write max_flow(source, sink) using BFS to find augmenting paths until none exist.
behavior
Write a function that, given a directed graph with edge capacities, a source, and a sink, repeatedly finds the shortest path (in number of edges) from source to sink whose every edge still has leftover capacity. Push the smallest leftover capacity along that path as flow, subtract that amount from the leftover capacity of each edge on the path (and add it to the reverse edge), and repeat until no such path exists. Return the total flow.
Made one-pass detection of directed cycles practical.
For instanceFind groups of webpages where each can reach the others.
sccs ← empty list
FUNCTIONtarjan(node)
index[node] ← time
low[node] ← time
time ← time + 1push(stack, node)
onStack[node] ←TRUEFOR EACH neighbor IN graph[node]
IF neighbor NOT visited THENtarjan(neighbor)
low[node] ←min(low[node], low[neighbor])
ELSEIF onStack[neighbor] THEN
low[node] ←min(low[node], index[neighbor])
ENDIFENDFORIF low[node] = index[node] THEN
component ← empty list
REPEAT
x ←pop(stack)
onStack[x] ←FALSEappend(component, x)
UNTIL x = node
append(sccs, component)
ENDIFEND FUNCTIONFOR EACH node IN graph
IF node NOT visited THENtarjan(node)
ENDIFENDFORRETURN sccs
Robert Tarjan published the algorithm in 1972 while at Stanford, in the same paper that introduced the low-link concept — a single number per node that captures "the oldest ancestor I can reach without leaving my still-open subtree." The trick collapsed what had been a multi-pass problem into one elegant DFS, and Tarjan went on to win the Turing Award in 1986 partly for this and a sequence of similarly tight graph algorithms. Decades later, low-link bookkeeping is still the standard way to find SCCs, articulation points, and bridges.
Teaches: Track how far back each node can reach to expose cycles
The Idea
Run a depth-first search and record two numbers per node: index[v] is the order in which the search first arrives at v, and low[v] is the smallest index reachable by descending from v (possibly hopping back through one cross-edge to a node still on the search stack). After visiting each neighbor, update low[v] accordingly. Push every newly discovered node onto a stack.
The key insight: when a node v finishes with low[v] = index[v], it is the root of an SCC — there's no way to escape from v to anything older still on the stack. Pop the stack down to and including v; everything popped is exactly that component. The invariant is "the stack always holds nodes whose component is not yet decided." Because each node is pushed and popped at most once, and each edge is examined once, the total cost is O(V + E).
Trace
step
visit
index
low
stack
action
1
1
0
0
[1]
enter 1
2
2
1
1
[1, 2]
enter 2 from 1
3
3
2
2
[1, 2, 3]
enter 3 from 2
4
(3→1)
—
0
[1, 2, 3]
1 on stack; low[3] = 0
5
back to 2
—
0
[1, 2, 3]
low[2] = min(1, 0) = 0
6
back to 1
—
0
[1, 2, 3]
low[1] = 0; low[1] == index[1]
7
pop 3, 2, 1
—
—
[]
emit SCC {1, 2, 3}
8
start at 4
3
3
[4]
enter 4
9
5
4
4
[4, 5]
enter 5; no out-edges; pop SCC {5}
10
back to 4
—
3
[4]
low[4] == index[4]; pop SCC {4}
Where It's Used Today
Compiler optimization — finding mutually recursive function clusters so they can be analyzed together for inlining and dead-code removal.
Web link analysis — Google's old PageRank pre-processing identified SCCs of the web graph to handle "spider traps" cleanly.
Social network analysis — detecting tightly knit groups on Twitter or LinkedIn where everyone follows everyone.
2-SAT solvers — boolean satisfiability problems with two-literal clauses reduce directly to finding SCCs in an "implication graph."
Deadlock detection — operating systems can spot deadlocks by finding cycles (SCCs of size > 1) in a wait-for graph.
When NOT to Use
When the graph is undirected — connected components only need a flat BFS/DFS or union-find, not the low-link machinery.
When you only need to detect whether a cycle exists — a coloring DFS that flags back-edges is shorter and uses no auxiliary stack.
When the graph is so deep that recursion blows the call stack — convert to an iterative DFS or use Kosaraju's two-pass version.
Common Mistakes
Updating low[v] from a neighbor that is visited but not on the stack — the neighbor belongs to a finished SCC, so it must be ignored.
Using low[neighbor] instead of index[neighbor] for back-edges, leaking information from sibling subtrees and merging SCCs wrongly.
Forgetting to clear onStack[x] when popping, so later searches treat the popped node as still-in-progress.
Try It with an AI Assistant
short
Write tarjan(graph) returning strongly connected components in a single DFS pass.
behavior
Write a depth-first search on a directed graph that, for each node, records the time of first visit and the smallest visit time it can reach by descending plus one back-edge to a node still on a working stack. Push each new node onto that stack. When a node's recorded reach equals its own visit time, pop everything down to and including it as one component.
For instanceChoose the most meetings that fit in one room.
activities ← [(1,4), (3,5), (0,6), (5,7), (3,9), (5,9), (6,10), (8,11), (8,12), (2,14), (12,16)]
sort activities by finish time
result ← [activities[0]]
lastFinish ← activities[0].finish
FOR EACH activity IN activities
IF activity.start >= lastFinish THENappend(result, activity)
lastFinish ← activity.finish
ENDIFENDFORRETURN result
By the early 1970s, operations-research textbooks were collecting a growing menagerie of greedy algorithms — small procedures that just walk through the data once and make a locally best choice. Activity selection became the textbook example because the greedy rule is so counter-intuitive: not "shortest first," not "earliest start," but earliest finish. The proof is a clean exchange argument — replacing any optimal first pick with the earliest-finishing one never makes the schedule worse — which is why the problem is now a standard introduction to "greedy is provably optimal."
Teaches: Choose earliest finishing tasks to maximize total selections
The Idea
Sort all activities by their finish time. Pick the very first one — the activity that finishes earliest. Then walk through the rest in order and greedily pick every activity whose start is at or after the previous pick's finish.
Why does the earliest finish rule work? Because finishing early leaves the most room for later activities. Suppose some optimal schedule starts with activity A instead of the earliest-finishing one E. Since E finishes no later than A, we can swap A for E without breaking anything — the rest of the schedule still fits. So there's always an optimal schedule that starts with E. After we commit to E, the problem reduces to the same problem on the activities that start after E finishes — and the same argument applies again. This is the classic exchange argument used to prove greedy algorithms optimal.
Trace
activity (start, finish)
start >= lastFinish?
action
result so far
lastFinish
(3, 5)
3 >= 4? no
skip
[(1,4)]
4
(0, 6)
0 >= 4? no
skip
[(1,4)]
4
(5, 7)
5 >= 4? yes
pick
[(1,4), (5,7)]
7
(3, 9)
3 >= 7? no
skip
[(1,4), (5,7)]
7
(5, 9)
5 >= 7? no
skip
[(1,4), (5,7)]
7
(6, 10)
6 >= 7? no
skip
[(1,4), (5,7)]
7
(8, 11)
8 >= 7? yes
pick
[(1,4), (5,7), (8,11)]
11
(8, 12)
8 >= 11? no
skip
[(1,4), (5,7), (8,11)]
11
(2, 14)
2 >= 11? no
skip
[(1,4), (5,7), (8,11)]
11
(12, 16)
12 >= 11? yes
pick
[(1,4), (5,7), (8,11), (12,16)]
16
Where It's Used Today
Conference and classroom scheduling — software like Google Calendar's room booking and university timetable solvers use this greedy as a building block.
Air traffic control — runway slot assignment uses earliest-finish-time scheduling so the runway is freed up for the next plane as soon as possible.
CPU job scheduling — non-preemptive batch schedulers schedule fixed-time jobs by deadlines using this exact rule.
Manufacturing and factory floors — assigning jobs to a single machine to maximize throughput.
Interview question favorite — almost every introductory algorithms course and tech interview includes activity selection because it's the cleanest example of "greedy is optimal."
When NOT to Use
When activities have different values or weights and you want to maximize total value — this becomes weighted interval scheduling, which needs DP, not the greedy.
When you have multiple rooms or machines — it becomes interval graph coloring or k-machine scheduling; greedy on finish time alone is no longer optimal.
When activities can be split, paused, or have setup costs between them — the simple "fits or doesn't fit" test breaks down.
Common Mistakes
Sorting by start time instead of finish time — picks the earliest meeting, but a long one can block many shorter ones; the count is no longer maximal.
Picking the shortest activity first — feels intuitive but fails on inputs where one short meeting straddles a useful boundary; only finish-time order is provably optimal.
Using > instead of >= when comparing start to lastFinish — back-to-back meetings (one ends at 7, next starts at 7) get incorrectly rejected.
Try It with an AI Assistant
short
Write activity_select(activities) where activities are (start, end) intervals; return the maximum number of mutually non-overlapping activities.
behavior
Write a function that takes a list of (start, finish) pairs. Sort them by finish. Walk through the sorted list keeping a running lastFinish. Pick the first activity, then for each later activity pick it whenever its start is at least lastFinish, and update lastFinish to that activity's finish. Return the picked activities.
It made digital maps, GPS traces, and vector drawings smaller and faster while preserving visible shape.
// Ramer-Douglas-PeuckerFUNCTIONrdp(pts, eps)
d_max ←0; index ←0FOR i FROM1TO len-2
d ←perp_dist(pts[i],
pts[0], pts[len-1])
IF d > d_max THEN
index ← i; d_max ← d
ENDIFENDFORIF d_max > eps THEN
L ←rdp(pts[0..index], eps)
R ←rdp(pts[index..END], eps)
RETURN L + R[1..]
ENDIFRETURN [pts[0], pts[len-1]]
END FUNCTION
Urs Ramer published the recursive perpendicular-distance procedure in 1972 (in Computer Graphics and Image Processing), and David Douglas and Thomas Peucker independently rediscovered the same idea in 1973 in The Canadian Cartographer — Douglas was at the University of Ottawa, working on automated digitization of the Canadian National Atlas. Cartographers had been hand-thinning coastlines and rivers for centuries, and the problem grew acute once digital maps appeared: a single faithfully-traced shoreline could carry tens of thousands of redundant vertices. RDP cut those by 90% or more without visibly changing the map, and remains the default simplifier in QGIS, Mapbox, and most fitness apps.
Teaches: Simplify shapes by removing insignificant detail recursively
The Idea
Draw a straight line from the first point to the last. Then look at every point in between and ask: how far does it sit, perpendicularly, from that line? Find the point that's farthest. If even that worst offender is within tolerance eps, every point is close enough — throw them all away and keep only the two endpoints. If the worst offender is too far, split the curve at that point and recurse on the two halves.
Why does this work? The recursion preserves a guarantee: every kept point is either an endpoint of the original curve or a corner whose perpendicular distance to its surrounding segment exceeded eps. So no removed point is more than eps away from the simplified line — the visible shape is preserved within the tolerance you chose. Tighten eps for more detail; loosen it for fewer points.
Trace
call
pts subset
d_max at index
action
rdp(full, 1.0)
(0,0)…(7,9)
~1.58 at i=2
split at index 2 (point (2,0))
rdp(L1, 1.0)
(0,0), (1,0.1), (2,0)
0.1 at i=1
within eps → keep [(0,0),(2,0)]
rdp(R1, 1.0)
(2,0), (3,5), (4,6), (5,7), (6,8.1), (7,9)
~1.71 at i=1
split at index 1 (point (3,5))
rdp(R1.L, 1.0)
(2,0), (3,5)
only 2 points
keep both
rdp(R1.R, 1.0)
(3,5), (4,6), (5,7), (6,8.1), (7,9)
~0.07 at i=3
within eps → keep [(3,5),(7,9)]
Where It's Used Today
Digital map rendering — Google Maps, OpenStreetMap, and Mapbox simplify coastlines, rivers, and roads when you zoom out, so the screen doesn't try to draw a million invisible vertices.
GPS track sharing — Strava and similar fitness apps shrink your saved run from thousands of points to a few hundred without changing the route's appearance.
Vector graphics editing — Adobe Illustrator and Inkscape use it to "smooth" or simplify hand-drawn paths.
Robotics and navigation — robots simplify recorded paths or sensor readings before planning movements.
Computational geometry — preprocessing polygons before running expensive algorithms like collision detection or polygon clipping.
When NOT to Use
When the curve has sharp peaks that matter (medical signals, seismograms) — RDP can drop a one-sample spike whose perpendicular distance is small relative to neighbors.
When you need to simplify multiple polylines while preserving shared boundaries (a country border between two states) — RDP processed independently produces gaps; use a topology-preserving simplifier.
When the curve self-intersects or simplification might introduce a self-intersection — RDP makes no guarantee against creating crossings between non-adjacent simplified segments.
Common Mistakes
Using point-to-point distance from the line's endpoints instead of perpendicular distance to the line segment, so collinear-but-distant points are kept needlessly.
Duplicating the split point when concatenating the recursive halves, producing a polyline with adjacent identical vertices.
Running with eps = 0 and expecting "no simplification" — floating-point comparisons will still drop colinear points; pass through the original list instead.
Try It with an AI Assistant
short
Write ramer_douglas_peucker(pts, eps) implementing Ramer-Douglas-Peucker line simplification.
behavior
Write a recursive function that takes a list of (x, y) points and a tolerance eps. Compute the perpendicular distance from each interior point to the straight line between the first and last points. Find the point with maximum distance. If that distance exceeds eps, split the list at that point and recurse on each half, then concatenate the results without duplicating the shared midpoint. Otherwise, return only the two endpoints.
For instanceA vending machine chooses fewest coins for 87 cents.
amount ←6
coins ← [1, 3, 4]
dp ← array[0..amount] filled with ∞
dp[0] ←0FOR i FROM1TO amount
FOR EACH c IN coins
IF c <= i THEN
dp[i] ←min(dp[i], dp[i - c] + 1)
ENDIFENDFORENDFORRETURN dp[amount]
The Coin Change problem became a classroom favorite once Richard Bellman's dynamic-programming framework matured in the 1950s and 60s. By the early 1970s, undergraduate textbooks were using it to introduce two ideas at once: that "obvious" greedy algorithms can be subtly wrong, and that filling in a small table of subproblems gives a guaranteed-optimal answer in modest time. It became a canonical exercise for explaining the leap from greedy to DP — and it's still the first DP problem most CS students meet.
Teaches: Build solutions to a problem from optimal answers to its smaller subproblems
The Idea
Build the answer for every amount from 0 up to amount, in order. Keep an array dp where dp[i] will hold the minimum number of coins needed to make amount i. Start with dp[0] = 0 (zero coins needed to make zero).
To compute dp[i], try each coin c. If c <= i, then one possible plan is "use one coin of value c, plus the best plan for the remaining i - c." That costs dp[i - c] + 1 coins. Take the minimum over all coins. Why does this work? Any optimal plan for i must end with some coin c; if we knew the right last coin, we'd be done. Since we don't, try them all and keep the best. By the time we ask about dp[i], every smaller dp[i - c] has already been computed correctly — the invariant of dynamic programming.
Trace
i
candidates (dp[i-c] + 1)
dp[i]
0
(base case)
0
1
from coin 1: dp[0] + 1 = 1
1
2
from coin 1: dp[1] + 1 = 2
2
3
from coin 1: dp[2]+1=3; from coin 3: dp[0]+1=1
1
4
from coin 1: dp[3]+1=2; from coin 3: dp[1]+1=2; from coin 4: dp[0]+1=1
1
5
from coin 1: dp[4]+1=2; from coin 3: dp[2]+1=3; from coin 4: dp[1]+1=2
2
6
from coin 1: dp[5]+1=3; from coin 3: dp[3]+1=2; from coin 4: dp[2]+1=3
2
Where It's Used Today
Vending machines and cash registers — making change with the fewest coins, especially when the currency is unusual.
Currency design — economists check whether a proposed coin denomination set behaves greedily (so cashiers don't need a computer).
Resource allocation — packaging items into containers of a few fixed sizes with minimum waste.
Tax and tariff schedules — choosing the smallest set of fixed payment increments that sum to a required amount.
Compiler scheduling — emitting the fewest instructions that together accomplish a target effect.
When NOT to Use
When the denominations form a "canonical" system (US coins, euro coins) and you only need a fast answer — plain greedy "biggest coin first" is provably optimal there and skips the DP table.
When amount is huge (billions) and coin denominations are few — the O(amount * len(coins)) table is too big; switch to a coin-frequency / number-theory approach.
When you need the actual coin list and not just the count — the basic DP only stores counts; you must also keep a parent array, or you'll have to reconstruct by re-running.
Common Mistakes
Using the greedy version on coins = [1, 3, 4] and trusting it — greedy returns 3 coins for amount 6 while the true minimum is 2; greedy is only safe for canonical coin sets.
Initializing dp with 0 instead of infinity — every amount reports 0 coins because the min(dp[i], dp[i-c]+1) never improves on the false zero.
Returning dp[amount] without checking it's still infinity — when no combination of coins sums to amount, you return a garbage sentinel like 9999 instead of -1.
Try It with an AI Assistant
short
Write coin_change_dp(amount, coins) returning the smallest number of coins summing to amount using dynamic programming over dp[0..amount]; return -1 if impossible.
behavior
Write a function that, given a target amount and a list of coin denominations, builds an array dp where dp[i] is the minimum number of coins summing to i. Initialize dp[0] = 0 and dp[i] = infinity otherwise. For each i from 1 to the amount, try each coin c; if c ≤ i, set dp[i] = min(dp[i], dp[i-c] + 1). Return dp[amount], or -1 if it's still infinity.
For instanceFind the fence needed around scattered trees.
points ← [(0,0), (1,1), (2,0), (2,2), (0,2)]
start ← point with smallest y (ties: smallest x)
hull ← [start]
current ← start
REPEAT
next ← any point ≠ current
FOR EACH q IN points
IF q = current THENCONTINUEIF next = current ORcross(current, next, q) < 0THEN
next ← q
ENDIFENDFOR
current ← next
IF current ≠ start THENappend(hull, current)
ENDIFUNTIL current = start
RETURN hull
R. A. Jarvis described the gift-wrapping algorithm in a 1973 paper that gave the convex-hull problem its first widely taught solution. The procedure mirrors what you would do by hand: start at the lowest point, then repeatedly "wrap" around to whichever next point keeps every other point on your left. Within a few years convex hulls would become a foundation of computational geometry — used in collision detection, computer graphics, robotics, and any system that needs to summarize a cloud of points by its outer boundary.
Teaches: Wrap outermost points by following boundary turns
The Idea
Find the lowest point — the one with the smallest y (breaking ties by smallest x). It must be on the hull, so start there. Now "wrap" outward: from the current point, look at every other point and pick the one that makes the most counter-clockwise turn (i.e., for any other candidate q, the cross product (current → next, current → q) is non-negative — every other point lies to the left of or on the line through current and next). Move to that point. Repeat until you arrive back at the starting point.
Why does it work? The invariant is that the next point chosen is always a vertex of the hull, because every other point lies on the same side of the chosen edge. Each iteration adds one hull vertex, so the algorithm runs in O(n · h) time, where h is the number of hull points. It's slow when the hull is large, but its picture is the cleanest in computational geometry — you literally wrap a string around the points.
Trace
step
current
candidate next
hull so far
1
(0,0)
(2,0) — every other point is to its left of edge (0,0)→(2,0)
[(0,0), (2,0)]
2
(2,0)
(2,2)
[(0,0), (2,0), (2,2)]
3
(2,2)
(0,2)
[(0,0), (2,0), (2,2), (0,2)]
4
(0,2)
(0,0) — back to start; stop
[(0,0), (2,0), (2,2), (0,2)]
Where It's Used Today
Computer graphics — game engines compute the hull of a 3D mesh's projected vertices to draw the silhouette of a building or character.
Collision detection — physics engines wrap each object in its convex hull first, because two convex shapes can be checked for overlap far faster than two arbitrary blobs.
Geographic mapping — apps that draw a "fence" around a customer's check-ins, or the territory covered by a delivery fleet, compute the convex hull of those GPS points.
Robotics path planning — a robot's reachable workspace is summarized by the convex hull of its arm-tip positions, used for safe motion planning.
Image recognition — handwriting and shape recognizers extract the hull of a stroke or contour to normalize its outline before classification.
When NOT to Use
When the shape you really need is concave (a coastline, a star polygon) — the convex hull will smooth over every inward dent; use an alpha-shape or concave-hull algorithm.
When you only need the bounding box or the diameter — far simpler O(n) sweeps give those without computing the full hull.
When the points live in high dimensions (more than 3D) — Jarvis march and monotone chain don't generalize; use Quickhull or specialized libraries.
Common Mistakes
Using floating-point cross products with no tolerance and treating nearly-collinear points as left turns one frame and right turns the next, producing a flickering hull.
Forgetting to handle duplicate points or all-collinear inputs — the standard sweep can return a degenerate two-point or zero-area "polygon".
Stopping after one wrap step instead of continuing until you return to the starting point — you'd get a single edge, not a closed polygon.
Try It with an AI Assistant
short
Write convex_hull(points) returning the convex hull of a set of 2D points using Jarvis march, in counter-clockwise order.
behavior
Given a list of 2D points, start at the lowest one (smallest y, breaking ties by smallest x). Then repeatedly look at every other point and pick whichever one makes the most counter-clockwise turn from your current direction — equivalently, leaves every other candidate to the left of the new edge. Move there, append it to the hull, and stop when you wrap back to the start.
For instanceVersion control can compare two edited documents.
a ←"AGCAT"
b ←"GAC"
n ←length(a)
m ←length(b)
dp ←matrix(n+1, m+1) filled with 0FOR i FROM1TO n
FOR j FROM1TO m
IF a[i-1] = b[j-1] THEN
dp[i][j] ← dp[i-1][j-1] + 1ELSE
dp[i][j] ←max(dp[i-1][j], dp[i][j-1])
ENDIFENDFORENDFORRETURN dp[n][m]
The algorithm became extremely important in genetics and software version control because it captures structural similarity rather than exact matching.
Needed to identify shared ordered patterns between sequences.
Teaches: Preprocess for constant-time weighted random choices
The Idea
Build a small grid dp where dp[i][j] holds the LCS length of the first i characters of a against the first j characters of b. Walk the grid row by row. If a[i-1] = b[j-1], you've found a matching letter — extend the diagonal answer: dp[i][j] = dp[i-1][j-1] + 1. Otherwise the best you can do is whatever was already best when you ignored one character: dp[i][j] = max(dp[i-1][j], dp[i][j-1]).
Why does this work? Each cell answers a strictly smaller version of the same question, and the recurrence covers every case: either the last letters match (so they belong in the LCS together), or one of them doesn't, so you drop it. The bottom-right cell dp[n][m] holds the answer for the full strings.
Trace
dp
ε
G
A
C
ε
0
0
0
0
A
0
0
1
1
G
0
1
1
1
C
0
1
1
2
A
0
1
2
2
T
0
1
2
2
Where It's Used Today
Version control diffs — git diff and similar tools use LCS to align unchanged lines between two file versions and show only what changed.
DNA and protein comparison — bioinformatics tools score how similar two genetic sequences are by computing their longest shared subsequence.
Plagiarism detection — comparing two essays for shared word-order patterns even when sentences have been edited.
Spell check and autocorrect — variants of the same DP table measure how close one word is to a dictionary entry.
File synchronization — tools like rsync and Dropbox use LCS-style alignment to transfer only the changed parts of a file.
When NOT to Use
When you need a contiguous match (substring), not a scattered subsequence — use the longest common substring DP or suffix-array techniques instead.
When both strings are very long (millions of characters) — the O(n·m) table blows past memory; switch to Hunt-Szymanski or a diff algorithm tuned for sparse matches.
When you only need to know whether a sequence is a subsequence of another — a two-pointer scan answers that in O(n+m) without any DP table.
Common Mistakes
Off-by-one indexing — confusing dp[i][j] (lengths used) with a[i]/b[j] (zero-based characters), so the wrong characters get compared.
Using dp[i-1][j-1] + 1 even when the characters don't match, double-counting non-matches up the diagonal.
Returning the LCS string instead of the length (or vice versa) without the explicit backtracking step that walks the table from the bottom-right back to the origin.
Try It with an AI Assistant
short
Write lcs(a, b) returning the length of the longest common subsequence of strings a and b.
behavior
Write a function that, given two strings, fills an (n+1) × (m+1) grid where each cell records the longest match length using the first i characters of one string and the first j of the other. When the latest characters match, take the diagonal cell plus one; otherwise take the max of the cell above and the cell to the left. Return the bottom-right cell.
It made fast weighted sampling practical for games, simulations, randomized algorithms, recommender systems, and probabilistic models.
items ← ["A", "B", "C", "D"]
weights ← [1, 1, 4, 2]
n ←length(items)
// Walker's alias method
build:
scale weights so sum = n
split into small / large queues
pair them into prob[i], alias[i]
sample:
i ←rand_int(0, n-1)
IFrand() < prob[i] THENRETURN i
ELSERETURN alias[i]
ENDIF
Random choice is easy when all outcomes are equal, but harder when outcomes have different weights. In 1974, Alastair Walker — a New Zealand statistician — published the surprising trick that any discrete distribution over n outcomes can be repackaged into two short tables so that each draw costs only one fair die roll plus one biased coin flip, no matter how skewed the weights are. Michael Vose later cleaned up the construction so the tables can be built in linear time, and today every Monte Carlo simulator, video-game loot system, and language-model sampling routine relies on the same two-table sleight of hand.
Teaches: Trade preprocessing for constant-time weighted sampling via alias tables
The Idea
Picture each item as a slice of a length-n ruler whose total area is n. Scale every weight so the average is 1. Some slices are now bigger than 1 ("large"), some smaller ("small"). The trick: cut a piece of size 1 − prob[i] off a large slice and tape it onto a small slice. Now both slots have total area 1. Each slot stores its own original probability prob[i] and an alias[i] — the donor it borrowed from.
To sample, roll a fair die to pick a slot i (one of n slots, equally likely), then flip a biased coin with bias prob[i]: heads return i, tails return alias[i]. Two random calls and you're done. Why does it work? The setup ensures each item ends up with total area equal to its weight — the original probabilities are exactly preserved, but every draw runs in O(1).
Trace
step
small queue
large queue
action
prob
alias
1
[A, B]
[C, D]
pair A(0.5) with C; C → 2.0 − 0.5 = 1.5
prob[A]=0.5
alias[A]=C
2
[B]
[C(1.5), D]
pair B(0.5) with C; C → 1.0
prob[B]=0.5
alias[B]=C
3
[]
[C(1.0), D]
both ≥1 — set prob[C]=1, prob[D]=1
prob[C]=1, prob[D]=1
alias[C]=C, alias[D]=D
Where It's Used Today
Loot tables in video games — drawing the next dropped item from a long list of weighted possibilities.
Recommender systems — sampling content for "shuffle" or "explore" modes where some items should appear more often than others.
Monte Carlo simulations — picking events in physics, finance, or epidemiology models where outcome probabilities differ.
A/B testing — splitting traffic 70/20/10 across three variants, drawing each visitor in O(1).
Natural language generation — sampling the next word from a probability distribution over a vocabulary, the inner loop of every old-school n-gram language model.
When NOT to Use
When weights change every draw — rebuilding the alias table costs O(n); a Fenwick-tree weighted sample handles updates in O(log n).
When you only sample a handful of times — building the table costs more than a simple cumulative-sum + binary-search approach.
When weights are extreme (some 10^15, some 1) — floating-point rounding can leave probability mass orphaned; use exact rational or integer alias tables.
Common Mistakes
Forgetting to scale weights so they sum to n (not 1) — the small/large split misclassifies items and the table sums no longer match the input weights.
Reusing one random number for both the slot and the coin flip — the two need to be independent or the sample distribution skews.
Mishandling the boundary case where a "large" item becomes exactly 1 — leaving it in the large queue causes an infinite loop in the build.
Try It with an AI Assistant
short
Write weighted_pick(items, weights) returning one item chosen with probability proportional to its weight.
behavior
Preprocess weighted outcomes into alias and probability tables, then sample one outcome in O(1) time.
For instanceFind the longest mirrored substring in a long text.
s ←"abaab"
t ←"^#" + join(chars of s with "#") + "#$"
n ←length(t)
p ← array[n] filled with 0
center ←0
right ←0FOR i FROM1TO n-2
mirror ←2*center - i
IF i < right THEN
p[i] ←min(right - i, p[mirror])
ENDIFWHILE t[i + 1 + p[i]] = t[i - 1 - p[i]]
p[i] ← p[i] + 1ENDWHILEIF i + p[i] > right THEN
center ← i
right ← i + p[i]
ENDIFENDFORRETURN p
Glenn Manacher published the algorithm in 1975 in the Journal of the ACM, originally to attack a more specific problem: finding the shortest "left-anchored" palindromic suffix-prefix of a string in linear time. Later researchers noticed his trick — reusing the symmetry inside an already-known palindrome to skip redundant comparisons — generalized to the full longest-palindrome problem on any input. Today every competitive programmer keeps a copy in their template library; the algorithm is short, but the symmetry argument behind its O(n) runtime is one of the most elegant in string processing.
Teaches: Reuse symmetry already discovered to skip redundant comparisons
The Idea
First, transform the input by inserting a separator character (#) between every pair of letters and adding sentinels (^, $) at the ends. This trick makes both odd-length and even-length palindromes look the same — every palindrome is now centered on some single position of the new string. Then walk through the new string left to right, building array p where p[i] is the radius of the longest palindrome centered at i.
The key insight: while you're inside a known palindrome (the current center/right window), the position to the left of center — the mirror — already knows its palindrome length. You can copy that as a starting estimate for p[i] instead of comparing from scratch. Then you only ever attempt to extend past the right edge of the current best palindrome — which means each character of the string only causes one new successful comparison total. That's why the whole algorithm runs in linear time.
Trace
i
t[i]
mirror
starting p[i]
extends to
new center, right
1
#
—
0
0
center=1, right=1
2
a
—
0
1
center=2, right=3
3
#
1
0
0
—
4
b
0
0
3
center=4, right=7
5
#
3
min(2, p[3]=0) = 0
0
—
6
a
2
min(1, p[2]=1) = 1
1
—
7
#
1
min(0, p[1]=0) = 0
4
center=7, right=11
8
a
6
min(3, p[6]=1) = 1
1
—
9
#
5
min(2, p[5]=0) = 0
0
—
10
b
4
min(1, p[4]=3) = 1
1
—
11
#
3
min(0, p[3]=0) = 0
0
—
Where It's Used Today
Bioinformatics — DNA contains palindromic regions (sequences that match their reverse complement) that mark restriction-enzyme sites; Manacher-style scans help locate them in long genomes.
Programming-contest libraries — competitive programmers ship Manacher's algorithm as a pre-written tool because palindrome problems are common.
Plagiarism detection — some text-similarity tools look for unusual mirrored substrings as fingerprints.
Compiler symbol analysis — some specialized compilers and linkers detect mirrored patterns in symbol tables for diagnostic checks.
Puzzle generators — crossword and word-game tools use it to find or avoid long palindromic patterns automatically.
When NOT to Use
When the strings are short (a few hundred characters) — the simpler "expand around each center" approach is O(n^2) but easier to write and debug, and runs faster in practice.
When you need all palindromes, not the longest — Manacher gives radii at each center but doesn't enumerate them; eertree (palindromic tree) is the better tool.
When the alphabet is huge or the comparison is expensive (e.g. Unicode normalization) — the constant factor swamps the linear-time advantage.
Common Mistakes
Skipping the # separator transformation and then trying to handle even and odd palindromes with two separate loops — code becomes a tangle.
Forgetting the sentinel characters ^ and $ at the ends, so the inner WHILE loop walks off the array and crashes.
Updating center and right only when strictly greater (>) but using mirror values that depend on the boundary case — produces subtle off-by-one errors near the right edge.
Try It with an AI Assistant
short
Write manacher(s) returning palindrome radii at every position in O(n).
behavior
Write a function that, given a string, inserts a separator between every two characters so that even-length and odd-length palindromes look the same, then walks through the string keeping track of the rightmost palindrome found so far. At each new position, use the mirrored position inside that palindrome as a starting estimate, then try to extend further. Return the longest palindrome substring.
Made nearest-neighbor queries on multidimensional points fast.
FUNCTIONbuild(pts, depth)
IF pts empty THENRETURNNULLENDIF
axis ← depth MOD k
sort pts by axis
m ←len(pts) / 2
node ←Node(pts[m])
node.left ←build(pts[0..m], depth+1)
node.right ←build(pts[m+1..], depth+1)
RETURN node
END FUNCTION
Jon Louis Bentley designed the KD-tree as a Stanford graduate student in 1975, publishing it in Communications of the ACM under the title "Multidimensional Binary Search Trees Used for Associative Searching." His motivation was practical: databases were starting to store geographic and scientific data with multiple coordinates, and one-dimensional B-trees couldn't answer questions like "what's near this point?" Bentley's idea — alternate the splitting axis level by level — turned out to generalize to any number of dimensions, and the data structure became standard equipment in graphics, robotics, and machine learning libraries.
Teaches: Partition space recursively for efficient multidimensional queries
The Idea
Pick an axis to split on (alternating each level: x at depth 0, y at depth 1, x at depth 2, …). Sort the points by that coordinate, take the median point as the current tree node, then recursively build the left subtree from points smaller on that axis, and the right subtree from points larger on that axis. With k dimensions, the axis at depth d is d mod k.
Why does this work? At every node, the splitting plane divides space into two half-spaces, with the left subtree's points all on one side and the right subtree's all on the other. That spatial guarantee is what makes later queries fast: when you search, you can prove an entire subtree is too far away and skip it. Picking the median keeps the tree balanced — depth roughly log n — so neither side becomes a tall, thin spike.
Trace
call
depth
axis (depth % 2)
sorted pts on axis
m
node (pts[m])
left half
right half
build(all 6 pts, 0)
0
0 (x)
[(2,3),(4,7),(5,4),(7,2),(8,1),(9,6)]
3
(7,2)
(2,3),(4,7),(5,4)
(8,1),(9,6)
build(left 3 pts, 1)
1
1 (y)
[(2,3),(5,4),(4,7)]
1
(5,4)
(2,3)
(4,7)
build([(2,3)], 2)
2
0 (x)
[(2,3)]
0
(2,3)
empty
empty
build([(4,7)], 2)
2
0 (x)
[(4,7)]
0
(4,7)
empty
empty
build(right 2 pts, 1)
1
1 (y)
[(8,1),(9,6)]
1
(9,6)
(8,1)
empty
build([(8,1)], 2)
2
0 (x)
[(8,1)]
0
(8,1)
empty
empty
Where It's Used Today
Nearest-neighbor lookup in maps — "find the closest coffee shop to me" in apps like Yelp or Google Maps starts by querying a KD-tree of business locations.
Robotics and self-driving cars — LiDAR returns millions of 3D points per second; KD-trees let the robot find the nearest obstacle in milliseconds.
Computer graphics ray tracing — hit-testing a ray against a scene full of triangles is sped up by a KD-tree (or its cousin, the BVH).
Image processing — k-nearest-neighbor classifiers and color quantization (reducing a photo to 256 colors) both rely on KD-tree queries over feature vectors.
Recommendation systems — searching among millions of user-embedding vectors for the most similar ones uses KD-trees (or close relatives like ball trees) at modest dimensions.
When NOT to Use
When dimensionality is high (say > 20) — axis-aligned splits stop pruning, and queries degrade to scanning every point.
When the point set changes constantly — KD-trees rebalance poorly under inserts and deletes; use an R-tree or rebuild periodically.
When you need exact distances for non-Euclidean metrics with no axis structure — use a metric tree like a VP-tree or ball tree.
Common Mistakes
Picking the median by value rather than the median index of sorted points, producing an unbalanced tree.
Forgetting to alternate the splitting axis, building a tree that splits on x at every level.
Splitting on a copy of the point list at each level instead of partitioning in place, blowing up memory on big inputs.
Try It with an AI Assistant
short
Write build_kdtree(points) returning a balanced 2D kd-tree from a list of (x, y) points; alternate splitting axes per depth.
behavior
Build a KD-tree by recursively choosing an axis, sorting points by that axis, and storing the median as the node.
It made fast nearest-neighbor lookup practical for geometry, machine learning, image search, and location-based systems.
target ← (6, 5)
k ←2FUNCTIONnn(node, best, depth)
IF node = NULLTHENRETURN best
ENDIFIFdist(node.p, target) < dist(best, target) THEN
best ← node.p
ENDIF
axis ← depth MOD k
IF target[axis] < node.p[axis] THEN
near ← node.left
far ← node.right
ELSE
near ← node.right
far ← node.left
ENDIF
best ←nn(near, best, depth + 1)
IF |target[axis] - node.p[axis]| < dist(best, target) THEN
best ←nn(far, best, depth + 1)
ENDIFRETURN best
END FUNCTION
After spatial data was organized as a KD-tree, the next breakthrough was searching it intelligently. Instead of checking every point, the search visits promising regions first and prunes regions that cannot contain a closer point.
Teaches: Prune search using spatial bounds
The Idea
Walk down the tree recursively. At each node, the splitting axis cycles through depth MOD k (so depth 0 splits on x, depth 1 on y, depth 2 on x again, ...). Compare the target's coordinate on that axis with the node's: that tells you which side the target is on — descend that near branch first. As you visit nodes, keep track of the closest point seen so far in best.
After the near branch returns, ask: could the far branch possibly hold something even closer? Yes, only if the perpendicular distance from the target to the splitting line is less than the current best distance. If so, recurse into the far branch too; otherwise prune it. The invariant: best is always the closest point among all nodes visited so far. When the recursion finishes, best is the global nearest neighbor.
Trace
step
node
axis
dist(node, target)
best after
near / far decision
1
(5, 4)
x
√2 ≈ 1.41
(5, 4)
target.x = 6 ≥ 5, so near = right, far = left
2
(9, 6)
y
√10 ≈ 3.16
(5, 4)
target.y = 5 < 6, so near = left = (8,1), far = NULL
3
(8, 1)
x
√20 ≈ 4.47
(5, 4)
leaf
4
back at (5,4): far = left subtree; \
6−5\
= 1 < 1.41 → recurse
5
(4, 7)
y
√8 ≈ 2.83
(5, 4)
target.y = 5 < 7, so near = left = (2,3), far = NULL
6
(2, 3)
x
√20 ≈ 4.47
(5, 4)
leaf
Where It's Used Today
Maps and "find nearest" — Google Maps, Yelp, and Uber dispatch all use spatial indexes (KD-trees, R-trees) to find the nearest restaurant, driver, or charging station.
k-NN classifiers in machine learning — finding the nearest training points for a new query is the core operation of k-nearest-neighbor classification and regression.
Computer vision — feature matching between two photos (SIFT, ORB) is built on nearest-neighbor lookups in high-dimensional descriptor space.
Robotics and path planning — sample-based planners like RRT use KD-trees to find the nearest existing tree node when extending a path.
Particle simulations and games — finding the nearest particle in fluid simulations, or the nearest enemy to a unit, is exactly this query.
When NOT to Use
When the dimension k is large (say > 20) — pruning becomes ineffective and search degrades to a linear scan; use HNSW or LSH.
When the point set changes constantly — KD-trees don't rebalance gracefully; an R-tree or grid index handles dynamic data better.
When you need the nearest along a non-Euclidean metric (cosine, Hamming) — the splitting-plane prune relies on Euclidean geometry.
Common Mistakes
Comparing only the splitting coordinate instead of the full Euclidean distance when updating best — wrong nearest is reported.
Always recursing into both children regardless of the prune test — the search becomes O(n) instead of average O(log n).
Forgetting to cycle axis = depth MOD k, so every level splits on the same axis and the tree devolves into an unbalanced list.
Try It with an AI Assistant
short
Given a 2D kd-tree, write nearest(tree, p) returning the closest stored point to query p.
behavior
Search a KD-tree for the nearest point by descending the likely branch first and pruning branches whose bounding distance is too large.
For instanceFind the maximum number of meetings happening at once.
intervals ← [(1, 4), (2, 5), (7, 9), (3, 6)]
events ← empty list
FOR EACH (l, r) IN intervals
append(events, (l, +1))
append(events, (r, -1))
ENDFOR
sort events
active ←0
best ←0FOR EACH (_, delta) IN events
active ← active + delta
best ←max(best, active)
ENDFORRETURN best
The line-sweep paradigm crystallized in computational geometry in the mid-1970s, when Michael Shamos and Dan Hoey showed how to find the closest pair of points by sweeping a vertical line and maintaining only an active "neighborhood" along it. Three years later, Jon Bentley and Thomas Ottmann generalized the idea into their famous segment-intersection algorithm, and the technique then spread far beyond geometry — into interval scheduling, calendar overlap, and event simulation. The unifying insight: replace O(n²) pairwise checks with one sorted walk through O(n) events.
Teaches: Replace pairwise checks with one sorted walk through events
The Idea
Turn each interval into two events: a +1 at its left endpoint (a meeting starts) and a −1 at its right endpoint (a meeting ends). Sort all events by time. Then sweep through them in order, keeping a running counter active that goes up and down as meetings start and finish. The largest value active ever reaches is the answer.
Why does it work? Imagine moving a vertical line from left to right across the time axis. Every interval the line currently crosses is "active." The set of active intervals only changes at endpoints — at every other moment the count is constant. So checking the count at each event is enough; nothing in between matters. Sorting the events takes O(n log n), and the sweep is O(n). The same idea generalizes to many computational geometry problems — finding line segment intersections, computing area unions, building Voronoi diagrams.
Trace
event
active (after delta)
best
(1, +1)
1
1
(2, +1)
2
2
(3, +1)
3
3
(4, -1)
2
3
(5, -1)
1
3
(6, -1)
0
3
(7, +1)
1
3
(9, -1)
0
3
Where It's Used Today
Calendar and meeting-room software — Outlook and Google Calendar use line-sweep ideas to count concurrent meetings and detect double bookings.
Hospital staffing and ICU planning — counting peak concurrent patients to size staff and bed capacity.
Network monitoring — tracking the maximum number of concurrent TCP connections or active phone calls to size servers and switches.
Computational geometry — Shamos-Hoey closest-pair (1975) and Bentley-Ottmann line-segment intersection (1979) extended the same sweep template, and most polygon-overlap algorithms in CAD use it today.
Skyline rendering — computing the silhouette of overlapping buildings or charts uses a sweep that tracks the highest active rectangle.
When NOT to Use
When you only have a handful of intervals — the sort cost dominates; a simple O(n²) pairwise check is faster and easier to read.
When intervals change often (insertions and deletions during querying) — line sweep needs all events upfront; use an interval tree instead.
When you need to know which intervals overlap, not just the count — line sweep loses identity information; track active IDs in a set as you sweep.
Common Mistakes
Sorting events only by time and not by type — a -1 (end) processed before a +1 (start) at the same instant gives the wrong overlap count for closed intervals.
Updating best before adding the delta, which records the count from before the new interval became active.
Treating (end, -1) as exclusive when intervals are inclusive (or vice versa), producing off-by-one results at boundaries.
Try It with an AI Assistant
short
Write max_overlap(intervals) returning the maximum number of overlapping intervals.
behavior
Write a function that takes a list of (left, right) intervals. For each interval, emit two events: (left, +1) and (right, -1). Sort all events by their first coordinate. Walk through them, keeping a running sum of the deltas. Return the largest value the running sum ever reaches.
It made large geometric proximity problems practical without quadratic explosion.
FUNCTIONclosest_pair(pts)
sort pts by x
RETURNdnc(pts, 0, n-1)
END FUNCTIONFUNCTIONdnc(pts, lo, hi)
IF hi - lo <= 3THENRETURNbrute_force(pts, lo, hi)
ENDIF
m ← (lo + hi) / 2
d ←min(dnc(pts, lo, m),
dnc(pts, m+1, hi))
strip ← pts[i] where
|pts[i].x - pts[m].x| < d
RETURNmin(d,
strip_check(strip, d))
END FUNCTION
In the mid-1970s, Michael Shamos and Jon Bentley were laying the foundations of computational geometry — turning fuzzy questions about shapes and distances into precise algorithms with provable running times. The closest-pair problem was a showcase: an obvious quadratic brute force compared against a clever divide-and-conquer that achieved O(n log n) by recursing on x-sorted halves and merging through a narrow vertical strip. The deep insight that made the merge step linear — only seven nearby strip points need to be checked per candidate — became a template for many later geometric algorithms.
Teaches: Divide space and combine answers to nearby candidates
The Idea
First, sort the points by x-coordinate. Then split them into a left half and a right half at the median x. Recursively find the closest pair in the left half (call its distance dL) and the closest pair in the right half (dR). Let d = min(dL, dR). The true closest pair is either one of those two — or it straddles the dividing line, with one point on each side.
The clever part is the strip check: only points within horizontal distance d of the dividing line could possibly form a smaller pair. Sort that strip by y and, for each point, you only need to compare it to the next 6 or 7 points by y — geometry guarantees no closer pair can hide farther away. The d already-known bound limits how many neighbors you must check, keeping the merge step linear, which gives the overall O(n log n) time.
Trace
step
call
range
action
1
dnc(0, 4)
5 pts
n > 3, so split. m = 2.
2
dnc(0, 2)
3 pts
brute_force on (0,0),(1,2),(3,6) → closest = (0,0)-(1,2), d ≈ 2.236
3
dnc(3, 4)
2 pts
brute_force on (4,1),(5,5) → d ≈ 4.123
4
back at dnc(0,4)
—
d = min(2.236, 4.123) = 2.236
5
strip check
—
strip = points with x within 2.236 of pts[2].x = 3 → check (1,2),(3,6),(4,1),(5,5); no closer pair found
6
return
—
closest distance = 2.236
Where It's Used Today
Air-traffic control — finding the two aircraft closest to each other for collision-avoidance alerts.
Robotics and self-driving cars — proximity checks among detected obstacles.
Computational chemistry — finding the closest pair of atoms in a large molecular structure.
Geographic information systems — proximity queries over millions of map features.
Computer graphics and physics engines — broad-phase collision detection seeds use closest-pair-style tricks before the precise overlap test.
When NOT to Use
When the point set is small (a few dozen points) — the O(n²) brute-force check is simpler, has lower constants, and avoids the strip-merge bookkeeping.
When you need all near pairs within radius r, not just the single closest — use a grid bucket or a k-d tree's range-search instead.
When the points live in high dimensions (3D and above with many neighbors) — the strip trick relies on the planar 7-neighbor bound; use k-d trees or locality-sensitive hashing.
Common Mistakes
Re-sorting the strip by y from scratch at every recursion level, turning the merge step into O(n log n) and inflating the total to O(n log² n).
Comparing each strip point against all others in the strip instead of stopping at the next ~7 by y, losing the linear-merge guarantee.
Using a strict < instead of ≤ when picking strip candidates within d of the dividing line, missing pairs exactly on the boundary.
Try It with an AI Assistant
short
Write closest_pair(pts) returning the smallest distance between any two points using the standard O(n log n) divide-and-conquer method.
behavior
Sort the points by x. Recursively split them into halves at the median x, find the smallest distance in each half, and call the smaller of the two d. Then check only the points whose x is within d of the dividing line; for each such point, compare it to the next few points by y. Return the smallest distance found across left half, right half, and strip.
For instanceCryptography can quickly test candidates for large primes.
witnesses ← [2, 3, 5, 7]
FUNCTIONis_probably_prime(n)
// write n - 1 = 2^s * d (d odd)FOR EACH a IN witnesses
x ←mod_pow(a, d, n)
IF x = 1OR x = n - 1THENCONTINUEENDIFREPEAT s - 1 times
x ← (x * x) MOD n
IF x = n - 1THENBREAKENDIF
ENDREPEAT
IF x != n - 1THENRETURN false
ENDIFENDFORRETURN true
END FUNCTIONFUNCTIONnext_prime(n)
candidate ← n + 1WHILENOTis_probably_prime(candidate)
candidate ← candidate + 1ENDWHILERETURN candidate
END FUNCTION
In 1976, Gary Miller (then at Carnegie Mellon) gave a deterministic primality test that ran in polynomial time — but only assuming the still-unproven Generalized Riemann Hypothesis. Four years later Michael Rabin made it unconditional by switching the witness from "all values up to a bound" to "several random values," giving the now-standard probabilistic Miller-Rabin test. RSA had just been invented in 1977 and was hungry for fast primality testing on 1024-bit candidates; Miller-Rabin became the engine that makes practical public-key cryptography possible.
Teaches: Use randomness to test properties faster than certainty allows
The Idea
Two layers. The outer layer scans candidates n+1, n+2, ... until one tests prime. The inner layer is Miller-Rabin: a fast probabilistic primality test.
Miller-Rabin works like a courtroom. Write n - 1 = 2^s · d where d is odd. Pick a random "witness" a between 2 and n-2. If n were prime, then by Fermat's little theorem a^(n-1) ≡ 1 (mod n) — and the only square roots of 1 modulo a prime are ±1. So compute x = a^d mod n and keep squaring; you should see 1 or -1 show up at some point. If you don't, a is a witness that n is composite. One witness can be unlucky, but trying several independent witnesses makes the chance of a false positive astronomically small. The invariant: **a single contradicting witness proves n composite; agreement of many witnesses makes nprobably prime with overwhelming confidence**.
Trace
step
x
check
verdict
0
x = 2^25 mod 101
compute mod_pow(2, 25, 101) = 10
not 1, not 100 — keep going
1
x = 10² mod 101 = 100
x = 100 = n - 1
break — passes for a=2
Where It's Used Today
RSA key generation — every secure web certificate's private key is built by repeatedly calling next_prime on huge random starting numbers.
Diffie-Hellman key exchange — needs large safe primes; Miller-Rabin tests candidates fast enough to be practical.
Hash table sizes — many hash table implementations resize to the next prime to spread keys evenly.
Lottery and gaming software — random number generators sometimes use prime moduli for better statistical properties.
Coding theory — error-correcting codes over GF(p) need the next prime past the alphabet size.
When NOT to Use
When n is small (say, under a million) — a sieve of Eratosthenes precomputes every prime far faster than scanning candidates.
When you need a certified prime for legal or audit reasons — Miller-Rabin is probabilistic; use AKS or ECPP for deterministic proof.
When you need primes with special structure (safe primes, strong primes) — plain next_prime ignores those constraints.
Common Mistakes
Stepping by 1 from n+1 instead of skipping even candidates after the first odd one — half the work is on numbers obviously composite.
Picking a single witness a = 2 and calling it prime — Carmichael-like composites can fool one witness; use several.
Computing a^d mod n with regular exponentiation instead of modular exponentiation, blowing up to enormous integers before the mod.
Try It with an AI Assistant
short
Write next_prime(n) that returns the smallest prime strictly greater than n.
behavior
Write a function that, given a positive integer n, scans the integers n+1, n+2, ... and returns the first one that survives several rounds of a probabilistic primality test. The test should write n−1 as 2^s · d with d odd, then for several random witnesses a, compute a^d mod n, and check whether repeated squaring produces 1 or n−1; if not, n is composite.
For instanceFind a DNA pattern without restarting at every mismatch.
text ←"ABABCABAB"
pattern ←"ABABC"
lps ←computeLPS(pattern) // [0, 0, 1, 2, 0]
i ←0
j ←0WHILE i < length(text)
IF text[i] = pattern[j] THEN
i ← i + 1
j ← j + 1ENDIFIF j = length(pattern) THENRETURN i - j
ELSEIF i < length(text) AND text[i] != pattern[j] THENIF j > 0THEN
j ← lps[j - 1]
ELSE
i ← i + 1ENDIFENDIFENDWHILERETURN -1
Donald Knuth (Stanford), James Morris (Berkeley), and Vaughan Pratt (also Berkeley) discovered the algorithm independently in the early 1970s and published their joint paper Fast Pattern Matching in Strings in SIAM Journal on Computing in 1977. The trio reportedly converged on the same failure-table idea within months of each other while exploring linear-time string-matching bounds. KMP was the first widely-known string search guaranteed to run in linear time on any input — closing a longstanding gap between worst-case and average-case bounds — and it became a textbook example of how preprocessing the pattern (not the text) can sidestep apparent quadratic behaviour.
Teaches: Reuse partial matches to avoid rechecking characters
Anecdote
Although elegant, KMP is often not used in production. Simpler heuristics (like Boyer-Moore variants) are often faster in practice — a reminder that theoretically optimal ≠ practically dominant.
The Idea
Before scanning the text, build a small failure table (called lps, the "longest proper prefix that is also a suffix") for the pattern. For each position j in the pattern, lps[j] says: if I've matched the first j+1 characters and then mismatch, how many characters at the start of the pattern can I keep already matched without rechecking the text? This table only depends on the pattern, not the text.
Then walk through the text with two pointers, i (text) and j (pattern). On a match, advance both. On a mismatch with j > 0, slide the pattern forward smartly using j ← lps[j − 1] — never moving i backward. If j reaches the pattern length, you've found a match at i − j. Because i only moves forward and j decreases at most as many times as it increased, the total work is proportional to the lengths of text and pattern combined — O(n + m).
Trace
step
i
j
text[i] vs pattern[j]
action
0
0
0
A == A
match: i=1, j=1
1
1
1
B == B
match: i=2, j=2
2
2
2
A == A
match: i=3, j=3
3
3
3
B == B
match: i=4, j=4
4
4
4
C == C
match: i=5, j=5
—
5
5
j = length(pattern)
return i − j = 0
Where It's Used Today
DNA and protein matching — bioinformatics tools like BLAST and Bowtie use linear-time string matching ideas (KMP and its descendants) to scan genomes for known motifs.
Intrusion-detection and antivirus — scanning network packets or files for known signatures relies on linear-time multi-pattern matching (Aho-Corasick, a generalization of KMP).
Plagiarism detectors — academic tools scan submitted papers against huge corpora using linear-time substring search as the inner loop.
Streaming data — real-time log monitors and SIEM tools need to spot trigger phrases in a never-ending stream without buffering, which requires the no-backtrack property KMP provides.
Text editors and grep — modern grep uses Boyer-Moore for the common case but falls back to KMP-style automata when patterns make Boyer-Moore inefficient.
When NOT to Use
When the alphabet is large and the pattern long — Boyer-Moore's bad-character rule skips ahead by big jumps and outperforms KMP on natural-language text.
When you're searching for many patterns at once — Aho-Corasick generalizes the failure idea to a trie and matches all patterns in one pass.
When the pattern is a regex or has wildcards — KMP only handles fixed strings; build an NFA/DFA from the regex instead.
Common Mistakes
Building the LPS table by checking proper prefixes brute-force in O(m²) — defeats the whole point; the table must be built in O(m) using the same failure trick on the pattern itself.
Resetting i (the text index) on a mismatch instead of using lps[j-1] — that's exactly the naive search KMP exists to avoid.
Confusing "longest proper prefix that is also a suffix" with "longest prefix that is a suffix" — including the whole string makes lps[j] = j+1, and the failure jump becomes a no-op infinite loop.
Try It with an AI Assistant
short
Search substring efficiently by reusing previously matched prefix information.
behavior
Write a function that finds a pattern inside a text. First, precompute a small lookup table for the pattern: for each pattern position, the length of the longest prefix that also occurs as a suffix of the prefix ending at that position. Then walk the text with two pointers, advancing both on a character match. On a mismatch, never move the text pointer backward; instead, use the lookup table to slide the pattern pointer forward to the next viable starting alignment.
It made fast circle drawing practical on limited hardware, enabling early graphics, games, CAD, and plotting systems.
// Midpoint circle algorithm
x ← r; y ←0
err ←1 - r
WHILE x >= y
plot 8-symmetric points
y ← y + 1IF err < 0THEN
err ← err + 2*y + 1ELSE
x ← x - 1
err ← err + 2*(y - x + 1)
ENDIFENDWHILE// each "plot 8-symmetric points" lights 8 pixels mirrored across the circle's axes
Early screens and plotters could not afford expensive floating-point trigonometry for every circle pixel. The midpoint method used integer decisions to draw smooth circles efficiently.
Teaches: Choose pixels using incremental integer error tracking
The Idea
A circle has eight-way symmetry: if you know the pixels in one octant (one-eighth of the circle, say from the top to the 45° line), you can mirror them to draw the other seven octants for free. So we only need to trace one slim slice.
In that octant, walk y upward one pixel at a time. At each step, decide whether x stays the same or drops by one — and base that decision on a running integer error termerr that measures how far off the true circle the current (x, y) is. If the chosen midpoint between the two candidate pixels lies inside the circle (err < 0), keep x. Otherwise, drop x by one. The update formulas use only addition and multiplication by small constants — fast even on 1970s hardware. The invariant: at every step, (x, y) is the integer pixel closest to the true circle on this row.
Trace
step
x
y
err
what happens
0
5
0
−4
plot (5, 0) and 7 mirrors; err < 0
1
5
1
−1
plot (5, 1) and 7 mirrors; err < 0
2
5
2
4
plot (5, 2) and 7 mirrors; err ≥ 0 → drop x
3
4
3
4
plot (4, 3) and 7 mirrors; err ≥ 0 → drop x
4
3
4
—
x = 3 < y = 4 next step; loop ends
Where It's Used Today
Embedded displays — microcontrollers driving small LCDs use midpoint circle to draw dials, gauges, and rounded UI elements without floating-point hardware.
Retro and 2D games — drawing circular projectiles, explosion radii, and round sprites on pixel-art canvases.
CAD software — quickly rendering arcs, fillets, and rounded corners in technical drawings.
Plotters and printers — physical pen plotters used integer step decisions for the same reason early screens did.
Computer vision — drawing detection circles around faces, balls, or coins in annotated images uses the same eight-way symmetric pixel walk.
When NOT to Use
When you need an anti-aliased (smooth-edged) circle for high-resolution displays — midpoint produces hard pixel staircases; use Wu's algorithm or supersampling.
When you need to draw an ellipse or rotated arc — eight-way symmetry no longer holds; use the midpoint ellipse variant or a different formulation.
When the radius is very small (r < 3) — the integer rounding produces visibly lopsided circles; precomputed pixel templates look better.
Common Mistakes
Plotting only the first octant and forgetting the other seven mirrored points, drawing a thin arc instead of a full circle.
Updating err with the wrong increment when x is dropped (err += 2(y - x + 1) is easy to mistype as 2(y - x)), shifting the entire circle by one pixel.
Looping WHILE x > y instead of WHILE x >= y, missing the diagonal pixel and leaving 8 pinpricks of black on the rendered circle.
Try It with an AI Assistant
short
Write midpoint_circle(r) implementing the midpoint circle algorithm; return the list of pixels.
behavior
Write a function that, given a radius r, prints the integer pixels of a circle on a grid. Use only integer arithmetic. Walk one octant from the top to the 45° line, and at each step decide whether the next pixel sits at the same x or drops by one based on a running error value. For every pixel found, plot the eight symmetric points around the center.
For instanceReverse a graph to reveal strongly linked groups.
graph ← {1: [2], 2: [3], 3: [1], 4: [5], 5: []}
visited ← empty set
order ← empty stack
sccs ← empty list
FOR EACH node IN graph
dfs1(node) // post-order push onto `order`ENDFOR
reverse_graph ←reverse(graph)
clear visited
WHILE order is NOT empty
node ←pop(order)
IF node NOTIN visited THEN
component ← empty list
dfs2(reverse_graph, node) // walks component, appends each visited nodeappend(sccs, component)
ENDIFENDWHILERETURN sccs
S. Rao Kosaraju described the algorithm in unpublished lecture notes in the late 1970s; Micha Sharir independently rediscovered it in 1981, and most textbooks credit them jointly. What made it stick — even though Tarjan's earlier algorithm was a single pass — was clarity: the two-DFS structure is so easy to explain and prove correct that it became the standard way to teach strongly connected components, especially in introductory algorithms courses and competitive programming.
Teaches: Reverse the arrows to expose hidden symmetric structure
The Idea
Pass 1: do a DFS on the original graph and push each node onto an order stack as it finishes (post-order). This gives a topological-ish order where SCC "sinks" appear at the bottom and SCC "sources" appear at the top.
Pass 2: build the reverse graph (every edge flipped). Pop nodes from order one at a time; each unvisited node starts a new DFS in the reversed graph, and everything that DFS reaches forms a single SCC.
Why does it work? In the reverse graph, an SCC stays an SCC (cycles flip but remain cycles), but the connections between SCCs reverse direction. Starting from a top-of-order node in the reversed graph, you can reach exactly its SCC and nothing else — the connections back to other SCCs have been flipped away. The invariant is that every popped node, when unvisited in pass 2, sits at the top of an undiscovered SCC. Total cost is O(V + E).
Trace
visit
finishes
order stack (top right)
1
[]
2
[]
3
3
[3]
2
[3, 2]
1
[3, 2, 1]
4
5
5
[3, 2, 1, 5]
4
[3, 2, 1, 5, 4]
Where It's Used Today
Static program analysis — finding mutually recursive call clusters in compilers and code-quality tools, just like Tarjan's algorithm but easier to teach.
Data-flow systems — Apache Spark and similar frameworks build directed dependency graphs of computation stages and need to identify cyclic regions.
Reachability databases — services that answer "can A reach B?" pre-process the graph by collapsing each SCC into a single super-node.
Dependency resolution — package managers detect circular dependencies (an SCC of size > 1) and refuse to install them.
Education and competitive programming — Kosaraju's two-pass version is the most-taught SCC algorithm because the proof of correctness is shorter and clearer than Tarjan's.
When NOT to Use
When the graph is undirected — "strongly connected" collapses to "connected"; just run a single DFS or union-find instead.
When you can't afford to materialize the reverse graph — on huge graphs (billions of edges) Tarjan's single-pass algorithm uses half the memory because it skips the edge-flip step.
When the graph is streamed or stored in a way that makes edge reversal expensive (e.g., row-major adjacency on disk) — building the transpose dominates the runtime.
Common Mistakes
Pushing onto the order stack on first visit instead of on finish — the post-order property is lost and pass 2 explores SCCs in the wrong order, merging components that should stay separate.
Forgetting to clear visited between pass 1 and pass 2 — pass 2 then skips every node and returns no SCCs.
Reversing the wrong adjacency list (e.g., reversing each list's contents instead of flipping every edge u->v to v->u) — pass 2 walks the original graph and produces incorrect components.
Try It with an AI Assistant
short
Write kosaraju(graph) returning SCCs via two DFS passes (graph + reverse graph).
behavior
Write a function that, given a directed graph, runs a depth-first search on it and pushes each node onto a stack the moment it finishes. Then build the same graph with every edge reversed. Pop nodes off the stack one by one; each unvisited node starts a new DFS in the reversed graph, and the set of nodes reached by that DFS is one component.
For instanceFind F(1,000,000) using logarithmic recursion.
n ←10FUNCTIONfib(n)
IF n = 0THENRETURN (0, 1)
ENDIF
(a, b) ←fib(n DIV2)
c ← a * (2*b - a)
d ← a*a + b*b
IF (n MOD2) = 0THENRETURN (c, d)
ELSERETURN (d, c + d)
ENDIFEND FUNCTION
(result, _) ←fib(n)
RETURN result
The fast-doubling identities for Fibonacci numbers come from rewriting the matrix-power formulation [[1,1],[1,0]]^n in scalar form — no single inventor; the trick has circulated as folklore among number theorists since at least the 1980s. It became indispensable once cryptography needed to compute F(n) mod p for n with hundreds of digits, and competitive-programming problems started asking for F(10^18). The plain iterative loop simply cannot finish; doubling reduces the work from n steps to log₂(n) and turns a non-starter into a sub-millisecond computation.
Teaches: Use algebraic identities to double progress instead of stepping
The Idea
Two algebraic identities do all the work. If a = F(k) and b = F(k+1), then:
- F(2k) = a · (2b − a)
- F(2k+1) = a² + b²
So given the answer at index k, we can jump to index 2k (or 2k+1) in one shot. Recursively halve n down to 0, and on the way back up double.
Why does this work? Each recursive call halves n, so the depth is log₂(n). At each level we do a handful of multiplications and additions — constant work per level. Plain iterative Fibonacci needs n additions; fast doubling needs about log n levels of cheap algebra. The identities themselves come from matrix exponentiation: [[1,1],[1,0]]^n produces the Fibonacci numbers, and squaring a matrix is doubling the index.
Trace
n
recurse on
a (=F(n//2))
b (=F(n//2+1))
c = a·(2b−a)
d = a²+b²
n even?
return (F(n), F(n+1))
0
base
—
—
—
—
—
(0, 1)
1
n=0
0
1
0
1
no
(1, 1)
2
n=1
1
1
1
2
yes
(1, 2)
5
n=2
1
2
3
5
no
(5, 8)
10
n=5
5
8
55
89
yes
(55, 89)
Where It's Used Today
Cryptography — the Lucas test for primality uses fast Fibonacci computation modulo a prime; the same code shows up in OpenSSL's primality routines.
Big-integer libraries — Python's sympy.fibonacci(n) and many GMP-based Fibonacci routines in C/C++ use fast doubling for large n.
Competitive programming — every Fibonacci-mod-p problem on Codeforces or LeetCode that allows n up to 10^18 is solvable only with this trick.
Number-theory research — checking conjectures about Fibonacci divisibility for huge indices needs F(n) mod something, computed by fast doubling.
Procedural generation — some games seed grid layouts or sequences with very-large-index Fibonacci values to get spread-out, non-repeating numbers.
When NOT to Use
When n is small (say, under a few hundred) — the simple iterative loop has lower constant overhead and no recursion cost.
When you need every Fibonacci number up to F(n) — the iterative method gives you the whole sequence in one pass; fast doubling skips intermediate values.
When the language lacks big integers and F(n) overflows — for n above ~93 you need a big-int type or modular arithmetic regardless of which method you pick.
Common Mistakes
Returning only F(n) from the recursion and recomputing F(n+1) separately — the pair (F(n), F(n+1)) is what makes the doubling work; splitting it doubles the work.
Swapping the formulas c = a·(2b − a) and d = a² + b² — both look symmetric but they are not interchangeable; c is F(2k) and d is F(2k+1).
Using a memoized linear recursion and calling it "fast doubling" — memoization helps but is still O(n); the doubling identities are what give true O(log n).
Try It with an AI Assistant
short
Write fib(n) returning F(n) using the fast-doubling identity in O(log n).
behavior
Write a recursive function that returns the pair (F(n), F(n+1)). For n=0 return (0, 1). Otherwise compute (a, b) for n//2, then form c = a·(2b − a) and d = a² + b². If n is even return (c, d); if n is odd return (d, c + d).
Made prefix-match information available in linear time.
For instanceFind all places a pattern begins inside a string.
s ←"aabaabcab"
n ←length(s)
z ← array[0..n-1] filled with 0
l ←0
r ←0FOR i FROM1TO n-1IF i <= r THEN
z[i] ←min(r - i + 1, z[i - l])
ENDIFWHILE i + z[i] < n AND s[z[i]] = s[i + z[i]]
z[i] ← z[i] + 1ENDWHILEIF i + z[i] - 1 > r THEN
l ← i
r ← i + z[i] - 1ENDIFENDFORRETURN z
The Z-array crystallised in the 1980s as competitive programmers and string-algorithm textbooks (notably Gusfield's Algorithms on Strings, Trees, and Sequences) reorganised an idea implicit in the older Knuth-Morris-Pratt machinery. Where KMP builds a failure function you have to read backwards, the Z-array is the same information laid out forwards — easier to teach, easier to implement, and the natural starting point for a long line of suffix-tree and suffix-array algorithms that followed.
Teaches: Reuse a known matched window instead of recomparing characters
The Idea
Walk through the string from left to right, keeping a window [l, r] — the rightmost block we've already verified matches the prefix. When we reach position i, two cases occur. If i falls inside the current window, we already know what s looks like there — it mirrors s[i − l] — so we can copy that as a starting estimate for z[i] (capped by how much of the window remains). Then we try to extend z[i] further, comparing characters one by one. If we extended past r, we update the window to the new rightmost match.
Why is this linear? Each successful character comparison either lives inside the existing window (cost amortized to zero, because we already paid for it) or extends the window to the right. Since the window can only move right, total comparisons across the whole pass are at most 2n. The result is a complete prefix-match map in O(n) time and O(n) space.
Trace
i
inside [l, r]?
starting z[i]
extended z[i]
window after
0
—
0
0
l = 0, r = 0
1
no
0
1 (a = a, then a ≠ b)
l = 1, r = 1
2
no
0
0 (b ≠ a)
unchanged
3
no
0
3 (matches aab, then a ≠ c)
l = 3, r = 5
4
yes, mirror = 1
min(5−4+1, z[1]=1) = 1
1 (then a ≠ b)
unchanged
5
yes, mirror = 2
min(5−5+1, z[2]=0) = 0
0 (b ≠ a)
unchanged
6
no
0
0 (c ≠ a)
unchanged
7
no
0
1 (a = a, then a ≠ b)
l = 7, r = 7
8
no
0
0 (b ≠ a)
unchanged
Where It's Used Today
Pattern matching — finding all occurrences of a search query inside a long text in linear time, often used in editors and grep-like tools.
Bioinformatics — locating short DNA motifs (transcription-factor binding sites, primer sequences) inside long genomes.
Plagiarism and duplicate detection — comparing documents by finding repeated prefix matches between texts.
Compression preprocessing — some compression schemes use Z-arrays to detect repetition that can be encoded more compactly.
Programming-contest libraries — competitive programmers ship z_array as one of the standard linear-time string tools.
When NOT to Use
When you only need a single yes/no "does pattern P occur in T?" — str.find or KMP is simpler and uses less memory than building a full Z-array.
When the alphabet is huge or comparisons are expensive (e.g. comparing whole objects) — Z-algorithm assumes O(1) character comparison; otherwise the linear-time bound disappears.
When searching across multiple patterns simultaneously — Aho-Corasick handles many patterns in one pass; running Z-algorithm per pattern is wasteful.
Common Mistakes
Forgetting the separator when concatenating P + "#" + T for substring search — without it, a partial overlap of P and T can produce a false z[i] = |P|.
Initializing l = r = -1 but then comparing i <= r without guarding against negatives, breaking the very first iteration.
Re-comparing characters from index 0 instead of from z[i] when extending — turns the algorithm from O(n) into O(n²) on strings like "aaaaa…".
Try It with an AI Assistant
short
Write z_array(s) returning the Z-array of a string in O(n).
behavior
Write a function that, for each position i in a string, computes how many characters starting at i match the start of the string. Maintain a sliding window of the rightmost prefix-match found so far; when the next position falls inside that window, reuse the mirrored answer as a starting guess instead of comparing from scratch, and only extend by direct comparison when necessary.
It made simple search loops faster and cleaner in low-level code where every branch mattered.
// Sentinel linear search
a[n] ← key // sentinel at end
i ←0WHILE a[i] != key
i ← i + 1ENDWHILEIF i < n THENRETURN i
ENDIFRETURN -1
Linear search checks each item and also checks whether the end has been reached. The sentinel trick places the target at the end temporarily, eliminating one repeated boundary check.
Teaches: Remove boundary checks by embedding a guaranteed stopping condition
The Idea
Reserve one extra slot at the end of the array, beyond the real n elements. Place key in that extra slot — that's the sentinel. Now run a tight loop: WHILE a[i] != key, i ← i + 1. The loop has no bounds check at all. It always halts, because in the worst case i reaches n and finds the planted key.
Why is this safe? Because the loop is guaranteed to terminate as soon as it sees the first occurrence of key — and we've ensured at least one occurrence exists. After the loop, just check whether i landed inside the real data (i < n → found at index i) or on the sentinel (i == n → not found, return -1). The invariant: a[i] != key for every i already scanned, and the sentinel guarantees the loop will eventually find a match. It saves one comparison per iteration — small per loop, but multiplied by billions of iterations over a system's lifetime, it adds up.
Trace
step
i
a[i]
a[i] != key?
action
0
0
3
yes
i ← 1
1
1
1
yes
i ← 2
2
2
4
yes
i ← 3
3
3
7
no
exit
Where It's Used Today
Embedded firmware — microcontrollers with no branch predictor benefit from removing per-iteration boundary checks in tight inner loops.
C standard library internals — older strchr/memchr implementations and many K&R-era utilities use the sentinel pattern.
Database scan loops — some columnar database scanners place a sentinel at the end of a scanned page to avoid a per-row bounds check.
Linked-list searches — a "dummy tail node" holding the search key is the linked-list version of the same trick.
Performance teaching — sentinel search is the canonical example of trading a small amount of memory for fewer instructions per loop iteration.
When NOT to Use
When the array is sorted — binary search is O(log n); sentinel search is still O(n) and saves only a constant factor.
When you can't write past index n−1 — read-only buffers, memory-mapped data, or shared arrays make placing a sentinel unsafe.
When the data is concurrent or shared — overwriting a[n] from one thread breaks a reader on another thread.
Common Mistakes
Allocating exactly n slots, then writing the sentinel into a[n] — that's a buffer overrun, not a sentinel; allocate n + 1.
Forgetting to restore the original a[n] value afterwards — the trick assumes that slot is scratch space, otherwise it corrupts the next call.
Returning i without the i < n check — when the key isn't present the function happily returns n as if it were a real match.
Try It with an AI Assistant
short
Write sentinel_linear_search(a, n, key) returning the index of key in a[0..n-1] using a sentinel placed at a[n].
behavior
Write a function that searches an array of n items for a key. To avoid checking the array bound on every iteration, first store the key itself in slot a[n] (one position past the real data). Then loop forward from index 0, advancing while the current cell doesn't equal the key. After the loop, return the index if it's less than n, otherwise return -1.
It made ray tracing, collision detection, selection picking, and 3D acceleration structures dramatically faster.
// Ray-AABB slab method
t_min ← -infinity
t_max ← +infinity
FOR EACH axis a IN (x, y, z)
inv_d ←1 / ray.dir[a]
t1 ← (box.min[a] - ray.o[a]) * inv_d
t2 ← (box.max[a] - ray.o[a]) * inv_d
IF t1 > t2 THENswap(t1, t2) ENDIF
t_min ←max(t_min, t1)
t_max ←min(t_max, t2)
IF t_min > t_max THENRETURN miss
ENDIFENDFORRETURN hit at t_min
Computer graphics needed to know quickly whether a ray might hit an object. Testing against complex shapes was expensive, so objects were wrapped in axis-aligned boxes first.
Teaches: Intersect ranges independently across dimensions
The Idea
Think of an axis-aligned box as the intersection of three "slabs" — one slab between the box's min and max along x, another along y, another along z. The ray hits the box only if it is inside all three slabs at once at some moment.
For each axis, compute the two parameter values t1 and t2 where the ray enters and leaves that slab. Sort them so t1 ≤ t2, then maintain t_min (the latest entry across axes seen so far) and t_max (the earliest exit). After processing every axis, if t_min ≤ t_max the ray pierced all three slabs simultaneously — a hit at parameter t_min. If at any point t_min > t_max, the slab intervals don't overlap and you can stop early: a miss. The whole test is just three multiplies and a few comparisons per axis.
Trace
axis
inv_d
t1
t2
after swap (t1, t2)
t_min
t_max
start
—
—
—
—
−∞
+∞
x
1.0
2
3
(2, 3)
2
3
y
±∞
±∞
±∞
(−∞, +∞)
2
3
z
±∞
±∞
±∞
(−∞, +∞)
2
3
Where It's Used Today
Ray tracers — every pixel of a Pixar or game-engine ray-traced frame fires rays that hit AABBs first to skip whole regions of the scene.
Game collision detection — checking whether a bullet, projectile, or character ray crosses an enemy's bounding box before doing per-triangle math.
3D selection / picking — clicking a 3D model in Blender, Unity, or AutoCAD shoots a ray from your cursor and tests AABBs to find what you clicked.
BVH and octree traversal — 3D acceleration structures use AABB tests at every internal node to skip subtrees that the ray can't reach.
Robotics and self-driving cars — sensor rays (LiDAR-style) tested against AABBs around obstacles for fast nearby-object filtering.
When NOT to Use
When the bounding box is rotated relative to the world axes — the slab method only works on axis-aligned boxes; use OBB tests with the separating-axis theorem instead.
When the actual geometry is nearly box-shaped — testing the AABB plus the geometry is wasted work; just test the geometry.
When you need the exit point or full segment overlap — the standard variant returns only t_min; the back exit needs a few more lines.
Common Mistakes
Not handling rays parallel to a slab (ray.dir[a] = 0) — 1 / 0 either crashes or produces NaN that contaminates t_min and t_max.
Forgetting to swap t1 and t2 when the ray direction is negative on that axis, so entry and exit get reversed.
Returning t_min without checking t_min >= 0 — a negative t_min means the box is behind the ray origin and should usually count as a miss for visibility tests.
Try It with an AI Assistant
short
Write ray_aabb_intersection_slab_method(ray, box) that returns the entry distance t_min if the ray hits the box, or None if it misses.
behavior
Write a function that, given a ray and an axis-aligned box, computes for each of the x, y, z axes the two distances along the ray where it enters and exits that axis's slab. Track the latest entry and earliest exit across all three axes. If the latest entry is at most the earliest exit, return that entry distance; otherwise report a miss.
It made repeated patterns computable in mathematics, simulations, dynamic systems, random generators, and sequence prediction.
// matrix exponentiation for// linear recurrence f(n) = c1*f(n-1) + c2*f(n-2) + ...
M ←companion_matrix(coefs)
result ← M^n applied TO seeds
RETURN result[0]
The companion-matrix view of linear recurrences is a piece of nineteenth-century linear algebra — Cayley and Frobenius worked out the theory long before computers existed. The algorithmic trick of using fast matrix exponentiation to jump to the n-th term emerged as folklore in competitive-programming circles in the 1980s, when contest setters realised they could ask for f(10^18) and force solvers to find the O(log n) method instead of plain iteration. The technique is now standard in any contest grader's toolkit and shows up in cryptography and population modelling whenever a sequence has to be projected far into the future.
Teaches: Turn iteration into fast exponentiation of transformations
The Idea
Pack the last k values into a column vector. Build a k × kcompanion matrixM whose top row holds the coefficients c₁, c₂, …, c_k and whose subdiagonal is all 1s. Then M times the vector [f(n−1), f(n−2), …, f(n−k)] is exactly [f(n), f(n−1), …, f(n−k+1)]. One matrix multiply advances the sequence by one step.
Now the speed-up: M applied n times is M^n, and we can compute M^n by repeated squaring in just O(log n) matrix multiplications. The invariant is that the vector always holds a window of k consecutive sequence terms; M^n slides the window forward by n steps in one operation. For Fibonacci this turns "make a billion additions" into "do thirty 2×2 matrix multiplies."
Trace
k
M^k
1
[[1, 1], [1, 0]]
2
M·M = [[2, 1], [1, 1]]
4
M²·M² = [[5, 3], [3, 2]]
5 = 4+1
M⁴·M = [[8, 5], [5, 3]]
Where It's Used Today
Competitive programming — finding the n-th Fibonacci or Tribonacci number for n = 10¹⁸ shows up in contest problems all the time.
Cryptography and hashing — fast term computation for sequences used inside stream ciphers and pseudorandom generators.
Population and economic models — Leslie matrices project age-structured populations many generations forward, exactly this technique.
Signal processing — IIR (infinite impulse response) filters are linear recurrences; understanding their long-term behavior reduces to powers of a companion matrix.
Markov chains — the state distribution after n steps is M^n · π₀, computed identically by repeated squaring.
When NOT to Use
When the recurrence is non-linear (f(n) = f(n-1)^2 + 1) — the matrix trick only works for linear combinations of past terms.
When n is small (a few thousand) — plain iteration is simpler and avoids the constant-factor cost of k³ matrix multiplies.
When k (the depth of the recurrence) is large — the matrix is k × k, so k³ log n may be slower than direct iteration.
Common Mistakes
Building the companion matrix with the coefficients in the wrong row, producing a different sequence with the same first few terms.
Forgetting to apply modular reduction when intermediate matrix entries exceed 2^63 and silently overflowing.
Using M^n with the seed vector positioned wrongly (off by one), so you compute f(n+1) or f(n-1) instead of f(n).
Try It with an AI Assistant
short
Write linear_recurrence(coeffs, init, n) that, given k coefficients and k initial values, returns the n-th term of the recurrence using matrix exponentiation.
behavior
Define a sequence by the rule that each term is a fixed linear combination of the previous k terms, given the first k terms as seeds. Compute the n-th term efficiently for large n by representing one step forward as a k × k matrix acting on a length-k vector, raising that matrix to the n-th power by repeated squaring, and reading off the appropriate entry.
Binary search is powerful, but in real systems data access is often block-like or sequential — disk sectors, magnetic tape, or paged-in pages of memory. Jump search emerged as the natural compromise: leap ahead in fixed-size blocks until the block containing the key is found, then linearly scan inside it. The square-root step size minimizes total work, and the technique is still the textbook example used to show that not every "slower than O(log n)" algorithm is bad — sometimes the access pattern matters more than the asymptote.
Teaches: Skip ahead, then refine locally
The Idea
Pick a STEP size — the classic choice is floor(sqrt(n)), which balances the number of jumps against the size of the linear scan. Phase one: jump forward by STEP indices at a time, checking the value at each landing spot. Stop the moment that value is at least the key — you've now bracketed the key inside the block ending at this jump. Phase two: walk backward (or scan from the previous jump position prev forward) until you either find the key or pass it.
Why does it work? Because the array is sorted, the key — if present — must be in the block where the right-end is the first element ≥ key. The invariant is all positions before prev are strictly less than the key, so we never miss anything by skipping them. With STEP = √n, the worst case is √n jumps plus √n walk steps — about 2√n comparisons total, faster than linear and gentler on sequential storage than binary search.
Trace
step
prev
STEP
check a[STEP-1]
action
1
0
3
a[2] = 5 < 13
prev = 3, STEP = 6
2
3
6
a[5] = 11 < 13
prev = 6, STEP = 9
3
6
9
a[8] = 17 ≥ 13
exit phase 1
Where It's Used Today
Tape and sequential storage — when data is read in one direction and seeking back is expensive (legacy backup tapes, log files), jump search beats binary search's random access.
Database index pages — older index designs and some embedded databases scan sorted blocks of a page using a jump-then-linear pattern that fits CPU cache lines well.
String matching helpers — when scanning a sorted list of byte offsets, jump search homes in on the right region quickly without expensive midpoint computations.
Embedded systems — microcontrollers searching a small sorted lookup table prefer jump search's simple loop over the recursion or pointer math of binary search.
Educational comparisons — jump search is the canonical example for showing why O(√n) sits between linear O(n) and binary O(log n), and why memory-access patterns matter.
When NOT to Use
When the data is unsorted — jump search relies on monotonicity to know it can skip preceding blocks safely.
When random access is cheap and the array fits in memory — binary search's O(log n) strictly beats O(√n).
When the data is stored in a linked list — there's no constant-time random jump, so the per-jump cost destroys the speedup.
Common Mistakes
Jumping past the end without bounding min(STEP, n) and reading off the array, causing a crash or false negative.
Choosing a fixed step size (like 100) instead of √n — the worst case becomes n/STEP + STEP, much worse than 2√n.
Returning -1 as soon as a jump lands above the key, instead of scanning the bracket between prev and that jump.
Try It with an AI Assistant
short
Write jump_search(a, x) over a sorted list using jump-then-linear search; jump size = √n.
behavior
Search a sorted array for a key by leaping forward in fixed-size blocks of size floor(sqrt(n)) until the value at the block's end is at least the key, then linearly scan within that block until you find the key or pass it. Return the index, or -1 if not found.
Made majority detection possible with constant memory.
For instanceFind if one candidate received more than half the votes.
arr ← [3, 3, 4, 2, 4, 4, 2, 4, 4]
count ←0
candidate ←NULLFOR EACH x IN arr
IF count = 0THEN
candidate ← x
ENDIFIF x = candidate THEN
count ← count + 1ELSE
count ← count - 1ENDIFENDFORRETURN candidate
Robert Boyer and J Strother Moore — already famous for the Boyer-Moore string-search algorithm — invented the majority-vote trick in 1981 while working at SRI International on automated theorem proving. They needed it for an internal verification tool: a way to check that a given value really was the dominant one in a list, without allocating the giant counter table the obvious approach demands. The algorithm sat as an internal SRI memo for ten years before being widely published, and is now standard interview fare and a textbook example of streaming algorithms with O(1) memory.
Teaches: Cancel opposing votes; a true majority always survives
The Idea
Keep two variables: a current candidate and a count. Walk through the array. If count is zero, adopt the current element as a fresh candidate and set count to 1. Otherwise, increment count if the element matches the candidate, decrement it if it differs. Think of it as pairing off: each non-candidate vote cancels one candidate vote.
This works because if a true majority exists, it has more votes than all other values combined. So no matter how the cancellations pair up, at least one majority vote always survives, and that surviving candidate is what the algorithm reports. Note: if no strict majority exists, the algorithm may report any value — so a real implementation often does a verification pass to count how many times the returned candidate actually appears.
Trace
step
x
candidate
count
what happens
0
3
3
1
count = 0 → candidate ← 3, count ← 1
1
3
3
2
x = candidate → count ← 2
2
4
3
1
x ≠ candidate → count ← 1
3
2
3
0
x ≠ candidate → count ← 0
4
4
4
1
count = 0 → candidate ← 4, count ← 1
5
4
4
2
x = candidate → count ← 2
6
2
4
1
x ≠ candidate → count ← 1
7
4
4
2
x = candidate → count ← 2
8
4
4
3
x = candidate → count ← 3
Where It's Used Today
Streaming systems — finding heavy hitters in network traffic or log streams when you can't store every value.
Distributed consensus — quorum-style voting where a single value must "win" with limited bookkeeping.
Election counting — pre-tally checks for strict majority in tabulation pipelines.
Sensor fusion — picking the dominant reading from a noisy redundant sensor array.
Coding interviews and competitive programming — the canonical "O(n) time, O(1) space" majority-element question.
When NOT to Use
When no strict majority is guaranteed — the algorithm returns some element regardless, so the candidate may be meaningless without a verification pass.
When you need the top-k frequent items (not just the single majority) — use a frequency map or a Misra-Gries / Boyer-Moore generalization with k-1 counters.
When the input is already grouped or sorted — a single grouped scan is clearer and gives exact counts without the cancellation reasoning.
Common Mistakes
Skipping the verification pass and trusting the candidate even when no majority exists — the function will confidently return the wrong element.
Setting count to 0 instead of incrementing it after adopting a new candidate, so that candidate is dropped on the very next mismatch.
Comparing x = candidate before the count = 0 check — the first element gets compared against an uninitialized candidate.
Try It with an AI Assistant
short
Write majority(arr) returning the majority element in O(n) time and O(1) space.
behavior
Walk through an array once, keeping a candidate value and a count. If count is zero, set the candidate to the current element and count to 1. Otherwise, increment the count if the current element matches the candidate, decrement it if it doesn't. Return the final candidate.
For instanceQuery total cable length between two nodes in a network tree.
tree ← {0: [1, 2], 1: [3, 4], 2: [5], 3: [6], 4: [], 5: [], 6: []}
size ← array[7]
heavy ← array[7] // -1 means "no heavy child"
chainHead ← array[7]
pos ← array[7]
currentPos ←0FUNCTIONcomputeSize(v)
size[v] ←1
heavy[v] ← -1
maxChildSize ←0FOR EACH child IN tree[v]
computeSize(child)
size[v] ← size[v] + size[child]
IF size[child] > maxChildSize THEN
maxChildSize ← size[child]
heavy[v] ← child
ENDIFENDFOREND FUNCTIONFUNCTIONdecompose(v, head)
chainHead[v] ← head
pos[v] ← currentPos
currentPos ← currentPos + 1IF heavy[v] != -1THENdecompose(heavy[v], head)
ENDIFFOR EACH child IN tree[v]
IF child != heavy[v] THENdecompose(child, child)
ENDIFENDFOREND FUNCTIONcomputeSize(0)
decompose(0, 0)
RETURN (chainHead, pos)
In 1981, Daniel Sleator and Robert Tarjan at Princeton's Bell Labs introduced link/cut trees for fast tree-path operations. The core idea — split a tree into "heavy" paths through its largest subtrees, with the much rarer "light" edges in between — guarantees that any root-to-leaf walk crosses only O(log n) light edges, since each crossing at least doubles the subtree size. Competitive programmers later popularized a static-tree variant called Heavy-Light Decomposition, which lays the heavy chains contiguously in an array so a segment tree can answer any path query in O(log² n).
Teaches: Decompose into long paths so any traversal crosses few of them
The Idea
For every node, look at its children. The child with the largest subtree is called the heavy child; all others are light. Connect each node to its heavy child, and chain after chain forms — the heavy chains. Walking from any node up to the root, every time you cross a light edge the subtree size at least doubles, so you can cross at most log₂ n light edges. Light edges = chain transitions, so any path touches O(log n) chains.
To make queries fast, we lay out the tree in a flat array via a DFS that always visits the heavy child first. That puts every heavy chain in a contiguous slice of the array, perfect for a segment tree. Why does it work? Because the heavy-edge rule guarantees the geometric subtree-doubling along every leaf-to-root path. Sleator and Tarjan's link/cut trees (1981) were the first to use this idea; modern competitive programming has popularized the array-based variant.
Trace
call
chainHead[v]
pos[v]
currentPos after
decompose(0, 0)
0
0
1
decompose(1, 0)
0
1
2
decompose(3, 0)
0
2
3
decompose(6, 0)
0
3
4
decompose(4, 4)
4
4
5
decompose(2, 2)
2
5
6
decompose(5, 2)
2
6
7
Where It's Used Today
Competitive programming — heavy-light is the standard tool for any contest problem that asks for path sums, path maxima, or path updates on a tree.
Network analysis tools — measuring latency or bandwidth along the route between two routers in a tree-shaped network.
Phylogenetic and genealogy software — answering "common ancestor and path distance" queries on enormous evolution trees and family trees.
Game and graphics scene graphs — propagating transforms or visibility queries along a deep hierarchy of game objects.
Compilers and program analysis — dominator-tree queries used in optimization passes lean on heavy-path or related decompositions.
When NOT to Use
When the structure is a general graph, not a tree — heavy-light only exploits the unique-path property of trees; cycles break the chain decomposition.
When the tree changes shape (links and cuts) frequently — HLD assumes a static tree; use Sleator-Tarjan link/cut trees or Euler-tour trees for dynamic ones.
When there are only a handful of queries — building the chains and the segment tree costs O(n log n) setup, which a simple O(n) path walk easily beats.
Common Mistakes
Picking the heavy child by depth instead of subtree size — the log n light-edge bound depends on the doubling argument over subtree sizes.
Forgetting to recurse into the heavy child before the light children, so chain positions are no longer contiguous in the flat array.
Querying a path without lifting the deeper chainHead first — leaving both endpoints in different chains and returning the wrong aggregate.
Try It with an AI Assistant
short
Write decompose(tree) assigning each node to a heavy chain so any path crosses O(log n) chains.
behavior
Write a function on a rooted tree that, for every node, identifies its child with the largest subtree as its 'heavy' child. Then run a DFS that always visits the heavy child first. Each time you cross a non-heavy edge, you start a new chain whose head is the new node. Record chainHead[v] and a flat position pos[v] for every node so each chain is a contiguous slice of the position array.
It made robust model fitting possible in messy data, especially computer vision, image stitching, 3D reconstruction, and sensor processing.
best ←NULL
best_inliers ← empty
FOR i FROM1TO max_iter
sample ←random_subset(data, k)
model ←fit(sample)
inliers ← points within eps of model
IF |inliers| > |best_inliers| THEN
best ← model
best_inliers ← inliers
ENDIFENDFORRETURNrefit(best, best_inliers)
Real-world measurements often contain bad outliers. RANSAC changed the strategy: instead of trusting all data, repeatedly sample small groups, fit a model, and keep the model that agrees with many points.
Teaches: Outliers can't vote when the majority agrees
The Idea
Repeat for many iterations: pick the smallest possible random subset of points (e.g., 2 points if you're fitting a line, 3 for a circle), fit a candidate model to that tiny sample, then count how many of the full dataset agree with the model — those are the inliers (within distance eps). The model with the most inliers wins.
Why does it work? If outliers are a small fraction, then a randomly chosen pair of points is probably "clean" — both inliers — and the model fit to them passes through the true cluster. Outliers can't vote against you in any meaningful way: they don't fit the candidate model that matches the real signal, so they get filtered out as non-inliers. After enough random tries, you almost certainly hit a clean sample at least once. The invariant: the best model so far is the one with the most agreeing points, full stop.
Trace
i
sample
candidate model
inliers within eps
best so far
1
(3,10), (5,5)
y = -2.5x + 17.5
only the two sampled = 2
best=2
2
(1,1), (4,4)
y = x
(1,1),(2,2),(3,3),(4,4),(5,5),(6,6) = 6
best=6
3
(2,2), (3,10)
y = 8x − 14
only the two sampled = 2
still 6
4
(5,5), (6,6)
y = x
6 inliers (matches iter 2)
still 6
Where It's Used Today
Image stitching (panoramas) — phones aligning two overlapping photos pick the rotation/translation that most matched feature points agree on.
3D reconstruction (SfM, photogrammetry) — recovering a 3D model from many photos depends on RANSAC at every step to reject mismatched points.
Self-driving cars — fitting road planes and lane markings from noisy LIDAR while ignoring rain, dust, and reflections.
Robotics SLAM — robots building a map of their surroundings use RANSAC to fit walls and reject sensor glitches.
Astronomy and physics — fitting orbital parameters or signal lines through measurements polluted with cosmic-ray hits.
When NOT to Use
When more than half the data are outliers — random samples are unlikely to be clean and the consensus disappears.
When the noise is Gaussian without gross errors — least squares is faster, gives a closed form, and handles it correctly.
When the model is high-dimensional and needs many points to fit — the chance of drawing a clean minimal sample drops exponentially.
Common Mistakes
Picking too few iterations for the outlier rate, so the algorithm rarely sees a clean sample and returns a junk model.
Setting the inlier threshold eps by guesswork instead of from the actual noise scale, either accepting outliers or rejecting good points.
Returning the candidate fit instead of refitting to all inliers at the end — losing the precision the inlier set could give.
Write a function that, given a set of points and a model class (like a line), repeats the following: pick the smallest random subset of points required to fit the model, fit a candidate model to that subset, then count how many points in the full dataset lie within a distance threshold of that model. Return the candidate with the most agreeing points, refit to all of them.
It made unsupervised visualization of complex data practical before modern deep learning tools became common.
inputs ← [1.0]
weights ← [0.0, 0.2, 0.7]
lr ←0.5
radius ←1// Self-Organizing Map stepFOR EACH x IN inputs
bmu ← argmin_node ||x - w_node||
FOR EACH node n
h ←neighborhood(bmu, n, radius)
w_n ← w_n + lr * h * (x - w_n)
ENDFORdecay(lr, radius)
ENDFORRETURN weights
Teuvo Kohonen’s self-organizing map offered a way for high-dimensional data to arrange itself on a low-dimensional grid. Nearby neurons learned to represent nearby patterns.
Teaches: Learn structure by adapting to input proximity
The Idea
Place a grid of "nodes," each carrying a weight vector w of the same dimension as your data. Initialize the weights randomly. Then for each input x: find the node whose weights are closest to x — the best-matching unit (BMU). Pull the BMU's weights a small step toward x. Pull the BMU's grid-neighbors toward x too, but by a smaller amount that fades with grid distance (a Gaussian "neighborhood function").
Why does this produce a nicely organized map? Because nearby grid nodes get pulled by the same input, they tend to develop similar weights — preserving topology: items that look alike in input space end up at nearby positions on the grid. Over time, both the learning rate lr and the neighborhood radius shrink, so early epochs spread coarse structure across the whole grid and later epochs sharpen the local detail. The invariant is gentle: each node's weights migrate toward inputs it is closest to, while staying coupled to its grid neighbors.
Trace
node
initial w
distance
x − w
role
0
0.0
1.0
1
0.2
0.8
2
0.7
0.3
BMU
Where It's Used Today
Customer segmentation — retailers use SOMs to group shoppers with similar purchase histories onto a 2D grid for marketing teams to inspect.
Fraud and intrusion detection — banks and security systems use SOMs to flag transactions that fall in sparsely populated regions of the map.
Genomics — biologists project gene-expression profiles onto SOMs to see which conditions group together.
Process monitoring in factories — sensor readings from a paper mill or steel plant get mapped onto a SOM; abnormal readings show up in unfamiliar grid cells.
Data visualization — before t-SNE and UMAP, SOMs were the standard way to give humans a flat picture of high-dimensional data; they're still used as a teaching tool and a baseline.
When NOT to Use
When you need a faithful low-dimensional embedding for downstream models — t-SNE, UMAP, or autoencoders preserve local structure better than a fixed grid.
When the data is labeled and the goal is classification — supervised methods (logistic regression, neural nets, gradient boosting) will beat an unsupervised SOM.
When the data is high-dimensional but very sparse (text, click streams) — Euclidean distance to a dense weight vector is a poor signal; use embeddings or cosine-similarity clustering.
Common Mistakes
Forgetting to decay the learning rate and neighborhood radius — the map keeps oscillating and never settles, or wipes out earlier organization with each new sample.
Initializing weights to all zeros (or all the same value) — every node is equally close to every input, so the first BMU is arbitrary and the map fails to spread.
Not normalizing features that live on different scales — a single large-magnitude feature dominates the BMU search and the map only organizes along that axis.
Try It with an AI Assistant
short
Write a class SOM(width, height, dim) with train(samples, epochs, lr0, sigma0) and bmu(x) returning the (i, j) of the best-matching unit; use a Gaussian neighborhood that decays each epoch.
behavior
Train a grid of nodes by moving the winning node and its neighbors closer to each input vector.
For instanceFind a good delivery route by sometimes accepting worse moves early.
state ←5
temp ←5.0
best ← state
WHILE temp > 0.001
next ←randomNeighbor(state)
delta ←cost(next) - cost(state)
IF delta < 0ORrandom() < exp(-delta / temp) THEN
state ← next
ENDIFIFcost(state) < cost(best) THEN
best ← state
ENDIF
temp ← temp * 0.99ENDWHILERETURN best
In 1983, Scott Kirkpatrick, Daniel Gelatt, and Mario Vecchi at IBM's T.J. Watson Research Center in Yorktown Heights published Optimization by Simulated Annealing in Science, applying a 1953 Monte Carlo trick from statistical physics — Metropolis et al.'s acceptance rule for sampling thermal equilibria — to combinatorial optimization. They demonstrated it on chip placement, the same hard layout problem IBM was wrestling with internally, and showed that gentle "cooling" let the search escape the local minima that crippled greedy methods. The paper's elegant physics analogy made the technique an instant favourite, and within years it was being used for everything from VLSI design to airline crew scheduling.
Teaches: Accept bad moves early to escape local traps, then settle
The Idea
Borrow the trick from metallurgy. When you cool molten metal slowly, the atoms have time to wiggle into their lowest-energy arrangement. Cool it too fast and the atoms freeze in a messy, suboptimal pattern. Simulated annealing applies that idea to a search:
Start at any solution and a high "temperature." Repeatedly pick a random neighbor and compute the change in cost, delta. If the neighbor is better (delta < 0), always move to it. If the neighbor is worse, sometimes still accept it — with probability exp(−delta / temp). At high temperatures, even bad jumps look acceptable, so you explore widely. As the temperature gradually drops, the algorithm becomes pickier and pickier; eventually it accepts only improvements, settling into the best valley it has wandered into. The key is to cool slowly so the search has time to escape shallow traps before locking in.
Trace
x
0
1
2
3
4
5
6
7
cost(x)
0
1
2
3
2
1
2
3
Where It's Used Today
Chip design — placing billions of transistors on a silicon die so wires are short and heat is balanced.
Vehicle routing and scheduling — finding good delivery routes, airline crew schedules, and shift assignments when exact optimization is infeasible.
Protein folding research — exploring many candidate molecular shapes to find low-energy configurations.
Machine learning — training certain types of neural networks (notably Boltzmann machines) and tuning hyperparameters.
Game and puzzle solving — solving large Sudoku grids, scheduling tournaments, and packing shapes into containers.
When NOT to Use
When the problem has a known polynomial-time exact algorithm (shortest path, MST, matching) — annealing is slower and gives no quality guarantee.
When the cost landscape is smooth and convex — gradient descent or Newton's method finds the minimum in a fraction of the time.
When you need a provably optimal solution (legal/financial settings) — annealing is heuristic; if it returns the wrong answer you have no certificate of optimality.
Common Mistakes
Cooling too quickly (temp *= 0.5) — the search freezes before escaping the first local trap and ends up worse than greedy hill-climbing.
Forgetting to keep a separate best variable and returning the current state — the algorithm may end on an accepted worse move and never report the minimum it visited.
Computing exp(-delta / temp) with temp = 0 or extremely small temp — produces division-by-zero or underflow; clamp the temperature floor.
Try It with an AI Assistant
short
Write anneal(state, schedule) returning the best state found via simulated annealing.
behavior
Write a search procedure that starts from an initial solution with a high 'temperature' that gradually decreases. At each step, pick a random neighboring solution. Always accept it if it lowers the cost; if it raises the cost by an amount delta, accept it only with probability exp(−delta / temperature). Track the best solution ever seen and return it when the temperature drops below a small threshold.
For instanceFind the lowest temperature in any date range quickly.
arr ← [3, 1, 4, 1, 5, 9, 2, 6]
l ←2
r ←6
n ←length(arr)
FOR i FROM0TO n-1
st[0][i] ← arr[i]
ENDFOR
j ←1WHILE2^j <= n
i ←0WHILE i + 2^j <= n
st[j][i] ←min(st[j-1][i], st[j-1][i + 2^(j-1)])
i ← i + 1ENDWHILE
j ← j + 1ENDWHILE
k ←floor(log2(r-l+1))
RETURNmin(st[k][l], st[k][r - 2^k + 1])
The basic idea — precomputing minima over power-of-two windows — circulated in the algorithms community for years, but Michael Bender and Martín Farach-Colton's 2000 paper The LCA Problem Revisited made the construction famous by showing it could reduce lowest common ancestor queries on trees to range-minimum queries on an Euler tour, giving O(1) LCA after linear preprocessing. The 1984 date in the literature points to earlier RMQ work; the technique itself is now a staple of competitive programming and a common interview building block.
Teaches: Trade preprocessing for instant repeated answers
The Idea
Build a 2-D table st[j][i] where each entry stores the minimum of the block of length 2^j starting at index i. Row j = 0 is just the array itself. Row j builds from row j − 1 by combining two half-blocks: st[j][i] = min(st[j−1][i], st[j−1][i + 2^(j−1)]). The whole table has n columns and log n rows.
To answer a query on [l, r], find the largest k such that 2^k ≤ r − l + 1. The two blocks st[k][l] and st[k][r − 2^k + 1] overlap and together cover exactly [l, r]. Because min is idempotent — min(x, x) = x — overlapping is harmless, so the answer is min(st[k][l], st[k][r − 2^k + 1]). The invariant: every query range can be tiled by exactly two power-of-two blocks already in the table, found in constant time.
Trace
j
block length
st[j][0..n−2^j]
0
1
[3, 1, 4, 1, 5, 9, 2, 6]
1
2
[1, 1, 1, 1, 5, 2, 2]
2
4
[1, 1, 1, 1, 2]
3
8
[1]
Where It's Used Today
Competitive programming — the standard tool for range-min/max queries when the array doesn't change between queries.
Read-only data analysis — answering "minimum stock price in a given window" or "lowest sensor reading between timestamps" instantly across millions of queries.
Bioinformatics — RMQ underpins suffix-array-based string matching used in genome assembly and read alignment.
Lowest common ancestor (LCA) — Bender and Farach-Colton famously reduced LCA queries on trees to RMQ on an Euler-tour array, making LCA queries effectively O(1).
Image processing — answering "darkest pixel in this row segment" for static images during streaming filters.
When NOT to Use
When the underlying array changes between queries — sparse tables are static; even a single update forces a full rebuild. Use a segment tree or Fenwick tree instead.
When the operation isn't idempotent (sum, XOR, product) — overlapping the two power-of-two blocks double-counts the intersection. Sparse tables only work for min, max, gcd, and similar idempotent operations.
When n is small or queries are few — the O(n log n) build cost isn't justified; a simple loop over the range per query is fine.
Common Mistakes
Picking k as ceil(log2(r-l+1)) instead of floor — the chosen blocks then extend past r and you read minimums of regions that lie outside the query range.
Trying to use sparse tables for sum queries — the overlap of the two blocks gets counted twice; use prefix sums or a Fenwick tree instead.
Computing floor(log2(...)) with floating-point math.log2 on large lengths — rounding errors flip k by one; precompute a log[] array of integer logs to be safe.
Try It with an AI Assistant
short
Write build/query for a sparse table answering range-minimum queries in O(1).
behavior
Write a function that, given an array, builds a 2-D table where row j column i stores the minimum of the block of length 2^j starting at i. Then write a query function that, for a range [l, r], picks k = floor(log2(r − l + 1)) and returns the minimum of two table entries: row k starting at l, and row k starting at r − 2^k + 1.
Made fair sampling from unknown-size streams possible.
For instancePick 100 random tweets from a live stream without storing all tweets.
result ← first k items FROM stream
i ← k
WHILE stream has next item
x ← next item
j ←random_int(0, i)
IF j < k THEN
result[j] ← x
ENDIF
i ← i + 1ENDWHILERETURN result
Jeffrey Vitter, then a young researcher at Brown University, formalized "Algorithm R" in 1985 — building on a folk technique used by tape-era statisticians who needed a uniform sample from data they could only read once. The proof is a small marvel of induction: each item that ever passes by ends up with the same probability k/n of sitting in the reservoir at the end, even though the algorithm never knew the value of n. Decades later, the same procedure samples log entries at Twitter, A/B-test users at any web platform, and rows from billion-row tables in BigQuery — anywhere the data is bigger than the memory you have to hold it.
Teaches: Sample uniformly without knowing total size
Anecdote
Jeffrey Scott Vitter refined earlier ideas into efficient forms. The algorithm became essential when data started arriving as streams you can't store — a problem that barely existed when it was first proposed.
The Idea
Fill result with the first k items as-is. Then, for the (i+1)-th item that arrives (with i starting at k), pick a random integer j in [0, i]. If j < k, replace result[j] with the new item; otherwise drop the new item. That's the entire algorithm — you never need to know how many items will eventually arrive.
Why does it work? You can prove by induction that after seeing n items, every one of them sits in the reservoir with exactly probability k / n. The key step: when the n-th item arrives, it is kept with probability k / n (because j < k happens k out of n times), and any specific older item survives this round with probability 1 − (1/n), which combined with its previous k/(n−1) chance gives k / n again. Uniform sampling from a stream of unknown length, with constant memory.
Trace
step
x
i
j (random in [0..i])
action
result
-
-
3
-
seed first 3
[A, B, C]
1
D
3
1
j < 3 → replace[1]
[A, D, C]
2
E
4
4
j ≥ 3 → drop
[A, D, C]
3
F
5
0
j < 3 → replace[0]
[F, D, C]
4
G
6
5
j ≥ 3 → drop
[F, D, C]
5
H
7
2
j < 3 → replace[2]
[F, D, H]
Where It's Used Today
Server log sampling — sites like Twitter and Cloudflare keep a uniform random sample of incoming requests for monitoring, without storing every request.
A/B testing pipelines — randomly selecting users to include in a metric without knowing the day's total user count up front.
Database query results — Postgres, BigQuery, and Spark all support reservoir-style TABLESAMPLE for cheap random samples over huge tables.
Distributed systems — picking a random worker from a stream of heartbeats, or sampling errors from a Kafka topic with bounded memory.
ML training — when training data arrives as a stream too large for disk, reservoir sampling provides a uniform mini-set for validation or distillation.
When NOT to Use
When the stream length is known and fits in memory — Fisher-Yates shuffle with a slice is simpler and avoids per-item RNG calls.
When items have unequal weights — Algorithm R assumes uniform sampling; for weighted streams use A-Res or Chao's variant.
When the stream is huge and you want speed — Algorithm L (geometric skip) draws fewer random numbers and is much faster than calling rand() for every item.
Common Mistakes
Drawing j from [0, k) instead of [0, i], which keeps the first k items forever and biases the sample toward the start of the stream.
Off-by-one on the index counter i — starting it at 0 instead of k makes early items more likely to be replaced than later ones.
Re-seeding the RNG inside the loop — repeated identical seeds make every replacement decision the same and destroy uniformity.
Try It with an AI Assistant
short
Write reservoir_sample(stream, k) returning a uniform random sample of k elements from a stream of unknown length.
behavior
Write a function that reads items from a stream and keeps an array of the first k. After that, for the i-th item seen (starting i = k for the (k+1)-th item), pick a random integer j between 0 and i inclusive. If j is less than k, overwrite the j-th slot with the new item; otherwise discard it. Return the array when the stream ends.
Made training multilayer neural networks practical.
For instanceA model learns which internal weights caused a wrong prediction.
pred ←sigmoid(dot(w, x))
error ← pred - y
FOR i FROM0TOlength(w)-1
grad ← error * pred * (1-pred) * x[i]
w[i] ← w[i] - lr * grad
ENDFORRETURN w
Backpropagation finally allowed deep layered networks to adjust internal parameters systematically using gradient information, igniting modern machine learning.
Neural networks needed efficient learning for many interconnected weights.
Teaches: Send the error backward to assign blame to each parameter
The Idea
Compute the prediction first: pred = sigmoid(w · x). Compare it with the target: error = pred − y. Now figure out, for each weight w[i], how much a tiny change in that weight would change the error. The chain rule of calculus says: the gradient with respect to w[i] is error · pred · (1 − pred) · x[i]. Subtract a small step in that direction (w[i] ← w[i] − lr · grad), and the error gets a little smaller.
Why does this work? pred · (1 − pred) is the derivative of the sigmoid — it tells us how sensitive the output is to the input at the current point. x[i] says how much that input depended on weight i. Multiplying these together gives the slope of the error surface in the direction of w[i]. Walking downhill on that slope (a tiny step proportional to lr) lowers the error. Repeat this on many examples and the network learns.
Trace
i
x[i]
grad = error · pred · (1−pred) · x[i]
new w[i] = w[i] − lr · grad
0
1.0
−0.4013 · 0.2404 · 1.0 ≈ −0.0965
0.2 − 0.5 · (−0.0965) ≈ 0.2483
1
0.5
−0.4013 · 0.2404 · 0.5 ≈ −0.0482
0.4 − 0.5 · (−0.0482) ≈ 0.4241
Where It's Used Today
Image recognition — every convolutional neural network (the kind that powers face unlock, photo search, medical-image diagnosis) is trained by backpropagation.
Large language models — GPT, Claude, Gemini, Llama — all of them learn their billions of weights by running this same algorithm at massive scale.
Speech recognition — Siri, Alexa, and Google Voice use deep nets trained by backprop on millions of hours of speech.
Self-driving cars — Tesla, Waymo, and Cruise train their perception nets with backprop on road footage.
Recommendation engines — YouTube, Netflix, and TikTok learn what to show you next by training neural ranking models with backprop on click data.
When NOT to Use
When the model isn't differentiable — discrete decision trees, hard k-means, or symbolic rule systems can't carry a gradient backward; use evolutionary search or specialized fitting instead.
When the dataset is tiny (a dozen examples) — gradient updates from such weak signals overfit immediately; logistic regression with a closed-form solver is more honest.
When you need provable convergence to a global optimum — backprop only finds a local minimum, and on non-convex losses the result depends heavily on initialization.
Common Mistakes
Forgetting the sigmoid derivative pred · (1 − pred), so the gradient is in the wrong magnitude (and often wrong sign for other activations).
Setting lr too large — weights overshoot the minimum and the loss explodes within a few steps; too small and training never moves.
Updating weights during the inner loop in multi-layer nets, so later gradients are computed against partially-updated weights and the chain rule breaks.
Try It with an AI Assistant
short
Write train(x, y, w, lr) doing one backprop step on a single-layer net with sigmoid activation.
behavior
Write a function that takes input vector x, target y, weight vector w, and learning rate lr. Compute pred = sigmoid(dot(w, x)) and error = pred − y. For each weight w[i], compute grad = error · pred · (1 − pred) · x[i] and update w[i] ← w[i] − lr · grad. Return the updated w.
It made interpretable classification practical: computers could learn rules that humans could inspect and explain.
// ID3 decision treeFUNCTIONbuild(rows, attrs)
IF all rows same class THENRETURNLeaf(class)
ENDIF
best ← argmax_a info_gain(rows, a)
node ←Split(best)
FOR EACH value v of best
subset ← rows where a = v
node.child[v] ←build(subset, attrs - {best})
ENDFORRETURN node
END FUNCTION
ID3 made machine learning feel like a sequence of understandable questions. It selected the attribute that best split examples, building a tree of decisions from data.
Teaches: Split data by maximizing information gain
The Idea
Build the tree top-down by repeatedly asking, "Which single feature, if I split the examples by its value, best separates the labels?" The answer comes from information gain — how much the entropy (the "messiness") of the labels drops when you split. Pick the feature with the highest gain, make a node for it, and recurse on each subset.
The recursion stops when a subset is pure (all examples have the same label) — that becomes a leaf. The invariant: every leaf corresponds to a path of questions, and along that path the answer is unanimous in the training data. Greedily choosing the most informative split at each step doesn't always yield the smallest possible tree, but it produces a sensible, human-readable one in time roughly proportional to (rows × features).
Trace
row
Outlook
Humidity
Play
1
Sunny
High
No
2
Sunny
Normal
Yes
3
Overcast
High
Yes
4
Rain
High
No
Where It's Used Today
Medical diagnosis support — interpretable trees flag patients for further screening, with rules a doctor can audit.
Credit scoring — banks use small decision trees to make initial loan-approval decisions because regulators require explainability.
Random forests and gradient-boosted trees — modern Kaggle-winning models (XGBoost, LightGBM) are ensembles of many ID3-style trees.
Customer-churn analysis — marketing teams build decision trees on usage features to identify which subscribers are about to cancel.
Game AI behavior trees — many NPCs use a tree of conditions to choose actions; the structure is the same as ID3's output even when learned by hand.
When NOT to Use
When features are continuous (height, price, time) — pure ID3 only handles categorical splits; use C4.5 or CART for numeric thresholds.
When the table is small relative to feature count — ID3 will overfit, building a perfect tree on training data that fails on new examples.
When you need top-tier predictive accuracy — single trees lose to random forests and gradient boosting; ID3 is for interpretability.
Common Mistakes
Using accuracy or class-frequency as the split criterion — it's biased toward features with many values; use information gain or gain ratio.
Forgetting to remove the chosen attribute from the recursive call — the same feature gets re-split forever down the tree.
Treating an empty subset (no rows for a particular feature value) as a bug — instead emit a leaf with the parent's majority class.
Try It with an AI Assistant
short
Write decision_tree_id3(rows, attributes) that learns a decision tree using ID3 with information gain on categorical features.
behavior
Given a table of training examples (each with categorical feature values and a class label), build a tree where each internal node tests one feature, each branch from that node corresponds to one value of the feature, and each leaf is a class label. At every node, choose the feature whose split most reduces the entropy of the labels. Stop and make a leaf when all remaining examples share a label, or when no features remain.
It made memory-efficient rearrangement practical for editors, buffers, sorting subroutines, and low-level systems.
a ← [1, 2, 3, 4, 5, 6, 7]
n ←7
k ←3FUNCTIONreverse(a, lo, hi)
WHILE lo < hi
swap(a[lo], a[hi])
lo ← lo + 1
hi ← hi - 1ENDWHILEEND FUNCTIONreverse(a, 0, k - 1)
reverse(a, k, n - 1)
reverse(a, 0, n - 1)
RETURN a
The three-reverse trick was popularized by Jon Bentley's Programming Pearls in the 1980s, where he held it up as a model of clean engineering: when copying to a temp buffer feels obvious, ask whether a clever sequence of in-place moves achieves the same result with constant memory. The same idea — rotate by reversing parts then the whole — had been folklore among Unix kernel and editor authors for years (vi and emacs both use it for cut-and-shift), but Bentley's column made it part of every algorithms textbook.
Teaches: Transform structure using reversible local operations
The Idea
Three reverses do the job. First, reverse the front block a[0 .. k−1]. Second, reverse the back block a[k .. n−1]. Third, reverse the entire array. After the third pass, the array has rotated left by exactly k. Each individual reverse uses only a single temp variable to swap a pair of elements.
Why does it work? Think of the array as two strings: A = a[0..k−1] and B = a[k..n−1]. We want BA. Reversing each part gives us A^R B^R. Reversing the whole gives (A^R B^R)^R = B A — exactly what we wanted. The invariant after step i of any single reverse is the elements outside the swap window are unchanged, and the window's outer pair has been swapped. Total work is about n swaps in all (k/2 + (n−k)/2 + n/2 ≈ n), with constant extra memory.
Trace
step
call
what changes
array after
1
reverse(a, 0, 2)
swap a[0]↔a[2] (1↔3); skip middle a[1]
[3, 2, 1, 4, 5, 6, 7]
2
reverse(a, 3, 6)
swap a[3]↔a[6] (4↔7); swap a[4]↔a[5] (5↔6)
[3, 2, 1, 7, 6, 5, 4]
3
reverse(a, 0, 6)
swap a[0]↔a[6], a[1]↔a[5], a[2]↔a[4]
[4, 5, 6, 7, 1, 2, 3]
Where It's Used Today
Text editors — moving a block of text up or down within a buffer (cut-and-shift) uses the three-reverse trick on the underlying character array.
Ring buffers — communication queues and audio pipelines occasionally rotate the contents to realign read/write pointers without copying to a new buffer.
Sorting subroutines — some in-place merge and partition steps rotate runs of elements to merge them without temporary storage.
Image processing — rotating a row of pixels for column-wise effects, or scrolling a tile across a fixed framebuffer in retro game engines.
Embedded systems — devices with tight RAM budgets (microcontrollers, sensor nodes) cannot afford a second array, so the three-reverse trick is the rotation of choice.
When NOT to Use
When extra memory is freely available and you only rotate occasionally — a single slice-and-concat (a[k:] + a[:k]) is shorter, clearer, and just as fast.
When the underlying structure is a linked list — there are no random-access slots to swap; rewire pointers instead.
When you need to rotate by a non-integer or fractional offset (e.g. circular-buffer reads with sub-element granularity) — the three-reverse trick is integer-only.
Common Mistakes
Forgetting to reduce k modulo n first, so a k larger than n rotates the wrong amount or addresses out of bounds in reverse(a, 0, k - 1).
Using reverse(a, k, n) instead of reverse(a, k, n - 1) — runs off the end and corrupts adjacent memory.
Performing only two reverses out of three and shipping it — the array ends up reversed instead of rotated.
Try It with an AI Assistant
short
Write rotate(a, k) that rotates the array a left by k positions in place using the three-reverse trick.
behavior
Given an array of length n and a number k, rearrange the array in place so the first k elements move to the end. Do it without allocating another array. Use a helper that reverses a slice by repeatedly swapping the outer pair and stepping inward. Reverse the first k elements, reverse the rest, then reverse the entire array.
Made fast multi-pattern text search with hashing possible.
For instanceDetect copied passages by comparing rolling fingerprints.
h_pat ←hash(pattern)
h ←hash(text[0..m-1])
FOR i FROM0TO n - m
IF h = h_pat AND text[i..i+m-1] = pattern THENRETURN i
ENDIFIF i < n - m THEN
h ←roll(h, text[i], text[i+m])
ENDIFENDFORRETURN -1
Richard Karp (Berkeley) and Michael Rabin (Harvard / Hebrew University) introduced the algorithm in their 1987 IBM Journal paper "Efficient randomized pattern-matching algorithms." Their key insight was to apply randomization to a problem that had previously been treated as purely deterministic: by hashing windows instead of comparing characters, they reduced the expected work from O(nm) to O(n + m) while accepting a tiny probability of a missed match. The same rolling-hash trick later powered rsync's delta transfers and modern content-defined chunking systems used in backup software.
Teaches: Rolling hash skips mismatched windows cheaply
The Idea
Choose a rolling hash — a hash function that, given the hash of text[i..i+m−1], can compute the hash of text[i+1..i+m] in constant time without re-reading the whole window. A common choice is a polynomial hash: h = (text[i]·b^(m−1) + text[i+1]·b^(m−2) + ... + text[i+m−1]) mod a prime, where b is a base like 256.
Compute h_pat = hash(pattern) once. Compute h = hash(text[0..m−1]) for the first window. Then for each window: if h == h_pat, double-check by character comparison (because hashes can collide). Otherwise, roll the hash forward by subtracting the contribution of the leaving character and adding the contribution of the entering character. The invariant: h always equals the hash of the current window text[i..i+m−1]. Average runtime is O(n + m); the worst case (lots of collisions) is still O(n × m), but with good hash parameters that almost never happens.
Trace
i
window
h
h == h_pat?
char check
action
0
"ABR"
21
yes
"ABR" ≠ "BRA"
collision, roll
1
"BRA"
21
yes
"BRA" = "BRA"
RETURN 1
Where It's Used Today
Plagiarism detection — Turnitin, Moss, and similar tools fingerprint document windows with rolling hashes and compare against a database.
Multiple-pattern search — searching for thousands of patterns at once is one hash table lookup per window, the core of network intrusion detection systems like Snort.
rsync's "delta" transfer — file-sync uses Rabin's rolling hash to find which chunks of a file have changed without sending the whole file.
Bioinformatics — DNA short-read alignment and seed-finding (e.g., minimizers in modern aligners) build on rolling-hash ideas.
Content-defined chunking — backup systems like Borg, restic, and Git's pack-files split files at "fingerprint boundaries" computed with rolling hashes.
When NOT to Use
When you need guaranteed worst-case linear time on a single pattern — KMP or Boyer-Moore avoid the collision blow-up Rabin-Karp risks.
When the alphabet is tiny and the pattern is very short — naive scan is just as fast and avoids hash setup.
When you cannot tolerate even a one-in-a-billion false positive (e.g., crypto contexts) — without the verification step, hashes can collide.
Common Mistakes
Skipping the character-by-character verification after a hash match, accepting collisions as real matches.
Recomputing the hash from scratch on every window instead of rolling it, throwing away the algorithm's whole point.
Using a hash modulus that's too small (or a power of two), producing far more collisions than a prime would.
Try It with an AI Assistant
short
Search substring using rolling hash comparison over sliding text windows.
behavior
Write a function that searches for a pattern of length m inside a longer text. First compute a numeric fingerprint of the pattern, and a fingerprint of the first m-character window of the text. Slide the window forward one character at a time, updating the window fingerprint cheaply by removing the contribution of the character that leaves and adding the contribution of the character that enters. When fingerprints match, double-check by comparing characters. Return the first match index or -1.
For instanceFind someone’s 64th ancestor with precomputed jumps.
up[0] ← parent
FOR j FROM1TO LOG
FOR EACH node v
up[j][v] ← up[j-1][ up[j-1][v] ]
ENDFORENDFORFOR j FROM0TO LOG
IF k & (1 << j) THEN
v ← up[j][v]
ENDIFENDFORRETURN v
Binary lifting emerged from late-1980s academic work on Lowest-Common-Ancestor data structures (Bender & Farach-Colton's surveys give the cleanest write-up) and was popularised in the 1990s and 2000s as the standard trick on competitive-programming circuits, where problems on trees with millions of nodes routinely turn up. The construction is now a one-screen template that every serious contestant has memorised, and it underlies LCA libraries shipped with most algorithm-competition toolkits.
Teaches: Precompute power-of-two jumps to leap by binary digits
The Idea
Build a 2D table up[j][v] meaning "the 2^j-th ancestor of node v." Row 0 is just each node's parent. Row j is computed from row j − 1 using a beautiful identity: the 2^j-th ancestor of v is the 2^(j−1)-th ancestor of the 2^(j−1)-th ancestor of v. So filling the table costs O(n log n) total.
To find the k-th ancestor of any v, write k in binary and consume one bit at a time. Each set bit at position j says "jump up by 2^j," which we read off the table in O(1). After at most log₂ k jumps we land on the correct ancestor. The same table also answers Lowest-Common-Ancestor queries in O(log n): lift the deeper node up to match the depth of the shallower one, then lift both in sync until just before they meet.
Trace
j (= 2^j)
up[j][7]
0 (1)
6
1 (2)
5 (= up[0][6])
2 (4)
3 (= up[1][5] = up[1] of 5)
Where It's Used Today
Lowest Common Ancestor queries — finding the most recent shared ancestor of two nodes in a tree, used in version-control systems (the merge-base in Git is conceptually an LCA in the commit graph).
Genealogy software — answering "5 generations back" without walking the whole chain.
Hierarchical permissions — fast queries on access-control trees, where you ask whether a permission is inherited from some ancestor folder.
Network topology analysis — when networks are organized as trees (spanning trees, switch hierarchies), binary lifting answers reachability and routing distance queries quickly.
Programming-contest libraries — binary lifting is a standard tool, since many tree problems reduce to ancestor queries.
When NOT to Use
When the tree is mutable (parents change, nodes get re-parented) — the precomputed table goes stale and rebuilding is O(n log n).
When you only need a single ancestor query — plain parent-walking costs O(k) and skips the O(n log n) preprocessing.
When the structure is a general DAG rather than a tree — every node may have many parents and up[j][v] becomes ambiguous.
Common Mistakes
Sizing LOG too small (e.g., 20 for a tree of 10⁷ nodes when 24 is needed) — distant queries silently truncate.
Setting up[0][root] = root instead of a sentinel like -1, causing kth_ancestor to loop on the root forever.
Building the table column-by-column over jinside the node loop, so up[j-1][parent] isn't yet filled when used.
Try It with an AI Assistant
short
Write kth_ancestor(v, k) in a tree using binary lifting in O(log n).
behavior
Write code that, given a tree's parent array, precomputes a 2D table where row j gives every node's ancestor exactly 2^j steps up — each entry built from the previous row by jumping twice. Then, to find the k-th ancestor of any node, walk through the bits of k: for each set bit at position j, replace the current node with its 2^j-th ancestor.
Made constant-time range queries for static data broader.
For instanceAnswer repeated max/min/gcd range questions instantly.
arr ← [3, 1, 4, 1, 5, 9, 2, 6]
l ←2
r ←6
n ←length(arr)
levels ←ceil(log2(n))
table ←matrix(levels, n)
FOR level FROM0TO levels-1
size ←1 << (level + 1)
FOR mid FROM size/2TO n STEP size
table[level][mid-1] ← arr[mid-1]
FOR i FROM mid-2 DOWN TO mid-size/2
table[level][i] ←combine(arr[i], table[level][i+1])
ENDFORIF mid < n THEN
table[level][mid] ← arr[mid]
FOR i FROM mid+1TOmin(n-1, mid+size/2-1)
table[level][i] ←combine(table[level][i-1], arr[i])
ENDFORENDIFENDFORENDFORIF l = r THENRETURN arr[l]
ENDIF
k ←highestBit(l XOR r)
RETURNcombine(table[k][l], table[k][r])
The classic sparse table (Bender and Farach-Colton's tool for range-minimum queries) only worked for operations like min and max, where overlapping pieces are harmless. But programmers wanted the same O(1) query speed for non-overlapping operations like sum or gcd, and the disjoint variant — passed around in competitive-programming circles in the 1990s without a single canonical paper — solved exactly that. The trick: instead of overlapping power-of-two windows, precompute the prefix and suffix combines around each block's midpoint, so any query splits cleanly into one left piece and one right piece.
Teaches: Split each query at a midpoint into two precomputed pieces
The Idea
Imagine cutting the array into pieces of size 2^level. At each level, every block has a midpoint. Precompute, for each block, prefix combines from the midpoint to the right and suffix combines from the midpoint to the left. Now for any query [l, r], find the position of the leftmost bit where the binary representations of l and r first differ — that's the right level. (For example, l = 2 = 010 and r = 6 = 110 first differ at bit 2, so level = 2.) The midpoint at that level lies betweenl and r, so the answer is just combine(table[level][l], table[level][r]): one suffix piece on the left of the midpoint plus one prefix piece on the right.
This works because the midpoint splits the query exactly into two precomputed pieces, and the operation is associative — meaning (a + b) + c equals a + (b + c), so left-of-midpoint combined with right-of-midpoint really does equal the whole. Unlike a regular sparse table, this idea handles operations that don't allow overlap (like sum or gcd, where counting an element twice gives the wrong answer) and still answers each query in O(1).
Trace
i
0
1
2
3
4
5
6
7
table[2][i]
9
6
5
1
5
14
16
22
Where It's Used Today
Competitive programming — fast range-min, range-gcd, and range-sum on static arrays for problems with millions of queries.
Read-only analytics — precomputing range aggregates over historical data so dashboards answer instantly.
Bioinformatics — range queries over fixed reference genomes (e.g., GC-content over windows).
Image processing — answering rectangular region statistics on a fixed image (a 2-D adaptation of the same idea).
Compiler optimizations — static analyses that need fast range queries over an immutable program representation.
When NOT to Use
When the array changes between queries — the table is built once and assumes immutability; use a Fenwick tree or segment tree for dynamic updates.
When the operation isn't associative (like average or median) — splitting at the midpoint and combining no longer gives the right answer.
When you only have a few queries on a small array — the O(n log n) build cost outweighs simply scanning each query directly.
Common Mistakes
Computing k = highestBit(l XOR r) but forgetting the l == r case, leading to XOR = 0 and an undefined highest bit.
Building only the prefix half of each block and querying with a single lookup — both pieces (left suffix and right prefix) are needed.
Confusing it with a regular sparse table and applying it to operations that break under overlap (like sum or gcd, where counting an element twice gives the wrong answer) without using the disjoint construction.
Try It with an AI Assistant
short
Write build/query for a disjoint sparse table answering associative range queries in O(1).
behavior
Precompute a 2-D table where row level covers blocks of size 2^(level+1). In each block, store prefix-combines starting from the midpoint going right and suffix-combines going left. To answer a range query (l, r), find the highest bit where l and r differ — call it k — and return combine(table[k][l], table[k][r]).
Made segment trees faster and simpler in practice.
For instanceUse an array-based tree for range sums without recursion.
arr ← [1, 3, 5, 7, 9, 11, 13, 15]
l ←2
r ←5// build
n ←length(arr)
tree ← array[2*n]
FOR i FROM0TO n - 1
tree[n + i] ← arr[i]
ENDFORFOR i FROM n - 1 DOWN TO1
tree[i] ← tree[2*i] + tree[2*i + 1]
ENDFOR// point update: set arr[pos] ← value, then climb from the leaf fixing parents// (call when needed; not run in the trace below)// i ← pos + n// tree[i] ← value// i ← i DIV 2// WHILE i >= 1// tree[i] ← tree[2*i] + tree[2*i + 1]// i ← i DIV 2// ENDWHILE// range-sum query [l, r]
l ← l + n
r ← r + n
sum ←0WHILE l <= r
IF l MOD2 = 1THEN
sum ← sum + tree[l]
l ← l + 1ENDIFIF r MOD2 = 0THEN
sum ← sum + tree[r]
r ← r - 1ENDIF
l ← l DIV2
r ← r DIV2ENDWHILERETURN sum
Segment trees were a workhorse of computational geometry from the 1970s, but their classical recursive implementation carried real overhead — function-call costs, pointer chasing, awkward boundary handling. By the 1990s, competitive programmers (especially in the Russian and Eastern European training scenes) popularized a tight bottom-up form that stores the tree in a flat array of size 2n and replaces recursion with simple bit arithmetic. The same O(log n) operations now compile to a handful of integer instructions, making the iterative segment tree the standard contest tool for fast range queries.
Teaches: Flatten tree structure into an array and climb with arithmetic
The Idea
Lay out a complete binary tree in a flat array tree[1..2n]. The leaves live at positions n through 2n − 1 and store the input values. Each internal node at position i is the sum of its two children at 2i and 2i + 1. Building the tree means filling the leaves and then walking i from n − 1 down to 1, summing.
For a range query [l, r] (inclusive), shift l and r into leaf indices by adding n. Then climb both pointers toward the root, picking up partial sums whenever a pointer is on the "wrong" side of its parent. Concretely: if l is odd, it's a right child — its parent doesn't fully cover the range, so add tree[l] to the sum and step l past it. If r is even, it's a left child — same idea on the other end. Then divide both pointers by 2 and repeat. Each iteration halves the index range, so the loop runs O(log n) times. No recursion, no pointer chasing — just integer arithmetic on a flat array.
Trace
step
l
r
l odd?
r even?
sum updates
sum
next l, r
1
10
13
no
no (odd)
—
0
5, 6
2
5
6
yes
yes
sum += tree[5]=12; sum += tree[6]=20
32
3, 2
3
3
2
(l > r, exit)
—
32
—
Where It's Used Today
Competitive programming — the iterative form is the standard go-to for range-sum and range-min problems on Codeforces and at the IOI.
Database engines — column-store databases use segment-tree-like structures for fast range aggregates over time-series and analytics tables.
Game leaderboards — efficient rank and range queries over millions of player scores rely on segment-tree variants.
Real-time analytics dashboards — computing rolling sums and percentiles over event streams uses similar flat-array trees.
Computational biology — fast range queries over genome-coverage arrays in bioinformatics pipelines.
When NOT to Use
When you need lazy propagation for range updates (add v to every element of [l, r]) — the iterative form is awkward to extend; prefer the recursive segment tree.
When the array is static and you only do range queries — a precomputed prefix-sum array answers in O(1) with no log factor.
When the operation is non-associative (like "average") — segment trees only compose associative operations cleanly; pick a different structure.
Common Mistakes
Sizing the tree as n instead of 2*n, overwriting leaves and corrupting all internal sums.
Mixing inclusive and exclusive r between build and query, returning sums that are off by one element at the right boundary.
Forgetting to walk up from the changed leaf in update, leaving stale sums in every ancestor of the modified position.
Try It with an AI Assistant
short
Write build/query for an iterative segment tree with point update and range sum.
behavior
Write a class that stores an array of length n in a flat tree of size 2n, with leaves at positions n through 2n-1 and each internal node at position i equal to tree[2i] + tree[2*i+1]. Provide a build, a point-update that walks from a leaf to the root fixing parents, and a range-sum query that climbs l and r in lock-step, adding tree[l] if l is odd and tree[r] if r is even before dividing both by 2.
Made fast ordered lookup possible without strict tree balancing.
node ← head
FOR lvl FROM max_level DOWNTO 0WHILE node.next[lvl] AND
node.next[lvl].key < target
node ← node.next[lvl]
ENDWHILEENDFOR
node ← node.next[0]
IF node AND node.key = target THENRETURN node.value
ENDIFRETURNNONE
William Pugh invented skip lists partly out of frustration with balanced trees. His pitch was almost rebellious — skip strict structure entirely; let randomness give you the same expected performance. Many researchers initially thought it was a trick, not a serious data structure.
Teaches: Add randomness to achieve balanced structure without strict rules
Anecdote
William Pugh invented skip lists partly out of frustration with balanced trees. His pitch was almost rebellious — skip strict structure entirely; let randomness give you the same expected performance. Many researchers initially thought it was a trick, not a serious data structure.
The Idea
Stack several sorted linked lists on top of each other. The bottom level (level 0) contains every item; level 1 contains a random half of them; level 2 contains a random half of those; and so on. Higher levels are express lanes that let you skip over many items in one hop.
To search for target, start in the top-left corner. At the current level, walk right as long as the next node's key is still less than target. When you can't move right any more, drop down a level and continue. Eventually you fall off the bottom, and the next node at level 0 is either target or proves it's missing.
Why does it work? Each item climbs to its level by flipping a coin (with probability ½ to keep going up). On average, half the items live at level 1, a quarter at level 2, and so on — giving roughly log n levels and log n work per search. The randomness, surprisingly, is just as reliable as strict balancing — and far simpler.
Trace
step
level
node
next.key
action
1
2
HEAD
25
25 < 30 → walk right to 25
2
2
25
NIL
can't go right → drop to level 1
3
1
25
42
42 ≥ 30 → drop to level 0
4
0
25
30
30 < 30? no — stop the inner loop
5
0
(after)
30
take node.next[0]; key matches → return
Where It's Used Today
Redis sorted sets — Redis uses skip lists internally for ZSET, the data structure behind leaderboards and ranged queries.
Apache Lucene/Solr — skip lists speed up posting-list intersection during full-text search.
LevelDB and RocksDB — Google's and Facebook's key-value stores use skip lists in their in-memory tables for fast inserts.
Real-time game leaderboards — high-score tables that update live and need fast rank queries.
Concurrent data structures — skip lists are easier to lock per-node than balanced trees, useful in multi-threaded code.
When NOT to Use
When you need guaranteed worst-case O(log n) — skip lists are only O(log n) in expectation; a balanced tree is safer for hard real-time.
When data is small and fits in a sorted array — binary search on an array is cache-friendlier and avoids per-node pointer overhead.
When memory is tight — each node carries multiple next pointers, costing more space than a tree node or array slot.
Common Mistakes
Using a deterministic coin (e.g., always promote on every other insert) — adversarial input can then build a degenerate tower.
Forgetting to update next pointers at every level the new node reaches, leaving search paths that skip over it.
Hardcoding max_level too low for the dataset, so once n exceeds 2^max_level searches degrade toward O(n).
Try It with an AI Assistant
short
Write skip_list(...) implementing Skip List.
behavior
Implement a sorted-set data structure built from several stacked sorted linked lists. The bottom list holds every item; each item randomly decides (coin-flip) whether to also appear in the list above. To search, start at the top-left and walk right while the next key is smaller than the target; when you can't advance, drop down a level. Repeat until you fall off the bottom level.
Made fast string-to-integer fingerprinting practical for hash tables.
h ← FNV_offset
FOR EACH byte b IN data
h ← h XOR b
h ← (h * FNV_prime)
MOD2^32ENDFORRETURN h
The early 1990s Unix world was full of programs that needed to hash millions of strings — symbol tables, configuration parsers, the DNS resolver — and existing options were either too slow (CRC32) or too complicated for a simple tight loop. Fowler, Noll, and Vo's hash, posted on Usenet in 1991, was almost embarrassingly simple: one XOR and one multiply per byte. That simplicity was the point. Within a few years it had quietly become the default "decent fast hash" in DNS daemons, game engines, and scripting language runtimes — the kind of utility code most programmers use without ever knowing whose name is on it.
Teaches: Mix bits incrementally for fast, low-cost spread
Anecdote
Named after its creators: Glenn Fowler, Landon Curt Noll, and Phong Vo. It wasn't designed in academia — it came from real-world Unix systems work, where speed and simplicity mattered more than theory.
The Idea
Start with a fixed magic constant, the FNV offset basis (for 32-bit, 2166136261). For each byte of the input, do two operations: XOR the byte into the running hash, then multiply the hash by another magic constant, the FNV prime (for 32-bit, 16777619). Take the result modulo 2³² so it fits in a 32-bit register.
Why does this work? The XOR pulls the new byte's bits into the running hash unpredictably. The multiplication by a carefully chosen prime then "shuffles" those bits across all 32 positions, so a one-bit change anywhere in the input flips many bits in the output. Doing XOR before the multiply (the "1a" in the name) avoids a small clustering bug that the original FNV-1 had. The whole step is two cheap CPU instructions per byte — which is why FNV-1a is fast enough to use in a tight inner loop.
Trace
step
byte b
h before XOR
h after XOR
h after × prime mod 2³²
0
(init)
—
2166136261
—
1
'A' (65)
2166136261
2166136196
3289118412
2
'B' (66)
3289118412
3289118350
752165258
Where It's Used Today
DNS resolvers and load balancers — FNV-1a routes queries to backends because it's fast enough to run on every packet.
Hash tables in databases and compilers — FNV-1a is a common default for in-memory key hashing where speed beats cryptographic strength.
Bloom filters and set-membership structures — many implementations use FNV-1a as one of the cheap hash functions feeding the filter.
Game engines and asset systems — Unreal, Unity, and id Tech use FNV-1a-style hashes to turn asset names into integer IDs at compile time.
Linux kernel and inotify paths — FNV-1a appears across systems software where a quick, decent-quality hash is needed without pulling in a crypto library.
When NOT to Use
When the hash needs to resist intentional collisions — FNV-1a is not cryptographic; an attacker can craft inputs that all land in the same bucket. Use SipHash, BLAKE3, or SHA-256.
When you're hashing for deduplication of large files or content addressing — FNV-1a's 32-/64-bit output collides too often; use SHA-256 or BLAKE3.
When the inner loop is bound by hash quality rather than speed — modern hashes like xxHash and wyhash are both faster and have better avalanche than FNV-1a on long strings.
Common Mistakes
Doing the multiply before the XOR (writing FNV-1 instead of FNV-1a) — preserves a small clustering bias on similar inputs and quietly weakens the hash.
Skipping the MOD 2^32 (or relying on the language's overflow) in a language with arbitrary-precision integers like Python — the value grows unbounded and never matches reference vectors.
Using the wrong offset basis or prime for the bit width — the 32-bit constants applied to a 64-bit hash give garbage that disagrees with every other implementation.
Try It with an AI Assistant
short
Write fnv_1a_hash(...) implementing FNV-1a Hash.
behavior
Write a function that, given a sequence of bytes, returns a 32-bit integer. Start with a fixed initial value. For each byte, XOR the byte into the running value, then multiply the running value by a fixed odd prime constant, keeping only the low 32 bits. Return the final running value.
For instanceSearch large books or genomes by sorted suffixes.
s ←"banana"
n ←length(s)
rank ← relative order of characters in s // a→1, b→2, n→3
sa ← [0..n-1]
k ←1WHILE k < n
sort sa by (rank[i], rank[i+k])
temp ← array[n]
temp[sa[0]] ←0FOR i FROM1TO n - 1
temp[sa[i]] ← temp[sa[i-1]]
IFpair(sa[i]) != pair(sa[i-1]) THEN
temp[sa[i]] ← temp[sa[i]] + 1ENDIFENDFOR
rank ← temp
k ← k * 2ENDWHILERETURN sa
Suffix trees, invented by Weiner in 1973 and refined by McCreight in 1976, gave linear-time substring search but were memory-hungry — often 15-20 bytes per character of input. In 1990, Udi Manber and Gene Myers (then at the University of Arizona) showed you could get most of the search power from a much smaller structure: just the sorted starting positions of every suffix, requiring only 4-8 bytes per character. Their O(n log² n) doubling construction made suffix arrays the building block of choice for genome aligners (BWA, Bowtie), compressors (bzip2's Burrows-Wheeler stage), and large-scale code search.
Teaches: Compare by ranks of doubling prefixes, not raw characters
The Idea
Naively sorting suffixes takes O(n² log n) because comparing two suffixes can scan up to n characters. The Manber-Myers doubling trick gets it down to O(n log² n).
Give each position a starting rank equal to its character. Then, in stages, compare suffixes by their first k characters using only their existing ranks: at stage k, each suffix is keyed by the pair (rank[i], rank[i+k]). Sort by this pair, then assign new ranks based on the sorted order — equal pairs get the same rank, different pairs get successive ranks. Now each rank captures the first 2k characters. Double k each stage. After log n stages, each suffix has a unique rank that depends on the entire suffix, and the array is fully sorted. The trick: at every stage, each suffix's "key" is a constant-size pair of integers — no character-by-character comparison.
Trace
stage k
sa (after sort)
suffixes (in rank order)
start
[0..5]
banana, anana, nana, ana, na, a
k=1
[5, 1, 3, 0, 2, 4]
sort by (rank[i], rank[i+1]): a < an < b < n
k=2
[5, 3, 1, 0, 4, 2]
sort by (rank[i], rank[i+2]): a < ana < anana < banana < na < nana
k=4
[5, 3, 1, 0, 4, 2]
already fully discriminated; ranks unique
Where It's Used Today
Bioinformatics — tools like BWA and Bowtie use suffix arrays (and related FM-indexes) to align billions of DNA reads against the human genome.
Full-text search — search engines and library systems build suffix arrays so that any phrase lookup turns into a binary search.
Data compression — bzip2 and similar compressors rely on the Burrows-Wheeler Transform, which is computed from a suffix array.
Plagiarism detection — suffix arrays make it efficient to find every long shared substring between two documents.
Code repository search — large source-code search tools (GitHub code search, Hound, Sourcegraph) build suffix-array-style indexes to support arbitrary substring queries across millions of files.
When NOT to Use
When you need a single one-off substring search on a short string — KMP or Boyer-Moore is simpler and avoids the construction overhead.
When the text is dynamic and updates frequently — rebuilding the suffix array on every insert is expensive; use a suffix tree or wavelet structure that supports updates.
When you need the longest common substring across many texts — generalized suffix trees or FM-indexes handle this more directly.
Common Mistakes
Treating out-of-bounds rank[i+k] as zero — zero is a valid rank, so short suffixes get tied with real ones; use -1 or a sentinel smaller than every real rank.
Sorting by raw suffix strings during the doubling step instead of by the rank pair, collapsing the algorithm to O(n² log n).
Stopping the doubling loop after exactly log n iterations — stop instead when every rank is unique, which can happen sooner or (with ties) require one more pass.
Try It with an AI Assistant
short
Write suffix_array(s) returning sorted suffix indices via the doubling method in O(n log² n).
behavior
Write a function that takes a string and returns the starting positions of all its suffixes sorted in dictionary order. To make this fast: assign each position a rank equal to its character. Then, doubling k from 1, sort positions by the pair (rank[i], rank[i+k]) — treating any out-of-bounds rank as smaller than any real one. Re-rank by sorted order so each rank reflects the first 2k characters. Stop once every position has a unique rank, and return the sorted positions.
It made practical subword tokenization possible, letting language models handle rare words, new words, names, and fragments without needing a fixed word dictionary.
vocab ← all single bytes
WHILE |merges| < target
pairs ←count_pairs(corpus)
pair ←argmax(pairs)
merges.append(pair)
vocab.add(pair[0]+pair[1])
replace pair IN corpus
ENDWHILERETURN merges, vocab
Philip Gage introduced byte-pair encoding as a compression idea: repeatedly replace the most common adjacent pair. Decades later, the same idea became central to AI tokenizers, where frequent character groups become reusable tokens.
Teaches: Build vocabulary by merging the most frequent pair
The Idea
Start with the smallest possible vocabulary: every single character (or byte) is its own token. Then, count every pair of adjacent tokens across the entire corpus, find the most frequent pair, and merge it into a brand new token. Repeat — recount pairs, pick the new winner, merge — until you've performed target merges (typically tens of thousands for a real LLM tokenizer).
The invariant after each step is that the corpus is still completely covered by the current vocabulary; the new merge is an additional token, not a replacement. Because the most frequent pair becomes one symbol, the next round's pair counts shift, and rare letter pairs eventually become whole-word-sized tokens. The greedy "merge the most common pair" rule is what makes BPE both simple and effective — it captures real linguistic regularity ("er" in bigger, bitter, litter) without needing any grammar.
Trace
step
corpus
pair counts
best pair
merge
0
a a a b d a a a b a c
aa:4, ab:2, bd:1, da:1, ba:1, ac:1
aa
aa
1
aa a b d aa a b a c
aa·a:2, a·b:2, b·d:1, ...
aa·a
aaa
2
aaa b d aaa b a c
aaa·b:2, b·d:1, d·aaa:1, b·a:1, a·c:1
aaa·b
aaab
Where It's Used Today
Large language models — ChatGPT, Claude, GPT-4, and most open-source LLMs use BPE or its close cousin (WordPiece, SentencePiece) to break input text into tokens.
Multilingual translation — neural machine-translation systems (Google Translate, DeepL) tokenize across hundreds of languages with one shared BPE vocabulary.
Code models — Copilot, Codex, and Claude tokenize source code with BPE so that snippets like def, function, or _init_ become single tokens.
Speech recognition — modern speech-to-text systems output BPE tokens instead of letters, dramatically improving accuracy on rare words.
File compression — the original 1994 use case: compressing files by replacing common byte pairs, still alive in some embedded codecs.
When NOT to Use
When the corpus is very small — there aren't enough repeated pairs to learn meaningful merges, and the resulting tokenizer overfits to your training words.
When you need linguistically meaningful units (morphemes, syllables) — BPE merges by frequency, not grammar, so it splits "running" as runn + ing or run + ning depending on the data.
When the input is binary data without statistical regularity (encrypted bytes, already-compressed files) — pair frequencies are uniform and BPE provides no compression or useful tokens.
Common Mistakes
Failing to recount pair frequencies after each merge — pair counts shift dramatically once you collapse aa into a new symbol, and using stale counts picks the wrong next merge.
Allowing merges to cross word boundaries — most BPE implementations pre-split text at spaces; ignoring this produces tokens like the_quick that fragment unpredictably at inference time.
Forgetting to apply learned merges in the same order at decode time — BPE merges are order-sensitive; applying them out of order tokenizes the same string differently than during training.
Try It with an AI Assistant
short
Write a byte_pair_encoding tokenizer that learns merges from a corpus and returns the merge list and final vocabulary.
behavior
Write a function that, given a corpus broken into individual characters, repeatedly counts every pair of adjacent symbols, finds the most frequent pair, replaces every occurrence of that pair in the corpus with a brand-new combined symbol, and records the merge. Stop after a fixed number of merges and return the merge list and the resulting symbol set.
Made tracking the median of a never-ending stream practical.
lo ←max_heap()
hi ←min_heap()
FOR EACH x IN stream
IF lo empty OR x <= lo.top THEN
lo.push(x)
ELSE
hi.push(x)
ENDIFIF |lo| > |hi| + 1THEN
hi.push(lo.POP)
ELIF |hi| > |lo|
lo.push(hi.POP)
ENDIF
YIELD median() // running median after each new elementENDFOR
This trick spread as folklore rather than a single paper. It became popular because of interview culture — many engineers first encounter it not in research, but in whiteboard interviews.
Teaches: Balance two heaps to track central tendency in a global stream
Anecdote
This trick spread as folklore rather than a single paper. It became popular because of interview culture — many engineers first encounter it not in research, but in whiteboard interviews.
The Idea
Keep two heaps: lo is a max-heap holding the smaller half of the data (its top is the largest of those), and hi is a min-heap holding the larger half (its top is the smallest of those). Maintain two rules: every value in lo is ≤ every value in hi, and the sizes differ by at most 1. When a new x arrives, drop it into lo if it's small (or lo is empty), otherwise into hi. Then re-balance by moving the top of the bigger heap to the other side.
The median falls right between the two heaps. If lo has one extra element, the median is lo.top. If they're the same size, the median is the average of the two tops. Each heap operation is O(log n), so a billion incoming values still gives instant medians. The invariant — lo holds the lower half, hi holds the upper half, sizes balanced — is what makes the median always reachable in constant time.
Trace
step
x
push into
rebalance?
lo (max-heap)
hi (min-heap)
median
1
5
lo (empty)
no
[5]
[]
5
2
15
hi (>5)
no
[5]
[15]
10
3
1
lo (≤5)
\
lo\
>\
hi\
+1 → move 5 to hi
[1]
[5, 15]
5
4
3
lo (≤5)
no
[3, 1]
[5, 15]
4
5
8
hi (>3)
\
hi\
>\
lo\
→ move 5 to lo
[5, 3, 1]
[8, 15]
5
6
7
hi (>5)
no
[5, 3, 1]
[7, 8, 15]
6
7
9
hi (>5)
\
hi\
>\
lo\
→ move 7 to lo
[7, 5, 3, 1]
[8, 9, 15]
7
Where It's Used Today
Latency dashboards — services like Datadog and New Relic report a "running p50" (median response time) over a stream of millions of requests using exactly this two-heap trick.
Sensor monitoring — fitness trackers and industrial sensors compute the median of incoming readings to filter out spikes more robustly than the mean.
Fraud detection — banks track the median transaction size per account in real time; sudden deviations trigger review.
Networking and SRE — packet-rate monitors, load balancers, and rate limiters use streaming medians to detect anomalous traffic without storing every packet.
Whiteboard interviews — it's a textbook coding-interview question at Google, Meta, and Amazon — generations of engineers have learned the heap balance trick this way.
When NOT to Use
When you only need an approximate quantile and memory is tight — sketches like t-digest or P² give p50/p99 in kilobytes regardless of stream size.
When elements expire (sliding-window median over the last N items) — heaps can't cheaply remove arbitrary old values; use an indexed multiset or a skip list.
When you actually need other percentiles too — running two heaps for p25, p50, p75, p99 separately is wasteful; one quantile sketch handles them all.
Common Mistakes
Using a min-heap for lo instead of a max-heap (Python's heapq is min-only) — without negating values, the "lower half" becomes the wrong end.
Skipping the rebalance step when sizes are equal but the new element belongs on the other side — the invariant every(lo) ≤ every(hi) quietly breaks.
Computing the median as (lo.top + hi.top) / 2 even when one heap is larger — the correct median is the top of whichever heap holds the extra element.
Try It with an AI Assistant
short
Write a class MedianStream with add(x) and median() using two heaps (max-heap of the lower half, min-heap of the upper half).
behavior
Build a class that accepts numbers one by one and can report the median at any moment. Keep two priority queues: one ordered so its top is the largest of the smaller half, the other ordered so its top is the smallest of the larger half. After each insert, move the top of the larger queue to the other side until their sizes differ by at most one. The median is read off the tops.
Made rolling maximum queries on streams practical in O(1) amortized per element.
dq ←deque()
out ← []
FOR i FROM0TO n - 1WHILE dq AND
dq.front < i - k + 1
dq.popleft()
ENDWHILEWHILE dq AND a[dq.back] <= a[i]
dq.pop()
ENDWHILE
dq.push(i)
IF i >= k - 1THEN
out.append(a[dq.front])
ENDIFENDFORRETURN out
The monotonic-deque trick has no single inventor — it spread through the competitive-programming and algorithmic communities in the 1990s, when streaming and online problems started showing up regularly in ACM ICPC and Eastern-European olympiad sets. By the early 2000s it had been folded into the "ascending-minima" technique used in standard problem-solving textbooks (Skiena, Halim) and was a staple of interview questions at Google, Amazon, and the like. The same idea also surfaced inside image-processing literature as the "van Herk / Gil-Werman" linear-time max-filter — discovered independently by researchers who'd never heard of competitive programming.
Teaches: Discard what can never matter again
The Idea
Keep a deque (a queue you can push and pop from both ends) holding the indices of values that could still become the window's max. The trick: every time a new value a[i] comes in, throw away from the back of the deque every smaller-or-equal value already there. Why? Because once a bigger value sits to their right inside the same window, those smaller values can never be the max again — ever. They've been made permanently irrelevant.
Also drop the front of the deque if its index has slid out of the window (dq.front < i - k + 1). After both prunings, the front of the deque always holds the index of the current window's maximum. Each index enters and leaves the deque at most once, so the total work is O(n) — the deque "forgets" useless candidates immediately.
Trace
i
a[i]
dq before
actions
dq after
out (window max)
0
1
[]
push 0
[0]
—
1
3
[0]
a[0]=1 ≤ 3 → pop; push 1
[1]
—
2
-1
[1]
a[1]=3 > -1 → push 2
[1,2]
a[1]=3
3
-3
[1,2]
a[2]=-1 > -3 → push 3
[1,2,3]
a[1]=3
4
5
[1,2,3]
front 1 still in window; a[3]=-3, a[2]=-1, a[1]=3 all ≤ 5 → pop all; push 4
[4]
a[4]=5
5
3
[4]
a[4]=5 > 3 → push 5
[4,5]
a[4]=5
6
6
[4,5]
a[5]=3, a[4]=5 ≤ 6 → pop both; push 6
[6]
a[6]=6
7
7
[6]
a[6]=6 ≤ 7 → pop; push 7
[7]
a[7]=7
Where It's Used Today
Stock trading — rolling-max indicators over the last k ticks for breakout signals on live price feeds.
Server monitoring — dashboards reporting the peak request rate inside the last 5 minutes, updated every second.
Image processing — morphological "max filters" run sliding-window maximum on each pixel row in linear time.
Audio and signal processing — peak detection for voice-activity detection or guitar-tuner envelope tracking.
Online competitive programming — a classic interview and contest problem, often the inner loop of more complex DP optimizations.
When NOT to Use
When the window is very small (k ≤ 3 or so) — a plain max(a[i:i+k]) per step is faster than the deque overhead.
When you need rolling median or arbitrary order statistics, not max — the monotonic-deque trick depends on a total order with one extreme; use a two-heap structure instead.
When elements can be removed from arbitrary positions (not just the trailing edge) — that's a different problem; you need a balanced BST or a sorted multiset.
Common Mistakes
Storing values in the deque instead of indices, then losing the ability to tell when the front has slid out of the window.
Using < instead of <= when popping the back — strict inequality keeps stale equal values around and inflates the deque without changing the answer.
Recording the window's max before the window is full (i < k - 1), producing extra leading entries in the output.
Try It with an AI Assistant
short
Write sliding_window_maximum(a, k) returning a list of the maximum value in every window of size k, using a monotonic deque for O(n) total time.
behavior
Write a function that scans an array left to right, keeping a queue of indices for values that could still be the maximum of some window of size k. When a new value arrives, drop indices at the back whose values are no longer competitive, drop the front if it has slid out of the window, then record the front's value as the current window's max.
MinHash (locality-sensitive hash for Jaccard similarity)
Two Sets, One Hash
Andrei Broder
It made large-scale similarity search practical for documents, webpages, plagiarism detection, and duplicate discovery.
sigs ← []
FOR i FROM0TO k - 1
min_h ← infinity
FOR EACH x IN S
h ←h_i(x)
IF h < min_h THEN
min_h ← h
ENDIFENDFOR
sigs.append(min_h)
ENDFORRETURN sigs
Comparing huge sets directly is expensive. MinHash found a clever fingerprint: under random permutations, the chance of matching minimum hashes estimates Jaccard similarity.
Teaches: Random permutations compress set similarity into one number
The Idea
Pick k independent hash functions h_1, h_2, …, h_k. For each one, compute the hash of every element in the set and remember the minimum hash value. That gives you a signature of k numbers per set. To estimate Jaccard similarity between two sets, count how often their signatures match in the same slot, divided by k.
Why does this work? Imagine the hash function as a random reordering of all possible elements. Whichever set element ends up "first" in this ordering has the minimum hash. The two sets share their minimum if and only if that first element is in their intersection. The chance of that is exactly |A ∩ B| / |A ∪ B| — the Jaccard similarity. Each hash function gives one independent yes/no test of "do you share the min?" The fraction of yeses across k hashes converges to the true Jaccard similarity as k grows.
Trace
i
hash function h_i
h_i(a)
h_i(b)
h_i(c)
h_i(d)
min_h
sigs after append
0
h_0
7
4
9
2
2
[2]
1
h_1
5
8
3
6
3
[2, 3]
2
h_2
1
9
4
7
1
[2, 3, 1]
Where It's Used Today
Web-scale duplicate detection — Google originally used MinHash (the "shingling" paper by Broder) to spot near-duplicate webpages so they don't all appear in search results.
Plagiarism detection — services like Turnitin compare student essays against a huge corpus by MinHash signatures, not raw text.
Recommendation systems — Spotify and Netflix use MinHash to find users with similar listening or watching history quickly.
Genome and bioinformatics — tools like mash use MinHash signatures to compare DNA sequences in seconds where naive comparison would take hours.
Log-data analysis — security tools scan billions of log lines for near-duplicate attack patterns using MinHash to cluster similar events.
When NOT to Use
When you need exact Jaccard similarity, not an estimate — MinHash always has sampling error proportional to 1/√k.
When elements have weights or counts (multisets, term-frequency vectors) — plain MinHash treats every element as 0/1; use weighted MinHash or SimHash for cosine similarity.
When the sets are very small (under ~50 elements each) — direct intersection is faster and exact than building signatures.
Common Mistakes
Using k independently re-seeded copies of the same hash function — they're not independent and the signature collapses; use k distinct hash families.
Forgetting that the universe of elements must be hashed consistently across both sets — different hash functions on each side make matches meaningless.
Estimating similarity by counting matching elements in the signatures rather than matches at the same index — only same-slot matches estimate Jaccard.
Try It with an AI Assistant
short
Write minhash(S, k) returning a MinHash signature of size k for set S.
behavior
Given a set S and an integer k, produce a length-k signature as follows. Pick k independent hash functions. For each hash function, hash every element of S and record the minimum hash value. Return the list of k minimums. Two sets are similar in proportion to how many slots of their signatures match.
It made scalable distributed caching and sharding practical for large web systems.
ring ←sorted_map()
FOR EACH server s
FOR i FROM0TO vnodes - 1
h ←hash(s + ":" + i)
ring[h] ← s
ENDFORENDFORlookup(key):
h ←hash(key)
e ← ring.first_above(h)
OR ring.first()
RETURN e.value
David Karger and collaborators at MIT introduced consistent hashing in 1997 to solve a very practical problem at the early company Akamai: how do you keep a distributed web cache stable when servers come and go all day? Their paper, Consistent Hashing and Random Trees, showed that mapping both keys and servers onto a single circular hash space limits the damage of any change to a thin slice of the keyspace. Within a few years the same trick was powering peer-to-peer DHTs (Chord), distributed databases (Dynamo, Cassandra), and almost every modern CDN.
Teaches: A ring placement minimizes data movement on resize
The Idea
Imagine a giant circle — say 0 to 2³² − 1, with 0 and the maximum stitched together. Hash every server's name to a point on the ring. Then hash every key to a point on the same ring. To find a key's server, walk clockwise from the key's position until you hit the first server.
Why does this minimize disruption? When a server vanishes, only the keys that previously walked to that server need to find a new home — and they simply re-walk clockwise to the next server. Every other key stays exactly where it was. Adding a server is symmetric: only the keys between the new server's position and the previous one need to move. The ring also supports "virtual nodes" — hashing each server many times under suffixes — to keep the load balanced even when the number of real servers is small.
Trace
server
hash position on the ring
S1
20
S2
50
S3
90
Where It's Used Today
CDNs — Akamai, CloudFlare, and Fastly route URLs to edge servers via consistent hashing so adding capacity doesn't blow the cache.
Distributed caches — Memcached clients, Redis Cluster, and Riak place keys on a ring so machines can come and go without rebalancing the whole keyspace.
NoSQL databases — Cassandra and DynamoDB use consistent hashing (with virtual nodes) to assign rows to partitions.
Load balancers — sticky-session routing in tools like HAProxy and Envoy hashes the user ID onto a ring of backends.
Peer-to-peer file systems — Chord and other DHTs rely on the exact same ring construction to find which peer holds a file.
When NOT to Use
When the server set is fixed and never changes — a plain hash(key) mod N is simpler and gives perfect uniformity at no cost.
When you need range queries (e.g. "all keys between A and Z") — hashing destroys order; use a range-partitioned scheme like consistent ranges or B-tree sharding instead.
When request loads are highly skewed to a few hot keys — consistent hashing balances keys, not traffic; you'll still hammer one server. Add a caching layer in front.
Common Mistakes
Using only one hash position per server (no virtual nodes) — load distribution becomes wildly uneven with a small number of servers.
Forgetting the wrap-around case in lookup — when hash(key) exceeds the largest server position, the answer must wrap to ring.first(), not return null.
Re-hashing the entire ring on every lookup instead of caching the sorted structure — turns an O(log N) operation into O(N log N) per query.
Try It with an AI Assistant
short
Write a small consistent-hashing class with add_server, remove_server, and lookup(key), using a sorted ring and vnodes replicas per server.
behavior
Maintain a sorted map from integers (positions on a circular range) to server names. To register a server, hash its name several times under different suffixes and insert each hash as a key pointing to that server. To look up a data key, hash the key, find the smallest map entry whose position is ≥ that hash (wrapping around to the first entry if none), and return that entry's server.
For instanceA search engine ranks pages by links, not just keywords.
graph ← {A: [B, C], B: [C], C: [A]}
d ←0.85
iterations ←20
n ← number of nodes
rank ← array[n] filled with 1/n
FOR t FROM1TO iterations
next ← array[n] filled with (1 - d) / n
FOR EACH node u
share ← d * rank[u] / outdegree(u)
FOR EACH v IN graph[u]
next[v] ← next[v] + share
ENDFORENDFOR
rank ← next
ENDFORRETURN rank
Larry Page and Sergey Brin were Stanford PhD students in 1998 when they reframed web search around a circular question: a page is important if other important pages link to it. Their insight — that this self-referential definition has a clean mathematical answer (the principal eigenvector of the link matrix) — let them rank the entire web in one calculation, leaving keyword-only competitors looking primitive overnight. The paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine" launched Google later that year and reshaped the economics of the internet.
Teaches: Importance flows recursively through who endorses you
The Idea
Imagine a "random surfer" clicking through the web. Most of the time (probability d, typically 0.85) the surfer follows a random link from the current page. Sometimes (probability 1 − d) they teleport to a random page anywhere. After clicking long enough, the fraction of time the surfer spends on each page is that page's PageRank.
The algorithm computes this iteratively. Start with every page at equal rank 1/n. Each round, each page divides its current rank evenly among the pages it links to (its outdegree is the number of outgoing links), multiplied by d. Every page also receives a small uniform contribution (1 − d)/n from teleportation. After enough rounds, the ranks stop changing — that's the steady state. The invariant: after every iteration the ranks sum to 1, just like a probability distribution.
Trace
iter
rank[A]
rank[B]
rank[C]
computation
0
0.333
0.333
0.333
initial uniform
1
0.333
0.192
0.475
A gets 0.05 + 0.85·0.333 from C; B gets 0.05 + 0.85·(0.333/2); C gets 0.05 + 0.85·(0.333/2 + 0.333)
2
0.453
0.192
0.355
re-distribute using row 1
3
0.352
0.243
0.405
continue
4
0.394
0.200
0.406
converging
5
0.395
0.217
0.388
nearly stable
Where It's Used Today
Web search — Google's original ranker; modern search uses hundreds of signals, but PageRank-style link analysis is still in the mix.
Social network influence — Twitter's "Who to Follow" and LinkedIn's connection scoring use PageRank variants on the social graph.
Citation analysis — academic journals use eigenvector centrality (a generalized PageRank) to rank papers and journals by citation importance.
Biology — protein interaction networks use PageRank to identify central proteins; drug-target prediction uses it on disease-gene graphs.
Spam detection and fraud — propagating "trust" or "risk" through transaction or link graphs catches accounts cluster-connected to known bad actors.
When NOT to Use
When edges have meaningful weights you must respect — basic PageRank treats every outlink equally; you need a weighted variant.
When the graph is tiny and a direct centrality calculation (or just counting in-links) would tell you everything.
When the graph has many "dangling" nodes with no outlinks and you skip the special handling — rank leaks away each iteration.
Common Mistakes
Updating rank in place during one iteration so half the nodes use new values, half use old, breaking the math.
Forgetting the teleport term (1 − d)/n, so disconnected components or dead-ends collapse all rank to zero.
Dividing by outdegree(u) when u has no outlinks, hitting a divide-by-zero instead of redistributing the rank uniformly.
Try It with an AI Assistant
short
Write pagerank(graph, iterations, damping) returning the steady-state rank vector.
behavior
Write a function that takes a directed graph and computes a score for each node by simulating a random walker. Start with every node at equal score. In each iteration, every node sends a fraction d of its current score, divided equally among its outgoing edges, to each of its neighbors. Every node also receives a small uniform contribution (1 − d)/n. Repeat for some number of iterations and return the final score vector.
Made breaking structured markup into labeled pieces systematic.
input ← "<p class=\"hi\">Hello</p>"
pos ←0WHILE pos < length(input)
IF input[pos] = '<' THENIF input[pos+1] = '/'THENemit_end_tag()
ELSEIF input[pos..pos+3] = '<!--' THENemit_comment()
ELSEemit_start_tag()
ENDIFELSEemit_text()
ENDIFENDWHILE// each emit_*() call appends one token to the output stream
When the W3C published XML 1.0 in February 1998, the working group set itself a hard goal: make the spec small enough that "an undergraduate could write a conforming parser in a week." They got close — XML's tokenizer is famously simple compared to SGML's — but the strictness has a sharp edge: a single missing quote or unescaped ampersand makes a document non-well-formed and every conforming parser must reject it. That brittleness, intentional in 1998, is why so many systems eventually moved to JSON, but the XML tokenizer still runs every time you open a .docx, an SVG, or an RSS feed.
Teaches: Parse structure by recognizing repeating syntactic patterns and states
Anecdote
XML parsing rules were influenced by SGML's extreme complexity. Designers intentionally made XML simpler — but still strict enough that a tokenizer must be precisely correct or everything breaks (famously brittle).
The Idea
Walk through the input one position at a time, deciding what kind of token starts there based on a small set of rules. If the current character is <, you're entering markup; otherwise you're inside text. If the next character is /, it's an end tag; if the next four characters are <!--, it's a comment; otherwise it's a start tag, possibly carrying attributes and possibly self-closing with />. After emitting each token, advance the position past it and repeat.
Why does it work? Because XML's grammar is regular enough that simple character lookahead — at most a handful of characters — disambiguates every case. Inside attribute values you have to remember the quote character (" or ') and keep reading until you see the matching close quote, but otherwise the tokenizer is a flat loop of peek, decide, consume, emit. This is the smallest finite-state machine you can imagine, and the same shape underlies tokenizers for JSON, CSS, programming-language source code, and every kind of structured text.
Trace
pos
input[pos] starts…
rule taken
token emitted
0
<p (not </, not <!--)
start tag
start-tag(p, attrs={class:"hi"})
14
H
text
text("Hello")
19
</
end tag
end-tag(p)
23
end of input
loop terminates
—
Where It's Used Today
Web browsers — every browser begins by tokenizing HTML before building the DOM tree of the page.
RSS and Atom feeds — feed readers tokenize the XML to extract article titles, dates, and links.
Office file formats — .docx, .xlsx, and .pptx are zipped bundles of XML; opening any one of them runs an XML tokenizer.
SVG graphics — vector-image software tokenizes SVG to render shapes and animations.
Configuration files — many enterprise systems still ship XML config files; loading them starts with tokenization.
When NOT to Use
When you need to validate nesting, namespaces, or DTDs — that's a parser's job, not a tokenizer's; use lxml or expat.
When parsing HTML in the wild — real-world HTML is full of unclosed tags and quirks that a strict XML tokenizer rejects.
When a regex like <(\w+)>...</\1> is enough for your one-off extraction — building a tokenizer is overkill.
Common Mistakes
Failing to track the active quote character inside attributes, so class="say \"hi\"" ends the value at the first inner quote.
Treating <![CDATA[ ... ]]> as a regular text run, mangling content that legitimately contains < and &.
Forgetting to detect self-closing /> separately from end tags </p>, producing two opens for <br/> or losing it entirely.
Try It with an AI Assistant
short
Write xml_tokenize(s) returning a list of tokens labeled as one of: tag-open, tag-close, tag-self-close, attribute, text, comment; handle quoted attribute values.
behavior
Write a function that walks through a string left to right. At each step, peek at the current character: if it is a less-than sign, decide whether the next characters spell an end tag, a comment, or a start tag (possibly with attributes and possibly self-closing); otherwise treat the next chunk as plain text. After each piece, label it and advance past it. Return the labeled list.
Count-Min Sketch (probabilistic frequency estimation)
Multiple Tables, Take the Min
Cormode & Muthukrishnan
It made frequency estimation possible for massive streams such as network traffic, search queries, and click logs using tiny memory.
d ←2
w ←5
C ←matrix(d, w) filled with 0// update(x): increment countersFOR i FROM0TO d - 1
j ←h_i(x) MOD w
C[i][j] ← C[i][j] + 1ENDFOR// query(x): minimum over all rows
best ← infinity
FOR i FROM0TO d - 1
j ←h_i(x) MOD w
best ←min(best, C[i][j])
ENDFORRETURN best
Data streams can be too large to store exact counts for every item. Count-Min Sketch used several hash tables to keep approximate counts with controlled overestimation.
Teaches: Take the minimum across hash collisions
The Idea
Keep a 2-D counter array C with d rows and w columns. Pick d independent hash functions, one per row. To update for an item x, hash x once per row to get a column index j, and increment C[i][j] for every row. To query the count of x, hash x again with the same d functions and return the minimum over the d cells visited.
Why does the minimum work? Each cell C[i][j] counts every item whose hash in row i lands at column j, so its value is at least the true count of x (collisions only push it higher). The smallest of the d cells is the tightest upper bound. With well-chosen d and w, the overestimate is small with high probability — and we never undercount.
Trace
step
action
C row 0
C row 1
0
(initial)
[0, 0, 0, 0, 0]
[0, 0, 0, 0, 0]
1
update "login" ×3
[0, 3, 0, 0, 0]
[0, 0, 0, 3, 0]
2
update "signup" ×2
[0, 5, 0, 0, 0]
[0, 0, 0, 3, 2]
Where It's Used Today
Network monitoring — Cisco and other routers use Count-Min variants to estimate per-flow packet counts at line rate.
Search engine analytics — keeping approximate frequencies of trillions of search queries.
Database query optimizers — cardinality estimates that guide which index or join order to pick.
Real-time abuse detection — counting how often a single IP or user has repeated an action without storing a row per IP.
Big-data systems — Apache Spark, Apache Storm, and Druid include Count-Min Sketch primitives for approximate streaming counts.
When NOT to Use
When undercounting is unacceptable but exact counts are not required either — Count-Min only overestimates; if you need a guaranteed underestimate, use a different sketch.
When item counts can be negative (decrements allowed) — taking the min over rows breaks; use a Count-Mean-Min or Count Sketch instead.
When most items in the stream are rare — collisions with heavy hitters wildly overestimate light items; rare-item analysis needs a different approach.
Common Mistakes
Using d correlated hash functions (or just one hash with different seeds applied trivially) — collisions stop being independent and the error bound collapses.
Returning the sum across rows instead of the minimum — every collision then inflates the answer instead of being filtered out.
Sizing w based on the number of distinct items rather than the desired error tolerance, leading to a sketch that's either huge or useless.
Estimate item frequencies in a stream using multiple hash rows and return the minimum counter value across rows. Keep a 2-D table with d rows and w columns. To increment an item, hash it once per row, take each hash modulo w to pick a column, and add one to that cell. To query an item, hash it the same way and return the smallest of the d cells you read.
For instanceA Go program chooses moves by simulating many futures.
root ← current game position
n_iter ←6REPEAT n_iter times
node ←select(root)
child ←expand(node)
result ←simulate(child)
backpropagate(child, result)
RETURN best child of root
Rémi Coulom introduced MCTS in a 2006 paper while working on his Go program Crazy Stone in France — at a time when classical alpha-beta engines couldn't beat even an amateur Go player because the branching factor (~250) and the lack of a good evaluation function made the standard chess-style approach hopeless. Within a year, MCTS Go programs were beating every previous Go engine; within ten years, AlphaGo combined MCTS with deep neural networks and defeated world champion Lee Sedol 4–1. The same select / expand / simulate / backpropagate loop now drives AlphaZero, MuZero, and most modern game-playing systems.
Teaches: Sample random futures to grow the tree where it matters most
The Idea
Each iteration of MCTS does four steps:
1. Select — start at the root and walk down the tree, choosing the most promising child at each level using a formula that balances "win rate so far" with "we haven't tried this much yet" (the UCB1 score).
2. Expand — when you reach a node that hasn't been fully expanded, add one of its unseen children.
3. Simulate — from that new node, play random moves until the game ends and record who won.
4. Backpropagate — walk back up the path, incrementing each node's visit count and adjusting its win count.
After many thousands of iterations, return the root's child with the highest visit count (or highest win rate). Why does it work? Because UCB1 is a bandit algorithm — it provably balances exploration and exploitation — and random play, run enough times, gives an unbiased estimate of each move's true value. The tree grows asymmetrically toward the moves that actually look promising, ignoring obviously bad branches. This is why MCTS scales to astronomical game trees that classical minimax cannot touch.
Trace
iter
select reaches
expand
simulated result
backpropagate updates
1
root
A
win
A: 1/1, root: 1/1
2
root
B
loss
B: 0/1, root: 1/2
3
root
C
win
C: 1/1, root: 2/3
4
root → A (UCB best)
A's child
win
A: 2/2, root: 3/4
5
root → C
C's child
loss
C: 1/2, root: 3/5
6
root → A
A's child
win
A: 3/3, root: 4/6
Where It's Used Today
Go and chess engines — AlphaGo, AlphaZero, Leela, KataGo, and Stockfish (in some variants) all use MCTS-style search guided by neural networks.
Video game AI — strategy games like Civilization and card games like Hearthstone use MCTS for planning enemy moves under massive branching factors.
Robotics planning — sampling-based motion planners (like RRT) and task-and-motion planners borrow the explore-promising-paths idea from MCTS.
Drug discovery and chemistry — exploring huge spaces of possible molecules by treating each "build" decision as a move in a game.
AlphaZero's general framework — every modern reinforcement-learning system that combines a neural network value estimate with planning (including DeepMind's MuZero) uses MCTS as its search core.
When NOT to Use
When the game has a small branching factor and a good evaluation function (e.g., classical chess) — alpha-beta with a strong heuristic still wins on per-second compute.
When random rollouts are useless (the game's outcome is decided by deep tactical lines, not by typical play) — the simulation step gives noise rather than signal.
When the environment is stochastic and partially observable to a high degree — vanilla MCTS assumes deterministic, fully-observed states; you need POMCP or information-set MCTS variants.
Common Mistakes
Picking the move with the highest win rate at the root instead of the highest visit count — high-rate but rarely-tried branches are usually statistical flukes.
Forgetting to flip the win/loss perspective at each backpropagation step in a two-player game, so the opponent's wins get credited to you.
Hardcoding the UCB exploration constant c without tuning — too small and the tree never explores; too large and it never exploits the good moves it found.
Try It with an AI Assistant
short
Write mcts(root, n_iter) running select/expand/simulate/backpropagate iterations to choose the best action.
behavior
Write a function that, given a game position, repeats the following many times: walk down a tree of positions choosing at each level the child that maximizes win-rate plus an exploration bonus that decays as that child is visited; when a position is reached whose children aren't all in the tree, add one new child; play random moves from there until the game ends; then walk back up the path incrementing every node's visit count and recording who won. Return the move at the root with the highest visit count.
It made approximate unique counting practical for analytics systems: unique visitors, distinct users, unique queries, and large telemetry streams.
stream ← ["apple", "banana", "apple", "cherry", "banana", "apple"]
b ←1
m ←2
M ← array[0..m-1] filled with 0FOR EACH x IN stream
h ←hash(x)
j ← top b bits of h
rho ← position of
leftmost 1IN remaining bits
M[j] ←max(M[j], rho)
ENDFORRETURN alpha * m^2 /
sum(2^(-M[j]))
Counting distinct items exactly at internet scale is memory-hungry. HyperLogLog used the signal in leading-zero patterns of hashes to estimate cardinality compactly.
Teaches: Leading zeros reveal hidden cardinality
The Idea
Hash each item to a long bit string. Random hashes look like coin flips: among n random bit strings, you expect to see one that starts with about log₂ n leading zeros. So if the longest run of leading zeros you've ever observed is R, then the number of distinct items you've seen is roughly 2^R. That's the seed insight.
To make the estimate more reliable, split items across many "registers" using the top few bits of the hash to pick a register j. Each register M[j] keeps the longest run of leading zeros it ever saw (in the remaining bits). The final estimate combines all m registers with a harmonic mean, multiplied by a correction constant alpha. The harmonic mean dampens the rare extremely-large-zero-run readings that would otherwise dominate. The invariant: each register's max-zero-run depends only on the distinct items mapped to it; duplicates of an already-seen item never raise it.
Trace
x
hash (top bit / remaining)
j (top bit)
remaining bits
rho (leftmost-1 position)
"apple"
0 / 0010110
0
0010110
3 (first 1 is at pos 3)
"banana"
1 / 0001011
1
0001011
4
"apple"
0 / 0010110
0
0010110
3 (no change)
"cherry"
0 / 1000000
0
1000000
1 (no change, 3 > 1)
"banana"
1 / 0001011
1
0001011
4 (no change)
"apple"
0 / 0010110
0
0010110
3 (no change)
Where It's Used Today
Google BigQuery and Redshift — APPROX_COUNT_DISTINCT is implemented with HyperLogLog under the hood for fast SQL analytics.
Redis — the PFCOUNT command estimates unique-item counts in 12 KB of memory regardless of input size.
Network telemetry — counting unique IP addresses, domains, or flows on backbone routers without storing them all.
Web analytics — Google Analytics, Adobe Analytics, and similar tools estimate unique visitors per minute with HLL.
Distributed systems monitoring — Prometheus, Datadog, and other observability stacks use HLL-style counters to track cardinality of metric labels.
When NOT to Use
When the true cardinality is small (a few hundred) — a plain hash set is exact, simpler, and uses less memory than the register array.
When you need to enumerate the distinct items, not just count them — HLL throws away identities by design.
When 2% error is unacceptable, like billing or audit counts — use exact counting with a hash set or a sorted index.
Common Mistakes
Using a weak hash like Java's default hashCode — biased leading-zero patterns make the estimate systematically wrong.
Counting the leading-zero position over the whole hash instead of the bits remaining after register selection, double-counting bits.
Skipping the small-range correction (linear counting) for low cardinalities — estimates near zero come out wildly off.
Implement an approximate distinct-count data structure. Maintain m registers, all zero. For each input value, hash it; use the top b bits to choose a register, and update that register with the maximum of its current value and the position of the leftmost 1-bit in the remaining bits. To estimate the cardinality, return alpha · m² divided by the sum of 2^(−register) across all registers.
It made leaderboards, search results, recommendation candidates, and streaming ranking efficient without full sorting.
heap ←min_heap()
FOR EACH x IN stream
IF |heap| < k THEN
heap.push(x)
ELIF x > heap.top
heap.POP
heap.push(x)
ENDIFENDFORRETURN heap.values
When a stream has millions of items, sorting everything just to find the top few is wasteful. A small min-heap keeps only the current best k candidates.
Teaches: Reject anything smaller than the smallest kept
The Idea
Keep a min-heap of size at most k. A min-heap is a data structure that always lets you peek at and remove its smallest element in O(log k) time. As each new value x arrives: if the heap has fewer than k items, just push x in. Otherwise, compare x to the smallest item in the heap (the heap's top). If x is bigger, evict the top and push x. If x is smaller, drop it.
The invariant is the whole point: at every moment, the heap holds the k largest values seen so far, with the smallest of those k sitting at the top — ready to be the gatekeeper. Any new arrival smaller than the gatekeeper can't possibly be in the top-k. The cost per item is O(log k), and we only ever store k items, regardless of stream length.
Trace
step
x
heap before
action
heap after
1
7
[]
size < 3, push
[7]
2
3
[7]
size < 3, push
[3, 7]
3
9
[3, 7]
size < 3, push
[3, 7, 9]
4
1
[3, 7, 9]
1 ≤ top (3), drop
[3, 7, 9]
5
8
[3, 7, 9]
8 > top (3), pop+push
[7, 8, 9]
6
6
[7, 8, 9]
6 ≤ top (7), drop
[7, 8, 9]
7
4
[7, 8, 9]
4 ≤ top (7), drop
[7, 8, 9]
8
5
[7, 8, 9]
5 ≤ top (7), drop
[7, 8, 9]
9
10
[7, 8, 9]
10 > top (7), pop+push
[8, 9, 10]
10
2
[8, 9, 10]
2 ≤ top (8), drop
[8, 9, 10]
Where It's Used Today
Search-result ranking — Google, Elasticsearch, and Lucene keep the top k documents per query using a min-heap as scores stream in from the index.
Recommendation systems — YouTube and Spotify use top-k heaps to keep the k highest-scoring candidate videos or tracks per user request.
Real-time leaderboards — gaming services and analytics dashboards maintain top-k players or top-k events under continuous updates.
Network monitoring (heavy hitters) — routers and DDoS defenses identify the top-k busiest source IPs in a packet stream this way.
Streaming ML inference — beam-search decoders in machine translation and speech recognition keep the top-k partial hypotheses each step.
When NOT to Use
When k is comparable to n (say, k > n/2) — the heap is almost as big as the input; just sort the array, which is simpler and only O(n log n).
When you also need the top-k in sorted order at every step — a min-heap gives you the set, not the order; pop into a stack at the end, or use an order-statistics tree.
When the stream supports random access and you only need a one-shot answer — Quickselect partitions to find the k-th in expected O(n), faster than O(n log k).
Common Mistakes
Using a max-heap instead of a min-heap — you can no longer cheaply test "is the new value larger than the worst kept?" without scanning all k items.
Forgetting the size check (|heap| < k) and trying to compare against heap.top on an empty heap — crashes on the first item, or worse, returns nonsense.
Pushing every stream value and only trimming at the end — memory grows to O(n), defeating the whole point of bounded-memory streaming.
Try It with an AI Assistant
short
Write top_k_via_min_heap_streaming(...) implementing Top-k via Min-Heap (streaming).
behavior
Find the largest k items from a stream of values without storing the whole stream. Maintain a small collection of size at most k that always lets you find its smallest element quickly. For each new item, if the collection has room, add it; otherwise compare it to the collection's smallest element and only add it if it's larger, evicting the smallest first.
Made hashing fast enough to keep up with memory bandwidth.
data ←"hello"
seed ←0
primes ← [p1, p2, p3, p4, p5] // five fixed odd primes from the xxHash32 spec
v1 ← seed + p1 + p2
v2 ← seed + p2
v3 ← seed
v4 ← seed - p1
WHILE remaining >= 16
v1 ←round(v1, read32())
v2 ←round(v2, read32())
v3 ←round(v3, read32())
v4 ←round(v4, read32())
ENDWHILEIFlength(data) >= 16THEN
h ←combine(v1, v2, v3, v4)
ELSE
h ← seed + p5
ENDIF
h ←absorb_tail(h, tail)
RETURNfinalize(h) // h ^= h>>15; h *= p2; h ^= h>>13; h *= p3; h ^= h>>16
Yann Collet created xxHash while working on data compression tools. The goal wasn't "best hash ever" — it was: "fast enough that hashing is no longer the bottleneck." Speed over elegance.
Teaches: Navigate small worlds and avalanche aggressively to reduce code mixing
Anecdote
Yann Collet created xxHash while working on data compression tools. The goal wasn't "best hash ever" — it was: "fast enough that hashing is no longer the bottleneck." Speed over elegance.
The Idea
Process the input in blocks of 16 bytes, but split each block across four parallel accumulatorsv1, v2, v3, v4, each handling 4 bytes. Each accumulator updates with a round step that multiplies, rotates, and XORs using carefully chosen large odd primes. Doing four accumulators in parallel lets a modern CPU pipeline absorb 16 bytes per loop iteration — far faster than a single-stream hash.
When the input runs out of full 16-byte blocks, combine the four accumulators into one 32-bit (or 64-bit) hash by mixing them with rotations and the same prime constants. Then finalize by absorbing the leftover tail bytes one or four at a time, and apply a final avalanche step (multiply, shift, XOR) so every output bit depends nonlinearly on every input bit. The result: a hash that mixes thoroughly while running at the limit of the CPU's memory bandwidth.
Trace
step
what happens
state
0
initialize v1..v4 from seed and primes p1, p2
v1, v2, v3, v4 set
1
remaining = 5 < 16 → skip main loop
v1..v4 unchanged
2
combine: h = seed + p5 = 0 + p5
h initialized
3
absorb tail: 4 bytes "hell" via 4-byte mixing step
h folded
4
absorb tail: 1 byte "o" via 1-byte mixing step
h folded again
5
finalize: h ^= h >> 15; h = p2; h ^= h >> 13; h = p3; h ^= h >> 16
avalanche complete
Where It's Used Today
File system integrity — ZFS, Btrfs, and the LZ4 compression library use xxHash to verify that data hasn't been silently corrupted on disk.
Database hashing — RocksDB, ClickHouse, and many other storage engines use xxHash to spread keys across shards or buckets.
Build systems and caches — Bazel, Cargo, and Docker layer caches fingerprint files with xxHash to detect changes quickly.
Networking and packet inspection — high-throughput firewalls and load balancers use xxHash for fast flow-table lookups.
Big data pipelines — Spark, Hadoop, and Kafka use xxHash-style fast hashes for partitioning records across worker nodes.
When NOT to Use
When you need cryptographic security (passwords, signatures, tamper resistance) — xxHash is non-cryptographic and trivially reversible; use SHA-256 or BLAKE3.
When you need stable hashes across architectures with different endianness without careful byte-order handling — xxHash reads multi-byte words natively and will diverge if endianness isn't normalized.
When inputs are tiny and uniform (3-8 byte keys) — the constant cost of init/finalize dominates; a simpler integer hash like FNV or MurmurHash3's finalizer is competitive.
Common Mistakes
Picking your own "small" prime constants instead of the published xxHash primes — the avalanche guarantee disappears and similar inputs collide.
Forgetting the final avalanche step (h ^= h >> 15; h *= p2; ...) — the upper bits stay weakly mixed and modulo-by-table-size produces clustered keys.
Reading past the end of the input when handling the tail — xxHash explicitly processes 4-byte then 1-byte chunks; doing one bulk read corrupts the hash and risks a buffer overrun.
Try It with an AI Assistant
short
Write xxhash32(data, seed) implementing xxHash.
behavior
Write a fast non-cryptographic hash that processes input in 16-byte blocks split across four parallel 32-bit accumulators. Each accumulator's round step multiplies in a 4-byte chunk, rotates, multiplies again by a large odd prime, and updates the accumulator. After all full blocks, combine the four accumulators by rotation and XOR. Then absorb any leftover tail bytes (4 at a time, then 1 at a time) and finish with a final avalanche step that multiplies and XOR-shifts so every output bit mixes thoroughly.
HNSW Search (Hierarchical Navigable Small World — search step)
Climb the Graph to Your Neighbor
Malkov & Yashunin
Made finding nearest neighbors in millions of vectors fast.
node ← entry_point
FOR level FROM top DOWNTO 1REPEAT
next ← argmin neighbor
by dist(query)
IFdist(next) < dist(node) THEN
node ← next
ELSEBREAKENDIFUNTIL no progress
ENDFORRETURNsearch_layer(
node, query, k, level=0)
Yury Malkov and Dmitry Yashunin were inspired by small-world networks (like social graphs). The key intuition: navigating a graph like jumping between friends-of-friends — a social metaphor turned into one of the most important search structures today.
Teaches: Navigate small worlds via greedy local moves at decreasing scale
Anecdote
Yury Malkov and Dmitry Yashunin were inspired by small-world networks (like social graphs). The key intuition: navigating a graph like jumping between friends-of-friends — a social metaphor turned into one of the most important search structures today.
The Idea
The graph has multiple layers. The top layer contains very few nodes connected by long-range edges — like flying over a country. Each layer below has more nodes and shorter edges, all the way down to layer 0 which contains every item. Crucially, nodes are linked to their nearest neighbors within each layer, so a greedy walk on any layer always descends toward the query.
Search starts at a fixed entry point on the top layer. At each layer: from the current node, look at every neighbor's distance to the query, jump to the neighbor closest to the query, and repeat until no neighbor is closer than where you are. Then drop down a layer and continue. At layer 0, perform a small expanded search to collect the k nearest items. The invariant is "the current node is the closest one I've seen at this layer." Because layers shrink the search space exponentially, the entire process is roughly O(log n) distance computations.
Trace
level
node
candidate neighbors (with dist to query)
action
2
A (d=8)
B (d=6), C (d=9)
move to B
2
B (d=6)
A (d=8), D (d=7)
no closer → drop
1
B (d=6)
E (d=4), F (d=7)
move to E
1
E (d=4)
B (d=6), G (d=3)
move to G
1
G (d=3)
E (d=4), H (d=4)
no closer → drop
0
G (d=3)
(run search_layer to collect k nearest)
return result
Where It's Used Today
Vector databases — Pinecone, Weaviate, Milvus, Qdrant, and Chroma all use HNSW (or HNSW-like graphs) as their primary index for nearest-neighbor search.
Retrieval-augmented generation (RAG) — when ChatGPT or Claude pulls relevant documents from a knowledge base, the retrieval step is almost always HNSW.
Image search — reverse-image search engines and visual recommendation systems (Pinterest, Shopify) use HNSW over image embeddings.
Recommendation systems — Spotify, Netflix, and YouTube use approximate nearest-neighbor search to find "songs similar to this one" or "users like you."
Face recognition — large-scale face-matching systems (security, photo tagging) use HNSW to search billions of facial embeddings.
When NOT to Use
When you need exact nearest neighbors — HNSW is approximate; greedy descent can miss the true closest item. Use brute-force or a tree-based exact index when correctness matters.
When the dataset fits in a few thousand vectors — brute-force scan is faster than building a layered graph, and the index memory overhead isn't worth it.
When vectors are constantly inserted and deleted — HNSW handles inserts well but deletions leave "tombstones" that degrade search quality; rebuild periodically or use a different index.
Common Mistakes
Returning the first node where no neighbor is closer as the final answer — that's only the entry point for the next layer down; you must keep descending and run an expanded search at layer 0.
Using too small an efSearch parameter — greedy descent gets stuck in local minima far from the true neighbor; recall drops sharply but the bug is silent.
Comparing distances inconsistently (cosine for build, Euclidean for query, or vice versa) — the graph's "closeness" relations no longer match the query metric and results become meaningless.
Try It with an AI Assistant
short
Write the search step of HNSW (Hierarchical Navigable Small World) — given a multi-layer graph and a query vector, return the k nearest items.
behavior
Write a function that searches a multi-layer graph by starting at a fixed entry node on the top layer. At each layer, from the current node look at the distance from the query to every neighbor and jump to the closest neighbor; repeat until no neighbor is closer than the current node. Then drop one layer and continue. At the bottom layer, run a small expanded search to collect the k nearest items.
Swiss Tables Lookup (open-addressing with control bytes)
Modern Buckets, Modern Probes
Google / Abseil
Made hash table lookups fast on modern cache-aware CPUs.
group ← hash >> 7
ctrl ← hash AND 0x7F
WHILETRUE
bytes ← table.ctrl[group..]
matches ←simd_eq(bytes, ctrl)
FOR EACH i IN matches
IF table.keys[group+i] = key THENRETURN table.vals[group+i]
ENDIFENDFORIF any byte = empty THENRETURNNONEENDIF
group ← (group + 1) MOD ng
ENDWHILE
By the mid-2010s, Google's C++ services were spending a measurable fraction of their CPU time inside std::unordered_map — and most of that time was waiting for memory. Engineers on the Abseil team realized that the bottleneck wasn't hashing or comparison; it was cache misses, one per probe. Their fix, presented at CppCon 2017, married two old ideas in a new way: open-addressing hash tables (so probes are contiguous in memory) and SIMD instructions (so a single CPU cycle can compare against sixteen one-byte fingerprints at once). The design quickly migrated outside Google — into Rust's standard HashMap, into Go's runtime, and into half the high-performance C++ codebases that exist.
Teaches: Use metadata to filter probes and reduce cache misses
Anecdote
Developed inside Google's Abseil library. The breakthrough was using tiny "control bytes" + SIMD instructions to check many slots at once — turning hashing into a vectorized operation.
The Idea
Hash the key into a 64-bit number. Split it: the top bits pick a group of 16 slots in the table; the low 7 bits form a control bytectrl — a tiny fingerprint of this key, with its top bit cleared so it can never be confused with the "empty" or "deleted" sentinel bytes (which have the top bit set). Each table slot has its own one-byte control byte stored in a parallel array.
To look up a key, load the group's 16 control bytes, and ask the CPU's SIMD unit to compare all 16 against ctrl in one instruction. The result is a 16-bit mask of candidate slots whose fingerprints match. For each candidate, do a real key comparison. If you find your key, return its value. If the group also contains an empty slot, the key isn't in the table — stop. Otherwise advance to the next group. Because the control bytes are packed and contiguous, one cache line covers a whole group — so most lookups touch slow memory just once.
Trace
step
bytes (slice)
matches mask
check key
result
1
[0x12, 0x07, 0x80, 0x07, ...]
bits 1, 3 set
table.keys[1] = "alice"?
YES → return table.vals[1]
Where It's Used Today
Google's C++ Abseil library — absl::flat_hash_map, used inside Google Search, Maps, Gmail, and YouTube backends.
Rust standard library — std::collections::HashMap adopted the Swiss Tables design as its default in 2018.
Go runtime (since 1.24) — Go's built-in map switched to a Swiss-Tables-inspired layout for better cache behavior.
Compilers and linkers — symbol tables in LLVM and other compilers are huge hash maps; Swiss Tables shave seconds off big builds.
Game engines and databases — anywhere you do millions of key lookups per second, the SIMD-accelerated Swiss layout shows up.
When NOT to Use
When the table is tiny (under a few dozen entries) — the SIMD setup and group bookkeeping cost more than a plain linear scan over a small array.
When keys must stay in sorted order or you need range queries — Swiss Tables (like all hash maps) shuffle keys; use a B-tree or skip list.
When pointer stability is required — open addressing relocates entries on resize, invalidating any references; use a node-based map (std::unordered_map) instead.
Common Mistakes
Failing to clear the high bit of the control byte fingerprint, letting a real key's ctrl collide with the empty/deleted sentinel encodings (which Abseil distinguishes by the top bit being set).
Stopping the probe sequence at the first deleted slot instead of the first empty slot — keys placed after a tombstone become unreachable.
Forgetting to wrap the group index modulo the table size — the probe walks off the end of the array on the second wraparound.
Try It with an AI Assistant
short
Write a Swiss Tables lookup using open addressing with 1-byte control fingerprints and SIMD-style group matching, returning the value for a given key or None.
behavior
Write a hash-table lookup that splits the hash into a group index and a 7-bit fingerprint. Load 16 control bytes for that group, compare all of them to the fingerprint in parallel, and for each match check whether the stored key equals the search key. Stop on a hit, on an empty slot, or otherwise move to the next group.
In 2017, eight researchers at Google Brain in Mountain View — Vaswani, Shazeer, Parmar, Uszkoreit and four collaborators — published "Attention Is All You Need." Their bet was radical: throw out recurrence entirely and let every token in a sequence attend directly to every other token through a simple matrix-multiply operation. The resulting Transformer architecture trained much faster on GPUs than any RNN, captured long-range dependencies effortlessly, and within five years powered essentially every major large language model — one of the fastest paradigm shifts in the history of computing.
Teaches: Weight pairwise interactions to focus relevance
Anecdote
The 2017 paper "Attention Is All You Need" by Vaswani and seven co-authors at Google Brain proposed a network architecture — the Transformer — that dispensed entirely with recurrence and relied solely on attention. Within five years it had become the foundation of essentially every large language model, making it one of the fastest-adopted ideas in computing history.
The Idea
Pack all the queries into a matrix Q, the keys into K, and the values into V. The matrix product Q · transpose(K) produces a table of scores: row i, column j is how much query i matches key j. Divide the scores by sqrt(dim) (where dim is the size of each query/key vector) to keep the numbers numerically stable, apply softmax along each row to turn raw scores into probabilities that sum to 1, then multiply by V — each output row is a weighted average of the value vectors.
Why does it work? The dot product is a similarity score: large when two vectors point the same direction, small when they don't. Softmax sharpens that into a probability distribution — the most relevant keys dominate the blend. The sqrt(dim) divisor matters because for large dim the dot products grow large, pushing softmax into a regime with vanishing gradients. The whole operation is one big matrix multiply, which is why GPUs can run attention on thousands of tokens at once.
Large language models — every layer of GPT, Claude, Gemini, Llama, and every modern chatbot is a stack of scaled dot-product attention blocks; this is the literal heart of the technology.
Machine translation — Google Translate, DeepL, and every modern translator use attention so each output word can pull from any input word, regardless of position.
Image recognition — vision transformers (ViTs) treat image patches as tokens and apply attention to figure out which patches matter for the classification.
Speech recognition — Whisper and other transcription models use attention to align audio frames with the most likely words, even when they are far apart in time.
Protein folding — AlphaFold's structure module is an attention network; each amino acid attends to every other to predict the 3D shape of the protein.
When NOT to Use
When sequences are extremely long (tens of thousands of tokens) — the O(n²) score matrix dominates memory; use FlashAttention, sparse, or linear-attention variants instead.
When position matters and you have no positional encoding — attention is permutation-invariant, so without explicit positional information it cannot tell "dog bites man" from "man bites dog".
When training data is tiny — transformers need lots of examples to learn useful Q/K/V projections; a linear or recurrent model trains faster on small datasets.
Common Mistakes
Forgetting the sqrt(dim) divisor — for large dim, raw dot products grow huge, softmax saturates, and gradients vanish.
Applying softmax along the wrong axis (columns instead of rows), so the weights for one query no longer sum to 1 and the output blends nothing meaningful.
Skipping the causal mask in autoregressive decoders, letting future tokens leak into past predictions and silently breaking next-token generation.
Given three matrices Q, K, V of shape (n, dim), compute the matrix product of Q with the transpose of K, divide every entry by the square root of dim, apply softmax row by row to get a row-stochastic weight matrix, and finally return the weight matrix multiplied by V.
Random Integer in a Range (Lemire's fast unbiased rejection)
An Unbiased Coin from a Biased One
Montreal, Canada
It made fast unbiased bounded random numbers practical for simulations, games, randomized algorithms, and cryptographic-adjacent utilities.
m ←random_u32() * range
l ← m AND 0xFFFFFFFF
IF l < range THEN
t ← (-range) MOD range
WHILE l < t
m ←random_u32() * range
l ← m AND 0xFFFFFFFF
ENDWHILEENDIFRETURN m >> 32
Turning random bits into a fair bounded integer is trickier than using modulo, which can introduce bias. Daniel Lemire’s method used multiplication and rejection to generate unbiased bounded integers quickly.
Teaches: Eliminate modulo bias using rejection of regions
The Idea
Multiply your 32-bit random number by range to get a 64-bit product m. The upper 32 bits of m are essentially floor(random / 2³² · range) — a candidate integer in [0, range). That's "Lemire's multiply-shift" and on average it's unbiased almost everywhere in the range.
The "almost" is the bias: the lower 32 bits of m (call them l) tell you which mini-interval of size 2³²/range you fell into, and a few "leftover" mini-intervals are larger than the rest. Reject the small range of values where l < t = (−range) mod range, and resample. After rejection the output is provably uniform on [0, range). Most of the time the cheap fast-path check l ≥ range already settles it; the slow path with the (−range) mod range threshold runs only on a tiny sliver of inputs.
Trace
step
computation
value
1
m = random_u32() * range = 1,717,986,918 · 10
17,179,869,180
2
l = m AND 0xFFFFFFFF (lower 32 bits)
4,294,967,292
3
check l < range = 10?
no — fast accept
4
return m >> 32 (upper 32 bits)
3
Where It's Used Today
Game shuffles — the "fair shuffle" in card games and matchmaking lobbies needs unbiased indices, not modulo-skewed ones.
Statistical simulations — Monte Carlo experiments at scale notice the bias when running billions of draws.
Modern language standard libraries — Go's math/rand/v2, Rust's rand, and Numpy's Generator all use Lemire-style integer generation.
Cryptographic-adjacent utilities — random nonces, salts, and shuffles where unbiased uniform draws are required for security analysis.
Reservoir sampling and randomized algorithms — any algorithm that needs an unbiased index [0, n) benefits from the fast-path acceptance.
When NOT to Use
When range is a power of two — a single bit-mask is faster and unbiased; rejection isn't needed.
When the target language can't do a 64-bit multiply on 32-bit operands — the multiply-shift trick assumes wide arithmetic.
When the bias of random_u32() % range is small enough to ignore (a quick game shuffle, a UI animation) — the simpler modulo is good enough and one line shorter.
Common Mistakes
Computing t = -range mod range with signed arithmetic — -range underflows; treat the bound as unsigned ((2^32 - range) mod range).
Doing the threshold check unconditionally — the fast-path l >= range settles most calls, and skipping it wastes the speedup that justifies the algorithm.
Resampling only the lower 32 bits instead of generating a fresh random_u32() — the rejection loop must redraw the full word, otherwise it's still biased.
Try It with an AI Assistant
short
Write random_in_range(range) using Lemire's fast unbiased rejection — multiply a 32-bit random by range, fast-accept on the upper 32 bits, only fall back to the threshold rejection when the lower bits dictate.
behavior
Given a uniform 32-bit random word and a target range smaller than 2³², produce a uniform integer in [0, range). Multiply the random word by range to get a 64-bit product. Take the upper 32 bits as a candidate. Take the lower 32 bits; if they're at least range, return the candidate. Otherwise compute the threshold (−range) mod range and resample until the lower 32 bits are at least that threshold.