What in the wordle?


Information theory defines the “entropy” or “information” of a random variable as the number of bits required to describe it. How does this apply to a guessing game? Let’s look at a simpler example.

Applying it to Wordle

So how does information theory apply to Wordle? The first question we can try to answer is: “ How many bits of information are required to guess a five-letter word?”. This is actually straightforward — Once we know how many five-letter words there are, by definition, the information content of choosing one of them is just log2(N) bits.

  1. words_alpha (contains 477k english words)
  2. top 10,000 most commonly searched english words according to Google

Wordle guesses

The next question to answer is how much information we gain from a Wordle guess. Calculating the information gain for a Twenty Questions guess was easy, since each question has exactly two outcomes (one bit). But you learn some more complex information from a Wordle guess: which letters are not in the word, and which are in the word at the right and wrong locations. How do we calculate the information gain for that?

A bit of inspiration

I left this question in the back of my mind for a few days, since I didn’t see an obvious approach. Then one day last week, I was driving home when I ran out of podcasts (I normally have them queued up and downloaded) so I turned my attention back to it.

Going deeper

Even though I had answered my original question, I wasn’t totally satisfied. I had a new question in my mind: What is the “best” word we can guess, and how much information are we able to get with it? To answer these we need to understand exactly what makes a guess “good”.

“Shut up and calculate”

It turns out that we can generalize the intuition above — a discrete random variable has maximum entropy when the outcomes are equally likely. This is exactly the same principle that rewarded us for choosing questions in Twenty Questions that evenly divide the space. For our purposes, the “best” guess is the one that most evenly divides the outcome space. If there are N words and M different colorings, we want the word that most closely puts N/M words in each coloring. In our case, with 12,500 words and 243 outcomes, we should aim for around 50 words per coloring.


So we answered the original question and also a couple more that came up through thinking about the problem. I’m pretty satisfied with this, but one question I’m still wondering about is whether I really need to compute the entropy for every word in order to get the best score, or if there’s a more efficient way to do it. This way works okay, but it took my computer a while to chug through the computations. I had to rewrite my code from python to C++ to get it to finish in a reasonable amount of time. If I increased the vocabulary size or allowed non-vocabulary words to be guessed, this approach wouldn’t scale well. If anyone has ideas on how to speed this up or take a less computational approach, I’d love to hear it!

Update 1 (01/08/21):

Looking around on the internet, I realized others like Tyler Glaiel have come up with different best first words than me. I’ve also gotten a lot of responses from friends telling me they only try “tares” now, which made me realize I had a responsibility to double-check my work, lest people underachieve on their wordles.

Update 2(01/17/21):

Some friends had a couple of interesting follow-up questions that I followed up on.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store