Could static analysis provide a generic way to approach Wordle? James Handley uses simple command line tools in order to (hopefully) name that Wordle in four!
For many people (myself included) Wordle [Wordle] has become part of the daily routine. Wordle is a simple ‘Mastermind’-like game where you have to guess a five-letter word in six guesses or fewer. For each guess you make (each of which has to be a valid five-letter word), you get an indication of whether each letter appears in the solution, and if it does whether you have guessed it in the correct position. One important thing to note is that the spelling is American English.
After a several weeks of solving it with random guessing, I decided to try and be a bit more scientific, in part thanks to a little extra prod from a video by Hannah Dee [Dee22].
Key to the examples:
First things first. If the answer didn’t have to be a valid word you are unlikely to solve it. There are 26 × 26 × 26 × 26 × 26 (over 11 million) permutations possible. While you can rule out a lot of these permutations after every guess, basic arithmetic exposes the problem. If we guess 5 different letters each turn, then we will have tried 25 different letters in 5 guesses. For our final guess, we will know which distinct letters are in the solution, but can’t be sure of their position, or if any have been repeated. Or to look at it in a more Wordle way – on the first guess, say we happen to get four letters in the correct position. That still leaves 25 possibilities for the fifth letter (as one may be repeated), but only 5 guesses to find it.
Fortunately it does have to be a valid word, which helps enormously. There are ‘only’ 11,302 words in Linux’s ‘huge’ American English dictionary, from “aahed” to “zymix”. We can rule out a lot of these, but we’ll use this list as the worst case. Most of the analysis in this article uses two GNU commands –
grep (a regular expression line matcher) and
wc (a word counter). We will also pipe the output from command to the next. There are of course multiple regular expressions to solve any given problem [RegEx] – I’m using ones which (hopefully) are easy to read.
Each word in the source dictionary file is on a separate line, but there are words with diacritics, apostrophes, and some proper nouns. First of all we’ll need to filter it to only ASCII characters, and then to lines with exactly five lower case letters. We’ll write the result into a file called wordle-words.txt to use throughout the rest of the analysis (see Listing 1).
$ grep -P "^[[:ascii:]]*$" /usr/share/dict/american-english-huge | \ grep "^[a-z][a-z][a-z][a-z][a-z]$" > wordle-words.txt $ wc -l wordle-words.txt 11302 world-words.txt
The approach I am taking is to try to extract as much information as possible from each individual guess – the theory being that at each stage you are reducing the search space by the greatest amount. I am using guesses based on the letters which are most likely to occur in the solution, which provides positional information where there is a match, or significantly reduces the candidate word list where there isn’t. This is unlikely to be the optimal approach, but has the advantage of being easy to understand and implement.
We can use some
grep magic to count how many times each letter of the alphabet appears in the full word list. The flag
–o will output every match on a new line, for example grep -o "." on “ABA” will output “A”, “B”, “A” on separate lines. We can then sort this output, and group/count it with uniq. The final two commands in Listing 2 aren’t strictly required – they sort the output numerically and output it in columns.
$ grep -o "." ./wordle-words.txt | sort | uniq -c | sort -g -r | column 5807 s 2828 l 1637 c 1194 k 228 j 5723 e 2722 t 1628 m 869 f 92 q 5242 a 2414 n 1531 y 856 w 3832 o 2013 d 1387 h 593 v 3636 r 1898 u 1334 g 340 z 2901 i 1661 p 1307 b 237 x
The 5 most common letters are S, E, A, O and R, and AROSE is the only word we can make from those five letters. If we count all the words which have at least one of these letters, it’s an amazing 10,782 out of 11,302. This means we have a 95% chance matching at least one letter, and if we don’t match any letters at all we’ve only 520 candidate words left to search.
$ grep "[arose]" wordle-words.txt | wc -l 10782
On the other hand if we used the least popular letters (say BUZZY), we only cover 4,665 words. Our odds of a match have gone down from 95% to less than 50%. What’s worse, if we don’t match any, we are still left with 6,637 words to search.
|First word||Number of matches||Number of non-matches|
Ignoring the positional information for a moment, we can break down what happens when we match combinations of letters. If we only match “A” then we know we didn’t match any of “ROSE”, which only leaves 797 possible words:
$ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep -v "[rose]" | wc -l 797
Do the above for all the possible combinations of letters (we’ve already seen that AROSE is the only word with all these letters) and you get the following:
Even without positional information, we have reduced the word list from 10,782 to a maximum of 894.
By taking into account positional information, we can reduce it even further. Say we need an “A” in the correct position:
$ grep "[arose]" wordle-words.txt | \ grep "a" | \ grep -v "[rose]" | \ grep "a...." | wc -l 123
That means there are 123 five-letter words which begin with A, but don’t have the letters ROSE. Inverting the final grep would give us the count for “correct letter, wrong position”. Completing this analysis for our five starting letters:
The chances are we will match more than one letter, and all the possible combinations would be too many to list here. But for the least specific combination (“AS”) we get:
So it seems reasonable to estimate that our new candidate word list has a maximum of 750 words. Not bad for one guess.
As we saw with SHAME/SHAPE/etc., changing only one letter is not a good strategy. The ‘best’ second guess will depend on the match of the first guess, but if we are taking a generic approach we will want choose a distinct set of letters to the first guess. The next 5 on the list are I, L, T, N, and U – “UNTIL” (or “UNLIT” if you prefer). Our coverage with UNTIL against the full list is still pretty good at 8,690 matches vs. 2,612 non-matches. It also turns out there is only one five-letter word which does not have any letters from “AROSEUNTIL” in it, but that word is left as an exercise for the reader!
The ‘best’ third guess depends even more on the first two matches, and you will have to at least re-use a vowel. The next most frequent letters are D, P, C, Y, and M, and H so trying to make a word from these letters with whichever vowels you already know is probably your best bet. Assuming some match with “AROSEUNTIL”, it turns out that the third most frequent letter set doesn’t actually change very much.
So, my approach to solving Wordle:
- Guess “AROSE”
- Guess “UNTIL”
If no match with either, there is only one possible word.
- Guess a word with the letters you know, also using as many letters as possible from DPCYM.
- You should now be able to work out the answer!
This is what happened for me with Wordle #244:
AR O S E– 107 words left UNTIL– 20 words left DO P E Y– 1 word left DODGE
[Dee22] Hannah Dee – “A Linux refresher through the medium of Wordle” posted on 25th January 2022. https://www.youtube.com/watch?v=i4UipSGjaNQ
[RegEx] “Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.” Attributed to Jamie Zawinski – see for example http://regex.info/blog/2006-09-15/247
The Rev’d James Handley has an MEng in Software Engineering, a PhD in Computing and Medical Physics, and a BA in Theology. He is a Principal Software Developer at JBA Consulting as well as being an ordained priest in the Church of England. For the past 15 years or so he has specialised in GIS and mapping, and he is particularly interested in how software development can influence faith and ministry, and vice versa.