Published under: statistical analysis, data analysis, crosstabulation, games

Wordle is a very popular word game that has recently been purchased by the New York Times. Every morning, a new game is posted at https://www.nytimes.com/games/wordle/index.html. The goal of the game is to guess that day’s 5-letter word in no more than 6 tries. The instructions posted at that site give you a good idea of how it works:

There have been a number of mathematical analyses of the game. One of the most interesting articles describes a large computer simulation used to determine which starting words are best if you want to solve the puzzle as fast as possible or you want to be sure to guess the correct word within 6 tries (https://www.inverse.com/culture/wordle-top-3-start-words). Other studies look at the frequency of letters in English words containing exactly 5 letters.

Frequency Distribution of Letters

It turns out that you can download the words used in Wordle. There are 12,972 5-letter English words that are considered to be valid guesses, although only 2,315 of those are used by Wordle as possible answers. So as not to spoil the fun, I decided to do an analysis of the larger set of possible guesses. So I downloaded the words and put them in a file with 12,972 rows and 5 columns. Each of the 5 columns contained a single letter corresponding to the letter in that position of the word.

The first graph I decided to do shows the percentage of time that each letter appears in the total 64,860 positions across all 12,972 words:

The most frequently occurring letters are S, E and A in that order. This is a little surprising since according to published studies of the occurrence of letters in the Oxford English Dictionary, the most commonly occurring letters are E, A and R (https://www3.nd.edu/~busiforc/handouts/cryptography/letterfrequencies.html). In that analysis, the letter S ranks eighth with only about half the frequency of the letter E. Also, the standard order of letters used by typesetters is ETAOINSHRDLU. It’s seems that the distribution of letters in 5-letter words is different than in all English words. I'm guessing one major difference occurs because the list of all 5-letter words contains plural forms as well as singular forms. In my experience, however, Wordle doesn't tend to use plural forms.

Letter Distribution by Position

I then decided to look at the distribution of letters at each position. Performing a crosstabulation of letter by position generated the following graph:

The letter S is the most frequent letter to begin or end a word (by a large margin, with all those plural forms). A, E and I are much more frequent than S in positions 2, 3 and 4. The letter O is also an especially good choice for position 2. The letter Y occurs fairly often at the end of the word and occasionally at position 2.

Note: I did a little more digging. The letter S is found at position 5 in 30.5% of the words in the 5-letter word list that Wordle accepts as valid guesses, but only 1.6% of the words it uses for answers. So I've stopped using plural forms for my guesses.

Conclusion

Analyzing the distribution of letters used by Wordle is a fun mathematical exercise. But you shouldn’t let it spoil your fun playing the game. I still choose my first guess by whatever 5-letter word I think of first.

Instructions

If you'd like to analyze the data yourself, download this zip file. It contains 2 Statgraphics data files: one with all the words Wordle accepts as guesses and the second the words it uses for answers. Then download and install Statgraphics 19.4. Start Statgraphics and load either data file. Choose Describe - Categorical Data - Two Factors - Crosstabulation. Tell it you have multiple data columns and fill out the data input dialog box as shown below:

You'll get the tables and graphs that I've shown here.