How to Get a Bunch of New York Times Crossword Puzzle Data, and What Can Be Done Thereafter

Peter Shaffery

Introduction

I like the New York Times (NYT) crossword puzzle. I got started with it as a way to divert my attention from social media, but after about a year or two of pretty regular puzzling, I’ve started to grow an appreciation for the crossword as a craft (at least, the NYT crossword). Crossword constructors can control the puzzle-solver’s experience in a number of subtle ways, both in the choices which they make for an individual clue (eg. whether its phrased misleadingly or not) and for the puzzle as a whole (eg. the distribution of easy clues, the placement of black squares). The mixture of these factors can have a serious impact on how a puzzle “feels” to solve. A witty or punny clue that you spend 10 minutes trying to work out gives you a little reward at the end, whereas the intersection of two obscure trivia clues can be an immensely frustrating experience (referred to as a natick).

Understanding these principles, and using them to curate crossword puzzles is therefore quite important, made even more so by the fact that the NYT adheres to a difficulty schedule. Mondays are the easiest, with difficulty increasing through to Saturday where it (approximately) maxes out. Sunday crosswords are physically larger (17x17 squares, instead of the usual 15x15), but the difficulty is reduced to the level of a Wed-Fri to make the solve more pleasant.

I’d like to know more about crossword difficulty. Partly because I’d like to start making my own crosswords, but also because I’m running out of ways to procrastinate writing my thesis. Since I’m pretty hand with data analytics tools (and looking for a job 😉), I might as well see what I can pull out of some crossword data.

Getting Data

It turns out that actually getting crossword data is a little hard. Up until a few years ago some databases of crossword clues and solutions were maintained by single individuals, but it seems like as of 2020 those are no longer available. This is speculation, but my guess would be that the NYT doesn’t want their crossword data publicly available after a plagiarism scandal rocked the crossword puzzling world. I’m going to respect that and not include the dataset I’ve created in the GitHub repo which accompanies the project. However, I am going to show you how I got it. Important: this method requires a subscription to the online NYT crossword puzzle service. This service includes the bulk of the Shortz era on archive, and unlimited downloads of .puz crossword files. As far as I am aware simply automating this procedure doesn’t violate any terms of service, but use it at your own peril.

A friend of mine told me that any website is an API if you think about, and this is absolutely true of the NYT crossword portal:

Notice up in the right hand corner there’s a download button?

This button is actually a link which directs you to a URL of the form https://www.nytimes.com/svc/crosswords/v2/puzzle/XXXXX.puz, where XXXXX is some crossword ID number. If you direct your internet browser to this URL (and you’re logged in to your NYT crossword subscription) it will hand over a .puz file containing a crossword. At time of writing I can’t quite figure out the relationship between the ID number XXXXX and the returned puzzle. It doesn’t increase with the puzzle date, so maybe it’s a hash of the puzzle data or smething.

Starting from your login to the NYT puzzle service, let’s take a closer look at the transaction which actually hands over the .puz file (starting from when you click the login button on the NYT crossword website):

  1. The NYT website uses some JavaScript (I believe) to prompt you for your login info.
  2. You enter the login info and complete a Captcha challenge, and then send your NYT Crossword username and password to the server hosting the crossword app
  3. The server sends you back some authorization tokens (cookies) which certify that you are who you say you are, without having to login again
  4. You direct your browser to the URL https://www.nytimes.com/svc/crosswords/v2/puzzle/XXXXX.puz
  5. Your browser bundles the relevant NYT authorization cookies with a GET request to the server at the above URL
  6. If the NYT cookies are valid and have not expired, the server will then send back the file XXXXX.puz

Automating steps 3-5 is fairly simple. Most popular languages (Python, R, Javascript) have functionality (either base or with a package) to send GET requests. You could even use Bash tools like curl or wget wrapped in a little script to incremement the puzzle’s ID number.

The difficulty is really in steps 1 and 2, and specifically getting past the Captcha. As you might expect, automating Captcha challenges is quite difficult, so we don’t have much hope of handling it within our scraper script. Instead what I ended up doing was simply logging in using the Firefox browser, and then raiding Firefox’s cookie database for everything which corresponded to the NYT website. By including all of those cookies with my automated GET requests I was able to successfully tap the endpoints. Because I’m trying to learn it, I implemented this process using Go, and you can find source code for that in the code/go directory of the accompanying GitHub repository.

What Does That Get Of You?

I tried 30,000 ID numbers tried, and just over 10,200 of them returned a puzzle. This is approximately consistent with the number of puzzle’s published in the Will Shortz era (365 puzzles per year for Shortz’s tenure of 26 years would be 9,490 puzzles). These puzzles are stored in the .puz file format, which is a popular, open-source format used by most crossword apps or software. It includes all clues (and their number and direction), the puzzle’s layout, as well as the solutions (which are typically scrambled by some key phrase). To work with these files, I used the puzpy package for python.

Processing the data

For now I’m going to ignore the spatial information of the puzzle, eg. placement of the clues and black squares. While in my dream analysis I’d figure out a way to estimate the entropy for a given empty square when some nearby squares are filled, in reality that’s a lot to tackle in one blog post. So to start I used puzpy to pull the clue text, the date of the puzzle, the clue’s orientation (across or down), and how long the answer was, and stored it in a .csv file. Besides these basic factors, I added two simple features to start from, one indicating whether the clue contained a “?” (what I’ll call a “pun clue”), and the second indicating whether it contained a proper noun (what I’ll call a “culture clue”).

In a pun clue, the last character of the clue is a question mark (“?”). This indicates that the clue text is not to be taken literally, ie. that it contains a pun, a joke, or a play on word, etc. For example, the clue “What’s up?” has the solution “SKY”. The clue “Going MY way?” has the solution “EGOTRIP”. Not all clues containing a pun end in question marks, but all clues ending in question marks are puns (with the rare exception that the clue quotes a question or something).

Culture clues are a concept I came across in this cool 2014 Bachelor’s thesis by Jocelyn Adams: “A Pragmatic Analysis of Crossword Puzzle Difficulty”. The idea here was that crossword clues can be divided into straightforward hints (eg. where the clue is a synonym or the solution) and “culture clues”. These might broadly be considered as “trivia clues”, containing references to history, contemporary culture, sports, art, etc. While it is difficult to automatically detect whether a clue contains trivia or not, one heuristic Jocelyn used was to check whether all of the words in the clue were in the Scrabble dictionary. If any clue word was not a Scrabble word, then the clue was considered a “culture clue”. I couldn’t find a fast way to check clue words against the Scrabble dictionary in Python, so I instead checked if words were proper nouns or not, which was straightforward in the NLP package I used (spaCy). This is probably not exactly the same as the Scrabble dictionary method, but given that neither is perfect I figured the difference would get lost in the overall error.

Low-Hanging Fruit

Okay now that we’ve gotten a nice Tidy dataset let’s dive in and see what’s inside. First lets take a look at our lovely new features: pun clues and culture clue. We can get a sense of how crossword constructors use these by plotting the percentage of clues which are pun/culture clues either by day of the week, or by year:

Woah! Those are some neat trends. I’ll be honest with you reader I was not expecting this to work. Wow, so we see that there are some real variation over time in regards to how often constructors rely on different types of clues. Up front let’s just quickly observe that the total change in any of these subplots is on the order of \(3-5\%\), so while these trends seem real (ie. are statistically significant ie. mean CIs show a lot of seperation, not that this really means anything for non-random data) the effect size is fairly small. Nevertheless I think these say some pretty interesting stuff about how these kinds of clues are used.

At the day-level, we see that at the beginning of the week pun clues are at their lowest points, while culture clues are quite frequent. As the week goes on we see that pun clues increase, which is somewhat compensated for by a reduction in culture clues. To me this suggests that pun clues are intended to be “harder”, whereas culture clues can make a puzzle easier to solve. This tracks with my experience: when you solve a culture clue you can typicaly be very confident in that solution, so they convey a lot of information. Interestingly, culture clues come back in popularity for the Sunday puzzle, alongside pun clues. I’m guessing this is because Sunday puzzles are so large that culture clues provide a lot of “anchors” which can help a solver lock down the different quadrants of a puzzle. It could also be that culture clues allow for a wider variety of fill (crossword jargon for the words used as solutions), which is probably helpful when constructing a larger puzzle.

On the year scale, we also see some really interesting trends. For one, pun clues seem to be having a popular decade. We see that early in Will Shortz’s career pun clues were relatively rare, increasing throughout the 2000’s to their peak around 2014. Culture clues on the other hand are currently at their lowest popularity in 30 years. Despite a pretty serious peak around 2009, culture clues have plummeted out of favor very steadily since that high-water mark. I for one, appreciate this, as I often feel like culture clues are kind of boring (either you know the reference or you don’t, there’s not much to puzzle over).

We can also break these down by clue direction (Across or Down):

For culture clues nothing much changes, but our clues do seem slightly more frequent in the “Across” direction, and on Sundays the increase in pun clues is due to an increase in “Across” (“Down” puns actually decrease). Honestly I’m kind of stumped by this one. If I had to hazard a guess, I’d say that it seems easier to solve a pun when the letters are arranged in “reading order”.

Somewhat Higher Hanging Fruit

Clue Clustering Preliminaries

Well the acid house mix I’ve got playing in the background has like 45 more minutes on it, and I’m still trying to procrastinate my thesis, so let’s do some real feature engineering. My original plan was to try running Latent Dirichlet Allocation (LDA) on the crossword clue text and see if there were any natural clue “topics”. Unfortunately it looks like the clue text is too short for LDA, so I had to try something else.

Instead I decided to run a clustering algorithm on clue embeddings, which I found using Weighted Matrix Factorization. I’m not going to go into the gory details of WMF here (for a quick summary check the Appendix, or for a longer one click the link), but the upshot of this process was that I converted each clue into a point in 10-dimensional space. The procedure is set up so that similar clues are converted into points which are “close”, so by searching for clusters of vectors in the 10-dimensional feature space we can assign each clue to a group of similar clues. The most popular way to do this kind of group assignment in pretty much any application is k-means clustering.

I decided to run k-means with \(k=11\) clusters, that is each clue was assigned a numeric label of 0-10 (since I zero-index like any rational human). I chose this value of \(k\) by minimizing the Akaike Information Criterion (AIC, see appendix).

Results

Okay with all of the house-keeping out of the way, let’s look at some results! To start let’s take a look at some (randomly selected) examples of clues in each of the clusters:

Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 Cluster 7 Cluster 8 Cluster 9 Cluster 10
Break the seal on Like slicker floors Seven-time Rose Bowl winner for short 66 e.g.: Abbr. Way: Abbr. Kind of school Say “Oh that was nothing”, say El Greco’s city Big name in computers Amount left in Old Mother Hubbard’s cupboard Tabloid cover topic maybe
Brief apology Like 1-Across by descent Jerry Garcia’s band for short Genealogical datum: Abbr. Catch in a way Kind of fairy Safaris without guns, say City near Brigham City Vardalos of My Big Fat Greek Wedding Letters for Old MacDonald Patella’s place
The Thrilla in Manila for one Like the Sahara Letter ender for short Norm: Abbr. By way of Kind of state Crosswords, say German city rebuilt after W.W. II Birds that lay big green eggs Home for an old woman in a nursery rhyme Rakes maybe
Character actor in the Cowboy Hall of Fame Like a New York/Los Angeles romance Withdrawal site for short Letter from St. Paul: Abbr. Shocks in a way Kind of titmouse Head out to sea, say City near Düsseldorf Big name in A.T.M.’s Charles Nelson ___ old game show staple Place to get drunk before getting high?
Regulation Like some private-home apartments Free TV ad for short Part of A.T.&T.: Abbr. Trouble in a way Kind of word Tighten one’s laces, say Gomorrah’s sister city Pitt of The Big Short Old White House moniker Place to see a flick?

Here are the sizes (number of words) of each cluster:

Cluster 0 1 2 3 4 5 6 7 8 9 10
Size 770179 11930 5786 11726 4818 4574 4946 4087 4419 4561 7953

It’s not hard to see some clear trends within clusters: cluster 1 is composed of clues using the word “Like”, cluster 2 has clues containing the phrase “For short”, etc. Some of these clue types seem somewhat aligned with what I’d consider pun clues or culture clues, so let’s see if there’s any relationships between our clusters, and the previously defined clue types:

There doesn’t appear to be too much variation in the presence of pun clues across clusters, however culture clues vary substantially. Clusters 3, 7, and to a lesser extent 9 all appear to have higher than average percentages of culture clues, whereas Clusters 4, 5, and 6 have lower than average percentages. Looking at the examples of each cluster, this seems very reasonable. Cluster 3 corresponds to clues whose solutions are abbreviated, and “abbr.” is probably getting flagged as proper nouns (or something similar) and so are getting classified as culture clues. Cluster 7 seems to be defined by clues with the word “City”, which means that these clues are largely straight trivia questions. Cluster 9 seems defined by the word “old”, which often appears to correspond to some kind of trivia question as well. On the other hand, Clusters 4 and 6 appear to be alternate methods of suggesting that the solution is some kind of pun, or play on words (“in a way” or “say” both typically indicate that the clue isn’t straightforward). Cluster 5 (defined by “Kind of”) seems like a straight trivia cluster, but not cultural trivia per se.

To finish off, let’s look at the usage of each cluster over time, and see if there are any interesting trends. Because there’s a lot of clusters, straight-up line plots would be a little crowded and hard to read. To make things more legible I’ll define a few ad-hoc groups of clusters which I’ll plot instead. These groups won’t be very rigorously defined, relying on observations about what type of clues belong in each cluster as well as my personal experience as a cruciverbalist. I’m also going to exclude cluster 0, which ended up being the “miscellaneous” cluster, containing clues that didn’t belong elsewhere.

First I’ll define a “trivia” group, which contains clusters 5, 7, 8, and 9. The last two clusters, 8 and 9, appear to be defined by clues which contain the word “big” or “old” respectively. “Old” clues in particular seem to be heavily culture based (as in "Old White House moniker") so obviously belong in the trivia group. “Big” clues however don't seem to have quite so close a relation to “culture clues”. The reason I’m choosing to include them, then, is because of some prior knowledge I’m bringing to the table: a common clue template is “Big name in ____”, where the blank is often filled by a non-proper noun. Even though the clue text itself is less likely to contain a "culture word" (the name of a large company, for example) I would still consider this a “culture clue”, because the solution is typically a well-known company or individual.

Second, I’ll look at the “wordplay” group, containing clusters 4, 6 , and 10. These appear to be defined by the words “way” (as in “in a way”), “say”, and “maybe”, respectively. From my personal experience solving crosswords these phrases can be seen as an alternative to marking clues with a “?”. While these clues may not be puns exactly, these types of clues are typically not straightforward, relying on an unusual intpretation of a word or phrase.

The remaining three clusters (clusters 1, 2, and 3) seem to be defined (respectively) by the words “like”, abbr.“, and”short" (as in “for short), and don’t seem to relate to each other meaningfully. I’ll therefore leave these clusters separate, but refer to them as the”simile“,”abbrevation“, and ”Nickname" clusters (there may be some overlap between the abbreviation and nickname clusters, as “for short” is occasionally used to indicate an abbreviation).

Now let’s see how usage of these groups has changed over time:

I think what jumps off the page most to me is the difference between the trend in “Trivia” clues (shown here) and the trend in “Culture” clues (shown in “Low Hanging Fruit” above). Obviously these clues types are not precisely the same, but I thought they would have somewhat similar behavior. Instead we see that Trivia clue exhibit almost no change in usage over the last 20 years, and become slightly more popular as the week goes on (almost the opposite of the usage of Culture clues). My guess is that trivia clues from the “Kind of” cluster are really carrying the difference here, as these were the clues which were least related to culture clues.

On the other hand, “wordplay” clues seem to closely mimic the behavior of pun clues. This is consistent with our earlier finding that pun clues tend to increase puzzle difficulty, as measured by day of the week. It definitely seems to me that “maybe” and “say” are used as close substitutes for “?”

Wrapping Up

Alright this feels like a good place to leave this blog-post for now. We’ve seen that two types of clues, “culture” and “pun” see a lot of variation over the days of the week (not to mention throughout the years). By adjusting the mix of these clue types, constructors can control the difficulty of a crossword puzzle. Additionally, we’ve seen that there are a few other ways to define clue types. Using clustering we discovered that “maybe” and “say” are alternative indicators of a pun clue. Additionally, we’ve seen that culture clue trends are not fully reflective of trivia clues more generally. Non-culture trivia clues exhibit a lot more stability over the days of the week, and the years, than culture-based trivia clues.

I’d be remiss if I didn’t include one last bit of eye-candy before I closed up. I’ve talked a lot about cluing here, but I totally ignored the other major aspect of crossword construction: the grids! That’s going to be the subject of another blog post I think, but for now I’d like to leave you with the “average” crossword grid for each day of the week (excluding Sunday). Lighter colors indicate a square that’s more likely to be blacked out:

Zoinks!

Appendix

Weighted Matrix Factorization

We start by compiling what I’ll call the “count matrix”. That is the matrix \(M\) whose elements \(M_{ij}\) is the number of times word \(j\) appeared in clue \(i\). Now, because clue texts are short and words are rarely repeated in a clue text, this is basically a matrix of 0s and 1s, where an element is 1 when a clue contains a word and 0 otherwise. This matrix is usually “sparse”, ie. most elements are 0. This makes a lot of matrix algebra cheap to perform.

The basic idea here is to find two matrices \(C\) (dimension \(r \times n\)) and \(W\) (dimension \(r \times m\)), designated as the “clue” and “word” matrices, such that \(C^T W \approx M\). Here \(n\) is the number of the clues, \(m\) is the number of unique words, and \(r\) a parameter which we choose, the dimension of the embedding space (I selected \(r=10\) in this post). We then take the column \(c_i\) of the clue matrix as the vector embedding of the clue \(i\), and the column \(w_j\) of the word matrix as the embedding of word \(j\). We can understand the logic behind this choice of factorization by observing that the element \(M_{ij} = c_i^T w_j\), ie. it’s the inner product between \(c_i\) and \(w_j\). This means when two clues share a word then their vectors will both be “similar” to w_j (since \(c_{i_1}^T w_j = c_{i_2}^T w_j=1\)). Now, the quotes around “similar” are doing a lot of lifting here, and it may be the case that \(c_{i_1}\) and \(c_{i_2}\) are quite far from each other (in the euclidean sense). However the more words two clues share, the closer their vectors will become.

One last bit of house-cleaning: the problem of finding \(C\) and \(W\) such that \(C^T W \approx M\) is not well-defined (it lacks a unique solution). Observe that for matrices \(C\) and \(W\) such that \(C^T = \hat{M}\) is a good approximation for \(M\), then for any value \(\alpha \neq 0\) we can find an equally good approximation with \(\alpha C\) and \(\frac{1}{\alpha}W\). We therefore need to “regularize” this problem. I won’t go into too much detail about what that means here, but the upshot is that we look for matrices \(C\) and \(W\) that not only approximate \(M\), but that also have the smallest possible columns without picking up too much error between \(\hat{M}\) and \(M\). While there may not be unique matrices \(C\) and \(W\) that can approximate \(M\), there are unique matrices \(C\) and \(W\) that both approximate \(M\) and have small columns, so the regularized problem is well-defined.

Akaike Information Criterion

The AIC is a method of measuring the statistical “fit” of a model, that penalizes a model for being too complex (since more complex models can more easily fit data; a clustering algorithm which assigns each data point to its own cluster would have perfect fit, but would be a bad model). AIC scores are “better” when they’re smaller, we want to choose the value of \(k\) which produces the lowest AIC score:

As we see, the initial increases in \(k\) (ie. increases in the model complexity) are “worth it”; the improvement in model fit outpaces the penalty on model complexity and so AIC decreases. Eventually, however, the gains in model fit diminish and the complexity penalty takes over, causing AIC to start increasing. We see that the best performing model has \(k=11\) clusters.