Something I Am Embarrassed to Say I Just Learned

I knew that English Bible translators have access to a computerized linguistic corpus—an unbelievably massive collection of English texts—to help them do their work.

What I didn’t know, what I just learned, is that I do, too.

What you’re about to learn, if you didn’t already know it, is so do you.

So I chose to get involved in some online discussion about the KJV, and I’m glad I did. I was talking to some intelligent guys who kept me on my toes. I pointed out to them, in a broader argument about the readability of the KJV, that “dropsy” (Luke 14:2) is an archaic word liable to cause today’s readers to draw a blank. The word is very old, first citation 1290—though, of course, that doesn’t necessarily mean it’s archaic (sack is also very old, but not archaic). But my sense was that “dropsy” just doesn’t get used today.

One of my interlocutors pointed out, and touché for him, that my beloved ESV uses the word, too, however! (He could have added that the NASB uses it as well.) I had not realized this, and I was initially surprised.

However, being a denizen of the Internet and therefore rarely being one to admit fault, I determined to do some poking around. Standard contemporary dictionaries weren’t enough help. Merriam-Webster told me only that “dropsy” means “edema.” American Heritage said the word is “no longer in scientific use,” but didn’t elaborate. Is it archaic? Should the ESV and NASB have used it? I didn’t know yet. Even if the word has dropsied right out of science, maybe it has landed in the speech of the common man.

So I checked Google’s NGram Viewer, and this is what I found:

Right after 1900, “edema” clearly changes places with “dropsy.” I’m not sure why there are massive spikes, and a big drop in the “edema” line starting sometime before the year 2000. I’m also not sure how much to trust Google NGram Viewer—I simply don’t know whether the corpus it’s searching (Google Books) is a truly representative sample. I’m not confident that I’m interpreting the graphs correctly. Perhaps the relative difference is huge, but the actual difference is not. Stats are tricky.

Then it hit me: I wonder if there’s an online English corpus available freely, designed for precisely my question, and focused on contemporary English—the kind of corpus I’ve heard Doug Moo talk about, which he used for the NIV. I searched for “english corpus,” and as they say in Telugu, voilà. I discovered BYU’s Corpus of Contemporary American English (COCA). It provides a massive, curated database balanced of different types of American speech and writing. It’s composed of roughly equal parts spoken, fiction, magazine, newspaper, and academic English. Wow.

There are actually multiple English corpora at the site, and they “allow research on variation—historical, between dialects, and between genresin ways that are not possible with other corpora.”

So, COCA, what’s a more common word in contemporary English, “dropsy” or “edema”? There’s a very clear winner. But if I give you a fish you’ll only eat for today. Go see if you can figure it out yourself.

Author: Mark Ward

PhD in NT; theological writer for Faithlife; former high school Bible textbook author for BJU Press; husband; father; ultimate frisbee player; member of the body of Christ.

7 thoughts on “Something I Am Embarrassed to Say I Just Learned”

  1. Mark, I’ve frequently played with Google nGrams. I think there is a reason why the nGram viewer defaults to 1800-2000 for the timeframe. Every time I increase the range to say 1500- or 1600-, I observe the same kind of wild jumping around in the lines prior to 1800 or so.

    I think this may be because Google Books just doesn’t have a lot of data for books published prior to 1800. The gaps in the dataset probably produce the low scores that show in the chart as the troughs in the line. So I don’t trust it prior to 1800 either, although typical “KJV” words are still interesting to chart from 1500 onwards, as they typically trend steadily downwards after the 1830s or so.

    That’s just my hunch anyway. I’ll have to play with COCA now and see if I can learn more about things.

  2. Oh, and one more thing… this page has some helpful background on the datasets behind nGram Viewer.

    https://books.google.com/ngrams/info

    I think this part may confirm (at least in part), my theory about dataset gaps prior to 1800.

    Why do I see more spikes and plateaus in early years?

    Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.

    Plateaus are usually simply smoothed spikes. Change the smoothing to 0.

  3. You probably need to combine “oedema” and “edema” and compare that with “dropsy”. Though that will make your case even more so.

  4. Fun graphs. A quick and dirty (and far less accurate) way of measuring word popularity is simply to compare the number of ‘hits’ you get for the terms. Hits is in scare quotes bc these numbers are estimates and known to be wildly in-accurate e.g. orders of magnitude off (if you don’t believe me and have time to kill search for something rare and page to the end, you quickly find that Google and Bing always overestimates by a lot. That said, in practice, despite the many other caveats that could be said it is often right enough. Of course the more corpi you can do the comparison on with the same result the greater the confidence of which term is really more popular.

    Wikipedia has an interesting criticism section for google nGrams and the ways in which it is known to not be representative: https://en.wikipedia.org/wiki/Google_Ngram_Viewer#Criticism
    To expand on the library critique is that it isn’t taking into account a book’s popularity, a book no-one read using a word is not the same as a best selling book that uses the same word. Additionally the corpus is biased towards expired copyrighted works and against works that still have some form of restrictive licence that disallows the work’s inclusion in Google’s corpus.

Leave a Reply