Something I Am Embarrassed to Say I Just Learned

by Sep 22, 2016KJV, Linguistics6 comments

I knew that English Bible translators have access to a computerized linguistic corpus—an unbelievably massive collection of English texts—to help them do their work.

What I didn’t know, what I just learned, is that I do, too.

What you’re about to learn, if you didn’t already know it, is so do you.

So I chose to get involved in some online discussion about the KJV, and I’m glad I did. I was talking to some intelligent guys who kept me on my toes. I pointed out to them, in a broader argument about the readability of the KJV, that “dropsy” (Luke 14:2) is an archaic word liable to cause today’s readers to draw a blank. The word is very old, first citation 1290—though, of course, that doesn’t necessarily mean it’s archaic (sack is also very old, but not archaic). But my sense was that “dropsy” just doesn’t get used today.

One of my interlocutors pointed out, and touché for him, that my beloved ESV uses the word, too, however! (He could have added that the NASB uses it as well.) I had not realized this, and I was initially surprised.

However, being a denizen of the Internet and therefore rarely being one to admit fault, I determined to do some poking around. Standard contemporary dictionaries weren’t enough help. Merriam-Webster told me only that “dropsy” means “edema.” American Heritage said the word is “no longer in scientific use,” but didn’t elaborate. Is it archaic? Should the ESV and NASB have used it? I didn’t know yet. Even if the word has dropsied right out of science, maybe it has landed in the speech of the common man.

So I checked Google’s NGram Viewer, and this is what I found:

Right after 1900, “edema” clearly changes places with “dropsy.” I’m not sure why there are massive spikes, and a big drop in the “edema” line starting sometime before the year 2000. I’m also not sure how much to trust Google NGram Viewer—I simply don’t know whether the corpus it’s searching (Google Books) is a truly representative sample. I’m not confident that I’m interpreting the graphs correctly. Perhaps the relative difference is huge, but the actual difference is not. Stats are tricky.

Then it hit me: I wonder if there’s an online English corpus available freely, designed for precisely my question, and focused on contemporary English—the kind of corpus I’ve heard Doug Moo talk about, which he used for the NIV. I searched for “english corpus,” and as they say in Telugu, voilà. I discovered BYU’s Corpus of Contemporary American English (COCA). It provides a massive, curated database balanced of different types of American speech and writing. It’s composed of roughly equal parts spoken, fiction, magazine, newspaper, and academic English. Wow.

There are actually multiple English corpora at the site, and they “allow research on variation—historical, between dialects, and between genresin ways that are not possible with other corpora.”

So, COCA, what’s a more common word in contemporary English, “dropsy” or “edema”? There’s a very clear winner. But if I give you a fish you’ll only eat for today. Go see if you can figure it out yourself.

Read More 

Review: The Inclusive Language Debate by D.A. Carson

Review: The Inclusive Language Debate by D.A. Carson

The Inclusive Language Debate: A Plea for Realism, by D.A. Carson (Grand Rapids: Baker, 1998). Don Carson's prose is elegant, and his pace is perfect. He briskly moves the reader through a narrative of the conflict among evangelical Christians over so-called...

Mark Driscoll Makes It into the OED

Look who I discovered being cited in the august OED… I wish I knew more about the work of OED lexicographers, my heroes. I don't know, for example, how OED editors find/choose their citation sources. It's just that beyond Shakespeare and various editions of the Bible,...

Leave a comment.

6 Comments
  1. Duncan Johnson

    Mark, I’ve frequently played with Google nGrams. I think there is a reason why the nGram viewer defaults to 1800-2000 for the timeframe. Every time I increase the range to say 1500- or 1600-, I observe the same kind of wild jumping around in the lines prior to 1800 or so.

    I think this may be because Google Books just doesn’t have a lot of data for books published prior to 1800. The gaps in the dataset probably produce the low scores that show in the chart as the troughs in the line. So I don’t trust it prior to 1800 either, although typical “KJV” words are still interesting to chart from 1500 onwards, as they typically trend steadily downwards after the 1830s or so.

    That’s just my hunch anyway. I’ll have to play with COCA now and see if I can learn more about things.

  2. Duncan Johnson

    Oh, and one more thing… this page has some helpful background on the datasets behind nGram Viewer.

    https://books.google.com/ngrams/info

    I think this part may confirm (at least in part), my theory about dataset gaps prior to 1800.

    Why do I see more spikes and plateaus in early years?

    Publishing was a relatively rare event in the 16th and 17th centuries. (There are only about 500,000 books published in English before the 19th century.) So if a phrase occurs in one book in one year but not in the preceding or following years, that creates a taller spike than it would in later years.

    Plateaus are usually simply smoothed spikes. Change the smoothing to 0.

  3. bethyada

    You probably need to combine “oedema” and “edema” and compare that with “dropsy”. Though that will make your case even more so.

    • Mark Ward

      Excellent!

      That would be this:

      Though this is also helpful:

  4. tearfang

    Fun graphs. A quick and dirty (and far less accurate) way of measuring word popularity is simply to compare the number of ‘hits’ you get for the terms. Hits is in scare quotes bc these numbers are estimates and known to be wildly in-accurate e.g. orders of magnitude off (if you don’t believe me and have time to kill search for something rare and page to the end, you quickly find that Google and Bing always overestimates by a lot. That said, in practice, despite the many other caveats that could be said it is often right enough. Of course the more corpi you can do the comparison on with the same result the greater the confidence of which term is really more popular.

    Wikipedia has an interesting criticism section for google nGrams and the ways in which it is known to not be representative: https://en.wikipedia.org/wiki/Google_Ngram_Viewer#Criticism
    To expand on the library critique is that it isn’t taking into account a book’s popularity, a book no-one read using a word is not the same as a best selling book that uses the same word. Additionally the corpus is biased towards expired copyrighted works and against works that still have some form of restrictive licence that disallows the work’s inclusion in Google’s corpus.

    • Mark Ward

      Great comment. Great point: the more independent corpi the better.

Trackbacks/Pingbacks

  1. 3 Reasons Not to Panic over Bible Translation Revisions | LogosTalk - […] and “mandrakes” in English Bibles), we still need to insist that our English translations sound like us. As Lewis…