A gen­er­a­tion ago, stu­dents would say they “grad­u­ated from col­lege,” but now they “grad­uate col­lege.” These tiny fluc­tu­a­tions in the way we use lan­guage are ubiq­ui­tous because “chil­dren don’t learn the lan­guage their par­ents actu­ally speak,” according to David Smith, an assis­tant pro­fessor in the Col­lege of Com­puter and Infor­ma­tion Sci­ence.

The dis­crep­an­cies don’t sig­nif­i­cantly impede our ability to under­stand our chil­dren and grand­chil­dren, he said, “but accu­mu­la­tion of small changes over long periods of time is enough to make our Eng­lish sound a lot dif­ferent from Shake­speare, Chaucer, or Beowulf.”

Backed by a Google Fac­ulty Research Award, Smith is cur­rently studying how lan­guages have changed over the last sev­eral hun­dred years. But he’s doing it in a way only recently made pos­sible through tech­no­log­ical devel­op­ments in the dig­ital human­i­ties and nat­ural lan­guage pro­cessing. In the last few decades, libraries have been working to dig­i­tize lit­er­a­ture. Now that mil­lions of books are avail­able as search­able files, researchers are able to ask ques­tions that couldn’t be asked before.

Smith and his team will use cor­pora like the Penn Tree­bank, which includes the syn­tactic analyses of 30,000 sen­tences from The Wall Street Journal, to build sta­tis­tical models that auto­mat­i­cally detect the syntax of a sen­tence in a dig­i­tized book.

The main chal­lenge will be building models that work across a diverse range of texts over the last sev­eral hun­dred years, including news­pa­pers, blogs, and tele­phone con­ver­sa­tions. “The sta­tis­tical models pre­dict which words are con­nected to other words in a sen­tence,” Smith explained. “The problem is that over 500 years, pre­cisely because of the very phe­nom­enon we’re trying to model, words’ pat­terns of attach­ment change.”

Once the researchers have a com­pu­ta­tional pro­gram in place that doesn’t require human super­vi­sion, they will be able to visu­alize the evo­lu­tion of lan­guage. It will also have a far-​​reaching impact on cul­tural and his­tor­ical analyses, Smith said. “If we have a better model for lan­guage changes, we can recon­struct lan­guages that don’t exist any­more,” he said. Fur­ther, if we under­stand how lan­guages influ­ence each other through his­tory, we might get a better under­standing of how cul­tures connect.

Smith’s research is pri­marily focused on com­pu­ta­tional lin­guis­tics, “but texts can be evi­dence for lots of things in the human­i­ties,” he explained. “Not just lan­guage itself, but what people talk about with lan­guage.” His work, he said, can reveal what aspects of a cul­ture people find inter­esting or how texts are evi­dence for com­mu­ni­ca­tion, trans­porta­tion, and social net­works that are oth­er­wise not observable.