The project expects to launch by the end of the month. When it does, researchers and the public will be able to comb through widely reprinted texts iden­ti­fied by mining 41,829 issues of 132 news­pa­pers from the Library of Con­gress. While this first stage focuses on texts from before the Civil War, the project even­tu­ally will include the later 19th cen­tury and expand to include mag­a­zines and other pub­li­ca­tions, says Ryan Cordell, an assis­tant pro­fessor of Eng­lish at North­eastern Uni­ver­sity and a leader of the project.

Fast for­ward a cen­tury and a half and many of these news­pa­pers have been scanned and dig­i­tized. North­eastern com­puter sci­en­tist David Smith devel­oped an algo­rithm that mines this vast trove of text for reprinted items by hunting for clus­ters of five words that appear in the same sequence in mul­tiple pub­li­ca­tions (Google uses a sim­ilar con­cept for its Ngram viewer).

The project is spon­sored by the NULab for Texts, Maps, and Net­works at North­eastern and the Office of Dig­ital Human­i­ties at the National Endow­ment for the Human­i­ties. Cordell says the main goal is to build a resource for other scholars, but he’s already cap­i­tal­izing on it for his own research, using modern map­ping and net­work analysis tools to explore how things went viral back then.

 

Read the article at Wired →