Applying a grammar of visualization to millions of texts: the Bookworm project
Benjamin Schmidt, Assistant Professor at Northeastern Universitycore faculty at the NuLab for Texts, Maps, and Networks
Large textual collections in the humanities poses unique problems for visualization. Decades of digitization programs have left humanists with dozens of major textual collections ranging in size from thousands to millions of documents. Each represents a substantial archive in and of itself, deserving of extensive analysis. As a rule, most humanists are only able to access these texts through search engines, leaving their broad outlines and biases relatively invisible and intractable to emergent practices of “distant reading,” which require large teams to re-implement relatively simple tasks.
This talk will outline a comprehensive strategy for data modelling of large full-text collections through their metadata as expressed in the Bookworm platform, a project led jointly by the author and Erez Aiden of Rice University and currently deployed by a number of major text repositories, including the Medical Heritage Library, the Yale University libraries, and the Hathi Trust. Bookworm is a platform that enables an expressive grammar for textual analysis and research on any digital library with metadata by integrating full text with metadata, exposing data for statistical analysis and quantitative research. Treating words and metadata as equivalent entities allows extremely fast access to descriptive statistics of large collections, and easy integration with a wide variety of outside tools, such as Mallet for incorporating detailed analysis of topic models and the Stanford Natural Language Tool Kit for named entity recognition. The data modelling strategy supports a wide variety of text visualizations, from temporal charts to multivariable models to networks. This talk will illustrate the larger platform with examples from digital libraries and some of the other large collections, available including tens of thousands of movies and TV shows, and millions of student evaluations of their professors.
Benjamin Schmidt is an assistant professor of history at Northeastern University and core faculty at the NuLab for Texts, Maps, and Networks. His research interests are in the digital humanities and the intellectual and cultural history of the United States in the 19th and 20th centuries. His digital humanities research focuses on adapting techniques from data visualization and machine learning to enable the critical analysis of historical data. His dissertation, “Paying Attention,” described how shifting ways of measuring and defining attention in pedagogy, advertising and psychology transformed American understandings of the subject from 1890 to 1960. He also uses data from time to time to write less formally about the relatively unrelated topics of higher education in the United States and accuracy in historical fiction.