Enumerations Data and Literary Study
by Andrew Piper
University of Chicago Press, 2018
Cloth: 978-0-226-56861-4 | Paper: 978-0-226-56875-1 | Electronic: 978-0-226-56889-8
DOI: 10.7208/chicago/9780226568898.001.0001


For well over a century, academic disciplines have studied human behavior using quantitative information. Until recently, however, the humanities have remained largely immune to the use of data—or vigorously resisted it. Thanks to new developments in computer science and natural language processing, literary scholars have embraced the quantitative study of literary works and have helped make Digital Humanities a rapidly growing field. But these developments raise a fundamental, and as yet unanswered question: what is the meaning of literary quantity?
          In Enumerations, Andrew Piper answers that question across a variety of domains fundamental to the study of literature. He focuses on the elementary particles of literature, from the role of punctuation in poetry, the matter of plot in novels, the study of topoi, and the behavior of characters, to the nature of fictional language and the shape of a poet’s career. How does quantity affect our understanding of these categories? What happens when we look at 3,388,230 punctuation marks, 1.4 billion words, or 650,000 fictional characters? Does this change how we think about poetry, the novel, fictionality, character, the commonplace, or the writer’s career? In the course of answering such questions, Piper introduces readers to the analytical building blocks of computational text analysis and brings them to bear on fundamental concerns of literary scholarship. This book will be essential reading for anyone interested in Digital Humanities and the future of literary study.


Andrew Piper is professor in the department of languages, literatures, and cultures at McGill University. He is the author of Dreaming in Books: The Making of Bibliographic Imagination in the Romantic Age and Book Was There: Reading in Electronic Times, both published by the University of Chicago Press. He is also a founding member of the Multigraph Collective, a group of twenty-two scholars that recently published Interacting with Print: Elements of Reading in the Era of Print Saturation, also with the University of Chicago Press.


"Andrew Piper’s ambitious and timely book introduces a range of new concepts and approaches into the ongoing debate about the role and implications of computational text analysis for literary studies. In place of the hyperbolic emphasis on size and scale in 'distant reading' he explores the possibilities of repetitive, implicated, distributed, and diagrammatic reading. Uniting these diverse approaches by deploying notions of translation and entanglement, Piper explores the multiple ways in which computational readings always and inevitably participate in the construction of meaning. As well as clearly and incisively describing methods for computational text analysis, Piper uses them to address longstanding questions in literary studies, ranging from nature of poetic contrasts and plot development to the relationship of life, death and aesthetics, and the shape of literary oeuvres. In doing so he foregrounds and advances the traditions and concerns that computational and non-computational literary scholars share in common in a way that bridges and exceeds the differences in method that sometimes threaten to divide them."

— Katherine Bode, Australian National University

"Enumerations is a landmark.  No book has succeeded half so well in bridging the stubborn divide between numbers and letters in literary study.   To his Swiss-army-knife proficiency in data science, book history, and close reading, Piper adds a rare measure of generosity and tact. The book extends itself as an open invitation and encouragement to anyone interested in problems of literary method."
— James F. English, University of Pennsylvania

"Mathematical reasoning and literary insight dovetail in Enumerations. This beautifully written book uses numbers to make the texture of poetry and fiction surprising, so we can reflect more deeply on habitual pleasures. Piper's readers will come away knowing how to do new things with data, but also with a better understanding of characterization, plot, and the nature of fiction itself. Enumerations is not just an important book but a foundational one for this emerging field."
— Ted Underwood, University of Illinois, Urbana-Champaign

"This is the book that literary studies needs. To read Enumerations is to follow the movement of a mind lit up by the new interpretive pathways that computational reading opens, and thus to see demonstrated before one’s eyes its enormous significance. From massive questions like 'What is fiction?' to miniscule matters like 'What details allow us to see two poems as similar?,' Piper uses computational modeling to defamiliarize the act of literary interpretation. One puts down Enumerations with a sense of how rich a field literary studies is, of how many questions we still have to ask, of how much work we have to do. In language that is theoretically sophisticated and technically straightforward—that leaps gracefully from Barthes’s 'non-sentence' to the concept of distributional semantics—Piper finally answers literary scholars’ persistent 'So what?'"
— Adam Hammond, University of Toronto


DOI: 10.7208/chicago/9780226568898.003.0001
[digital humanities;cultural analytics;natural language processing;machine learning;literary theory]
This bookstudies the meaning ofrepetitions of reading instead of its more singular moments--the quantities of texts and the quantities of things within texts--that give meaning to a reader's experience. The introduction establishes a new theoretical framework for thinking aboutliterary quantity through three key terms: implication, distribution, and diagram. First, implication puts the idea of modeling and its contingency into the foreground of research. Focusing on the implicatedness of computational modeling allows a rethinking ofour investments in eitherpurely empirical or subjective reading experiences. Implicatedness acknowledges both the constructedness of our knowl­edge about the past and an underlying affirmatory belief contained in any construction of the world. Distributed reading, by contrast, suggests a fundamentally relational/reflexive way of thinking about literary meaning, the way the sense of a text is always mutually produced through the construction ofcontext. Distributional models allow for a more spatial, contingent modeling of texts and contexts. Finally, diagrammatic reading indicates how the practice of computation produces meaning by"drawing together" different sign systems (letter, number, image).Replacing the haptic totality of the book--itsgraspability--computational reading relies on the diagram's perspectival totality, the continual mediation between letters and numbers.

DOI: 10.7208/chicago/9780226568898.003.0002
[Georges Bataille;digital humanities;cultural anaytics;natural language processing;punctuation;poetry;economy of punctuation]
This chapter is a history of what Bataille might call the general economy of punctuation: its distributions, luxuriant overaccumulation, andrhythmic rise and fall (Amiri Baraka’s “delay of language”). Economy of punc­tuation shows how spacing/pacing create meaning on the page, also how tactics of interruption, delay, rhythm, periodicity, and stoppage are all essential means of communicating within literature’s long history. Economy of punctuation reveals the social norms surrounding how we feel about the discontinuities of what we want to say. Viewing the relationship between punc­tuation’s excess and its manifestation in twentieth-­century poetry through a collection of 75,000 English poems by 452 poets who were active during the twentieth century, the chapter explores methods that move from the elementary function ("grep") to more sophisticated uses of word embeddings; it also explores poems that deploy periods well in excess of the norms of their age. Few narratives are more strongly ingrained in the field of poetics than this era's growing antipathy to punctuation. Yet we observe howthe period became increasingly deployed by these poets. The period’s abundance creates a language space marked not only by a sense of the elementary (deictic/rudimentary) but alsoof opposition/conjunction, a sense of the irreconcil­able.

DOI: 10.7208/chicago/9780226568898.003.0003
[digital humanities;cultural analytics;natural language processing;literary plot;novels;narratology]
This chapter is about words, not in the individual sense, but in the distributional sense (a larger set of patterns/behaviors, as a form of usage). Contra a single luminous word, distributional semantics shows relationships existing between words; meaningshaped through probabi­listic distributions. Understanding­ texts as word distributions, a way of thinking about plot (the way actions/beliefs are encoded in narrative form),tracks the shift/drift of language in a text as it signals to readers a change in the text’s concerns. Using a trilingual collection (450 mostly canonical novels from the long nineteenth century), the chapter shows these novels as distinctive in their lexical contraction. Though for much of their history novels have been imagined as an abun­dant, often exceedingly long form, multiplying dramatically over time, vector space model techniques show these novels pushing against their perceived history of imagined excess. These novels are unique in how the linguistic “space” within them contracts as they explore social constraint experienced through language. Here the art of lack offers insights into what it means to contract in­ward and, in so doing, potentially saying more. The art of lack is the dream of insight where there is increasingly less and less to say.

DOI: 10.7208/chicago/9780226568898.003.0004
[digital humanities;cultural anaytics;natural language processing;probabilistic topic modeling;information retrieval;information surplus;figure of speech]
This chapter shows how probabilistic topic modeling drives interest in the study of topics, and how topic modeling has proven a successful tool for identifying coherent linguistic categories within collections of texts. Yet despite interest in topic models, no one has yet asked the question “What is a topic?” (either in classical rhetoric or computational study). If we derive large-scale semantic significance from texts, how does this relate to the longer philosophical/philological tradition? Beginning with an overview of the long (pre-computational) history of topics (from Aristotle to Renaissance commonplace books to nineteenth-century encyclopedism), then moving to a quantitative approach to topic modeling’s link to the past, the chapter uses a single topic (a single model run on a collection of German novels written over the course of the long nineteenth century) to explore larger metaphorical constellations associated with this topic (through the close reading of individual passages), and to apply a more quantitative approach (chapter 2’s method of multidimensional scaling, where word distributions are transformed into spatial representations). The topic’s semantic coherence (when it’s more or less present) at different states of likelihood within a text is compared to the spatial relationships and interconnectedness of topics (the way they coalesce/disperse).

DOI: 10.7208/chicago/9780226568898.003.0005
[digital humanities;cultural analytics;machine learning;fictional speech;speech act theory;novels;realism;narratology]
How do we know when a text is signaling that it is “true” or “not true”? This chapter looks at these differences at larger scale, using a collection of 28,000 documents (from the nineteenth to the twenty-first century) to understand what distinguishes fictional writing from its nonfictional counterpart, thereby making three primary counter-claims to dominant scholarly narratives. First, from the perspective of machine learning, fictionality is a highly legible category at the level of linguistic content; second, not only is such legibility coherent across different kinds of narrative types and languages, it also appears to have been surprisingly stable for at least two hundred years; and third, the way such stability has been achieved departs from a variety of scholarly positions about the novel. Fiction’s stability (esp. the novel) is based on a “phenomenological investment.” Seen quantitatively, fictional discourse’s particularity since the nineteenth century has been its profound investment not simply with the world around us, but also with our perceptual encounter with that world (how we “make sense,” explicitly related to our physical senses). These three aspects comprise “the coherence hypothesis,” the “immutability hypothesis,” and the” phenomenological hypothesis” about the rise of the novel over the past two centuries.

DOI: 10.7208/chicago/9780226568898.003.0006
[digital humanities;cultural analytics;natural language processing;named entity recognition;literary character;gender;female subjectivity;classical realism]
Quantity has a role to play in understanding the nature of characters and the process of characterization (the writerly act of generating animate entities through language). With Alex Woloch’s question of “the many” in mind, the chapter begins with a survey of an estimated 85 characters per novel in the nineteenth century, a conservative estimate of 20,000 novels published during this period in English, producing ca. 1.7 million unique characters appearing in one century in one language. Simultaneously, the process of characterization poseschallenges of scale: the great number of characters, plus the vast amount of information surrounding even a single, main character. Characters (like other textual features) are abundant across the pages of novels. Through an examination of over 650,000 characters using new techniques in natural language processing and entity recognition, this chapter explores the "character-text" of novels (how characters are activated, described, objectified). The (surprising) evidence here suggests that the process of characterization is best described as one of stylistic constraint, aligning the practice of characterization more closely with a character’s etymological origins (as representative, general, or “characteristic”—not individualistic). The chapter then explores the rise of “interiorly oriented” characters and Nancy Armstrong’s notion of strongly gendered “deep character.”

DOI: 10.7208/chicago/9780226568898.003.0007
[digital humanities;cultural analytics;natural language processing;late style;poetry;literary careers;corpus linguistics;Walt Whitman;Wanda Coleman;Goethe]
The “corpus” (or body of work, since Cicero) of an author is meant to be organic, integral--well connected--but also distinct and whole; it marks limits; it is the material complement to the author’s life. What does it mean to imagine writing as a body, something with a distinct shape/form, but also subject to vulnerability? How do we understand those moments when a writer opens herself or her corpus up to change, how radical or gradual are these movements, how permanent, fleeting, or even recurrent? Is there something called “late style,” a distinctive signature that characterizes the end of a career as contours of an aging body mapped onto the weave of writing? When and how do we intellectually/creatively exfoliate? While we have very successful ways of detecting “authors” or “style,” we have considerably fewer techniques for talking about change, the nature of the variability within an author’s corporal outline, the variety of measures to study the shape of a writer’s career. Working with a trilingual collection of roughly 30,000 poems in French, German and English, this chapter explores questions of local/global vulnerability and late style, concluding with a computationally informed reading of the work of Wanda Coleman.

DOI: 10.7208/chicago/9780226568898.003.0008
[digital humanities;cultural analytics;natural language processing;late style;poetry;literary careers;corpus linguistics;Walt Whitman;Wanda Coleman;Goethe]
To be implicated can also mean--more literally--to be folded into something. This bookdiscusses how we can become more implicated in our observations aboutliterature past and present. Modeling in particular allows us to fold our­selves into the techniques/technologies, through which we arrive at newknowledge of our objects of study.Models can account, more explicitly, for the mediations that govern our insights. This conclusion gestures toward ways to implicate ourselves more fully in our scholarship, moving from acts of self-measurement, observing the ways our own books are constructed and related to the field; to institutional accounting, the ways in which the institutions we work for, or those that we create (e.g., authors or concepts), are distributed across the field of academic publishing. These practices are part of a larger movement to imagine new forms of algorithmic openness, where computation is used not as an afterthought (as a means of searching for things that have already been preselected and sorted), but as a form of forethought (a means of generating more diverse ecosystems of knowledge). How can we use the tools of data science to capture and more adequately represent those values we care aboutin systems of scholarly communication?