This database provides search and download access to the “speculative bibliographies” of reprinted nineteenth-century newspaper texts discovered through the Viral Texts project at Northeastern University and the University of Illinois Urbana-Champaign. These clusters of texts have been matched using Passim, a tool which “implements algorithms for detecting and aligning similar passages in text.” For more about how we use text-reuse detection to study nineteenth-century newspaper reprinting, see our book, Going the Rounds. The “Textual Criticism as Language Modeling” chapter describes our computational methods in detail.


Using this database

A few quick tips for using this tool to search newspaper reprints:

  1. When you search for a particular word or phrase, all the clusters in which it appears will be listed. You might find one text listed in multiple clusters, if for instance parts of it circulated in different forms. For reasons like that, sometimes our algorithm aligns different parts of texts and produces separate clusters where a scholar might group them together. If you notice particular trends in how this separation is happening, let us know—that information might help us improve our clustering.
  2. Many historical newspaper archives do not divide page data into articles, and even those that do may not transcribe metadata for all individual texts on the page. Historically, many nineteenth-century newspaper texts did not include a headline/title, and the vast majority did not include authorial attribution. The names we assign to clusters, then, simply describe their structure (e.g. “38 reprints from 1870-09-24 to 1871-02-03”) rather than trying to identify the text within the cluster.
  3. In addition to the number of reprints in the title of each cluster, search results in this tool will also list how many individual texts within the cluster match the search phrase: e.g. “8 texts match search.” Often texts within the same cluster—i.e. matched texts—do not include all the same words, due both to historical editorial changes, as well as to OCR errors in individual reprints. Our reprint-detection methods are “fuzzy” to enable us to match texts with significant internal differences, so we can connect texts that were edited while they circulated historically, as well as connect texts across digital archives made up of “dirty OCR.”
  4. Our newspaper data is drawn from a number of digitized newspaper archives, most substantially Chronicling America, as well as open resources such as Making of America, Trove, Europeana,and the Internet Archive. In addition, Viral Texts has secured permissions to datamine several corporate historical newspaper archives, including the ProQuest American Periodicals Series, the Gale British Library Newspapers: Part I: 1800-1900, the Gale Nineteenth Century U.S. Newspapers, and the Accessible Archives African-American Newspapers collections. Whenever possible, we link directly to the digitized page where each witness appears. However, we cannot reproduce data discovered in corporate archives. When a witness was discovered in proprietary data, the title and date of the reprint will be listed with a notice “This text comes from a proprietary database and cannot be displayed due to copyright restrictions.” If you have access to the relevant databases, you can track down these witnesses independently.
  5. Stemming from #4, these bibliographies include only texts from the databases we are able to study computationally. Many large-scale digital archives of historical newspapers do not provide data-level access to their collections and thus we cannot study them using our methods. We suspect you would only find more witnesses of the texts here were you to search for them in additional historical newspaper archives.
  6. Finally, we call these “speculative bibliographies” for a reason: we use probabilistic methods to study newspaper reprinting at a large scale. This means, however, that you will find mistakes in the clustering, such as false positives where two different texts use enough of the same phrases they were clustered together. Our favorite example of this is political speeches delivered by different politicians years apart, but which use enough of the same stock phrases that they are clustered together. If you notice trends in these kinds of false positives—particular kinds of texts that seem to be confusing our reprint detection—please let us know. Likewise, if you know that a particular text should be in our data but it is not—e.g. a text you’ve studied and know was widely reprinted, but which you cannot find at all in our data—let us know that also. This kind of feedback from other scholars helps us improve our methods moving forward.

Suggested Citation

One reason we share this data is, frankly, there is just so much of it. There are millions of clusters in our current dataset—far more than any one scholar could study in a lifetime. We hope other scholars across fields will find useful information here about texts, authors, or genres they care about. Please cite the Viral Texts project in any work you produce or publish using this reprinting data:

Ryan Cordell and David Smith, Viral Texts: Mapping Networks of Reprinting in 19th-Century Newspapers and Magazines (2024), https://viraltexts.org.