The following (originally completed in May 2018) was part of a project to investigate the context for the sport of “cricket” for the Victorians. I was particularly interested to see whether cricket's associations with nationality and empire would be visible using distant readings of *Nature*, *Notices of the Proceedings at the Meetings of the Members of the Royal Institution*, *Philosophical Magazine*, *Proceedings at the Royal Society of Edinburgh*, *Proceedings at the Royal Society of London*, the *Reports of the BAAS*. I post this to demonstrate my typical workflow in conducting these experiments, even when they yield less than impressive results.

There were some challenges in this project at the outset. As I quickly discovered, you cannot simply search for “cricket,” as this creates too many false data points in a scientific data set. That is why my search focuses on the sport's ball and its equipment (e.g. “cricket bat”, “cricket ball”). Similarly, I found that my initial plan to add the sport of horse racing as a topic of interest was also too difficult, as “racing” led to too many false matches with discussion of “race” as a grouping of human beings.

Methodology: Recreational Reckoning

Experimental Question

I complete my distant readings of texts using packages others have developed in R. R can be a powerful tool for better understanding texts. It isn't always necessary to have a fully testable hypothesis in mind; visualizing texts can be a powerful tool for discovery, especially when you are willing to have fun, exploring the many ways in which one can customize your analysis. On the other hand, because the data can be easily manipulated, one can easily fall into the trap of thinking they observe a feature in the text and then manipulating the text to draw out that feature. Fishing for information that supports a theory one already holds is a real problem in the field labelled by scholars such as those in the Stanford Literary Lab as “computational criticism.”

There are several principles that can be used to approach objective experimentation in automated text analysis, as discussed in Justin Grimmer and Brandon M. Stewart's “Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts” (Political Analysis, 2013). Unlike the social sciences, however, the humanities more generally proceed not through testable and reproducible experiments, but through the development of ideas. Recreational computational criticism–what I call 'Recreational Reckoning'–therefore asks only that you choose one question that your analysis will answer. Questions such as: “Does Dickens's Bleak House include more masculine or feminine pronouns?”; “What topics are central to the Sherlock Holmes canon?”; “Do novel titles become longer or shorter over the course of the nineteenth-century?” New features may become observable while pursuing this analysis. And it is up to the critic to theorize about what this newly visualized feature means. For this project, my question was whether I would find references to the Britain or the British Empire closely associated with references to cricket.

Why R?

R isn't the only tool one can use for visualizing texts. However, I have found that R computational methods shine when you have texts that are either too long to read quickly, or too many texts to read quickly. They are also useful when you have a specific methodology in mind or prioritize customizability in the data mining or the visualization. For quick visualizations of things like word clouds, Voyant (https://voyant-tools.org) is probably a better.

Downloading R

The first step in using this methodology is obviously to download R. This can be done here (https://www.r-project.org). Users should also download RStudio, an environment which will make running the code easier. (If you are reading this in R/RStudio, then congratulations on already having started!)

Setting Directory

The first step in analyzing your data is choosing a work space. I recommend creating a new folder for each project. This folder will be your working directory. The working directory in R is generally set via the “setwd()” command. However, here, we're going to be working within R Markdown Files (.Rmd). R Markdowns rely on a package called knitr, which generally requires the R Markdown to be stored in the location of your working directory. So I would recommend creating a new folder, and then downloading these R Markdown Files to the folder where you want to work. For example, you might create a folder called “data” on your computer desktop, in which case your working directory would be something like “C:/Users/Nick/Desktop/data”. You can check that your working directory is indeed in the right place by using the “getwd()” function below.

getwd()

Downloading Packages

The next step is to load in the packages that will be required. My methodology makes use of several packages, depending on what is required for the task. Rather than loading the libraries for each script, I generally find it more useful to install and initialize all the packages I will be using at once, even if I won’t be using all of these packages for a particular experiment.

Packages are initially loaded with the “install.packages()” function. HOWEVER, THIS STEP ONLY HAS TO BE COMPLETED ONCE.

“ggmpap” is a package for visualizing location data.

“ggplot2” is a package for data visualizations. More information can be found here (https://cran.r-project.org/web/packages/ggplot2/index.html).

“pdftools” is a package for reading pdfs. In the past, you had to download a separate pdf reader, and it was a real pain. You, reader, are living in a golden age. Information on the package can be found here (https://cran.r-project.org/web/packages/pdftools/pdftools.pdf).

“plotly” is a package for creating interactive plots.

“quanteda” is a package by Ken Benoit for the quantitative analysis of texts. More information can be found here (https://cran.r-project.org/web/packages/quanteda/quanteda.pdf). quanteda has a great vignette to help you get started (here). There are also exercises available here.

“readr” is a package for reading in certain types of data. More information can be found here (https://cran.r-project.org/web/packages/readr/readr.pdf).

“SnowballC” is a package for stemming words (lemmatizing words, or basically cutting the ends off words as a way of lowering the dimensions of the data. For instance, “working”,“worked”, and “works” all become “work”).

“tm” is a simple package for text mining. An introduction to the package can be found here (https://cran.r-project.org/web/packages/tm/vignettes/tm.pdf).

“tokenizers” is a package which turns a text into a character vector. An introduction to the package can be found here (https://cran.r-project.org/web/packages/tokenizers/vignettes/introduction-to-tokenizers.html).

install.packages("ggmap")
install.packages("ggplot2")
install.packages("pdftools")
install.packages("plotly")
install.packages("quanteda")
install.packages("readr")
install.packages("SnowballC")
install.packages("stm")
install.packages("tm")
install.packages("tokenizers")

Loading Libraries

The next step is to load the libraries for these packages into your environment, which is accomplished with the “library()” function.

library(ggmap)
library(ggplot2)
library(quanteda)
library(pdftools)
library(plotly)
library(readr)
library(SnowballC)
library(stm)
library(tm)
library(tokenizers)

A Note About Citation

Most of the software packages are written by academics. Reliable and easy-to-use software is difficult to make. If you use these packages in your published work: please cite them. In R you can even see how the author would like to be cited (and get a bibtex entry).

citation("ggplot2")
citation("quanteda")
citation("pdftools")
citation("plotly")
citation("readr")
citation("SnowballC")
citation("stm")
citation("tm")
citation("tokenizers")

Uploading Data and setting variables.

I had already created acquired .txt volumes of these texts. So I simply needed to upload the data. There are also various parameters that I might find useful later that need to be defined. The basic methodology is that I am going to construct a script that will go through each word in the .txt files and try to match it with some other words. I chose to look for references to cricket bats and balls, golf balls and clubs, and tennis balls and rackets. However, it is often helpful to make sure you know the words which occur around the referenced term, to provide context. The “conlength” variables provide three different sizes of “windows” for this purpose. For instance, “ProfSportsshortconlength” is set to three, meaning the final dataset will have a column showing the three words to either side of the matched term.

    templocation <- paste0(getwd(),"/Documents")
    ProfSportslocations <- c(paste0(templocation,"/Nature/Volumes"),paste0(templocation,"/Philosophical-Magazine/Volumes"),paste0(templocation,"/Reports-of-the-BAAS/Reports"),paste0(templocation,"/Royal-Institution/Proceedings"),paste0(templocation,"/Royal-Society-of-Edinburgh/Proceedings"), paste0(templocation,"/RSL/Proceedings"))
    ProfSportsIndex <- c("Nature","Philosophical-Magazine","BAAS","Royal-Institution","RSE","RSL")
    ProfSportslongconlength <- 250
    ProfSportsshortconlength <- 3
    ProfSportsPOSconlength <- 10
    ProfSportssearchedtermlist <- c("cricket ball","cricket bat", "golf ball","golf club", "tennis ball","tennis racket")
    ProfSportsoutputlocation <- paste0(getwd(),"/WordFlagDataFrames")
    ProfSportsWordFlagdfPath <- paste0(ProfSportsoutputlocation,"/","ProfSportsWordFlagdf.txt")

To create the data frame compiling every reference to a term, run the following script. Be aware that this takes quite a while. So if you already have a data set that you just need to upload, see below instead.

Running the Script, or Uploading Previous Data

if(file.exists(ProfSportsoutputlocation) == FALSE)
ProfSportsstemsearchedtermlist <- unique(wordStem(ProfSportssearchedtermlist)) #lemmatizes the list of terms you want to search for.
ProfSportsWordFlagmat <- matrix(,ncol=13,nrow=1)
for (g in 1:length(ProfSportslocations)) {
      tempdocloc <- ProfSportslocations[g]
      files <- list.files(path = tempdocloc, pattern = "txt", full.names = TRUE) #creates vector of txt file names.

      for (i in 1:length(files)) {
        fileName <- read_file(files[i])
        Encoding(fileName) <- "UTF-8"  #since tokenize_sentences function requires things to be encoded in UTF-8, need to remove some data.
        fileName <- iconv(fileName, "UTF-8", "UTF-8",sub='')
        ltoken <- tokenize_words(fileName, lowercase = TRUE, stopwords = NULL, simplify = FALSE)
        ltoken <- unlist(ltoken)
        stemltoken <- wordStem(ltoken) #this uses the Snowball library to lemmatize the entire text.
        textID <- i
        for (p in 1:length(ProfSportsstemsearchedtermlist)) {
          ProfSportsstemsearchedterm <- ProfSportsstemsearchedtermlist[p]
          for (j in 1:length(stemltoken)) {
              if (ProfSportsstemsearchedterm == paste0(stemltoken[j]," ",stemltoken[j+1])) {
                if (j <= ProfSportslongconlength) {longtempvec <- ltoken[(1:(j+ProfSportslongconlength))]}
                if (j > ProfSportslongconlength) {longtempvec <- ltoken[(j-ProfSportslongconlength):(j+ProfSportslongconlength)]}
                if (j <= ProfSportsshortconlength) {shorttempvec <- ltoken[(1:(j+ProfSportsshortconlength))]}
                if (j > ProfSportsshortconlength) {shorttempvec <- ltoken[(j-ProfSportsshortconlength):(j+ProfSportsshortconlength)]}
                if (j <= ProfSportsPOSconlength) {POStempvec <- ltoken[(1:(j+ProfSportsPOSconlength))]}
                if (j > ProfSportsPOSconlength) {POStempvec <- ltoken[(j-ProfSportsPOSconlength):(j+ProfSportsPOSconlength)]}
                TempTextName <- gsub(paste0(ProfSportslocations[g],"/"),"",files[i]) #This grabs just the end of the file path.
                TempTextName <- gsub(".txt","",TempTextName) #This removes the .txt from the end of the name.
                temprow <- matrix(,ncol=13,nrow=1)
                colnames(temprow) <- c("Text", "Text_ID", "ProfSportsstemsearchedterm","Lemma","Lemma_Perc","KWIC","Total_Lemma","Date","Category","Short_KWIC","POS_KWIC","Current_Date","Corpus")
                temprow[1,1] <- TempTextName
                temprow[1,2] <- textID
                temprow[1,3] <- ProfSportsstemsearchedterm
                temprow[1,4] <- j
                temprow[1,5] <- (j/length(stemltoken))*100
                temprow[1,6] <- as.character(paste(longtempvec,sep= " ",collapse=" "))
                temprow[1,7] <- length(stemltoken)
                temprow[1,8] <- strsplit(TempTextName,"_")[[1]][1]
                temprow[1,10] <- as.character(paste(shorttempvec,sep= " ",collapse=" "))
                temprow[1,11] <- as.character(paste(POStempvec,sep= " ",collapse=" "))
                temprow[1,12] <- format(Sys.time(), "%Y-%m-%d")
                temprow[1,13] <- ProfSportsIndex[g]
                ProfSportsWordFlagmat <- rbind(ProfSportsWordFlagmat,temprow)
              }
          }
        }
        print(paste0(i," out of ",length(files)," in corpus ",g," out of ",length(ProfSportslocations))) #let's user watch as code runs for long searches
      }
}
      ProfSportsWordFlagmat <- ProfSportsWordFlagmat[-1,]
      ProfSportsWordFlagdf <- as.data.frame(ProfSportsWordFlagmat)
      write.table(ProfSportsWordFlagdf, ProfSportsWordFlagdfPath)
ProfSportsWordFlagdf

If you have a previously constructed dataset, you can obviously upload it using a script like this.

ProfSportsWordFlagdf <- read.table(ProfSportsWordFlagdfPath)

RESULTS

An abbreviated version of the results looks like this:

ProfSportsWordFlagdf[,c("Text","ProfSportsstemsearchedterm","Short_KWIC")]

##                                                                       Text
## 1                                          187311-187404_Nature_Vol.09_v00
## 2                                          188305-188310_Nature_Vol.28_v00
## 3                                          188605-188610_Nature_Vol.34_v00
## 4                                          189211-189304_Nature_Vol.47_v00
## 5                                          189511-189604_Nature_Vol.53_v00
## 6                    185507-185512_Philosophical-Magazine_Ser.4_Vol.10_v00
## 7                    185507-185512_Philosophical-Magazine_Ser.4_Vol.10_v00
## 8                    189201-189206_Philosophical-Magazine_Ser.5_Vol.33_v00
## 9   189911-190107_Proceedings-of-the-Royal-Society-of-Edinburgh_Vol.23_v00
## 10 18540223-18551220_Proceedings-of-the-Royal-Society-of-London_Vol.07_v00
## 11 18540223-18551220_Proceedings-of-the-Royal-Society-of-London_Vol.07_v00
## 12 18991130-19000614_Proceedings-of-the-Royal-Society-of-London_Vol.66_v00
##    ProfSportsstemsearchedterm
## 1                 cricket bat
## 2                   golf club
## 3                   golf club
## 4                   golf club
## 5                 cricket bat
## 6                 cricket bat
## 7                 cricket bat
## 8                 cricket bat
## 9                   golf club
## 10                cricket bat
## 11                cricket bat
## 12                  golf club
##                                        Short_KWIC
## 1        were a molecular cricket bat and suppose
## 2    their foothill club golf club gymnastic club
## 3  when working at golf club felixstowe september
## 4                  room in the golf club house at
## 5                to place a cricket bat in stones
## 6         intending to make cricket bats out each
## 7               the pods his cricket bats but not
## 8        were a molecular cricket bat and suppose
## 9        resembles a miniature golf club the head
## 10          intending to make cricket bats out of
## 11              the pods his cricket bats but not
## 12         the mid surrey golf club arrange ments

The end result was somewhat disappointing. There are only twelve references to these phrases in all of the professional science corpus I've assembled. I determined this to be too little data to make any meaningful conclusions. But that's how things often turn out in Recreational Reckoning experiments.

Anoff Nicholas Cobblah

Anoff Nicholas Cobblah

Blog: Science Is Lit!

Anoff Nicholas Cobblah

Blog: Science Is Lit!

Experiments in Computational Criticism #5: "Cricket bats" and "cricket balls" Victorian Scientific Periodicals

Methodology: Recreational Reckoning

Experimental Question

Why R?

Downloading R

Setting Directory

Downloading Packages

Loading Libraries

A Note About Citation

Uploading Data and setting variables.

Running the Script, or Uploading Previous Data

RESULTS