Exploratory Data Analysis Using TF-IDF

Computational text analysis can be a powerful tool for exploring qualitative data. In this blog post, I'll walk you through the steps involved in reading a document into R in order to find and plot the most relevant words on each page.

While there are many possible applications to this approach, one way that it can be used is as an exploratory tool for getting to know a document (especially a lengthy one) before or after you begin to read it.

To calculate the "most relevant words", we'll be using a statistical metric called term frequency-inverse document freqency or TF-IDF, a massively popular tool for document search and information retrieval in the information sciences.

TF-IDF works by calculating the most frequent terms in a document, and then weighting these terms by how "unique" they are to a given document, page, paragraph, etc. In this case, we'll be calculating TF-IDF scores across each of the pages in a single document. This will tell us which words were frequent yet also unique to each page.

To illustrate this approach, I'm going to be using a short, 16-page RCMP report on cybercrime that you can download from my GitHub. We'll be reading in the document using a library called pdftools, which I've covered in an earlier blog post; cleaning and tokenizing the data using tidytext, dplyr, and textclean; and visualizing the results using ggplot2.

Exploring documents using TF-IDF

First, let's read in the libraries that we'll be using.

library(pdftools) # to read in pdfs
library(tidytext) # to tokenize text, remove stop words, and calculate tfidf
library(tidyverse) # to wrangle data, count words, and plot data
library(textclean) # to clean up text a bit, removing non-ascii chars etc.

Next, using pdftools, let's read our report into RStudio, where we can clean, tokenize, and analyze it.

cybercrime_report <- pdf_text("~/Desktop/personal-website/content/post/rcmp_cybercrime_report.pdf") #note: you'll need to swap out this file path if attempting to run this code on your own machine

A bit of standard text cleaning.

cybercrime_report_clean <- cybercrime_report %>%
  as_tibble() %>% #convert list to tibble
  rename(text = value) %>% #rename text column
  mutate(page = 1:16) %>% #add page number metadata as separate column
  mutate(text = str_trim(text), #trim leading and trailing white space
         text = str_squish(text), #remove extra white space from text (e.g., line breaks)
         text = replace_url(text), #remove URLs from text
         text = replace_non_ascii(text), #remove non-ascii characters
         text = replace_symbol(text), #replace $ and other characters with word replacements
         text = str_remove_all(text, "[0-9]+"), #remove numbers
         text = str_remove_all(text, "[[:punct:]]+")) #remove punctuation

Let's tokenize our data, removing custom and English stop words. Let's also filter out any pages (e.g., title page) that contain very few words (here I'm removing any pages that contain <100 words).

custom_stop_words <- tibble(word = c("canada", "canadas", "report", "cent", "gouvqcca", "crimi", "nal")) #words that may appear frequently on certain pages, but that we don't want to keep. 

cybercrime_tokens <- cybercrime_report_clean %>%
  unnest_tokens(word, text, token = "words", to_lower = TRUE) %>%
  anti_join(stop_words) %>% #remove English stop words (e.g., I, a, the)
  anti_join(custom_stop_words) %>% #remove our custom stop words
  add_count(page) %>% #count the number of words per page
  filter(n > 100) %>% #keep only pages with more than 100 words
  select(-n) %>% #to keep things clean, let's remove the total word count per page, as we don't need it anymore
  count(page, word) #count the number of times each word appears on each page. We'll need this to calculate tf-idf in the next step.

Finally, let's calculate the TF-IDF scores for each page, keeping only the top 5 terms for each (note: since some pages contain words with equivalent TF-IDF scores, some plots in our small multiple will have more than 5 words). To calculate TF-IDF, we'll be relying on the very handy function bind_tf_idf, available in the tidytext library.

cybercrime_tokens %>%
  bind_tf_idf(word, page, n) %>% #add tf-idf calculations to our df
  arrange(desc(tf_idf)) %>% #arrange by tf-idf score
  group_by(page) %>% #group by page
  slice_max(tf_idf, n = 5) %>% #keep only words with highest 5 tf-idf scores for each page
  ungroup() %>% #ungroup before plotting
  mutate(page_label = paste("Page ", page, sep = "")) %>%
  mutate(page_label = fct_reorder(page_label, page)) %>%
  mutate(word = fct_reorder(word, tf_idf)) %>%
  ggplot(aes(y = word, x = tf_idf, fill = page_label)) +
  scale_fill_viridis_d() +
  theme_minimal() +
  geom_col() +
  facet_wrap(~page_label, scales = "free", ncol = 2) +
  theme(legend.position = "") +
  labs(title = "Most important words in RCMP cybercrime report by page",
       subtitle = "Pages not shown contained <100 words",
       y = "",
       x = "tf-idf")

And that's it!

A great technique for exploring the content of lengthy reports, interview transcripts, newspaper articles, or whatever other qualitative text data you're working with.

Alex Luscombe
Alex Luscombe
PhD Candidate in Criminology