Exploratory Data Analysis Using TF-IDF
Computational text analysis can be a powerful tool for exploring qualitative data. In this blog post, I'll walk you through the steps involved in reading a document into R in order to find and plot the most relevant words on each page.
While there are many possible applications to this approach, one way that it can be used is as an exploratory tool for getting to know a document (especially a lengthy one) before or after you begin to read it.
To calculate the "most relevant words", we'll be using a statistical metric called term frequency-inverse document freqency or TF-IDF, a massively popular tool for document search and information retrieval in the information sciences.
TF-IDF works by calculating the most frequent terms in a document, and then weighting these terms by how "unique" they are to a given document, page, paragraph, etc. In this case, we'll be calculating TF-IDF scores across each of the pages in a single document. This will tell us which words were frequent yet also unique to each page.
To illustrate this approach, I'm going to be using a short, 16-page RCMP report on cybercrime that you can download from my GitHub. We'll be reading in the document using a library called
pdftools, which I've covered in an earlier blog post; cleaning and tokenizing the data using
textclean; and visualizing the results using
Exploring documents using TF-IDF
First, let's read in the libraries that we'll be using.
library(pdftools) # to read in pdfs library(tidytext) # to tokenize text, remove stop words, and calculate tfidf library(tidyverse) # to wrangle data, count words, and plot data library(textclean) # to clean up text a bit, removing non-ascii chars etc.
pdftools, let's read our report into RStudio, where we can clean, tokenize, and analyze it.
cybercrime_report <- pdf_text("~/Desktop/personal-website/content/post/rcmp_cybercrime_report.pdf") #note: you'll need to swap out this file path if attempting to run this code on your own machine
A bit of standard text cleaning.
cybercrime_report_clean <- cybercrime_report %>% as_tibble() %>% #convert list to tibble rename(text = value) %>% #rename text column mutate(page = 1:16) %>% #add page number metadata as separate column mutate(text = str_trim(text), #trim leading and trailing white space text = str_squish(text), #remove extra white space from text (e.g., line breaks) text = replace_url(text), #remove URLs from text text = replace_non_ascii(text), #remove non-ascii characters text = replace_symbol(text), #replace $ and other characters with word replacements text = str_remove_all(text, "[0-9]+"), #remove numbers text = str_remove_all(text, "[[:punct:]]+")) #remove punctuation
Let's tokenize our data, removing custom and English stop words. Let's also filter out any pages (e.g., title page) that contain very few words (here I'm removing any pages that contain <100 words).
custom_stop_words <- tibble(word = c("canada", "canadas", "report", "cent", "gouvqcca", "crimi", "nal")) #words that may appear frequently on certain pages, but that we don't want to keep. cybercrime_tokens <- cybercrime_report_clean %>% unnest_tokens(word, text, token = "words", to_lower = TRUE) %>% anti_join(stop_words) %>% #remove English stop words (e.g., I, a, the) anti_join(custom_stop_words) %>% #remove our custom stop words add_count(page) %>% #count the number of words per page filter(n > 100) %>% #keep only pages with more than 100 words select(-n) %>% #to keep things clean, let's remove the total word count per page, as we don't need it anymore count(page, word) #count the number of times each word appears on each page. We'll need this to calculate tf-idf in the next step.
Finally, let's calculate the TF-IDF scores for each page, keeping only the top 5 terms for each (note: since some pages contain words with equivalent TF-IDF scores, some plots in our small multiple will have more than 5 words). To calculate TF-IDF, we'll be relying on the very handy function
bind_tf_idf, available in the
cybercrime_tokens %>% bind_tf_idf(word, page, n) %>% #add tf-idf calculations to our df arrange(desc(tf_idf)) %>% #arrange by tf-idf score group_by(page) %>% #group by page slice_max(tf_idf, n = 5) %>% #keep only words with highest 5 tf-idf scores for each page ungroup() %>% #ungroup before plotting mutate(page_label = paste("Page ", page, sep = "")) %>% mutate(page_label = fct_reorder(page_label, page)) %>% mutate(word = fct_reorder(word, tf_idf)) %>% ggplot(aes(y = word, x = tf_idf, fill = page_label)) + scale_fill_viridis_d() + theme_minimal() + geom_col() + facet_wrap(~page_label, scales = "free", ncol = 2) + theme(legend.position = "") + labs(title = "Most important words in RCMP cybercrime report by page", subtitle = "Pages not shown contained <100 words", y = "", x = "tf-idf")
And that's it!
A great technique for exploring the content of lengthy reports, interview transcripts, newspaper articles, or whatever other qualitative text data you're working with.