Exploratory Data Analysis Using TF-IDF
Computational text analysis can be a powerful tool for exploring qualitative data. In this blog post, I'll walk you through the steps involved in reading a document into R in order to find and plot the most relevant words on each page.
While there are many possible applications to this approach, one way that it can be used is as an exploratory tool for getting to know a document (especially a lengthy one) before or after you begin to read it.
To calculate the "most relevant words", we'll be using a statistical metric called term frequency-inverse document freqency or TF-IDF, a massively popular tool for document search and information retrieval in the information sciences.
TF-IDF works by calculating the most frequent terms in a document, and then weighting these terms by how "unique" they are to a given document, page, paragraph, etc. In this case, we'll be calculating TF-IDF scores across each of the pages in a single document. This will tell us which words were frequent yet also unique to each page.
To illustrate this approach, I'm going to be using a short, 16-page RCMP report on cybercrime that you can download from my GitHub. We'll be reading in the document using a library called pdftools
, which I've covered in an earlier blog post; cleaning and tokenizing the data using tidytext
, dplyr
, and textclean
; and visualizing the results using ggplot2
.
Exploring documents using TF-IDF
First, let's read in the libraries that we'll be using.
library(pdftools) # to read in pdfs
library(tidytext) # to tokenize text, remove stop words, and calculate tfidf
library(tidyverse) # to wrangle data, count words, and plot data
library(textclean) # to clean up text a bit, removing non-ascii chars etc.
Next, using pdftools
, let's read our report into RStudio, where we can clean, tokenize, and analyze it.
cybercrime_report <- pdf_text("~/Desktop/personal-website/content/post/rcmp_cybercrime_report.pdf") #note: you'll need to swap out this file path if attempting to run this code on your own machine
A bit of standard text cleaning.
cybercrime_report_clean <- cybercrime_report %>%
as_tibble() %>% #convert list to tibble
rename(text = value) %>% #rename text column
mutate(page = 1:16) %>% #add page number metadata as separate column
mutate(text = str_trim(text), #trim leading and trailing white space
text = str_squish(text), #remove extra white space from text (e.g., line breaks)
text = replace_url(text), #remove URLs from text
text = replace_non_ascii(text), #remove non-ascii characters
text = replace_symbol(text), #replace $ and other characters with word replacements
text = str_remove_all(text, "[0-9]+"), #remove numbers
text = str_remove_all(text, "[[:punct:]]+")) #remove punctuation
Let's tokenize our data, removing custom and English stop words. Let's also filter out any pages (e.g., title page) that contain very few words (here I'm removing any pages that contain <100 words).
custom_stop_words <- tibble(word = c("canada", "canadas", "report", "cent", "gouvqcca", "crimi", "nal")) #words that may appear frequently on certain pages, but that we don't want to keep.
cybercrime_tokens <- cybercrime_report_clean %>%
unnest_tokens(word, text, token = "words", to_lower = TRUE) %>%
anti_join(stop_words) %>% #remove English stop words (e.g., I, a, the)
anti_join(custom_stop_words) %>% #remove our custom stop words
add_count(page) %>% #count the number of words per page
filter(n > 100) %>% #keep only pages with more than 100 words
select(-n) %>% #to keep things clean, let's remove the total word count per page, as we don't need it anymore
count(page, word) #count the number of times each word appears on each page. We'll need this to calculate tf-idf in the next step.
Finally, let's calculate the TF-IDF scores for each page, keeping only the top 5 terms for each (note: since some pages contain words with equivalent TF-IDF scores, some plots in our small multiple will have more than 5 words). To calculate TF-IDF, we'll be relying on the very handy function bind_tf_idf
, available in the tidytext
library.
cybercrime_tokens %>%
bind_tf_idf(word, page, n) %>% #add tf-idf calculations to our df
arrange(desc(tf_idf)) %>% #arrange by tf-idf score
group_by(page) %>% #group by page
slice_max(tf_idf, n = 5) %>% #keep only words with highest 5 tf-idf scores for each page
ungroup() %>% #ungroup before plotting
mutate(page_label = paste("Page ", page, sep = "")) %>%
mutate(page_label = fct_reorder(page_label, page)) %>%
mutate(word = fct_reorder(word, tf_idf)) %>%
ggplot(aes(y = word, x = tf_idf, fill = page_label)) +
scale_fill_viridis_d() +
theme_minimal() +
geom_col() +
facet_wrap(~page_label, scales = "free", ncol = 2) +
theme(legend.position = "") +
labs(title = "Most important words in RCMP cybercrime report by page",
subtitle = "Pages not shown contained <100 words",
y = "",
x = "tf-idf")
And that's it!
A great technique for exploring the content of lengthy reports, interview transcripts, newspaper articles, or whatever other qualitative text data you're working with.