Sampling the Canadian Hansard Dataset

Recently I learned about an incredible initiative launched by a team of political scientists, computer scientists, and historians at my university called The Canadian Hansard Dataset. The data set is a massive, digital collection of English-language debates in the House of Commons from 1901 to today (all French speeches have been translated to English). The authors of this ambitious project, which they are continually expanding, have a great paper on the project available here.

One of the really nice features of this initiative is that you can download the full data set in .csv format, which includes both the text from each speech and various other fields (the date of the speech, the name of the speaker, the speaker's party, and more).

In total, there are 13,354 .csv files parsed by day included in the dump, which adds up to about 4.7 million rows and 5gbs if you merge it all into one file (that's a lot of data!).

Now, if you're like me, you probably don't want to analyze every speech in the dataset (though a lot of really cool analyses could certainly be done at the whole corpus level). Instead, you might want to create a subset that includes only the speeches (and associated metadata) that mention a set of keywords you're interested in.

Below is an R script I wrote to do just that, using some tidyverse packages, a package called multigrep, and a package called glue (not necessary, but I find it helpful to indicate what file the parser is on when running the script, which takes a bit of time). To get this script to successfully run, you'll need to download the data in .csv format, inserting the path the highest level folder into the data_files vector. Remember to also set your working directory using setwd() and add the keywords you're interested in to the keywords vector.

Once complete, you should have a file in your working directory called hansard-debates-keyword-subset.csv.

#you'll need to uncomment and install these packages if you don't already have them
#install.packages("tidyverse")
#install.packages("remotes")
#remotes::install_github("elliefewings/multigrep")
#install.packages("glue")

library(tidyverse)
library(multigrep)
library(glue)

#remember to uncomment and fill this in before running
#setwd("")

#remember to fill this in before running, this is the directory to the master file of .csvs you downloaded above
data_files <- ""

file_paths <- list.files(data_files, recursive = TRUE, pattern = "csv$", full.names = TRUE)

file_name <- "hansard-debates-keyword-subset.csv"

#insert key words of interest in, all lower case
keywords <- c("keyword1",
              "keyword2",
              "otherkeywords")

create_data <- function(
  hid = NA,
  speechdate = NA,
  pid = NA,
  opid = NA,
  speakeroldname = NA,
  speakerposition = NA,
  maintopic = NA,
  subtopic = NA,
  subsubtopic = NA,
  speechtext = NA,
  speakerparty = NA,
  speakerriding = NA,
  speakername = NA,
  speakerurl = NA
) {
  tibble(
    hid = hid,
    speechdate = speechdate,
    pid = pid,
    opid = opid,
    speakeroldname = speakeroldname,
    speakerposition = speakerposition,
    maintopic = maintopic,
    subtopic = subtopic,
    subsubtopic = subsubtopic,
    speechtext = speechtext,
    speakerparty = speakerparty,
    speakerriding = speakerriding,
    speakername = speakername,
    speakerurl = speakerurl
    )
}

write_csv(create_data() %>% drop_na(), file_name, append = TRUE, col_names = TRUE)

get_debates <- lapply(file_paths, function(i) {
  
  data <- read_csv(i)
  
  data_corrected <- data %>%
    mutate(basepk = as.numeric(basepk),
           hid = as.character(hid),
           speechdate = as.Date(speechdate, "%Y-%m-%d"),
           pid = as.character(pid),
           opid = as.character(opid),
           speakeroldname = as.character(speakeroldname),
           speakerposition = as.character(speakerposition),
           maintopic = as.character(maintopic),
           subtopic = as.character(subtopic),
           subsubtopic = as.logical(subsubtopic),
           speechtext = as.character(speechtext),
           speakerparty = as.character(speakerparty),
           speakerriding = as.character(speakerriding),
           speakername = as.character(speakername),
           speakerurl = as.character(speakerurl))
  
  data_lowered <- data_corrected %>%
    mutate(speechtextlower = tolower(speechtext))
  
  print(glue("Filtering {i}..."))
  
  if(nrow(data_lowered) > 1){
    data_filtered <- data_lowered %>%
      filter(multigrep(keywords, speechtextlower))
      
    currdata <- create_data(
      hid = data_filtered$hid,
      speechdate = data_filtered$speechdate,
      pid = data_filtered$pid,
      opid = data_filtered$opid,
      speakeroldname = data_filtered$speakeroldname,
      speakerposition = data_filtered$speakerposition,
      maintopic = data_filtered$maintopic,
      subtopic = data_filtered$subtopic,
      subsubtopic = data_filtered$subsubtopic,
      speechtext = data_filtered$speechtext,
      speakerparty = data_filtered$speakerparty,
      speakerriding = data_filtered$speakerriding,
      speakername = data_filtered$speakername,
      speakerurl = data_filtered$speakerurl
    )
    
    write_csv(currdata, file_name, append = TRUE)
  }
})
Alex Luscombe
Alex Luscombe
PhD Candidate in Criminology

Related