For the past few weeks, I’d been thinking about writing a short blog post explaining how to scrape information from the Internet using a website’s client-side API (when available, and when deemed legal/ethical to do so, of course).
As part of its broader Race and Identity-Based Data Collection (RBDC) Strategy, the Toronto Police has published eight open data sets that it plans to update periodically. To make accessing these eight data sets as convenient as possible, I put together a little R package (it’s one very simple function that makes use of some base R, plus rio, tibble, and janitor) that grabs the data directly from the TPS’s client-side API, cleans up the column names, and imports it into R in tidy (tibble) format.
Some collaborators and I recently started a project analyzing a large amount of tweets we obtained via the Twitter API. To analyze these data, we are planning to train a machine learning model, which means we need training data, which means we need annotations (‘ground truth’ as its commonly referred to in computer science).
In a previous post, I explained how we can use regular expressions or "regex" in R to parse our text data. Turns out there is a very useful R library for crafting regular expressions, especially in the early stages of learning the notation.
Computational text analysis can be a powerful tool for exploring qualitative data. In this blog post, I'll walk you through the steps involved in reading a document into R in order to find and plot the most relevant words on each page.
While analyzing text data can be a lot fun, preprocessing text data is generally not. It can also be extremely difficult, especially when you’re just getting into computational text analysis or the R programming language.
Recently I learned about an incredible initiative launched by a team of political scientists, computer scientists, and historians at my university called The Canadian Hansard Dataset. The data set is a massive, digital collection of English-language debates in the House of Commons from 1901 to today (all French speeches have been translated to English).
Doing quantitative text analysis often means working with documents in .pdf format, and these documents may or may not be in a machine-readable format. Assuming we are using RStudio, how do we read these files into our environment so that we can clean, process, and analyze them?
Code and tutorial prepared for the Toronto Data Workshop session on July 30, 2020. You can download the corresponding slide deck for this workshop here.
Since launching the Policing the Pandemic Mapping Project with Alexander McClelland, a lot of people have asked us how we built the interactive map and database.