Pruitt research fraud

After a lengthy 3.5 year investigation, renowned McMaster University spider researcher (behavioural ecologist) Jonathan Pruitt has been found guilty of falsifying data in multiple major research papers by the university’s internal investigation committee. Pruitt was found to have “generally failed to meet the requirements expected of a tenured professor”. With 15 papers to his name retracted in the last 15 years, I’d have to agree. Research fraud is very bad, and I applaud the university for taking the matter seriously and handling this the way they did. There have been a lot of major research fraud scandals in the news lately, and McMaster seems to have done particularly well in handling this one.

Read More

Detecting fraud using Benford's

Benford’s Law is a statistical phenomenon that has been found to apply to a wide range of data sets, from stock prices to geographic populations. The law states that in many naturally occurring sets of numerical data, the first digit is more likely to be small (e.g., 1 or 2) than large (e.g., 8 or 9). Crazy, right?

Read More

Eight new TPS datasets

As part of its broader Race and Identity-Based Data Collection (RBDC) Strategy, the Toronto Police has published eight open data sets that it plans to update periodically. To make accessing these eight data sets as convenient as possible, I put together a little R package that grabs the data directly from the TPS’s client-side API, cleans up the column names, and imports it into R in tidy (tibble) format. You can install library(tps.rbdc) from my GitHub here, where I’ve also provided some details on how to use it.

Read More

Annotating training data in R

Some collaborators and I recently started a project analyzing a large amount of tweets we obtained via the Twitter API. To analyze these data, we are planning to train a machine learning model, which means we need training data, which means we need annotations (‘ground truth’ as its commonly referred to in computer science).

Read More

Parsing your .pdfs in R

While it’s fairly straightforward to read a .pdf file into R, we may not want all of the text from our .pdf files to be read in all at once, or into the same row or column of our dataframe. There are parts of our .pdfs we may not want to be included in our analysis, or that we may wish to include as metadata, separated from the main text component of our data set.

Read More