RVerbalExpressions: A Helpful Tool for Learning Regex in R

In a previous post, I explained how we can use regular expressions or "regex" in R to parse our text data. Turns out there is a very useful R library for crafting regular expressions, especially in the early stages of learning the notation. This tool is called RVerbalExpressions, an R library created by Tyler Littlefield and Dmytro Perepolkin. You can learn more about the library on GitHub.

Let's try it out!

Step 1: Getting data

First, let's get a document to play with. For this tutorial, I'm going to use a recent Canadian Securities Administrators consultation paper from the internet (about activist short selling in Canada -- an interesting study if you're interested in economic crime!). To do this, I'm going to use rvest, a library created by Hadley Wickham for simple web scraping in R.

#install.packages("rvest")
library(rvest)

url <- "https://www.osc.ca/en/securities-law/instruments-rules-policies/2/25-403/csa-consultation-paper-25-403-activist-short-selling"

my_report <- read_html(url) %>%
  html_node(".two-column-layout__right") %>%
  html_text(trim = TRUE)

Step 2: Crafting our regular expression

Writing a regular expression can be quite difficult, especially when we are just getting familiar with the format and notation. This is where RVerbalExpressions can be a huge help in our learning journey. Rather than try to write a regex pattern from scratch, relying on examples online and debugging our regex using something like regexr.com, RVerbalExpressions let's us construct our regex pattern using intuitive functions that we can stack on top of one another using the pipe %>% operator.

Let's write a regular expression that is going to match paragraphs containing the word "Fraud" or "fraud" anywhere in the report.

#install.packages("RVerbalExpressions")
library(RVerbalExpressions)

my_pattern <- rx_line_break() %>%
  rx_anything() %>%
  rx_either_of("Fraud", "fraud") %>%
  rx_anything() %>%
  rx_line_break()

Let's take a closer look at our regex pattern.

my_pattern
## [1] "(\\r\\n|\\r|\\n)(.*)(Fraud|fraud)(.*)(\\r\\n|\\r|\\n)"

So what this says is: find text sandwiched between two line breaks "(\\r\\n|\\r|\\n)", and then collect all of the text "(.*)" in that paragraph if the paragraph contains the word fraud "(Fraud|fraud)".

Step 3: Applying our regular expression

Finally, let's apply the regular expression we just created. To apply our pattern to our document, we are going to use a string manipulation library from the tidyverse called stringr. Using stringr's str_extract_all() function, we will apply our regex pattern to our report, and extract all of the paragraphs that we are interested in.

#install.packages("stringr")
library(stringr)

my_match <- str_extract_all(my_report, my_pattern)

my_match
## [[1]]
## [1] "\nWhile traditional long shareholder activism is a well-accepted practice in our markets and viewed by most as an effort to improve shareholder value in public companies, activism by short sellers is often viewed differently. Activist short sellers state that they create real value for public markets by contributing to market efficiency and price discovery. Some take it even further describing their work as a \"first line of defence against fraud and subsequent losses.\"{7} The approach of activist short sellers is not without controversy. If an activist short seller's objective is met, it will mean they have convinced the market of their thesis and caused a decline in a target issuer's share price, leading to a loss of value for its shareholders.{8}\n"                                                                                   
## [2] "\nIn most CSA jurisdictions, activist short sellers are not currently subject to specific regulatory requirements,{25} nor are they defined or easily identifiable. However, as with other market activity conducted by non-regulated or unregistered entities or individuals, short selling activism is subject to the existing prohibitions under securities law, for instance prohibitions against market manipulation, making misleading statements or fraud.{26}\n"                                                                                                                                                                                                                                                                                                                                                                                                      
## [3] "\nAcross all 116 Canadian Campaigns, 40% involved allegations of some type of fraud at the issuer. The most common type of fraud allegation was that of there being a stock promotion scheme (or an alleged \"pump and dump\" scheme), where the company was being promoted by a connected third party (e.g., an outside firm) (see Figure 4).{49} In peak Campaign years (2015, 2016 and 2018) fraud-related allegations accounted for under one-third of the Campaigns. Allegations related to business or industry issues (e.g., drop in commodity prices) and more general market overvaluation concerns have been more common in recent years.\n"                                                                                                                                                                                                                        
## [4] "\nCanadian securities legislation also contains fraud and market manipulation prohibitions{99} that could, in appropriate circumstances, be used to address misconduct by activist short sellers. In general, these provisions prohibit persons from directly or indirectly engaging in acts relating to securities, and in some cases derivatives,{100} that:\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [5] "\n• perpetrates a fraud on any person or company.{101}\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
## [6] "\nIn Québec, it is also an offence to influence or attempt to influence the market price or the value of securities by means of unfair, improper or fraudulent practices.{102}\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [7] "\nThere is a concern that limited regulatory proceedings in Canada arising from the conduct of activist short sellers may contribute to a perception that current enforcement tools are ineffective in addressing or deterring problematic conduct.{107} From an enforcement perspective, securities regulators have tools to address activist short seller behaviour that constitutes fraud, market manipulation or making a misleading statement to the market. However, for many of the misleading statement offences under Canadian securities legislation, evidence of a threshold of unlawful conduct and materialityand market impact related to a statement must be proven.{108} The use of social media to convey information has also introduced new complexities, including in terms of understanding and demonstrating market impact of a particular statement.\n"
## [8] "\n{93} Ibid. The authors have also asked the SEC to confirm that rapidly closing a position after publishing a report, without specifically disclosing an intent to do so can constitute fraud in violation of Rule 10b-5, and propose a safe harbour provision for closing at a price that is the equal to or lower the valuation stated or implied in the report.\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
## [9] "\n{101} In some jurisdictions, including British Columbia, Alberta, Québec and Ontario, it is also an offence to attempt to engage in a fraud or market manipulation.\n"

What we do next depends entirely on our goal. We could save the result as a .txt file to read more closely, or we could use these passages for some kind of quantitive text analysis.

Alex Luscombe
Alex Luscombe
PhD Candidate in Criminology

Related