Detecting fraud using Benford's

Posted on May 11, 2023 by Alex Luscombe

Benford’s Law is a statistical phenomenon that has been found to apply to a wide range of data sets, from stock prices to geographic populations. The law states that in many naturally occurring sets of numerical data, the first digit is more likely to be small (e.g., 1 or 2) than large (e.g., 8 or 9). Crazy, right?

This seemingly counterintuitive pattern is actually quite useful for detecting potential fraud in financial or research data. If a data set is not consistent with Benford’s Law, it may indicate that the data has been manipulated or fabricated.

To demonstrate how Benford’s Law can be used to detect fraud, we can write some simple R code that generates a random data set and then checks whether the first digits of the numbers in the set follow the expected distribution according to Benford’s Law.

# Generate a random data set
data <- rnorm(10000, mean = 100, sd = 20)

# Calculate the first digit of each number
first_digits <- as.numeric(substr(abs(data), 1, 1))

# Calculate the expected frequencies of each first digit
expected_freqs <- log10(1 + 1 / (1:9))

# Calculate the observed frequencies of each first digit
observed_freqs <- table(first_digits) / length(first_digits)

# Plot the expected and observed frequencies
barplot(rbind(expected_freqs, observed_freqs),
        names.arg = 1:9, beside = TRUE,
        col = c("#0000FF", "#00FF00"),
        main = "Benford's Law Test",
        xlab = "First Digit", ylab = "Frequency")

First we generate a random data set of 10,000 numbers with a mean of 100 and a standard deviation of 20. We then calculate the first digit of each number and compare the observed frequency of each to the expected frequency according to Benford’s Law. Then we plot this as a barchart.

Plot generated by the above R code

If the observed frequencies closely match the expected frequencies, the data set is likely to be consistent with Benford’s Law and is less likely to contain fraudulent data. If the observed frequencies differ significantly from the expected frequencies, well, there may be something fishy going on.