Scraping a PDF


Introduction

There might be times when needed data is in a pdf file, we’ll go through a few ways to extract and transform it into usable data. There will be times that the data in a pdf is not in a table format but separated by tabs and spaces. That will be much harder and could cause unplanned steps and workarounds.

We’ll start with the uneven data represented in the first pdf file. This turns out to be a complicated case which leads us down a path where we need to consider that the data is not extratable. But we can still learn a lot from the process. So, let’s start with the hard one first.

Note: I’ll use function names as section headings to make it easier to find examples of use for functions, rather then a descriptive sub-title which doesn’t add to the fact, because the notes accompanying code blocks explain the idea and steps taken by the code itself.

Packages

Here is a list of useful packages we’ll use, and some we won’t have a chance to use in this page but still apply:

Uneven Ledger Example


Data

The data can be found at

if (!require("pdftools")) {install.packages("pdftools")}
library(pdftools)
library(data.table)
file <- "D:/~/R/Data/pdf_data/ledger.pdf"

pdf_text()

Let’s use pdf_text() to read the pdf into R. The output is a character[2], the 2 signifies that it’s 2 pages and are stored in one object each.

old_ledger <- pdf_text(file)

Method 1


sub()

As I quickly scan through old_ledger I notice several issues that could cause issues down the line. The header row for the table is located a few lines down the page, and the names need some tidying up. So, let’s use sub() to clean the pages

editedledger <- sub("Document #", "Document", old_ledger)
editedledger <- sub("Transaction Code","TansactionCode", editedledger)
editedledger <- sub("Bldg/Unit","BldgUnit", editedledger)

read.table()

Now that we’ve cleaned some glaring potential issues, let’s:

  • read the text file into a table using "\n" as the separator (\n is line break)

  • search for the start of the table by finding the header row at line/row 16

  • we’ll subset the data set by dropping the first 15 rows and since we know the nrow=123 we can target the desired data and assign it to editedpdf readpdf[16:123,1]

  • clean other rows from the df

  • if you check class(editedpdf) you’ll realize that it’s “character”

  • read the editedpdf into tablepdf with read.table() but this time use (tab) “\t” as the separator

readpdf <-  read.table( text= editedledger, sep="\n")
class(readpdf)  #data.frame
readpdf[1:30,1]  #let's look for the ledger heading
readpdf[16,1]  # table header is line 16
nrow(readpdf)  #123
editedpdf <- (readpdf[16:123,1])  #extract the first 15 rows up to table header
editedpdf <-  editedpdf[-108]     #take out last line
editedpdf <- editedpdf[-(52:55)]    #take out text atop of page 2
editedpdf <- editedpdf[-2]              #take out line 2
tablepdf <- read.table(text= editedpdf, sep="\t")

separate()

tablepdf is a df with one column and 102 rows, so we need to separate/split/divide/map the objects within each row to a column. If you review tablepdf you’ll realize that the objects are multiple spaces apart, so we need to use sep="\\s+" which is multiple spaces.

santa <- separate(tablepdf,V1,sep="\\s+",
                  into=c("Date","Period","Bldg/Unit","Transaction Code",
                         "Description","Journal","Document #","Charge",
                         "Credit","Balance"))

Method 2


Data science is more about the person and the data, there are no set ways of doing a job. It’s more of an art than a set of steps to follow. We’ll go over a second method of achieving the same as Method 1. The reason is to use other functions so we can gain a better understanding of when to use a function.

So let’s go back to pdf_text() section where we generated old_ledger and go from there.

strsplit()

In the previous method we cleaned old_ledger and then we read it back in using sep="\n". Well in this method we’re going to split the text old_ledger using sep="\n" and see what we have:

  • nolinebreak is a list with 2 elements, one for each page, and each line from the pdf is now a row

  • page 1 is 75 rows and page 2 is 59 rows

nolinebreak <- old_ledger |>
                strsplit(split = "\n")
class(nolinebreak)              #list
length(nolinebreak[[1]])        # 75
length(nolinebreak[[2]])        # 59

This section is not necessary for where we are going, but it helps clarify some points.

str_squish()

Let’s look for the table header row, and it appears that it’s row [24], so

  • extract row[24] into tableheader

  • words are separated by tab so we’ll strsplit() by "\t"

  • str_squish() is from stringr package and does what it sounds like, squishes the string into a character

  • if you need a list then use as.list() to set it back, and now you notice the spaces between the objects are all equal

  • we can go some cleaning here as we notice there is a space in one of the objects, so we’ll use sub() to substitute into the space, and some other cleaning as necessary

tableheader <- nolinebreak[[1]][24]   #character
"Date            Period       Bldg/Unit     Transaction Code        Description
Journal       Document #       Charge        Credit     Balance"

name.vector <- as.character(strsplit(tableheader,"\t"))
class(name.vector)  #character

library(stringr)
tableheader <- str_squish(tableheader)   #character

tableheader <- as.list(strsplit(tableheader,"\t"))   #now as list
"Date Period Bldg/Unit Transaction Code Description Journal 
Document # Charge Credit Balance"

tableheader <- gsub( "\\s+", "/",tableheader)
 "Date/Period/Bldg/Unit/Transaction/Code/Description/Journal/
Document/#/Charge/Credit/Balance"

tableheader <- gsub("on/Co", "on Co", tableheader)
"Date/Period/Bldg/Unit/Transaction Code/Description/Journal/
Document/#/Charge/Credit/Balance"

tableheader <- gsub("ent/#","ent #",tableheader) #character
"Date/Period/Bldg/Unit/Transaction Code/Description/Journal/
Document #/Charge/Credit/Balance"

Now let’s go back to where we left off before we played with the tableheader


Merge pages

Prior to cleaning the tableheader, we had just split old_ledger using split="\n" into nolinebreak, so

  • now that we have two pages let’s take each page at a time

  • from page 1, we only want rows 27:75

  • from page 2 we extract lines 7:59

  • combine the two pages together using c(page1, page2) into mergedpages

#let's extract the rows we need after we took out the header
page1 <- unlist(nolinebreak[[1]][27:75])
page2 <- unlist(nolinebreak[[2]][7:59])

mergedpages <- c(page1, page2)
length(mergedpages)  # 102

colnames()

  • feed the mergedpages as text into read.table() using sep="" (a space), and end up with ledger_tiral

  • remove column V5 as it appears to contain a couple of “-” which are useless

  • now that we have a data frame with 10 columns labeled V1-V10 we can

  • assign colnames() since we already know what the table header row is

  • what would happen if we feed the text into read.table() and use sep=“\t” the tab instead?

ledger_tiral <-  read.table( text= mergedpages, sep="",fill=TRUE )  

ledger_tiral <- ledger_tiral |> select(-V5)   #it appears that col V5 is empty remove it

colnames(ledger_tiral) <- c("Date","Period","Bldg/Unit","Transaction Code",
                            "Description","Journal","Document #","Charge",
                            "Credit","Balance")

separate()

This will be the third option at arriving at the same result, so let’s answer the question from above, what if we separate it using “\t” instead right out of the gate?

In the first method we used sep="\n“, in the second we used sep="", well now

  • use sep="\t" to see what we get, so let’s assign the output of read.table to ledger2

  • let’s separate it now using sep=“\\s+” the multiple spaces separator, and split the data into the header

ledger2 <- read.table(text=mergedpages, sep="\t")

ding <- separate(ledger2,V1,sep="\\s+", 
                 into=c("Date","Period","Bldg/Unit","Transaction Code",
                        "Description","Journal","Document #","Charge",
                        "Credit","Balance"))
# or this way

ding <- ledger2 |> 
        separate(col=V1, 
                 into =c("Date","Period","Bldg/Unit",
                         "Transaction Code","Description",
                         "Journal","Document #","Charge",
                         "Credit","Balance") )

As you would’ve guessed by now, we ended up with three data frames (santa, ledger_tiral, ding) with very very slight differences.