if (!require("pdftools")) {install.packages("pdftools")}
library(pdftools)
library(data.table)
<- "D:/~/R/Data/pdf_data/ledger.pdf" file
Scraping a PDF
Introduction
There might be times when needed data is in a pdf file, we’ll go through a few ways to extract and transform it into usable data. There will be times that the data in a pdf is not in a table format but separated by tabs and spaces. That will be much harder and could cause unplanned steps and workarounds.
We’ll start with the uneven data represented in the first pdf file. This turns out to be a complicated case which leads us down a path where we need to consider that the data is not extratable. But we can still learn a lot from the process. So, let’s start with the hard one first.
Note: I’ll use function names as section headings to make it easier to find examples of use for functions, rather then a descriptive sub-title which doesn’t add to the fact, because the notes accompanying code blocks explain the idea and steps taken by the code itself.
Packages
Here is a list of useful packages we’ll use, and some we won’t have a chance to use in this page but still apply:
library(data.table)
library(tidydr)
library(stringr)
Uneven Ledger Example
Data
The data can be found at
pdf_text()
Let’s use pdf_text() to read the pdf into R. The output is a character[2], the 2 signifies that it’s 2 pages and are stored in one object each.
<- pdf_text(file) old_ledger
Method 1
sub()
As I quickly scan through old_ledger I notice several issues that could cause issues down the line. The header row for the table is located a few lines down the page, and the names need some tidying up. So, let’s use sub() to clean the pages
<- sub("Document #", "Document", old_ledger)
editedledger <- sub("Transaction Code","TansactionCode", editedledger)
editedledger <- sub("Bldg/Unit","BldgUnit", editedledger) editedledger
read.table()
Now that we’ve cleaned some glaring potential issues, let’s:
read the text file into a table using
"\n"
as the separator (\n is line break)search for the start of the table by finding the header row at line/row 16
we’ll subset the data set by dropping the first 15 rows and since we know the nrow=123 we can target the desired data and assign it to editedpdf
readpdf[16:123,1]
clean other rows from the df
if you check
class(editedpdf)
you’ll realize that it’s “character”read the editedpdf into tablepdf with
read.table()
but this time use (tab) “\t” as the separator
<- read.table( text= editedledger, sep="\n")
readpdf class(readpdf) #data.frame
1:30,1] #let's look for the ledger heading
readpdf[16,1] # table header is line 16
readpdf[nrow(readpdf) #123
<- (readpdf[16:123,1]) #extract the first 15 rows up to table header
editedpdf <- editedpdf[-108] #take out last line
editedpdf <- editedpdf[-(52:55)] #take out text atop of page 2
editedpdf <- editedpdf[-2] #take out line 2
editedpdf <- read.table(text= editedpdf, sep="\t") tablepdf
separate()
tablepdf is a df with one column and 102 rows, so we need to separate/split/divide/map the objects within each row to a column. If you review tablepdf you’ll realize that the objects are multiple spaces apart, so we need to use sep="\\s+"
which is multiple spaces.
<- separate(tablepdf,V1,sep="\\s+",
santa into=c("Date","Period","Bldg/Unit","Transaction Code",
"Description","Journal","Document #","Charge",
"Credit","Balance"))
Method 2
Data science is more about the person and the data, there are no set ways of doing a job. It’s more of an art than a set of steps to follow. We’ll go over a second method of achieving the same as Method 1. The reason is to use other functions so we can gain a better understanding of when to use a function.
So let’s go back to pdf_text() section where we generated old_ledger and go from there.
strsplit()
In the previous method we cleaned old_ledger and then we read it back in using sep="\n"
. Well in this method we’re going to split the text old_ledger using sep="\n"
and see what we have:
nolinebreak is a list with 2 elements, one for each page, and each line from the pdf is now a row
page 1 is 75 rows and page 2 is 59 rows
<- old_ledger |>
nolinebreak strsplit(split = "\n")
class(nolinebreak) #list
length(nolinebreak[[1]]) # 75
length(nolinebreak[[2]]) # 59
This section is not necessary for where we are going, but it helps clarify some points.
str_squish()
Let’s look for the table header row, and it appears that it’s row [24], so
extract row[24] into tableheader
words are separated by tab so we’ll
strsplit()
by"\t"
str_squish()
is from stringr package and does what it sounds like, squishes the string into a characterif you need a list then use as.list() to set it back, and now you notice the spaces between the objects are all equal
we can go some cleaning here as we notice there is a space in one of the objects, so we’ll use
sub()
to substitute into the space, and some other cleaning as necessary
<- nolinebreak[[1]][24] #character
tableheader "Date Period Bldg/Unit Transaction Code Description
Journal Document # Charge Credit Balance"
<- as.character(strsplit(tableheader,"\t"))
name.vector class(name.vector) #character
library(stringr)
<- str_squish(tableheader) #character
tableheader
<- as.list(strsplit(tableheader,"\t")) #now as list
tableheader "Date Period Bldg/Unit Transaction Code Description Journal
Document # Charge Credit Balance"
<- gsub( "\\s+", "/",tableheader)
tableheader "Date/Period/Bldg/Unit/Transaction/Code/Description/Journal/
Document/#/Charge/Credit/Balance"
<- gsub("on/Co", "on Co", tableheader)
tableheader "Date/Period/Bldg/Unit/Transaction Code/Description/Journal/
Document/#/Charge/Credit/Balance"
<- gsub("ent/#","ent #",tableheader) #character
tableheader "Date/Period/Bldg/Unit/Transaction Code/Description/Journal/
Document #/Charge/Credit/Balance"
Now let’s go back to where we left off before we played with the tableheader
Merge pages
Prior to cleaning the tableheader, we had just split old_ledger using split="\n"
into nolinebreak, so
now that we have two pages let’s take each page at a time
from page 1, we only want rows 27:75
from page 2 we extract lines 7:59
combine the two pages together using
c(page1, page2)
into mergedpages
#let's extract the rows we need after we took out the header
<- unlist(nolinebreak[[1]][27:75])
page1 <- unlist(nolinebreak[[2]][7:59])
page2
<- c(page1, page2)
mergedpages length(mergedpages) # 102
colnames()
feed the mergedpages as text into
read.table()
usingsep=""
(a space), and end up with ledger_tiralremove column V5 as it appears to contain a couple of “-” which are useless
now that we have a data frame with 10 columns labeled V1-V10 we can
assign
colnames()
since we already know what the table header row iswhat would happen if we feed the text into read.table() and use sep=“\t” the tab instead?
<- read.table( text= mergedpages, sep="",fill=TRUE )
ledger_tiral
<- ledger_tiral |> select(-V5) #it appears that col V5 is empty remove it
ledger_tiral
colnames(ledger_tiral) <- c("Date","Period","Bldg/Unit","Transaction Code",
"Description","Journal","Document #","Charge",
"Credit","Balance")
separate()
This will be the third option at arriving at the same result, so let’s answer the question from above, what if we separate it using “\t” instead right out of the gate?
In the first method we used sep="\n
“, in the second we used sep=""
, well now
use
sep="\t"
to see what we get, so let’s assign the output of read.table to ledger2let’s separate it now using sep=“\\s+” the multiple spaces separator, and split the data into the header
<- read.table(text=mergedpages, sep="\t")
ledger2
<- separate(ledger2,V1,sep="\\s+",
ding into=c("Date","Period","Bldg/Unit","Transaction Code",
"Description","Journal","Document #","Charge",
"Credit","Balance"))
# or this way
<- ledger2 |>
ding separate(col=V1,
into =c("Date","Period","Bldg/Unit",
"Transaction Code","Description",
"Journal","Document #","Charge",
"Credit","Balance") )
As you would’ve guessed by now, we ended up with three data frames (santa, ledger_tiral, ding) with very very slight differences.