Basics - In & Out


The first goal is to have Tidy Data, but before we get there we have to have the Data. So we have to get the data from somewhere. This document will cover a few ways to get the data IN to R , as well as some ways to get the processed data out of R.

First we’ll start with a quick reminder of what Tidy Data is

Tidy Data


What is tidy data: tidy data is the target when you take raw data and turn it into a tidy data set, which you can then use to do downstream analysis.

The four things that you should have: 1. The raw data 2. A tidy data set 3. A code book (meta data) that describes each variable 4. An explicit and exact report of the steps that you used to go from step one to step 2 and 3. Your report should be explicit so when you had it off to someone, they can follow it to the T and arrive at the same result.

So where can we get our data from?

Directory


getwd/setwd

You can get your files from a local depository, and what’s important here is to know both

getwd() 
&
setwd()

You can set your directory in a

Relative - setwd(“./data”) or setwd(“../”) where the .. move you up one directory

Absolute - setwd(“/Users/jtleek/data/”) and that takes you there directly

Let’s go over how to create, copy, and delete files in R. For more information on working with files in R, check out R documentation: files. R documentation is a tool that helps you easily find and browse the documentation of almost all R packages on CRAN. It’s a useful reference guide for functions in R code. Let’s go through a few of the most useful functions for working with files. Most of the commands are built in the RStudio menu and you can access them there, but here are some that you can use at the prompt

I’m using RStudio as my editor, so I’ll be referring to methods used in RStudio as well.

dir.create

Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function. It is always a good idea to use the conditional block of code as you see next, which I’ll omit in future examples for the lack of repitition

if(!file.exists("destination_directory"){
        dir.create ("destination_directory")}

file.create

Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv.

file.create (“new_text_file.txt”)
file.create (“new_word_file.docx”)
file.create (“new_csv_file.csv”)
#If the file is successfully created when you run the function, R will return a  value of TRUE
#(if not, R will return FALSE).
file.create (“new_csv_file.csv”)
[1] TRUE 

file.copy

Copying a file can be done using the file.copy() function. In the parentheses, add the name of the file to be copied. Then, type a comma, and add the name of the destination folder that you want to copy the file to.

file.copy (“new_text_file.txt” , “destination_folder”)

If you check the Files pane in RStudio, a copy of the file appears in the relevant folder:

file.path

Construct the path to a file from components in a platform-independent way.

Usage

file.path(..., fsep = .Platform$file.sep) 

Arguments

... character vectors. Long vectors are not supported.
fsep the path separator to use (assumed to be ASCII).
pollutantmean <- function(directory, pollutant, id = 1:332){ 
                        fileDir <- setwd("D:/Education/R/Data/specdata/")
                        fileList = list.files(directory)  
                        wantedFiles = file.path(fileDir,directory,fileList[id],fsep="/" ) 
                        dataIn <-  lapply(wantedFiles, read.csv)}

Download


Datasets

Remember to note the process you used to get the file, name, where, and method you used. Let’s say you are downloading from a page on a website, which is very very common:

  • Go to a page (note the page url for it) and document it. (Many times, those pages change names over the years and become useless)
  • Find the data/download section
  • Find the format you want to download (usually csv files)
  • Most times you right click on the file name and
  • Copy URL address
  • Use code below to download file
  • Use the method “curl” as a standard method unless you know it is different
  • Always verify the file was downloaded by listing the content of the directory
  • Always set a time stamp and name it a standard object like “dateDownloaded” because files online are always changed, updated or relocated

download.file

Will download ANY file, regardless if it’s csv, xls or ….

if (!file.exists("data")) {dir.create("data")}

fileUrl <- "https://~"          #assign the Url to an object 
download.file(fileUrl, destfile ="./data/camera.csv", method = "curl")

list.file("./data")            #this will list all the files in the directory (data) to verify the file was downloaded
[1] "camera.csv"

dateDownloaded <- date()
[1] "Sun Jan 13 3:23:45 2016"

Now that the file is local on my computer/system, we can load it in several ways:

Pre-built Data


data()

The default installation of R comes with a number of preloaded datasets that you can practice with. Plus, many online resources and tutorials use these sample datasets to teach coding concepts in R.

You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.

If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let’s load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine.

data(mtcars)

When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio. The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace. In this image, mtcars appears in the fifth row of the pane. R tells us that it contains 32 observations and 11 variables.

Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name…

mtcars

…and then press ctrl (or cmnd) and enter.

You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.

Base Input Functions


There are a few principal functions built right into R for reading data.

  • read.table,read.csv, for reading tabular data
  • readLines, for reading lines of a text file
  • source, for reading in R code files (inverse of dump)
  • dget, for reading in R code files (inverse of dput)
  • load, for reading in saved workspaces
  • unserialize, for reading single R objects in binary form

read.table

This is the main function for reading data into R.

  • It is flexible and robust but requires more parameters
  • Reads the data in RAM so big data can cause problems
  • The important parameters are: file, header, sep, row.names, nrows
  • Can be slow
  • Related functions are: read.csv() and read.csv2()
  • quote, you can tell R if there are quoted values with quote=“” means no quotes
  • na.strings, which sets the character that represents a missing value
  • nrows , tells it how many rows we want to read nrows=10 reads 10 lines
  • skip, number of line to skip before starting to read
cameraData <- read.table("./data/cameras/csv")
Error: line 1 did not have 13 elements

head(cameraData) # and you'll see that it is not relating the "," as seperators so we can edit the read stmt
cameraData <- read.table("./data/cameras.csv", sep = ",", header = TRUE)
# set the seperator, and tell it that the variables are included in the header

read.csv

This function can be used instead, which automatically sets the sep=“,” and the header is TRUE

cameraData <- read.csv("./data/cameras/csv")

I hardly use any of them anymore and rely on packages I install such as ReadR which I’ll get to. But it’s important to know that these are still available.

Readr


In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data.

Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

Pre-Built Data

The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments.

readr_example()
[1] "challenge.csv"     "epa78.txt"         "example.log"      
[4] "fwf-sample.txt"    "massey-rating.txt" "mtcars.csv"       
[7] "mtcars.csv.bz2"    "mtcars.csv.zip"

Functions

The readr package is recently developed by Hadley Wickham to deal with reading in large flat files quickly. The package provides replacements for functions like read.table() and read.csv(). The analogous functions in readr are read_table() and read_csv(). These functions are often much faster than their base R analogues and provide a few other nice features such as progress meters.

Here are some examples of file types that store rectangular data:

  • .csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
  • .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
  • .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion.
  • .log: a .log file is a computer-generated file that records events from operating systems and other software programs.

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.

  • read_csv(): comma-separated values (.csv) files
  • read_tsv(): tab-separated values files
  • read_delim(): general delimited files
  • read_fwf(): fixed-width files
  • read_table(): tabular files where columns are separated by white-space
  • read_log(): web log files

Read csv

read_csv

If you use the argument   colClasses, the function requires that the vector must have length equal to the number of imported columns. Supposing the rest of your dataset columns are 5 then your read stmt should be:
read.csv(“~.csv”, colClasses=c(“character”,rep(“numeric”,5))

The “mtcars.csv” file refers to the mtcars dataset that was mentioned earlier. Let’s use the read_csv() function to read the “mtcars.csv” file, as an example. In the parentheses, you need to supply the path to the file. In this case, it’s “readr_example(“mtcars.csv”).

read_csv(readr_example("mtcars.csv"))

When you run the function, R prints out a column specification that gives the name and type of each column. It also prints out a table

Example

When we have files in the  project already we can use the `read_csv()` function to import data from a .csv in the project folder called “hotel_bookings.csv” and save it as a data frame called `bookings_df`with this  code:

bookings_df <- read_csv("hotel_bookings.csv")

.csv files are the most popular for sharing, as all computer users can read them as opposed to xls or other format files

ReadXL

To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.

The readxl package is part of the tidyverse but is not a core tidyverse package, so you need to load readxl in R by using the library() function.

install.packages("readxl") 
library(readxl)

read_excel()

Like the readr package, readxl comes with some sample files from built-in datasets that you can use for practice. You can run the code readxl_example() to see the list.

You can use the read_excel() function to read a spreadsheet file just like you used read_csv() function to read a  .csv file. The code for reading the example file “type-me.xlsx” includes the path to the file in the parentheses of the function.

read_excel(readxl_example("type-me.xlsx"))

excel_sheets

You can use the excel_sheets() function to list the names of the individual sheets.

excel_sheets(readxl_example("type-me.xlsx"))
[1] "logical_coercion" "numeric_coercion" "date_coercion" "text_coercion"

You can also specify a sheet by name or number.  Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.

read_excel(..,sheet=“..”)

read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion")

Read XLSX

Similar to read_excel() you can load this library and use read.xlsx instead, where you can assign it to read certain rows and columns by passing it the colIndex=xx and rowIndex=blah to import a subset of the sheet

install.packages("openxlsx")
library(openxlsx)

cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, header=TRUE) 

specific rows

colIndex <- 2:3
rowIndex <- 1:4
cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex, header=TRUE)

Read XML


Extendible Markup Language.

  • Frequently used to store structured data
  • Particularly used in internet applications
  • Extracting XML is the basis of most web scraping
  • Basic structure are:
    • Markup - labels that give the text structure

    • Content - the actual text of the document

Tags

Correspond to general labels:

  • Start tag <section>
  • End tags </section>
  • Empty tags <line-break />

Elements are specific examples of tags

  • <Greeting> Hello, World </Greeting>

Attributes are components of the label

  • <img src=“jeff.jpg” alt=“instructor”/>
  • <step number=“3”> Connect A to B. </step>

xmlTreeParse

Parse out the XML file: It loads the document into memory in a way that you get access to different parts of it. Since it’s in R it is still a sturctured object and we have to use different functions to gain access to its parts.

xmlRoot

This gives us access to the root, it’s the wrapper of the entire document. So this command will give us access to that particular element of that xml file. First thing we might want to do is get the name of it by using

xmlName

use xmlName to that node and get the name of it, in this case “breakfast menu”. So here below we see that the root node wraps the entire document and the whole document is named “breakfast menu”.

names

We can also look at the names of the rootNode. This tells us what all the nested elements with that root node are. Remember the root node wraps the whole document and the whole document is named “breakfast menu”. There are 3 different breakfast items on this menu, and each one is wrapped within a food element. So if you look at the names of the rootNode.

library(XML)

fileUrl <- "http://www.w3schools.com/xml/simple.xml"  
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
rootNode <- xmlRoot(doc)

xmlName(rootNode)
[1] "breakfast_menu"

names(rootNode)
food     food     food 
"food"  "food"   "food"

direct access

So, the next thing we can do is directly access parts of the XML document, just like any R object

rootNode
[1] <food> 
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
        <calories>650</calories>
        </food>
        
rootNode[[1]][1]
<name>Belgian Waffles</name>

xmlValue

sapply

You can programmatically extract parts of the file with the Sapply(). It will loop through all the rootnodes and extract the xmlValue

xmlSApply(rootNode, xmlValue)
"Belgian Waffles$5.95Two of our famous Belgian.....     
"Strawberry Belgian Waffles$7.95....    
"Berry-Berry.... 

xpath

If you want to extract specific components of the document, you can use XPath language that I will not learn

/node

Top level node

//node

node at any level

node[@attr]

node[@attr-name] node specific to that attribute name. so if we want to extract all the elements with title “name

node[@attr-name='bob']

xpathSApply(rootNode, "//name",xmlValue)
[1] "Belgian Waffles"         "Strawberry Belgian...."   "Berry-Berry.."
[4] "French Toast"  ...

xpathSApply(rootNode, "//price",xmlValue)
[1] "5.95"  "7.95" "....

Scraping


readLines

One of the ways to read lines from URL but takes longer so be patient

con <- url("https://www.yourdataiq.com", "r")
x <- readLines(con)
x            # or just the head()
head(x) 
[1] "<!DOCTYPE HTML PUBLIC......."
[2] ""
[3] "<html>"
... #_______ALWAYS CLOSE THE CONNECTION__________________
close(con)

HTML Parse

htmlTreeParse

library(XML)
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens" 
doc <- htmlTreeParse(file,useInternal=TRUE)
scores <- xpathSApply(doc,"//li[@class='score']",xmlValue)
teams <- xpathSApply(doc,"//li[@class='team-name']", xmlValue)
scores
[1] "49-27"   "14-6"  "30-9"... 
[9]  "blah-boo".... 

teams 
[1] "Denver"   "Cleveland" ".....  

url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, userInter=T)
xpathSApply(html, "//title", xmlValue) 
xpathSApply(html, "//td@id='col-ditedby']", xmlValue)

HTTR

Here is a gihub help page for most of the OAuth code.

Get from httr

library(httr)

html2 = GET(url)
content2 =  content(html2, as="text")
parsedHtml = htmlParse(content2, asText=TRUE) 
xpathSApply(parsedHtml, "//title", xmlValue)  

Sites with Passwords

error pwd

pgl = GET("http://httpbin.org/basic-auth/user/passwd")
pgl

Response [http://http~
         Status: 401
         Content-type:

with pwd

pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg2

Response [http://http~ 
        Status: 200
        Content-type: application/json
        {
                "authenticated": true,
                "user" : "user" 
        }

names

names(pg2)
[1] "url"  "handle"  "status_code"  "headers"  "cookies"  "content" 
[7] "times"  "config"

using handles

blah = handle("http://google.com")
#_____ set the path for the first page________
pg1 = GET(handle=blah, path="/")
#______set path for another page______________
pg2 = GET(handle=blah, path="search")

Read API


Most APIs have similar process, as httr allows GET, POST, PUT, DELETE if you are granted any or all of those permissions. Most APIs use something similar to oauth.

  • You first create an account for your application
  • Create your application
  • Retrieve your keys

twitter

myapp = oauth_app("twitter", key="yourConsumerKeyHere", secret="yourConsumerSecretHere")
sig = sign_oauth1.0( myapp,
        token = "yourTokenHere",
        token_secret = "yourTokenSecretHere")
homeTL = GET("https"//api.twitter.com/1.1/statuses/home_timeline.json", sig)

json object

#______ convert the json object ___________ 
json1 = content(homeTL)
json2 = jsonlite::fromJSON(toJSON(jsonl))

#_____  retrieve first row and first 4 columns ___________
json2[1,1:4] 
        created_at               id            id_str
1 Mon Jan 14 05:34:45  +0000 2014  4.5999.ddds98    425893049900
1 Now that P. Norvig's blah blah    blkj 

Read JSON


JavaScript Object Notation:

  • Lightweight data storage
  • Commonly used in APIs (Application Programming Interfaces)
  • Similar to XML
  • Data stored as: Numbers(double), Strings(double quoted), Boolean, Array(ordered, comma separated in square brackets),
  • Object (unordered, comma seperated collection of key:value pairs in curley brackets

jsonlite

The package for reading this data in is jsonlite. You just pass the url to fromJson and you get a df and then you can read it like any other df. Actually it can store a df inside a df. So as you see below we can access owner$login which is inside owner

fromJSON

library(jsonlite) 
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")

names(jsonData) 
[1] "id"   "name"   "full_name"   "owner" 
[5]  "private"......

names(jsonData$owner)     # you get an array of values 
[1] "login"   "id"  "avatar-url"

jsonData$owner$login       
[1] "jtleek"  "jtleek"  ".....

toJSON

You can take a dataset and turn it into a JSON with this command. Let’s use the built-in iris dataset. You can use this if you are planning on exporting data to be used by an API

myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)
[   
        { 
                "Sepal.Length" : 5.1,
                "Sepal.Width" : 3.5, 
                ....

You can then send it back to a df using the fromJSON command, instead of using a Url like we did before, we use a local JSON file

iris2 <- fromJSON(myjson)

head(iris2) 
Sepal.Length  Sepal.Width  Petal.Length  Petal.Length
1       5.1          3.5.....

Data.Table


  • It is a more efficient and faster way to handle dataframes.
  • It is faster than dataframes.
  • Let’s create a table, you’ll see it’s similar to creating a df.
library(data.table)
DF = data.frame(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
head(DF,3)

DT = data.table(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
head(DT,3)

tables

To see all the tables in memory use tables() with an (s) at the end

tables()

NAME   NROW  MB   COLS   KEY
[1,] DT           9          1       x,y,z
Total: 1MB

subset

Just like a df, we can subset with the []

DT[2,]
x   y      z 
1:  1.002  a   1.509  

filter

To subset with condition, for example if we want all the rows where column y has a value of “a”

DT[DT$y =="a",]

subset 1 index

If you use just one index to subset it will use that index for the rows. For example for the code below it will subset rows 2 and 3. Which is unlike a df. If you try to subset using columns like DT[,c(2,3)] you end up with a completely different result than df

DT[c(2,3)] 
x       y     z 
1:  1.00158 a  1.5093
2.  -0.0338 a  0.4844

list

mean

You can calculate values for variables with list

DT[,list(mean(x), sum(z))]
V1      V2 
1:  0.05637    0.5815

DT[,table(y)] 
y 
a   b   c
3   3   3

add new column

It adds a new column quickly, if you want to add a new column that’s equal to the square of z, you can do

DT[,w:=z^2]
DT2 <- DT
DT[ , y:= 2]

#This will add a new column to DT2 NOT DT which is equal to 2

multiple operations

Starts with { and ends with }. 1st stmt assigns a value to the tmp var, and the second stmt takes the log2 of that var +5. The last thing that was executed gets returned, so what ends up being returned and assigned to m is the log base two of x+z+5

DT[ , m := {tmp <- (x + z); log2(tmp+5)}]

plyr

We can add to this data table a variable like a which is equal to true when X is greater than zero and false when X is less than zero. So now we have a column of binary value a that we can work with. So suppose we want to summarize another variable by the cases when X is greater than 0 versus the cases where X is less than 0.

DT[ , a:= x>0]

So for example we can take the mean of X + W and we can do it grouped by a variable a. This will take the mean of X plus W when a is equal to true, and it’s gonna place that mean in all the rows where A is equal to true.

Then it’s gonna take the mean of X plus W where a is equal to false and place that mean in all the rows where A is equal to false. So it creates a new variable that’s equal to the aggregated mean, aggregated over the variable.

DT[ ,b := mean(x+w), by=a]

count 

.N

There’s some special variables that allow you to do something’s really fast. So one is a .N is an integer, length 1, containig the number of times that a particular group appears.

For example,

  • If I created a data table that has a large number of As, Bs, and Cs in it, so about 100,000 A B and Cs.
  • If I want to count the number of times each of those letters appear I can use data.table .N, and then grouped by the X variable.
  • It will count the number of times grouped by the x variable
  • Counts the occurrence of each a, b, c
set.seed(123)
DT <- data.table(x = sample(letters[1:3], 1E5, TRUE))
DT [ , .N, by=x]
     x      N 
1:   a     33387 
2:   c     33201 
3:   b     33412

keys

Data.tables have keys, so if you set the key, it’s possible to subset and sort a data table much more rapidly than you would be able to do with a data frame. 

Here, I’m going to create a data table, and it’s going to have

  • a variable X.
  • a variable Y
  • set the key for the data table to be the variable X.

Then if I want a subset on the basis of x or if I put in ‘a’, it knows to go and look in the key, and the key is x, and it very quickly subsets the data to only the values of x that are equal to a.

DT <- data.table(x=rep(c("a","b","c"), each=100), y=rnorm(300))
setkey(DT, x)
DT['a']  
   x      y 
1: a     0.259 
2: a     0.917 
3: a     ...

joins

You can also use keys to facilitate joints between data tables so for example here I’ve created two data tables

  • where they have a variable X and a variable Y
  • and in this case the second data table has a variable Z
  • I can set the key in both cases to be equal to x
  • the same key for both data tables
  • I can merge them together
  • This is actually quite a bit faster than merging with the data frame as long as you have the same key for both data tables. It can be very fast.
DT1 <- data.table(x=c('a','a','b','dt1'), y=1:4)
DT2 <- data.table(x=c('a','b','dt2'), z=5:7)
setKey(DT1, x)
setKey(DT2, x)
merge(DT1, DT2)
   x  y  z 
1: a  1  5 
2: a  2  5 
3: b  3  6

write.table

It can also be advantageous to use write.table if you want to be able to read things fast from the disk. 

  • I’ve created a big data frame with two very large variables in it.
  • I set up a temporary file: file
  • Then I write that big df that’s now in a temp file=file
  • Then I’ll time it using the fread()

The fread command could be applied to reading data tables, just like basically a drop-in substitute for read.table tab separated files. 

And so you can see it takes about .32 seconds.

If I tried to do that same operation, if I just tried to read.table that file, it would  take over 5 minutes.

big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()

write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file)) 
user   system   elapsed
0.312     0.015    0.326

system.time(read.table(file, header = TRUE, sep="\t")) 
user   system   elapsed
5.702    0.048    5.755

MYSQL


As we know MySQL is a free and widely used open source database software. More information.

  • Widely used in internet applications
  • Data is structured in Databases, Tables, Fields
  • Each row is called a record
  • Start by installing MySQL :: Download MySQL Installer

You can use this site for examples: UCSC Genome Browser Downloads

Install Package

install.packages("RMySQL")

Connecting

connect

uscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")

show

Show all databases

result <- dbGetQuery(uscDb, "show databases;")

disconnect

Always disconnect from database

dbDisconnect(uscDb)

result

result

Process db

connect

#To connect to a specific database
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu" )

list tables

#list all tables in the db
allTables <- dbListTables(hg19)

length

length(allTables) #To list the first 5 tables allTables[1:5]

Process Table

dim of certain table

#get dimensions of a specific table
dbListFields(hg19, "affyU133Plus2")

count rows

# to get the length = nrows of that table - Inside the () is SQL command
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
# to display the count
count(*)

read table

# To read from a table in the db 
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

subset

# Query a subset from a table by assigning a row value out of column misMathes -
#Once again inside the () is SQL command
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")

fetch

# FETCH the query
affyMis <- fetch(query)

quantile

# Break it down by quarters
quantile(affyMis$misMatches)

sub-subset

# Fetch certain small section of the query
affyMisSmall <- fetch(query, n=10)

clear

#clear the query
dbClearResult(query)
# [1] TRUE

disconnect

#dimension of section
dim(affyMisSmall)
# [1] 10 22

# ALWAYS DISCONNECT
dbDisconnect(hg19)

Read HDF5


HDF5 are used

  • To store large datasets
  • Stores a range of data types
  • Hierarchical format
  • Broken down into groups containing datasets and metadata
  • Have a group header with group name and list of attributes
  • Have a group symbol table with a list of object groups
  • Datasets multidimsional array of data elements with metadata

load package

source("http://bicoconductor.org/biocLite.R")
biocLite("rhdf5")
library(rhd5)

create file

#Create an HDF5 file 
created = h5createdFile("example.h5")
created
[1] TRUE

create groups

created = h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
#create subgroup
created = h5createGroup("example.h5", "foo/boobaa")

display file

h5ls("example.h5")

write to groups

A = matrix(1:10, nr=5,nc=2)
h5write(A, "example.h5", "foo/A")
B = array(seq(0.1,2.0,by=0.1), dim=c(5,2,2))

#assign attribute to a column
attr(B, "scale") <- "liter"

write a dataset

h5write(B, "example5.h5", "foo/foobaa/B")
h5ls("example.h5") 

df = data.frame((1L:5L, seq(0, 1, length.out=5),
                 c("ab", "cde", "fghi", "a", "s"), stringsAsFactors=FALSE)
h5write(df, "example.h5", "df") 
h5ls("example.h5")

read data

readA = h5read("example.h5", "foo/A")
readB = h5read("example.h5", "foo/foobaa/B")
readf = h5read(example.h5", "df") 
readA

write chunks

h5write(c(12,13,14),"example.h5", "foo/A", index=list(1:3,1))

read chunks

h5read("example.h5", "foo/A")

Fixed Width


read.fwf

  • Fixed Width File Format  or .for can be read using read.fwf
  • Read a table of fixed width formatted data into a data.frame.

Usage

read.fwf(file, widths, header = FALSE, sep = "\t", 
         skip = 0, row.names, col.names, n = -1, 
         buffersize = 2000, fileEncoding = "", ...) 

Arguments

file

the name of the file which the data are to be read from.

Alternatively, file can be a connection, which will be opened if necessary, and if so closed at the end of the function call.

widths integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records.
header a logical value indicating whether the file contains the names of the variables as its first line. If present, the names must be delimited by sep.
sep character; the separator used internally; should be a character that does not occur in the file (except in the header).
skip number of initial lines to skip; see read.table.
row.names see read.table.
col.names see read.table.
n the maximum number of records (lines) to be read, defaulting to no limit.
buffersize Maximum number of lines to read at one time
fileEncoding character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for file, the ‘R Data Import/Export’ manual and ‘Note’.
... further arguments to be passed to read.table. Useful such arguments include as.isna.stringscolClasses and strip.white.

Other Sources


There is an R package for that as well. You can always google for “data storage R package”

With Files Directly

file

open connection to a text file

url

open connection to url

gzfile

open connection to a .gz file

bzfile

open connection to a .bz2 file

Foreign Package

Allows you to load data from Minitab, S, SAS, SPSS, Stata, Systat with basic functions like: - read.arff (weka) - read.dta (Stata) - read.mtp (Minitab) - read.octave (Octave) - read.spss (SPSS) - read.xport (SAS)

Save


Save Dataset

dir.create

Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function. It is always a good idea to use the conditional block of code as you see next, which I’ll omit in future examples for the lack of repitition

if(!file.exists("destination_directory"){ 
        dir.create ("destination_directory")}

file.create()

file.create creates files with the given names if they do not already exist and truncates them if they do. They are created with the maximal read/write permissions allowed by the ‘umask’ setting (where relevant). By default a warning is given (with the reason) if the operation fails. In RStudio you can create the file as you would any menu operation. Same with creating a directory.

file.create("bellabeat_cleaned_filtered_3_5.csv")

write_csv()

The ⁠write_*()⁠ family of functions are an improvement to analogous function such as write.csv() because they are approximately twice as fast. Unlike write.csv(), these functions do not include row names as a column in the written file. A generic function, output_column(), is applied to each variable to coerce columns to suitable output.

write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")

RStudio


Import

Importing data to RStudio: The data you have been asked to clean is currently an external .csv file. In order to view and clean it in `R`, you will need to import it. The `tidyverse` library `readr` package has a number of functions for “reading in” or importing data, including .csv files.

Now you can find your own .csv to import! Using the RStudio Cloud interface, import and save the file in the same folder as this R Markdown document. To do this,

  1. go to the Files tab in the lower-right console
  2. click the Upload button next to the + New Folder button.
  3. Click on Choose
  4. This will open a popup to let you browse your computer for a file. Select any .csv file, then click Open.
  5. After you upload file, read the data into Rstudio with this:
 #Read and assign to df kc_pbp_23_df will use _df next time 
kc_pbp_23 <- read_csv("KC_play_by_play_2023.csv") 
str(kc_pbp_23)
summary(kc_pbp_23)
glimpse(kc_pbp_23)
head(kc_pbp_23)
spec(kc_pbp_23)

Delimiter: "," 
chr  (143): game_id, home_team, away_team, season_type, posteam, posteam_ty... 
dbl  (191): play_id, old_game_id, week, yardline_100, quarter_seconds_remai... 
lgl   (34): lateral_sack_player_id, lateral_sack_player_name, lateral_inter...
dttm   (2): end_clock_time, end_yard_line 
date   (1): game_date 
time   (2): time, drive_game_clock_end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Download File

  1. Switch to directory you want to download files from within the Files pane (right lower pane in RStudio)
  2. Select the file(s) and/or folder(s) you want to download.
  3. Click More -> Export on the toolbar.
  4. You’ll then be prompted with a default file name for the download. Either accept the default or specify a custom name then press OK.

GitHub API


  • Go to GitHub Apps
  • Create app
  • Name the app
  • Copy the key = Client ID
  • Secret Key = Client Secret key
  • Follow the code below
install.packages("jsonlite") 
install.packages("httpuv")
install.packages("httr") 
library(jsonlite) 
library(httpuv)
library(httr) 
# 1. Find OAuth settings for github:
#    http://developer.github.com/v3/oauth/ 
oauth_endpoints("github") 

# 2. To make your own application, register at 
#    https://github.com/settings/developers. Use any URL for the homepage URL 
#    (http://github.com is fine) and  http://localhost:1410 as the callback url 
#    Replace your key and secret below.
myapp <- oauth_app(appname = "yourdataiq",
                   key = "Iv23lid14OnALIH6zocW", 
                   secret = "6e8eff47805a59e80e1fee233aaf97fe15ec1a3b") 

# 3. Get OAuth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
# Here you'll be prompted by Github to authenticate the app  

# 4. Use API gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken) 

# Take action on http error 
stop_for_status(req)  

# Extract the content that was requested 
json1 = content(req)  

# Convert the extracted json1 to a data.frame
json2DF = jsonlite::fromJSON(jsonlite::toJSON(json1)) 
head(json2DF) 

# Use this data to find the time that the datasharing repo was created. What time was it created?  
#let's pull the row where of the repo 
one <- subset(json2DF,json2DF$full_name == "jtleek/datasharing") 

#let's see when it was created
one$created_at
[[1]] [1] "2013-11-07T13:25:07Z"  

# Do it all in one statement: find the date and time the repo was created
json2DF[json2DF$full_name == "jtleek/datasharing", "created_at"]
[[1]] 
[1] "2013-11-07T13:25:07Z"