Basics - In & Out

The first goal is to have Tidy Data, but before we get there we have to have the Data. So we have to get the data from somewhere. This document will cover a few ways to get the data IN to R , as well as some ways to get the processed data out of R.

First we’ll start with a quick reminder of what Tidy Data is

Tidy Data

What is tidy data: tidy data is the target when you take raw data and turn it into a tidy data set, which you can then use to do downstream analysis.

The four things that you should have: 1. The raw data 2. A tidy data set 3. A code book (meta data) that describes each variable 4. An explicit and exact report of the steps that you used to go from step one to step 2 and 3. Your report should be explicit so when you had it off to someone, they can follow it to the T and arrive at the same result.

So where can we get our data from?

Download

Datasets

Remember to note the process you used to get the file, name, where, and method you used. Let’s say you are downloading from a page on a website, which is very very common:

Go to a page (note the page url for it) and document it. (Many times, those pages change names over the years and become useless)
Find the data/download section
Find the format you want to download (usually csv files)
Most times you right click on the file name and
Copy URL address
Use code below to download file
Use the method “curl” as a standard method unless you know it is different
Always verify the file was downloaded by listing the content of the directory
Always set a time stamp and name it a standard object like “dateDownloaded” because files online are always changed, updated or relocated

download.file

Will download ANY file, regardless if it’s csv, xls or ….

if (!file.exists("data")) {dir.create("data")}

fileUrl <- "https://~"          #assign the Url to an object 
download.file(fileUrl, destfile ="./data/camera.csv", method = "curl")

list.file("./data")            #this will list all the files in the directory (data) to verify the file was downloaded
[1] "camera.csv"

dateDownloaded <- date()
[1] "Sun Jan 13 3:23:45 2016"

Now that the file is local on my computer/system, we can load it in several ways:

Pre-built Data

data()

The default installation of R comes with a number of preloaded datasets that you can practice with. Plus, many online resources and tutorials use these sample datasets to teach coding concepts in R.

You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.

If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let’s load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine.

data(mtcars)

When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio. The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace. In this image, mtcars appears in the fifth row of the pane. R tells us that it contains 32 observations and 11 variables.

Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name…

mtcars

…and then press ctrl (or cmnd) and enter.

You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.

Base Input Functions

There are a few principal functions built right into R for reading data.

read.table,read.csv, for reading tabular data
readLines, for reading lines of a text file
source, for reading in R code files (inverse of dump)
dget, for reading in R code files (inverse of dput)
load, for reading in saved workspaces
unserialize, for reading single R objects in binary form

read.table

This is the main function for reading data into R.

It is flexible and robust but requires more parameters
Reads the data in RAM so big data can cause problems
The important parameters are: file, header, sep, row.names, nrows
Can be slow
Related functions are: read.csv() and read.csv2()
quote, you can tell R if there are quoted values with quote=“” means no quotes
na.strings, which sets the character that represents a missing value
nrows , tells it how many rows we want to read nrows=10 reads 10 lines
skip, number of line to skip before starting to read

cameraData <- read.table("./data/cameras/csv")
Error: line 1 did not have 13 elements

head(cameraData) # and you'll see that it is not relating the "," as seperators so we can edit the read stmt
cameraData <- read.table("./data/cameras.csv", sep = ",", header = TRUE)
# set the seperator, and tell it that the variables are included in the header

read.csv

This function can be used instead, which automatically sets the sep=“,” and the header is TRUE

cameraData <- read.csv("./data/cameras/csv")

I hardly use any of them anymore and rely on packages I install such as ReadR which I’ll get to. But it’s important to know that these are still available.

Readr

In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data.

Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.

Pre-Built Data

The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments.

readr_example()
[1] "challenge.csv"     "epa78.txt"         "example.log"      
[4] "fwf-sample.txt"    "massey-rating.txt" "mtcars.csv"       
[7] "mtcars.csv.bz2"    "mtcars.csv.zip"

Functions

The readr package is recently developed by Hadley Wickham to deal with reading in large flat files quickly. The package provides replacements for functions like read.table() and read.csv(). The analogous functions in readr are read_table() and read_csv(). These functions are often much faster than their base R analogues and provide a few other nice features such as progress meters.

Here are some examples of file types that store rectangular data:

.csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
.tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
.fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion.
.log: a .log file is a computer-generated file that records events from operating systems and other software programs.

The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.

read_csv(): comma-separated values (.csv) files
read_tsv(): tab-separated values files
read_delim(): general delimited files
read_fwf(): fixed-width files
read_table(): tabular files where columns are separated by white-space
read_log(): web log files

Read csv

read_csv

If you use the argument colClasses, the function requires that the vector must have length equal to the number of imported columns. Supposing the rest of your dataset columns are 5 then your read stmt should be:
read.csv(“~.csv”, colClasses=c(“character”,rep(“numeric”,5))

The “mtcars.csv” file refers to the mtcars dataset that was mentioned earlier. Let’s use the read_csv() function to read the “mtcars.csv” file, as an example. In the parentheses, you need to supply the path to the file. In this case, it’s “readr_example(“mtcars.csv”).

read_csv(readr_example("mtcars.csv"))

When you run the function, R prints out a column specification that gives the name and type of each column. It also prints out a table

Example

When we have files in the project already we can use the `read_csv()` function to import data from a .csv in the project folder called “hotel_bookings.csv” and save it as a data frame called `bookings_df`with this code:

bookings_df <- read_csv("hotel_bookings.csv")

.csv files are the most popular for sharing, as all computer users can read them as opposed to xls or other format files

ReadXL

To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.

The readxl package is part of the tidyverse but is not a core tidyverse package, so you need to load readxl in R by using the library() function.

install.packages("readxl") 
library(readxl)

read_excel()

Like the readr package, readxl comes with some sample files from built-in datasets that you can use for practice. You can run the code readxl_example() to see the list.

You can use the read_excel() function to read a spreadsheet file just like you used read_csv() function to read a .csv file. The code for reading the example file “type-me.xlsx” includes the path to the file in the parentheses of the function.

read_excel(readxl_example("type-me.xlsx"))

excel_sheets

You can use the excel_sheets() function to list the names of the individual sheets.

excel_sheets(readxl_example("type-me.xlsx"))
[1] "logical_coercion" "numeric_coercion" "date_coercion" "text_coercion"

You can also specify a sheet by name or number. Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.

read_excel(..,sheet=“..”)

read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion")

Read XLSX

Similar to read_excel() you can load this library and use read.xlsx instead, where you can assign it to read certain rows and columns by passing it the colIndex=xx and rowIndex=blah to import a subset of the sheet

install.packages("openxlsx")
library(openxlsx)

cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, header=TRUE)

specific rows

colIndex <- 2:3
rowIndex <- 1:4
cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex, header=TRUE)

Read XML

Extendible Markup Language.

Frequently used to store structured data
Particularly used in internet applications
Extracting XML is the basis of most web scraping
Basic structure are:
- Markup - labels that give the text structure
- Content - the actual text of the document

Tags

Correspond to general labels:

Start tag <section>
End tags </section>
Empty tags <line-break />

Elements are specific examples of tags

<Greeting> Hello, World </Greeting>

Attributes are components of the label

<img src=“jeff.jpg” alt=“instructor”/>
<step number=“3”> Connect A to B. </step>

xmlTreeParse

Parse out the XML file: It loads the document into memory in a way that you get access to different parts of it. Since it’s in R it is still a sturctured object and we have to use different functions to gain access to its parts.

xmlRoot

This gives us access to the root, it’s the wrapper of the entire document. So this command will give us access to that particular element of that xml file. First thing we might want to do is get the name of it by using

xmlName

use xmlName to that node and get the name of it, in this case “breakfast menu”. So here below we see that the root node wraps the entire document and the whole document is named “breakfast menu”.

names

We can also look at the names of the rootNode. This tells us what all the nested elements with that root node are. Remember the root node wraps the whole document and the whole document is named “breakfast menu”. There are 3 different breakfast items on this menu, and each one is wrapped within a food element. So if you look at the names of the rootNode.

library(XML)

fileUrl <- "http://www.w3schools.com/xml/simple.xml"  
doc <- xmlTreeParse(fileUrl, useInternal = TRUE)
rootNode <- xmlRoot(doc)

xmlName(rootNode)
[1] "breakfast_menu"

names(rootNode)
food     food     food 
"food"  "food"   "food"

direct access

So, the next thing we can do is directly access parts of the XML document, just like any R object

rootNode
[1] <food> 
        <name>Belgian Waffles</name>
        <price>$5.95</price>
        <description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
        <calories>650</calories>
        </food>
        
rootNode[[1]][1]
<name>Belgian Waffles</name>

xmlValue

sapply

You can programmatically extract parts of the file with the Sapply(). It will loop through all the rootnodes and extract the xmlValue

xmlSApply(rootNode, xmlValue)
"Belgian Waffles$5.95Two of our famous Belgian.....     
"Strawberry Belgian Waffles$7.95....    
"Berry-Berry....

xpath

If you want to extract specific components of the document, you can use XPath language that I will not learn

/node

Top level node

//node

node at any level

node[@attr]

node[@attr-name] node specific to that attribute name. so if we want to extract all the elements with title “name

node[@attr-name='bob']

xpathSApply(rootNode, "//name",xmlValue)
[1] "Belgian Waffles"         "Strawberry Belgian...."   "Berry-Berry.."
[4] "French Toast"  ...

xpathSApply(rootNode, "//price",xmlValue)
[1] "5.95"  "7.95" "....

Scraping

readLines

One of the ways to read lines from URL but takes longer so be patient

con <- url("https://www.yourdataiq.com", "r")
x <- readLines(con)
x            # or just the head()
head(x) 
[1] "<!DOCTYPE HTML PUBLIC......."
[2] ""
[3] "<html>"
... #_______ALWAYS CLOSE THE CONNECTION__________________
close(con)

HTML Parse

htmlTreeParse

library(XML)
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens" 
doc <- htmlTreeParse(file,useInternal=TRUE)
scores <- xpathSApply(doc,"//li[@class='score']",xmlValue)
teams <- xpathSApply(doc,"//li[@class='team-name']", xmlValue)
scores
[1] "49-27"   "14-6"  "30-9"... 
[9]  "blah-boo".... 

teams 
[1] "Denver"   "Cleveland" ".....  

url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, userInter=T)
xpathSApply(html, "//title", xmlValue) 
xpathSApply(html, "//td@id='col-ditedby']", xmlValue)

HTTR

Here is a gihub help page for most of the OAuth code.

Get from httr

library(httr)

html2 = GET(url)
content2 =  content(html2, as="text")
parsedHtml = htmlParse(content2, asText=TRUE) 
xpathSApply(parsedHtml, "//title", xmlValue)

Sites with Passwords

error pwd

pgl = GET("http://httpbin.org/basic-auth/user/passwd")
pgl

Response [http://http~
         Status: 401
         Content-type:

with pwd

pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg2

Response [http://http~ 
        Status: 200
        Content-type: application/json
        {
                "authenticated": true,
                "user" : "user" 
        }

names

names(pg2)
[1] "url"  "handle"  "status_code"  "headers"  "cookies"  "content" 
[7] "times"  "config"

using handles

blah = handle("http://google.com")
#_____ set the path for the first page________
pg1 = GET(handle=blah, path="/")
#______set path for another page______________
pg2 = GET(handle=blah, path="search")

Read API

Most APIs have similar process, as httr allows GET, POST, PUT, DELETE if you are granted any or all of those permissions. Most APIs use something similar to oauth.

You first create an account for your application
Create your application
Retrieve your keys

twitter

myapp = oauth_app("twitter", key="yourConsumerKeyHere", secret="yourConsumerSecretHere")
sig = sign_oauth1.0( myapp,
        token = "yourTokenHere",
        token_secret = "yourTokenSecretHere")
homeTL = GET("https"//api.twitter.com/1.1/statuses/home_timeline.json", sig)

json object

#______ convert the json object ___________ 
json1 = content(homeTL)
json2 = jsonlite::fromJSON(toJSON(jsonl))

#_____  retrieve first row and first 4 columns ___________
json2[1,1:4] 
        created_at               id            id_str
1 Mon Jan 14 05:34:45  +0000 2014  4.5999.ddds98    425893049900
1 Now that P. Norvig's blah blah    blkj

Read JSON

JavaScript Object Notation:

Lightweight data storage
Commonly used in APIs (Application Programming Interfaces)
Similar to XML
Data stored as: Numbers(double), Strings(double quoted), Boolean, Array(ordered, comma separated in square brackets),
Object (unordered, comma seperated collection of key:value pairs in curley brackets

jsonlite

The package for reading this data in is jsonlite. You just pass the url to fromJson and you get a df and then you can read it like any other df. Actually it can store a df inside a df. So as you see below we can access owner$login which is inside owner

fromJSON

library(jsonlite) 
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")

names(jsonData) 
[1] "id"   "name"   "full_name"   "owner" 
[5]  "private"......

names(jsonData$owner)     # you get an array of values 
[1] "login"   "id"  "avatar-url"

jsonData$owner$login       
[1] "jtleek"  "jtleek"  ".....

toJSON

You can take a dataset and turn it into a JSON with this command. Let’s use the built-in iris dataset. You can use this if you are planning on exporting data to be used by an API

myjson <- toJSON(iris, pretty=TRUE)
cat(myjson)
[   
        { 
                "Sepal.Length" : 5.1,
                "Sepal.Width" : 3.5, 
                ....

You can then send it back to a df using the fromJSON command, instead of using a Url like we did before, we use a local JSON file

iris2 <- fromJSON(myjson)

head(iris2) 
Sepal.Length  Sepal.Width  Petal.Length  Petal.Length
1       5.1          3.5.....

Data.Table

It is a more efficient and faster way to handle dataframes.
It is faster than dataframes.
Let’s create a table, you’ll see it’s similar to creating a df.

library(data.table)
DF = data.frame(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
head(DF,3)

DT = data.table(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
head(DT,3)

tables

To see all the tables in memory use tables() with an (s) at the end

tables()

NAME   NROW  MB   COLS   KEY
[1,] DT           9          1       x,y,z
Total: 1MB

subset

Just like a df, we can subset with the []

DT[2,]
x   y      z 
1:  1.002  a   1.509

filter

To subset with condition, for example if we want all the rows where column y has a value of “a”

DT[DT$y =="a",]

subset 1 index

If you use just one index to subset it will use that index for the rows. For example for the code below it will subset rows 2 and 3. Which is unlike a df. If you try to subset using columns like DT[,c(2,3)] you end up with a completely different result than df

DT[c(2,3)] 
x       y     z 
1:  1.00158 a  1.5093
2.  -0.0338 a  0.4844

list

mean

You can calculate values for variables with list

DT[,list(mean(x), sum(z))]
V1      V2 
1:  0.05637    0.5815

DT[,table(y)] 
y 
a   b   c
3   3   3

add new column

It adds a new column quickly, if you want to add a new column that’s equal to the square of z, you can do

DT[,w:=z^2]
DT2 <- DT
DT[ , y:= 2]

#This will add a new column to DT2 NOT DT which is equal to 2

multiple operations

Starts with { and ends with }. 1st stmt assigns a value to the tmp var, and the second stmt takes the log2 of that var +5. The last thing that was executed gets returned, so what ends up being returned and assigned to m is the log base two of x+z+5

DT[ , m := {tmp <- (x + z); log2(tmp+5)}]

plyr

We can add to this data table a variable like a which is equal to true when X is greater than zero and false when X is less than zero. So now we have a column of binary value a that we can work with. So suppose we want to summarize another variable by the cases when X is greater than 0 versus the cases where X is less than 0.

DT[ , a:= x>0]

So for example we can take the mean of X + W and we can do it grouped by a variable a. This will take the mean of X plus W when a is equal to true, and it’s gonna place that mean in all the rows where A is equal to true.

Then it’s gonna take the mean of X plus W where a is equal to false and place that mean in all the rows where A is equal to false. So it creates a new variable that’s equal to the aggregated mean, aggregated over the variable.

DT[ ,b := mean(x+w), by=a]

count

.N

There’s some special variables that allow you to do something’s really fast. So one is a .N is an integer, length 1, containig the number of times that a particular group appears.

For example,

If I created a data table that has a large number of As, Bs, and Cs in it, so about 100,000 A B and Cs.
If I want to count the number of times each of those letters appear I can use data.table .N, and then grouped by the X variable.
It will count the number of times grouped by the x variable
Counts the occurrence of each a, b, c

set.seed(123)
DT <- data.table(x = sample(letters[1:3], 1E5, TRUE))
DT [ , .N, by=x]
     x      N 
1:   a     33387 
2:   c     33201 
3:   b     33412

keys

Data.tables have keys, so if you set the key, it’s possible to subset and sort a data table much more rapidly than you would be able to do with a data frame.

Here, I’m going to create a data table, and it’s going to have

a variable X.
a variable Y
set the key for the data table to be the variable X.

Then if I want a subset on the basis of x or if I put in ‘a’, it knows to go and look in the key, and the key is x, and it very quickly subsets the data to only the values of x that are equal to a.

DT <- data.table(x=rep(c("a","b","c"), each=100), y=rnorm(300))
setkey(DT, x)
DT['a']  
   x      y 
1: a     0.259 
2: a     0.917 
3: a     ...

joins

You can also use keys to facilitate joints between data tables so for example here I’ve created two data tables

where they have a variable X and a variable Y
and in this case the second data table has a variable Z
I can set the key in both cases to be equal to x
the same key for both data tables
I can merge them together
This is actually quite a bit faster than merging with the data frame as long as you have the same key for both data tables. It can be very fast.

DT1 <- data.table(x=c('a','a','b','dt1'), y=1:4)
DT2 <- data.table(x=c('a','b','dt2'), z=5:7)
setKey(DT1, x)
setKey(DT2, x)
merge(DT1, DT2)
   x  y  z 
1: a  1  5 
2: a  2  5 
3: b  3  6

write.table

It can also be advantageous to use write.table if you want to be able to read things fast from the disk.

I’ve created a big data frame with two very large variables in it.
I set up a temporary file: file
Then I write that big df that’s now in a temp file=file
Then I’ll time it using the fread()

The fread command could be applied to reading data tables, just like basically a drop-in substitute for read.table tab separated files.

And so you can see it takes about .32 seconds.

If I tried to do that same operation, if I just tried to read.table that file, it would take over 5 minutes.

big_df <- data.frame(x=rnorm(1E6), y=rnorm(1E6))
file <- tempfile()

write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file)) 
user   system   elapsed
0.312     0.015    0.326

system.time(read.table(file, header = TRUE, sep="\t")) 
user   system   elapsed
5.702    0.048    5.755

MYSQL

As we know MySQL is a free and widely used open source database software. More information.

Widely used in internet applications
Data is structured in Databases, Tables, Fields
Each row is called a record
Start by installing MySQL :: Download MySQL Installer

You can use this site for examples: UCSC Genome Browser Downloads

Install Package

install.packages("RMySQL")

Connecting

connect

uscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")

show

Show all databases

result <- dbGetQuery(uscDb, "show databases;")

disconnect

Always disconnect from database

dbDisconnect(uscDb)

result

result

Process db

connect

#To connect to a specific database
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu" )

list tables

#list all tables in the db
allTables <- dbListTables(hg19)

length

length(allTables) #To list the first 5 tables allTables[1:5]

Process Table

dim of certain table

#get dimensions of a specific table
dbListFields(hg19, "affyU133Plus2")

count rows

# to get the length = nrows of that table - Inside the () is SQL command
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
# to display the count
count(*)

read table

# To read from a table in the db 
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)

subset

# Query a subset from a table by assigning a row value out of column misMathes -
#Once again inside the () is SQL command
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")

fetch

# FETCH the query
affyMis <- fetch(query)

quantile

# Break it down by quarters
quantile(affyMis$misMatches)

sub-subset

# Fetch certain small section of the query
affyMisSmall <- fetch(query, n=10)

clear

#clear the query
dbClearResult(query)
# [1] TRUE

disconnect

#dimension of section
dim(affyMisSmall)
# [1] 10 22

# ALWAYS DISCONNECT
dbDisconnect(hg19)

Read HDF5

HDF5 are used

To store large datasets
Stores a range of data types
Hierarchical format
Broken down into groups containing datasets and metadata
Have a group header with group name and list of attributes
Have a group symbol table with a list of object groups
Datasets multidimsional array of data elements with metadata

load package

source("http://bicoconductor.org/biocLite.R")
biocLite("rhdf5")
library(rhd5)

create file

#Create an HDF5 file 
created = h5createdFile("example.h5")
created
[1] TRUE

create groups

created = h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
#create subgroup
created = h5createGroup("example.h5", "foo/boobaa")

display file

h5ls("example.h5")

write to groups

A = matrix(1:10, nr=5,nc=2)
h5write(A, "example.h5", "foo/A")
B = array(seq(0.1,2.0,by=0.1), dim=c(5,2,2))

#assign attribute to a column
attr(B, "scale") <- "liter"

write a dataset

h5write(B, "example5.h5", "foo/foobaa/B")
h5ls("example.h5") 

df = data.frame((1L:5L, seq(0, 1, length.out=5),
                 c("ab", "cde", "fghi", "a", "s"), stringsAsFactors=FALSE)
h5write(df, "example.h5", "df") 
h5ls("example.h5")

read data

readA = h5read("example.h5", "foo/A")
readB = h5read("example.h5", "foo/foobaa/B")
readf = h5read(example.h5", "df") 
readA

write chunks

h5write(c(12,13,14),"example.h5", "foo/A", index=list(1:3,1))

read chunks

h5read("example.h5", "foo/A")

Fixed Width

read.fwf

Fixed Width File Format or .for can be read using read.fwf
Read a table of fixed width formatted data into a data.frame.

Usage

read.fwf(file, widths, header = FALSE, sep = "\t", 
         skip = 0, row.names, col.names, n = -1, 
         buffersize = 2000, fileEncoding = "", ...)

Arguments

`file`	the name of the file which the data are to be read from. Alternatively, `file` can be a connection, which will be opened if necessary, and if so closed at the end of the function call.
`widths`	integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records.
`header`	a logical value indicating whether the file contains the names of the variables as its first line. If present, the names must be delimited by `sep`.
`sep`	character; the separator used internally; should be a character that does not occur in the file (except in the header).
`skip`	number of initial lines to skip; see `read.table`.
`row.names`	see `read.table`.
`col.names`	see `read.table`.
`n`	the maximum number of records (lines) to be read, defaulting to no limit.
`buffersize`	Maximum number of lines to read at one time
`fileEncoding`	character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for `file`, the ‘R Data Import/Export’ manual and ‘Note’.
`...`	further arguments to be passed to `read.table`. Useful such arguments include `as.is`, `na.strings`, `colClasses` and `strip.white`.

Other Sources

There is an R package for that as well. You can always google for “data storage R package”

With Files Directly

file

open connection to a text file

url

open connection to url

gzfile

open connection to a .gz file

bzfile

open connection to a .bz2 file

Foreign Package

Allows you to load data from Minitab, S, SAS, SPSS, Stata, Systat with basic functions like: - read.arff (weka) - read.dta (Stata) - read.mtp (Minitab) - read.octave (Octave) - read.spss (SPSS) - read.xport (SAS)

Save

Save Dataset

dir.create

if(!file.exists("destination_directory"){ 
        dir.create ("destination_directory")}

file.create()

file.create creates files with the given names if they do not already exist and truncates them if they do. They are created with the maximal read/write permissions allowed by the ‘umask’ setting (where relevant). By default a warning is given (with the reason) if the operation fails. In RStudio you can create the file as you would any menu operation. Same with creating a directory.

file.create("bellabeat_cleaned_filtered_3_5.csv")

write_csv()

The ⁠write_*()⁠ family of functions are an improvement to analogous function such as write.csv() because they are approximately twice as fast. Unlike write.csv(), these functions do not include row names as a column in the written file. A generic function, output_column(), is applied to each variable to coerce columns to suitable output.

write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")

RStudio

Import

Importing data to RStudio: The data you have been asked to clean is currently an external .csv file. In order to view and clean it in `R`, you will need to import it. The `tidyverse` library `readr` package has a number of functions for “reading in” or importing data, including .csv files.

Now you can find your own .csv to import! Using the RStudio Cloud interface, import and save the file in the same folder as this R Markdown document. To do this,

go to the Files tab in the lower-right console
click the Upload button next to the + New Folder button.
Click on Choose
This will open a popup to let you browse your computer for a file. Select any .csv file, then click Open.
After you upload file, read the data into Rstudio with this:

 #Read and assign to df kc_pbp_23_df will use _df next time 
kc_pbp_23 <- read_csv("KC_play_by_play_2023.csv") 
str(kc_pbp_23)
summary(kc_pbp_23)
glimpse(kc_pbp_23)
head(kc_pbp_23)
spec(kc_pbp_23)

Delimiter: "," 
chr  (143): game_id, home_team, away_team, season_type, posteam, posteam_ty... 
dbl  (191): play_id, old_game_id, week, yardline_100, quarter_seconds_remai... 
lgl   (34): lateral_sack_player_id, lateral_sack_player_name, lateral_inter...
dttm   (2): end_clock_time, end_yard_line 
date   (1): game_date 
time   (2): time, drive_game_clock_end
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Download File

Switch to directory you want to download files from within the Files pane (right lower pane in RStudio)
Select the file(s) and/or folder(s) you want to download.
Click More -> Export on the toolbar.
You’ll then be prompted with a default file name for the download. Either accept the default or specify a custom name then press OK.

GitHub API

Go to GitHub Apps
Create app
Name the app
Copy the key = Client ID
Secret Key = Client Secret key
Follow the code below

install.packages("jsonlite") 
install.packages("httpuv")
install.packages("httr") 
library(jsonlite) 
library(httpuv)
library(httr) 
# 1. Find OAuth settings for github:
#    http://developer.github.com/v3/oauth/ 
oauth_endpoints("github") 

# 2. To make your own application, register at 
#    https://github.com/settings/developers. Use any URL for the homepage URL 
#    (http://github.com is fine) and  http://localhost:1410 as the callback url 
#    Replace your key and secret below.
myapp <- oauth_app(appname = "yourdataiq",
                   key = "Iv23lid14OnALIH6zocW", 
                   secret = "6e8eff47805a59e80e1fee233aaf97fe15ec1a3b") 

# 3. Get OAuth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
# Here you'll be prompted by Github to authenticate the app  

# 4. Use API gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken) 

# Take action on http error 
stop_for_status(req)  

# Extract the content that was requested 
json1 = content(req)  

# Convert the extracted json1 to a data.frame
json2DF = jsonlite::fromJSON(jsonlite::toJSON(json1)) 
head(json2DF) 

# Use this data to find the time that the datasharing repo was created. What time was it created?  
#let's pull the row where of the repo 
one <- subset(json2DF,json2DF$full_name == "jtleek/datasharing") 

#let's see when it was created
one$created_at
[[1]] [1] "2013-11-07T13:25:07Z"  

# Do it all in one statement: find the date and time the repo was created
json2DF[json2DF$full_name == "jtleek/datasharing", "created_at"]
[[1]] 
[1] "2013-11-07T13:25:07Z"

`...`	character vectors. Long vectors are not supported.
`fsep`	the path separator to use (assumed to be ASCII).

Basics - In & Out

Tidy Data

Directory

getwd/setwd

dir.create

file.create

file.copy

unlink

file.path

Download

Datasets

download.file

Pre-built Data

data()

Base Input Functions

read.table

read.csv

Readr

Pre-Built Data

Functions

Read csv

read_csv

Example

ReadXL

read_excel()

excel_sheets

read_excel(..,sheet=“..”)

Read XLSX

specific rows

Read XML

xmlTreeParse

xmlRoot

xmlName

names

direct access

xmlValue

sapply

xpath

Scraping

readLines

HTML Parse

htmlTreeParse

HTTR

Get from httr

Sites with Passwords

error pwd

with pwd

names

using handles

Read API

twitter

json object

Read JSON

jsonlite

fromJSON

toJSON

Data.Table

tables

subset

filter

subset 1 index

list

mean

add new column

multiple operations

plyr

count

.N

keys

joins

write.table

MYSQL

Install Package

Connecting

connect

show

disconnect

result

Process db

connect