getwd()
&
setwd()
Basics - In & Out
The first goal is to have Tidy Data, but before we get there we have to have the Data. So we have to get the data from somewhere. This document will cover a few ways to get the data IN to R , as well as some ways to get the processed data out of R.
First we’ll start with a quick reminder of what Tidy Data is
Tidy Data
What is tidy data: tidy data is the target when you take raw data and turn it into a tidy data set, which you can then use to do downstream analysis.
The four things that you should have: 1. The raw data 2. A tidy data set 3. A code book (meta data) that describes each variable 4. An explicit and exact report of the steps that you used to go from step one to step 2 and 3. Your report should be explicit so when you had it off to someone, they can follow it to the T and arrive at the same result.
So where can we get our data from?
Directory
getwd/setwd
You can get your files from a local depository, and what’s important here is to know both
You can set your directory in a
Relative - setwd(“./data”) or setwd(“../”) where the .. move you up one directory
Absolute - setwd(“/Users/jtleek/data/”) and that takes you there directly
Let’s go over how to create, copy, and delete files in R. For more information on working with files in R, check out R documentation: files. R documentation is a tool that helps you easily find and browse the documentation of almost all R packages on CRAN. It’s a useful reference guide for functions in R code. Let’s go through a few of the most useful functions for working with files. Most of the commands are built in the RStudio menu and you can access them there, but here are some that you can use at the prompt
I’m using RStudio as my editor, so I’ll be referring to methods used in RStudio as well.
dir.create
Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function. It is always a good idea to use the conditional block of code as you see next, which I’ll omit in future examples for the lack of repitition
if(!file.exists("destination_directory"){
dir.create ("destination_directory")}
file.create
Use the file.create() function to create a blank file. Place the name and the type of the file in the parentheses of the function. Your file types will usually be something like .txt, .docx, or .csv.
file.create (“new_text_file.txt”)
file.create (“new_word_file.docx”)
file.create (“new_csv_file.csv”)
#If the file is successfully created when you run the function, R will return a value of TRUE
#(if not, R will return FALSE).
file.create (“new_csv_file.csv”)
1] TRUE [
file.copy
Copying a file can be done using the file.copy() function. In the parentheses, add the name of the file to be copied. Then, type a comma, and add the name of the destination folder that you want to copy the file to.
file.copy (“new_text_file.txt” , “destination_folder”)
If you check the Files pane in RStudio, a copy of the file appears in the relevant folder:
unlink
You can delete R files using the unlink() function. Enter the file’s name in the parentheses of the function.
unlink (“some_.file.csv”)
file.path
Construct the path to a file from components in a platform-independent way.
Usage
file.path(..., fsep = .Platform$file.sep)
Arguments
... |
character vectors. Long vectors are not supported. |
fsep |
the path separator to use (assumed to be ASCII). |
<- function(directory, pollutant, id = 1:332){
pollutantmean <- setwd("D:/Education/R/Data/specdata/")
fileDir = list.files(directory)
fileList = file.path(fileDir,directory,fileList[id],fsep="/" )
wantedFiles <- lapply(wantedFiles, read.csv)} dataIn
Download
Datasets
Remember to note the process you used to get the file, name, where, and method you used. Let’s say you are downloading from a page on a website, which is very very common:
- Go to a page (note the page url for it) and document it. (Many times, those pages change names over the years and become useless)
- Find the data/download section
- Find the format you want to download (usually csv files)
- Most times you right click on the file name and
- Copy URL address
- Use code below to download file
- Use the method “curl” as a standard method unless you know it is different
- Always verify the file was downloaded by listing the content of the directory
- Always set a time stamp and name it a standard object like “dateDownloaded” because files online are always changed, updated or relocated
download.file
Will download ANY file, regardless if it’s csv, xls or ….
if (!file.exists("data")) {dir.create("data")}
<- "https://~" #assign the Url to an object
fileUrl download.file(fileUrl, destfile ="./data/camera.csv", method = "curl")
list.file("./data") #this will list all the files in the directory (data) to verify the file was downloaded
1] "camera.csv"
[
<- date()
dateDownloaded 1] "Sun Jan 13 3:23:45 2016" [
Now that the file is local on my computer/system, we can load it in several ways:
Pre-built Data
data()
The default installation of R comes with a number of preloaded datasets that you can practice with. Plus, many online resources and tutorials use these sample datasets to teach coding concepts in R.
You can use the data() function to load these datasets in R. If you run the data function without an argument, R will display a list of the available datasets.
If you want to load a specific dataset, just enter its name in the parentheses of the data() function. For example, let’s load the mtcars dataset, which has information about cars that have been featured in past issues of Motor Trend magazine.
data(mtcars)
When you run the function, R will load the dataset. The dataset will also appear in the Environment pane of your RStudio. The Environment pane displays the names of the data objects, such as data frames and variables, that you have in your current workspace. In this image, mtcars appears in the fifth row of the pane. R tells us that it contains 32 observations and 11 variables.
Now that the dataset is loaded, you can get a preview of it in the R console pane. Just type its name…
mtcars
…and then press ctrl (or cmnd) and enter.
You can also display the dataset by clicking directly on the name of the dataset in the Environment pane. So, if you click on mtcars in the Environment pane, R automatically runs the View() function and displays the dataset in the RStudio data viewer.
Base Input Functions
There are a few principal functions built right into R for reading data.
read.table
,read.csv
, for reading tabular datareadLines
, for reading lines of a text filesource
, for reading in R code files (inverse
ofdump
)dget
, for reading in R code files (inverse
ofdput
)load
, for reading in saved workspacesunserialize
, for reading single R objects in binary form
read.table
This is the main function for reading data into R.
- It is flexible and robust but requires more parameters
- Reads the data in RAM so big data can cause problems
- The important parameters are: file, header, sep, row.names, nrows
- Can be slow
- Related functions are: read.csv() and read.csv2()
- quote, you can tell R if there are quoted values with quote=“” means no quotes
- na.strings, which sets the character that represents a missing value
- nrows , tells it how many rows we want to read nrows=10 reads 10 lines
- skip, number of line to skip before starting to read
<- read.table("./data/cameras/csv")
cameraData : line 1 did not have 13 elements
Error
head(cameraData) # and you'll see that it is not relating the "," as seperators so we can edit the read stmt
<- read.table("./data/cameras.csv", sep = ",", header = TRUE)
cameraData # set the seperator, and tell it that the variables are included in the header
read.csv
This function can be used instead, which automatically sets the sep=“,” and the header is TRUE
<- read.csv("./data/cameras/csv") cameraData
I hardly use any of them anymore and rely on packages I install such as ReadR which I’ll get to. But it’s important to know that these are still available.
Readr
In addition to using R’s built-in datasets, it is also helpful to import data from other sources to use for practice or analysis. The readr package in R is a great tool for reading rectangular data.
Rectangular data is data that fits nicely inside a rectangle of rows and columns, with each column referring to a single variable and each row referring to a single observation.
Pre-Built Data
The readr package comes with some sample files from built-in datasets that you can use for example code. To list the sample files, you can run the readr_example() function with no arguments.
readr_example()
1] "challenge.csv" "epa78.txt" "example.log"
[4] "fwf-sample.txt" "massey-rating.txt" "mtcars.csv"
[7] "mtcars.csv.bz2" "mtcars.csv.zip" [
Functions
The readr
package is recently developed by Hadley Wickham to deal with reading in large flat files quickly. The package provides replacements for functions like read.table()
and read.csv()
. The analogous functions in readr
are read_table()
and read_csv()
. These functions are often much faster than their base R analogues and provide a few other nice features such as progress meters.
Here are some examples of file types that store rectangular data:
- .csv (comma separated values): a .csv file is a plain text file that contains a list of data. They mostly use commas to separate (or delimit) data, but sometimes they use other characters, like semicolons.
- .tsv (tab separated values): a .tsv file stores a data table in which the columns of data are separated by tabs. For example, a database table or spreadsheet data.
- .fwf (fixed width files): a .fwf file has a specific format that allows for the saving of textual data in an organized fashion.
- .log: a .log file is a computer-generated file that records events from operating systems and other software programs.
The goal of readr is to provide a fast and friendly way to read rectangular data. readr supports several read_ functions. Each function refers to a specific file format.
- read_csv(): comma-separated values (.csv) files
- read_tsv(): tab-separated values files
- read_delim(): general delimited files
- read_fwf(): fixed-width files
- read_table(): tabular files where columns are separated by white-space
- read_log(): web log files
Read csv
read_csv
If you use the argument colClasses, the function requires that the vector must have length equal to the number of imported columns. Supposing the rest of your dataset columns are 5 then your read stmt should be:
read.csv(“~.csv”, colClasses=c(“character”,rep(“numeric”,5))
The “mtcars.csv” file refers to the mtcars dataset that was mentioned earlier. Let’s use the read_csv() function to read the “mtcars.csv” file, as an example. In the parentheses, you need to supply the path to the file. In this case, it’s “readr_example(“mtcars.csv”).
read_csv(readr_example("mtcars.csv"))
When you run the function, R prints out a column specification that gives the name and type of each column. It also prints out a table
Example
When we have files in the project already we can use the `read_csv()` function to import data from a .csv in the project folder called “hotel_bookings.csv” and save it as a data frame called `bookings_df`with this code:
<- read_csv("hotel_bookings.csv") bookings_df
.csv files are the most popular for sharing, as all computer users can read them as opposed to xls or other format files
ReadXL
To import spreadsheet data into R, you can use the readxl package. The readxl package makes it easy to transfer data from Excel into R. Readxl supports both the legacy .xls file format and the modern xml-based .xlsx file format.
The readxl package is part of the tidyverse but is not a core tidyverse package, so you need to load readxl in R by using the library() function.
install.packages("readxl")
library(readxl)
read_excel()
Like the readr package, readxl comes with some sample files from built-in datasets that you can use for practice. You can run the code readxl_example() to see the list.
You can use the read_excel() function to read a spreadsheet file just like you used read_csv() function to read a .csv file. The code for reading the example file “type-me.xlsx” includes the path to the file in the parentheses of the function.
read_excel(readxl_example("type-me.xlsx"))
excel_sheets
You can use the excel_sheets() function to list the names of the individual sheets.
excel_sheets(readxl_example("type-me.xlsx"))
1] "logical_coercion" "numeric_coercion" "date_coercion" "text_coercion" [
You can also specify a sheet by name or number. Just type “sheet =” followed by the name or number of the sheet. For example, you can use the sheet named “numeric_coercion” from the list above.
read_excel(..,sheet=“..”)
read_excel(readxl_example("type-me.xlsx"), sheet = "numeric_coercion")
Read XLSX
Similar to read_excel() you can load this library and use read.xlsx instead, where you can assign it to read certain rows and columns by passing it the colIndex=xx and rowIndex=blah to import a subset of the sheet
install.packages("openxlsx")
library(openxlsx)
<- read.xlsx("./data/cameras.xlsx", sheetIndex=1, header=TRUE) cameraData
specific rows
<- 2:3
colIndex <- 1:4
rowIndex <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex, header=TRUE) cameraData
Read XML
Extendible Markup Language.
- Frequently used to store structured data
- Particularly used in internet applications
- Extracting XML is the basis of most web scraping
- Basic structure are:
Markup - labels that give the text structure
Content - the actual text of the document
Tags
Correspond to general labels:
- Start tag <section>
- End tags </section>
- Empty tags <line-break />
Elements are specific examples of tags
- <Greeting> Hello, World </Greeting>
Attributes are components of the label
- <img src=“jeff.jpg” alt=“instructor”/>
- <step number=“3”> Connect A to B. </step>
xmlTreeParse
Parse out the XML file: It loads the document into memory in a way that you get access to different parts of it. Since it’s in R it is still a sturctured object and we have to use different functions to gain access to its parts.
xmlRoot
This gives us access to the root, it’s the wrapper of the entire document. So this command will give us access to that particular element of that xml file. First thing we might want to do is get the name of it by using
xmlName
use xmlName to that node and get the name of it, in this case “breakfast menu”. So here below we see that the root node wraps the entire document and the whole document is named “breakfast menu”.
names
We can also look at the names of the rootNode. This tells us what all the nested elements with that root node are. Remember the root node wraps the whole document and the whole document is named “breakfast menu”. There are 3 different breakfast items on this menu, and each one is wrapped within a food element. So if you look at the names of the rootNode.
library(XML)
<- "http://www.w3schools.com/xml/simple.xml"
fileUrl <- xmlTreeParse(fileUrl, useInternal = TRUE)
doc <- xmlRoot(doc)
rootNode
xmlName(rootNode)
1] "breakfast_menu"
[
names(rootNode)
food food food "food" "food" "food"
direct access
So, the next thing we can do is directly access parts of the XML document, just like any R object
rootNode1] <food>
[<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
1]][1]
rootNode[[<name>Belgian Waffles</name>
xmlValue
sapply
You can programmatically extract parts of the file with the Sapply(). It will loop through all the rootnodes and extract the xmlValue
xmlSApply(rootNode, xmlValue)
"Belgian Waffles$5.95Two of our famous Belgian.....
"Strawberry Belgian Waffles$7.95....
"Berry-Berry....
xpath
If you want to extract specific components of the document, you can use XPath language that I will not learn
/node
Top level node
//node
node at any level
node[@attr]
node[@attr-name] node specific to that attribute name. so if we want to extract all the elements with title “name
@attr-name='bob']
node[
xpathSApply(rootNode, "//name",xmlValue)
1] "Belgian Waffles" "Strawberry Belgian...." "Berry-Berry.."
[4] "French Toast" ...
[
xpathSApply(rootNode, "//price",xmlValue)
1] "5.95" "7.95" ".... [
Scraping
readLines
One of the ways to read lines from URL but takes longer so be patient
<- url("https://www.yourdataiq.com", "r")
con <- readLines(con)
x # or just the head()
x head(x)
1] "<!DOCTYPE HTML PUBLIC......."
[2] ""
[3] "<html>"
[#_______ALWAYS CLOSE THE CONNECTION__________________
... close(con)
HTML Parse
htmlTreeParse
library(XML)
<- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
fileUrl <- htmlTreeParse(file,useInternal=TRUE)
doc <- xpathSApply(doc,"//li[@class='score']",xmlValue)
scores <- xpathSApply(doc,"//li[@class='team-name']", xmlValue)
teams
scores1] "49-27" "14-6" "30-9"...
[9] "blah-boo"....
[
teams 1] "Denver" "Cleveland" ".....
[
url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, userInter=T)
xpathSApply(html, "//title", xmlValue)
xpathSApply(html, "//td@id='col-ditedby']", xmlValue)
HTTR
Here is a gihub help page for most of the OAuth code.
Get from httr
library(httr)
= GET(url)
html2 = content(html2, as="text")
content2 = htmlParse(content2, asText=TRUE)
parsedHtml xpathSApply(parsedHtml, "//title", xmlValue)
Sites with Passwords
error pwd
= GET("http://httpbin.org/basic-auth/user/passwd")
pgl
pgl
://http~
Response [http: 401
Status-type: Content
with pwd
= GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg2
pg2
://http~
Response [http: 200
Status-type: application/json
Content
{"authenticated": true,
"user" : "user"
}
names
names(pg2)
1] "url" "handle" "status_code" "headers" "cookies" "content"
[7] "times" "config" [
using handles
= handle("http://google.com")
blah #_____ set the path for the first page________
= GET(handle=blah, path="/")
pg1 #______set path for another page______________
= GET(handle=blah, path="search") pg2
Read API
Most APIs have similar process, as httr allows GET, POST, PUT, DELETE if you are granted any or all of those permissions. Most APIs use something similar to oauth.
- You first create an account for your application
- Create your application
- Retrieve your keys
= oauth_app("twitter", key="yourConsumerKeyHere", secret="yourConsumerSecretHere")
myapp = sign_oauth1.0( myapp,
sig token = "yourTokenHere",
token_secret = "yourTokenSecretHere")
= GET("https"//api.twitter.com/1.1/statuses/home_timeline.json", sig) homeTL
json object
#______ convert the json object ___________
= content(homeTL)
json1 = jsonlite::fromJSON(toJSON(jsonl))
json2
#_____ retrieve first row and first 4 columns ___________
1,1:4]
json2[
created_at id id_str1 Mon Jan 14 05:34:45 +0000 2014 4.5999.ddds98 425893049900
1 Now that P. Norvig's blah blah blkj
Read JSON
JavaScript Object Notation:
- Lightweight data storage
- Commonly used in APIs (Application Programming Interfaces)
- Similar to XML
- Data stored as: Numbers(double), Strings(double quoted), Boolean, Array(ordered, comma separated in square brackets),
- Object (unordered, comma seperated collection of key:value pairs in curley brackets
jsonlite
The package for reading this data in is jsonlite. You just pass the url to fromJson and you get a df and then you can read it like any other df. Actually it can store a df inside a df. So as you see below we can access owner$login which is inside owner
fromJSON
library(jsonlite)
<- fromJSON("https://api.github.com/users/jtleek/repos")
jsonData
names(jsonData)
1] "id" "name" "full_name" "owner"
[5] "private"......
[
names(jsonData$owner) # you get an array of values
1] "login" "id" "avatar-url"
[
$owner$login
jsonData1] "jtleek" "jtleek" "..... [
toJSON
You can take a dataset and turn it into a JSON with this command. Let’s use the built-in iris dataset. You can use this if you are planning on exporting data to be used by an API
<- toJSON(iris, pretty=TRUE)
myjson cat(myjson)
[
{ "Sepal.Length" : 5.1,
"Sepal.Width" : 3.5,
....
You can then send it back to a df using the fromJSON command, instead of using a Url like we did before, we use a local JSON file
<- fromJSON(myjson)
iris2
head(iris2)
Sepal.Length Sepal.Width Petal.Length Petal.Length1 5.1 3.5.....
Data.Table
- It is a more efficient and faster way to handle dataframes.
- It is faster than dataframes.
- Let’s create a table, you’ll see it’s similar to creating a df.
library(data.table)
= data.frame(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
DF head(DF,3)
= data.table(x=norm(9), y=rep(c("a","b","c"), each=3), z=rnorm(9))
DT head(DT,3)
tables
To see all the tables in memory use tables() with an (s) at the end
tables()
NAME NROW MB COLS KEY1,] DT 9 1 x,y,z
[: 1MB Total
subset
Just like a df, we can subset with the []
2,]
DT[
x y z 1: 1.002 a 1.509
filter
To subset with condition, for example if we want all the rows where column y has a value of “a”
$y =="a",] DT[DT
subset 1 index
If you use just one index to subset it will use that index for the rows. For example for the code below it will subset rows 2 and 3. Which is unlike a df. If you try to subset using columns like DT[,c(2,3)] you end up with a completely different result than df
c(2,3)]
DT[
x y z 1: 1.00158 a 1.5093
2. -0.0338 a 0.4844
list
mean
You can calculate values for variables with list
list(mean(x), sum(z))]
DT[,
V1 V2 1: 0.05637 0.5815
table(y)]
DT[,
y
a b c3 3 3
add new column
It adds a new column quickly, if you want to add a new column that’s equal to the square of z, you can do
:=z^2]
DT[,w<- DT
DT2 := 2]
DT[ , y
#This will add a new column to DT2 NOT DT which is equal to 2
multiple operations
Starts with { and ends with }. 1st stmt assigns a value to the tmp var, and the second stmt takes the log2 of that var +5. The last thing that was executed gets returned, so what ends up being returned and assigned to m is the log base two of x+z+5
:= {tmp <- (x + z); log2(tmp+5)}] DT[ , m
plyr
We can add to this data table a variable like a which is equal to true when X is greater than zero and false when X is less than zero. So now we have a column of binary value a that we can work with. So suppose we want to summarize another variable by the cases when X is greater than 0 versus the cases where X is less than 0.
:= x>0] DT[ , a
So for example we can take the mean of X + W and we can do it grouped by a variable a. This will take the mean of X plus W when a is equal to true, and it’s gonna place that mean in all the rows where A is equal to true.
Then it’s gonna take the mean of X plus W where a is equal to false and place that mean in all the rows where A is equal to false. So it creates a new variable that’s equal to the aggregated mean, aggregated over the variable.
:= mean(x+w), by=a] DT[ ,b
count
.N
There’s some special variables that allow you to do something’s really fast. So one is a .N is an integer, length 1, containig the number of times that a particular group appears.
For example,
- If I created a data table that has a large number of As, Bs, and Cs in it, so about 100,000 A B and Cs.
- If I want to count the number of times each of those letters appear I can use data.table .N, and then grouped by the X variable.
- It will count the number of times grouped by the x variable
- Counts the occurrence of each a, b, c
set.seed(123)
<- data.table(x = sample(letters[1:3], 1E5, TRUE))
DT =x]
DT [ , .N, by
x N 1: a 33387
2: c 33201
3: b 33412
keys
Data.tables have keys, so if you set the key, it’s possible to subset and sort a data table much more rapidly than you would be able to do with a data frame.
Here, I’m going to create a data table, and it’s going to have
- a variable X.
- a variable Y
- set the key for the data table to be the variable X.
Then if I want a subset on the basis of x or if I put in ‘a’, it knows to go and look in the key, and the key is x, and it very quickly subsets the data to only the values of x that are equal to a.
<- data.table(x=rep(c("a","b","c"), each=100), y=rnorm(300))
DT setkey(DT, x)
'a']
DT[
x y 1: a 0.259
2: a 0.917
3: a ...
joins
You can also use keys to facilitate joints between data tables so for example here I’ve created two data tables
- where they have a variable X and a variable Y
- and in this case the second data table has a variable Z
- I can set the key in both cases to be equal to x
- the same key for both data tables
- I can merge them together
- This is actually quite a bit faster than merging with the data frame as long as you have the same key for both data tables. It can be very fast.
<- data.table(x=c('a','a','b','dt1'), y=1:4)
DT1 <- data.table(x=c('a','b','dt2'), z=5:7)
DT2 setKey(DT1, x)
setKey(DT2, x)
merge(DT1, DT2)
x y z 1: a 1 5
2: a 2 5
3: b 3 6
write.table
It can also be advantageous to use write.table if you want to be able to read things fast from the disk.
- I’ve created a big data frame with two very large variables in it.
- I set up a temporary file: file
- Then I write that big df that’s now in a temp file=file
- Then I’ll time it using the fread()
The fread command could be applied to reading data tables, just like basically a drop-in substitute for read.table tab separated files.
And so you can see it takes about .32 seconds.
If I tried to do that same operation, if I just tried to read.table that file, it would take over 5 minutes.
<- data.frame(x=rnorm(1E6), y=rnorm(1E6))
big_df <- tempfile()
file
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))
user system elapsed0.312 0.015 0.326
system.time(read.table(file, header = TRUE, sep="\t"))
user system elapsed5.702 0.048 5.755
MYSQL
As we know MySQL is a free and widely used open source database software. More information.
- Widely used in internet applications
- Data is structured in Databases, Tables, Fields
- Each row is called a record
- Start by installing MySQL :: Download MySQL Installer
You can use this site for examples: UCSC Genome Browser Downloads
Install Package
install.packages("RMySQL")
Connecting
connect
<- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu") uscDb
show
Show all databases
<- dbGetQuery(uscDb, "show databases;") result
disconnect
Always disconnect from database
dbDisconnect(uscDb)
result
result
Process db
connect
#To connect to a specific database
<- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu" ) hg19
list tables
#list all tables in the db
<- dbListTables(hg19) allTables
length
length(allTables) #To list the first 5 tables allTables[1:5]
Process Table
dim of certain table
#get dimensions of a specific table
dbListFields(hg19, "affyU133Plus2")
count rows
# to get the length = nrows of that table - Inside the () is SQL command
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
# to display the count
count(*)
read table
# To read from a table in the db
<- dbReadTable(hg19, "affyU133Plus2")
affyData head(affyData)
subset
# Query a subset from a table by assigning a row value out of column misMathes -
#Once again inside the () is SQL command
<- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3") query
fetch
# FETCH the query
<- fetch(query) affyMis
quantile
# Break it down by quarters
quantile(affyMis$misMatches)
sub-subset
# Fetch certain small section of the query
<- fetch(query, n=10) affyMisSmall
clear
#clear the query
dbClearResult(query)
# [1] TRUE
disconnect
#dimension of section
dim(affyMisSmall)
# [1] 10 22
# ALWAYS DISCONNECT
dbDisconnect(hg19)
Read HDF5
HDF5 are used
- To store large datasets
- Stores a range of data types
- Hierarchical format
- Broken down into groups containing datasets and metadata
- Have a group header with group name and list of attributes
- Have a group symbol table with a list of object groups
- Datasets multidimsional array of data elements with metadata
load package
source("http://bicoconductor.org/biocLite.R")
biocLite("rhdf5")
library(rhd5)
create file
#Create an HDF5 file
= h5createdFile("example.h5")
created
created1] TRUE [
create groups
= h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
created #create subgroup
= h5createGroup("example.h5", "foo/boobaa") created
display file
h5ls("example.h5")
write to groups
= matrix(1:10, nr=5,nc=2)
A h5write(A, "example.h5", "foo/A")
= array(seq(0.1,2.0,by=0.1), dim=c(5,2,2))
B
#assign attribute to a column
attr(B, "scale") <- "liter"
write a dataset
h5write(B, "example5.h5", "foo/foobaa/B")
h5ls("example.h5")
= data.frame((1L:5L, seq(0, 1, length.out=5),
df c("ab", "cde", "fghi", "a", "s"), stringsAsFactors=FALSE)
h5write(df, "example.h5", "df")
h5ls("example.h5")
read data
= h5read("example.h5", "foo/A")
readA = h5read("example.h5", "foo/foobaa/B")
readB = h5read(example.h5", "df")
readf readA
write chunks
h5write(c(12,13,14),"example.h5", "foo/A", index=list(1:3,1))
read chunks
h5read("example.h5", "foo/A")
Fixed Width
read.fwf
- Fixed Width File Format or .for can be read using read.fwf
- Read a table of fixed width formatted data into a
data.frame
.
Usage
read.fwf(file, widths, header = FALSE, sep = "\t",
skip = 0, row.names, col.names, n = -1,
buffersize = 2000, fileEncoding = "", ...)
Arguments
file |
the name of the file which the data are to be read from. Alternatively, |
widths |
integer vector, giving the widths of the fixed-width fields (of one line), or list of integer vectors giving widths for multiline records. |
header |
a logical value indicating whether the file contains the names of the variables as its first line. If present, the names must be delimited by sep . |
sep |
character; the separator used internally; should be a character that does not occur in the file (except in the header). |
skip |
number of initial lines to skip; see read.table . |
row.names |
see read.table . |
col.names |
see read.table . |
n |
the maximum number of records (lines) to be read, defaulting to no limit. |
buffersize |
Maximum number of lines to read at one time |
fileEncoding |
character string: if non-empty declares the encoding used on a file (not a connection) so the character data can be re-encoded. See the ‘Encoding’ section of the help for file , the ‘R Data Import/Export’ manual and ‘Note’. |
... |
further arguments to be passed to read.table . Useful such arguments include as.is , na.strings , colClasses and strip.white . |
Other Sources
There is an R package for that as well. You can always google for “data storage R package”
With Files Directly
file
open connection to a text file
url
open connection to url
gzfile
open connection to a .gz file
bzfile
open connection to a .bz2 file
Foreign Package
Allows you to load data from Minitab, S, SAS, SPSS, Stata, Systat with basic functions like: - read.arff (weka) - read.dta (Stata) - read.mtp (Minitab) - read.octave (Octave) - read.spss (SPSS) - read.xport (SAS)
Save
Save Dataset
dir.create
Use the dir.create function to create a new folder, or directory, to hold your files. Place the name of the folder in the parentheses of the function. It is always a good idea to use the conditional block of code as you see next, which I’ll omit in future examples for the lack of repitition
if(!file.exists("destination_directory"){
dir.create ("destination_directory")}
file.create()
file.create
creates files with the given names if they do not already exist and truncates them if they do. They are created with the maximal read/write permissions allowed by the ‘umask’ setting (where relevant). By default a warning is given (with the reason) if the operation fails. In RStudio you can create the file as you would any menu operation. Same with creating a directory.
file.create("bellabeat_cleaned_filtered_3_5.csv")
write_csv()
The write_*()
family of functions are an improvement to analogous function such as write.csv()
because they are approximately twice as fast. Unlike write.csv()
, these functions do not include row names as a column in the written file. A generic function, output_column()
, is applied to each variable to coerce columns to suitable output.
write_csv(data_3_5,"bellabeat_cleaned_filtered_3_5.csv")
RStudio
Import
Importing data to RStudio: The data you have been asked to clean is currently an external .csv file. In order to view and clean it in `R`, you will need to import it. The `tidyverse` library `readr` package has a number of functions for “reading in” or importing data, including .csv files.
Now you can find your own .csv to import! Using the RStudio Cloud interface, import and save the file in the same folder as this R Markdown document. To do this,
- go to the Files tab in the lower-right console
- click the Upload button next to the + New Folder button.
- Click on Choose
- This will open a popup to let you browse your computer for a file. Select any .csv file, then click Open.
- After you upload file, read the data into Rstudio with this:
#Read and assign to df kc_pbp_23_df will use _df next time
<- read_csv("KC_play_by_play_2023.csv")
kc_pbp_23 str(kc_pbp_23)
summary(kc_pbp_23)
glimpse(kc_pbp_23)
head(kc_pbp_23)
spec(kc_pbp_23)
: ","
Delimiterchr (143): game_id, home_team, away_team, season_type, posteam, posteam_ty...
dbl (191): play_id, old_game_id, week, yardline_100, quarter_seconds_remai...
lgl (34): lateral_sack_player_id, lateral_sack_player_name, lateral_inter...
dttm (2): end_clock_time, end_yard_line
date (1): game_date
time (2): time, drive_game_clock_end
`spec()` to retrieve the full column specification for this data.
ℹ Use `show_col_types = FALSE` to quiet this message. ℹ Specify the column types or set
Download File
- Switch to directory you want to download files from within the Files pane (right lower pane in RStudio)
- Select the file(s) and/or folder(s) you want to download.
- Click More -> Export on the toolbar.
- You’ll then be prompted with a default file name for the download. Either accept the default or specify a custom name then press OK.
GitHub API
- Go to GitHub Apps
- Create app
- Name the app
- Copy the key = Client ID
- Secret Key = Client Secret key
- Follow the code below
install.packages("jsonlite")
install.packages("httpuv")
install.packages("httr")
library(jsonlite)
library(httpuv)
library(httr)
# 1. Find OAuth settings for github:
# http://developer.github.com/v3/oauth/
oauth_endpoints("github")
# 2. To make your own application, register at
# https://github.com/settings/developers. Use any URL for the homepage URL
# (http://github.com is fine) and http://localhost:1410 as the callback url
# Replace your key and secret below.
<- oauth_app(appname = "yourdataiq",
myapp key = "Iv23lid14OnALIH6zocW",
secret = "6e8eff47805a59e80e1fee233aaf97fe15ec1a3b")
# 3. Get OAuth credentials
<- oauth2.0_token(oauth_endpoints("github"), myapp)
github_token # Here you'll be prompted by Github to authenticate the app
# 4. Use API gtoken <- config(token = github_token)
<- GET("https://api.github.com/users/jtleek/repos", gtoken)
req
# Take action on http error
stop_for_status(req)
# Extract the content that was requested
= content(req)
json1
# Convert the extracted json1 to a data.frame
= jsonlite::fromJSON(jsonlite::toJSON(json1))
json2DF head(json2DF)
# Use this data to find the time that the datasharing repo was created. What time was it created?
#let's pull the row where of the repo
<- subset(json2DF,json2DF$full_name == "jtleek/datasharing")
one
#let's see when it was created
$created_at
one1]] [1] "2013-11-07T13:25:07Z"
[[
# Do it all in one statement: find the date and time the repo was created
$full_name == "jtleek/datasharing", "created_at"]
json2DF[json2DF1]]
[[1] "2013-11-07T13:25:07Z" [