Clean & Manipulate
This will be an extensive section as it will cover most of what you’ll spend your time on Cleaning and Processing Data. We’ll cover:
Clean - Edit
- Clean_names
- round, abbreviations
- Replace:
- change values, rename, sub, gsub, grep, grepl, subset, sustr
- Select
- Separate
- Unite
- Mutate
- Expressions
Filter - Subset
- Filter:
- filter, exclude, extract, top N, slice, select
- Subset:
- subset, complete.cases, drop col, extract, in, across, if_any, if_all, cut, table, cut2, factor, relevel, any, all, is.na, which, identify ALL
- Identical:
- unique, sapply, section, exclude, names, [], matrix, [[]], $, which, dplyr, skimr
- match_df
Conditionals - Loops
- Conditionals:
- If, else, elseif
- For Loop:
- seq_along, seq_len
- While Loop:
- paste, while if, binominal dist
- Repeat:
- break, next, return
- Looping Functions:
- anti_join
- semi_join
- case_match
- case_when
- lapply, mean, matrix, extract, mean of subgroup
- sapply
- apply
- tapply
- factor: count, sub
- maaply
- vapply,
- split: split df, count
- col/row: rowSums, rowMeans, colSums, colMeans
Functions
- Functions:
- function inside function, subset, mean, manual input, ncol mean, removeNA, unpacking
Join - Merge
- Merge:
- left outer join, inner, right outer, outer, by.x, by.y, cross, multiple cols
- X_join, left_join
- Intersect
- Match_df
- Concatenate, vertically: rbind, bind_rows - horizontally: cbind, bind_cols
Summarize - Arrange
- Summarise:
- split, count, quantile, n_distinct, nrow
- Group_by
- Table:
- useNA, ifany
- Summary
- Aggregate
- Arrange:
- sort, order, arrange, across
- Reshape:
- wide to long, melt, dcast
- Other:
- sum, mean, mean avg, max, min
Missing Values - NAs
- Look for NA:
- is.na, count Extract: !is.na Remove: na.omit, complete.cases, na.fail, na.pass
Dates - Times
- Lubridate:
- ymd, mdy, dmy
- Dates:
- today, now
- Convert:
- string to date, date to string, date to datetime, datetime to date - as.date, unclass, format(as.Date), tostring, ymd_hms
- Split & Extract:
- day, month, year, timeframes
- Time:
- POSIXlt, POSIXct, coerce, str, extract, weekday, months, strptime, difftime, as.numeric
Sample Data
- R data
- Sequences:
- vector, series, sequence: incremented, by length, along.with, seq_along
- Replicate:
- pattern, design pattern
- Sample:
- dice roll, seed, shuffle, insert NAs, subset, probability, letters, runif
- Factors:
- gl, vector & factor, tapply for mean, tapply for range
- Distributions:
- normal, poisson, binomial