Clean & Manipulate


This will be an extensive section as it will cover most of what you’ll spend your time on Cleaning and Processing Data. We’ll cover:

Clean - Edit


  • Clean_names
    • round, abbreviations
  • Replace:
    • change values, rename, sub, gsub, grep, grepl, subset, sustr
  • Select
  • Separate
  • Unite
  • Mutate
  • Expressions

Filter - Subset


  • Filter:
    • filter, exclude, extract, top N, slice, select
  • Subset:
    • subset, complete.cases, drop col, extract, in, across, if_any, if_all, cut, table, cut2, factor, relevel, any, all, is.na, which, identify ALL
  • Identical:
    • unique, sapply, section, exclude, names, [], matrix, [[]], $, which, dplyr, skimr
  • match_df

Conditionals - Loops


  • Conditionals:
    • If, else, elseif
  • For Loop:
    • seq_along, seq_len
  • While Loop:
    • paste, while if, binominal dist
  • Repeat:
    • break, next, return
  • Looping Functions:
    • anti_join
    • semi_join
    • case_match
    • case_when
    • lapply, mean, matrix, extract, mean of subgroup
    • sapply
    • apply
    • tapply
    • factor: count, sub
    • maaply
    • vapply,
    • split: split df, count
    • col/row: rowSums, rowMeans, colSums, colMeans

Functions


  • Functions:
    • function inside function, subset, mean, manual input, ncol mean, removeNA, unpacking

Join - Merge


  • Merge:
    • left outer join, inner, right outer, outer, by.x, by.y, cross, multiple cols
    • X_join, left_join
    • Intersect
    • Match_df
    • Concatenate, vertically: rbind, bind_rows - horizontally: cbind, bind_cols

Summarize - Arrange


  • Summarise:
    • split, count, quantile, n_distinct, nrow
  • Group_by
  • Table:
    • useNA, ifany
  • Summary
  • Aggregate
  • Arrange:
    • sort, order, arrange, across
  • Reshape:
    • wide to long, melt, dcast
  • Other:
    • sum, mean, mean avg, max, min

Missing Values - NAs


  • Look for NA:
    • is.na, count Extract: !is.na Remove: na.omit, complete.cases, na.fail, na.pass

Dates - Times


  • Lubridate:
    • ymd, mdy, dmy
  • Dates:
    • today, now
  • Convert:
    • string to date, date to string, date to datetime, datetime to date - as.date, unclass, format(as.Date), tostring, ymd_hms
  • Split & Extract:
    • day, month, year, timeframes
  • Time:
    • POSIXlt, POSIXct, coerce, str, extract, weekday, months, strptime, difftime, as.numeric

Sample Data


  • R data
  • Sequences:
    • vector, series, sequence: incremented, by length, along.with, seq_along
  • Replicate:
    • pattern, design pattern
  • Sample:
    • dice roll, seed, shuffle, insert NAs, subset, probability, letters, runif
  • Factors:
    • gl, vector & factor, tapply for mean, tapply for range
  • Distributions:
    • normal, poisson, binomial