ggplot - qplot


As we’ve already learned about ggplot2, qplot is part of the package, it is used for Quick Plots (qplot), I will cover it briefly since it was deprecated.

Scenario 1


Data

We’ll be working with mpg a built-in dataset.

Packages

 library(tidyverse)

Scatterplot


Let’s start with the default dataset within ggplot2, the mpg dataframe.

str(mpg)
tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
 $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
 $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
 $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
 $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
 $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
 $ drv         : chr [1:234] "f" "f" "f" "f" ...
 $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
 $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
 $ fl          : chr [1:234] "p" "p" "p" "p" ...
 $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

We see that there are 234 points in the dataset concerning 11 different characteristics of the cars. Suppose we want to see

  • If there’s a correlation between engine displacement (displ) and

  • highway miles per gallon (hwy).

As we did with the plot function of the base system we could simply call qplot with 3 arguments, the first two are the variables we want to examine and the third argument data is set equal to the name of the dataset which contains them (in this case, mpg).

  • You’ll see all the labels are provided

  • First argument is shown on the x-axis, second on the y

qplot(displ, hwy, data=mpg)
Warning: `qplot()` was deprecated in ggplot2 3.4.0.

Color

Let’s say we want to know which points belong to which (drv) type. So basically we want to color by group, here we’ll use the aesthetic for color. We could’ve used a different aes instead of color, we could use shape=

qplot(displ, hwy, data= mpg, color= drv)

Smoothing trend line

Let’s add geom to the plot.

  • First let’s add points using the geom

  • Also add a smoothing line as well,

  • We can concatenate the two together in the 4th argument geom=

The results will show a wide smoothing area (a 95% confidence interval) for each color/drv line.

It also tells us which method it used as the default for smoothing the data y~x (y in relation to x)

qplot(displ, hwy, data= mpg, color= drv, geom = c("point", "smooth"))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

geom_smooth

This is a ggplot function that can be used with qplot() as well. If we want to show a smooth linear line we can do the following to the diamonds dataset. The full example is shown in ggplot - Sample 2

qplot(carat, price, data= diamonds, color=cut) +
        geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

No x value

Let’s say we want to know

  • How were the data points of hwy mpg spread out

  • For each drv type, in other words

  • We plot the hwy values against nothing in the x axis and qplot will automatically use the range of the dataframe (234 total rows) to plot the hwy (y=hwy) in the colors indicated by the color= argument (per drv)

  • So, specifying the y parameter only, without an x argument, plots the values of the y argument in the order in which they occur in the data.

qplot(y=hwy, data=mpg, color=drv)

Boxplot


geom

We want to look at the range of hwy values per drv using geom= boxplot. So we end up with

  • 3 regions/boxes one for each drv (it is called a factor/group)
qplot(drv, hwy, data=mpg, geom="boxplot")

Let’s color by manufacturer

qplot(drv, hwy, data=mpg, geom="boxplot", color=manufacturer)

Histogram


Histograms display frequency counts for a single variable.

Let’s plot bins/buckets/bars for the occurrence of each hwy value and let’s color/fill them by drv type.

One thing is obvious the 4 wheel drive doesn’t have a single bin exceeding 30 mpg.

qplot(hwy, data=mpg, fill=drv)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Facets


For some of us it’s hard to read histograms, so let’s split this plot into panels. So since we have 3 groups of drives, we’ll split the plot into 3 smaller individual plots for each subset.

Scatterplot

Let’s start with scatterplot panels.

.~drv means: what’s left of the ~ are the number of rows, what’s to the right are the columns

  • So since . is to the left of the ~ that means we only have the base 1

  • drv is on the right of the ~ which means we have as many columns as there are factors/groups in drv

  • The panels show a more detailed relationship than the histogram. The relationship of each drv type is broken down by panel

qplot(displ, hwy, data=mpg, facets = .~drv)

Histogram

Binwidth

  • Let’s lay the panels into rows this time, so all we do is move drv to the left of the ~

  • Set the binwidth of the histogram to 2 to make it wider

qplot(hwy, data=mpg, facets = drv ~ ., binwidth=2)

Scenario 2

Data

We’ll be working with the diamonds default data with ggplot2.

str(diamonds)
tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Histogram

Let’s plot the price

Not only you get a histogram but you get a message that bins=30 and asking to pick a better value. The default is calculated from range/30.

qplot(price, data=diamonds)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

If you run range() you’ll see how they came up with bins=30. From the range calculate the difference and use it to size the bins

range(diamonds$price)
[1]   326 18823

bins

Size the bins from the diff in the range, so now the bins are 616.5667 wide. So each bin is a price increment of $616.5667

qplot(price, data=diamonds, binwidth = 18497/30)

  • brk: If you create a vector of length (range) and of sequence with each sequence 617 from the previous one you’ll know the tick marks of the x-axis. Let’s call that vector: brk with first count is for prices from $0 to $617, second from $617 and $1234…

  • counts: is the number of diamonds in each bin, so the first object is 4611 which is the height of the first bar

  • Of course you can view these vectors by typing their names in the command line

fill

Let’s fill the bar with the colors corresponding to the cut size, so each bar shows the count for each cut in that price range

qplot(price, data=diamonds, binwidth = 18497/30, fill=cut)

geom

Let’s replot price again but as a density function.

This will show the proportion of diamonds in each bin as opposed to the count in each bin. The shape will be similar but the scale is different, and the area under the curve is equal to one.

So let’s add a third geom = “density”. The highest peak is around 0 and goes down except around $4000 where there is a slight increase.

qplot(price, data=diamonds, geom = "density")

Let’s see what’s responsible for this increase? Let’s add color =cut and see if that shows us something.

qplot(price, data=diamonds, geom = "density", color=cut)

Well the lines and color is pretty sad, I can’t see a darn thing, except I see an increase in the good, very good, and premium lines. So we do see a reason for the peak around $4000 which is caused by the type of cut and how they are priced

Scatterplot


Let’s plot the price per carat size, as the size of diamonds goes up so does the price DUH!

qplot(carat, price, data= diamonds)

shape

We can set the geom shape to be based on a variable. In this case let’s set it to the type of cut. Now each cut is shown as a different geom/symbol.

qplot(carat, price, data= diamonds, shape=cut)
Warning: Using shapes for an ordinal variable is not advised

color

It’s hard to see so let’s try color instead

qplot(carat, price, data= diamonds, shape=cut)
Warning: Using shapes for an ordinal variable is not advised

geom_smooth

method = lm

Even though this is a ggplot function, it can be used here with qplot to display a smooth linear line with the shadow around each line showing the 95% confidence interval. We can see the better the cut (yellow) the steeper the price point line is.

qplot(carat, price, data= diamonds, color=cut) + geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'

Facets


Of course we knew it was coming to this, since it’s hard to tell from the scatter plot, let’s break them down into subplots/panels/facets and see if we have something. So let’s divide the plot by cut, so we have one row and 5 columns for the cuts.

qplot(carat, price, data= diamonds, color=cut, facets = .~cut) +
        geom_smooth(method = "lm")
`geom_smooth()` using formula = 'y ~ x'