MTH-391A | Spring 2025 | University of Portland
January 29, 2025
R Packages: tidyverse
is a collection
of packages suited for data processing and visualization, which includes
tibble
and dplyr
packages.
Tibbles: Tibbles are special kinds of data frames
using the tibble
package in tidyverse
.
Chaining dplyr
Verbs Using
|>
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
dplyr
Verbs for Tidying Datadplyr
functions that operates on rows and
columns.
Verb | Purpose & Example |
---|---|
group_by() |
Groups rows by one or more columns, allowing operations
to be performed within groups.
group_by(data,category) |
summarise() |
Reduces multiple rows into a single summary row per
group. summarise(data,new_var = function(var)) |
\(\star\) The
group_by()
and summarise()
usually goes
together if you need to compute descriptive statistics of each category
of a categorical variable.
Define Data Frame as a Tibble
Summarising by Nesting Verbs
\(\star\) Here, you are nesting the
functions group_by()
and summarise()
to
compute the mean of the Sepal.Length
column in each
category of the Species
column.
Summarising by Piping Verbs
iris_tibble |>
# Step 1: group by species
group_by(Species) |>
# Step 2: Calculate the mean of the Sepal.Length column
# - mean_sepal_length is the new column for the calculated mean
summarise(mean_sepal_length = mean(Sepal.Length))
\(\star\) Here, you are using the
piping operator |>
, where you don’t need to nest the
verbs, and the verbs are written in a logical sequence line-by-line.
What is a strategy and best practice on summarising a data frame?
Case example: Using the iris
dataset,
we want to compute the mean of each numerical variable for each species
and count the number of rows per species.
Post-summarisation results
## # A tibble: 3 × 6
## Species N mean_sepal_length mean_sepal_width mean_petal_length
## <fct> <int> <dbl> <dbl> <dbl>
## 1 setosa 50 5.01 3.43 1.46
## 2 versicolor 50 5.94 2.77 4.26
## 3 virginica 50 6.59 2.97 5.55
## # ℹ 1 more variable: mean_petal_width <dbl>
\(\dagger\) The goal of the
demonstration is to replicate the shown result using the
group_by()
and summarise()
verbs.
The purpose of this activity is for you to start developing a
proficiency in summarising a data frame using the
group_by()
and summarise()
verbs.
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.