Summarising One Table

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

January 29, 2025

Objectives

Previously… (1/2)

R Packages: tidyverse is a collection of packages suited for data processing and visualization, which includes tibble and dplyr packages.

library(tidyverse)

Tibbles: Tibbles are special kinds of data frames using the tibble package in tidyverse.

Previously… (2/2)

Chaining dplyr Verbs Using |>

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

tibble(iris) |> 
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |> 
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |> 
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

dplyr Verbs for Tidying Data

dplyr functions that operates on rows and columns.

Verb Purpose & Example
group_by() Groups rows by one or more columns, allowing operations to be performed within groups.

group_by(data,category)
summarise() Reduces multiple rows into a single summary row per group.

summarise(data,new_var = function(var))

\(\star\) The group_by() and summarise() usually goes together if you need to compute descriptive statistics of each category of a categorical variable.

Example: Summarising One Table by Group

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Summarising by Nesting Verbs

summarise(group_by(iris_tibble,Species),mean_sepal_length = mean(Sepal.Length))

\(\star\) Here, you are nesting the functions group_by() and summarise() to compute the mean of the Sepal.Length column in each category of the Species column.

Summarising by Piping Verbs

iris_tibble |> 
  # Step 1: group by species
  group_by(Species) |> 
  # Step 2: Calculate the mean of the Sepal.Length column
  #  - mean_sepal_length is the new column for the calculated mean
  summarise(mean_sepal_length = mean(Sepal.Length))

\(\star\) Here, you are using the piping operator |>, where you don’t need to nest the verbs, and the verbs are written in a logical sequence line-by-line.

Data Summarization Strategy and Best Practice

What is a strategy and best practice on summarising a data frame?

  1. Understand your data frame, its variables, the variable types, and how many rows.
  2. Know the end goal of the summary such as knowing what computations on numerical (or categorical) variables.
  3. Figure out what categorical variables –if any– you need to group for the summarizations.
  4. Vision the ordering of categorical variables on which to group by before summarising.
  5. Implement the verbs in code line-by-line, check the output, then make adjustments.

In-Class Demonstrations

Case example: Using the iris dataset, we want to compute the mean of each numerical variable for each species and count the number of rows per species.

Post-summarisation results

## # A tibble: 3 × 6
##   Species        N mean_sepal_length mean_sepal_width mean_petal_length
##   <fct>      <int>             <dbl>            <dbl>             <dbl>
## 1 setosa        50              5.01             3.43              1.46
## 2 versicolor    50              5.94             2.77              4.26
## 3 virginica     50              6.59             2.97              5.55
## # ℹ 1 more variable: mean_petal_width <dbl>

\(\dagger\) The goal of the demonstration is to replicate the shown result using the group_by() and summarise() verbs.

Activity: Summarise Data by Category

The purpose of this activity is for you to start developing a proficiency in summarising a data frame using the group_by() and summarise() verbs.

  1. Log-in to Posit Cloud and open the R Studio assignment MA4: Summarise Data by Category.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.