Basics of Data Wrangling

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

January 27, 2025

Objectives

Previously… (1/2)

Running R Commands in Different Ways

Running R Commands in Different Ways

Previously… (2/2)

R Packages

What are R packages?

Base R versus R Packages

Aspect Base R R Packages
Availability Comes pre-installed with R Must be installed and loaded
Functionality Offers basic statistical and programming tools Provides advanced or specialized tools not included in base R
Customization Limited to what’s already available Highly customizable; users can install or even create their own packages
Performance Base R can sometimes be slower or more verbose Packages often include optimized or simpler syntax for complex tasks
Speciality Limited only for basic statistics Often built for a specific purpose or knowledge

The tidyverse Package

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data processing:

  • dplyr provides a grammar for data transformation.
  • tidyr provides a set of functions that help you get data in consistent form.
  • tibble is a data frame that prioritize simplicity, enforcing stricter checks to promote cleaner, more expressive code.

Installing Packages

How to install R Packages?

The following two methods installs the tidyverse package.

install.packages("tidyverse")

\(\dagger\) Try the above code sequence in your console. Then, install a different package called plotly.

\(\star\) Knitting an RMarkdown with this function in a code chunk will probably give you an error or warning message. You need to run it directly on the console.

Loading Packages

How to load R Packages?

The function library() loads any installed package. In this case library("tidyverse") loads the tidyverse package specifically.

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

\(\dagger\) Try the above code sequence in your console. Then, load the package plotly.

Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

\(\star\) Knowing and keeping track of what variables are defined is key to understanding why some errors occur.

Data Frame, An Introduction

What is a data frame?

Data Frame Key Characteristics

Base R: These data frames are data structures that comes with base R.

Tibbles: Tibbles are special kinds of data frames using the tibble package.

\(\star\) A base R data frame can be converted to a tibble data frame using the tibble() function. The iris data set is a built-in data in the datasets packages, which comes with base R.

Differences of R Data Frames and Tibbles

Feature R Data Frames Tibbles
Printing Full display Abbreviated, neat display
Subsetting Returns vectors Always returns tibbles
String Handling Strings can become factors Strings remain characters
Column Names Allows partial matching No partial matching, clearer errors
Error Messages Less informative More user-friendly

R Data Frames vs Tibbles, An Example

Sub-setting Columns

glimpse(iris[,"Sepal.Width"])
##  num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
glimpse(tibble(iris)[,"Sepal.Width"])
## Rows: 150
## Columns: 1
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4…

\(\dagger\) Try the above code examples using a different column on your console!

\(\star\) Here, the glimpse() function is used to simplify the output of the subsets.

Data Frame Processing Using dplyr

What is dplyr?

Why use dplyr?

Core Verbs for Rows of dplyr

dplyr functions that operates on rows.

Verb Purpose & Example
filter() Chooses rows based on conditions

filter(data, col1 > 10)
arrange() Reorders rows

arrange(data, col1)
distinct() Finds all the unique rows

distinct(data, col1)
count() Finds all unique rows, then counts the number of occurrences

count(data, col1)

\(\star\) Notice that the data frame data in the examples are always in the first argument in the verbs. The filter() verb uses logical operators, which we will discuss more in detail.

Core Verbs for Columns of dplyr

dplyr functions that operates on columns.

Verb Purpose & Example
mutate() Adds or modifies columns

mutate(data, new_col = col1-col2)
select() Chooses specific columns

select(data, col1, col2)
rename() Renames specific columns

rename(data, 1loc = col1)
relocate() Moves columns to the front

relocate(data, col1)

\(\star\) The = signs in column verbs are not logical operators. Only the filter() verb uses the logical operators.

Chaining dplyr Verbs Using |>

What is |>?

Example: Filter One Table Using |>

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Simple Example. The following code sequence filters the iris data frame (in tibble form) to include only the “setosa” species.

iris_tibble |> 
  filter(Species == "setosa")

\(\star\) Notice that the first line is the data frame itself, then the next line is the verb without putting it directly into the first argument of the filter() verb. This is a common practice of organizing verbs in a pipeline.

Example: Transform One Table Using |>

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble |> 
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |> 
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |> 
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

\(\dagger\) Try the above code sequence in your console with the virginica species, and compute the ratio of Petal.Width to Sepal.Width.

\(\star\) The verbs do not explicitly take the resulting data frames as the first argument because the pipe operator automatically passes the output of the previous step as the input to the next verb in the sequence.

Data Transformation Strategy and Best Practice

What is a strategy and best practice on transforming a data frame?

  1. Understand your data frame, its variables, the variable types, and how many rows.
  2. Know the end goal of transformation by visioning what it should look like.
  3. Figure out what verbs are appropriate to use.
  4. Vision a sequence of verbs that would lead you to the end goal.
  5. Implement the verbs in code line-by-line, check the output, then make adjustments.

In-Class Demonstration

Case example: Using the iris dataset, we want to determine the number of rows (or observations) for each species where the length ratio exceeds \(0.80\) and the width ratio exceeds \(0.50\). The length ratio is defined as Petal.Length divided by Sepal.Length, and the width ratio is defined as Petal.Width divided by Sepal.Width.

The original number of rows pre-transformation

## # A tibble: 3 × 2
##   Species        n
##   <fct>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50

Post-transformation results

## # A tibble: 2 × 2
##   Species        n
##   <fct>      <int>
## 1 versicolor     2
## 2 virginica     40

\(\dagger\) The goal of the demonstration is to replicate the shown data frames using the dplyr verbs.

Activity: Transform a Data Frame

The purpose of this activity is for you to start developing a proficiency in transforming a data frame using dplyr verbs.

  1. Log-in to Posit Cloud and open the R Studio assignment MA3: Transform a Data Frame.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.