Base R versus R Packages

Aspect	Base R	R Packages
Availability	Comes pre-installed with R	Must be installed and loaded
Functionality	Offers basic statistical and programming tools	Provides advanced or specialized tools not included in base R
Customization	Limited to what’s already available	Highly customizable; users can install or even create their own packages
Performance	Base R can sometimes be slower or more verbose	Packages often include optimized or simpler syntax for complex tasks
Speciality	Limited only for basic statistics	Often built for a specific purpose or knowledge

The `tidyverse` Package

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data processing:

dplyr provides a grammar for data transformation.
tidyr provides a set of functions that help you get data in consistent form.
tibble is a data frame that prioritize simplicity, enforcing stricter checks to promote cleaner, more expressive code.

Installing Packages

How to install R Packages?

The following two methods installs the tidyverse package.

Using the “Tools” menu

Using the R Console Directly

install.packages("tidyverse")

\(\dagger\) Try the above code sequence in your console. Then, install a different package called plotly.

\(\star\) Knitting an RMarkdown with this function in a code chunk will probably give you an error or warning message. You need to run it directly on the console.

Loading Packages

How to load R Packages?

The function library() loads any installed package. In this case library("tidyverse") loads the tidyverse package specifically.

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

\(\dagger\) Try the above code sequence in your console. Then, load the package plotly.

Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

\(\star\) Knowing and keeping track of what variables are defined is key to understanding why some errors occur.

Data Frame, An Introduction

What is a data frame?

A data frame is a key data structure in R used for storing data sets.
It organizes data into rows and columns, similar to a spreadsheet.
Each column can contain data of a different variable type (e.g., numeric, character, vector).

Data Frame Key Characteristics

Base R: These data frames are data structures that comes with base R.

Tibbles: Tibbles are special kinds of data frames using the tibble package.

\(\star\) A base R data frame can be converted to a tibble data frame using the tibble() function. The iris data set is a built-in data in the datasets packages, which comes with base R.

Differences of R Data Frames and Tibbles

Feature	R Data Frames	Tibbles
Printing	Full display	Abbreviated, neat display
Subsetting	Returns vectors	Always returns tibbles
String Handling	Strings can become factors	Strings remain characters
Column Names	Allows partial matching	No partial matching, clearer errors
Error Messages	Less informative	More user-friendly

R Data Frames vs Tibbles, An Example

Sub-setting Columns

R Data Frames returns a vector.

glimpse(iris[,"Sepal.Width"])

##  num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

Tibbles returns tibbles.

glimpse(tibble(iris)[,"Sepal.Width"])

## Rows: 150
## Columns: 1
## $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4…

\(\dagger\) Try the above code examples using a different column on your console!

\(\star\) Here, the glimpse() function is used to simplify the output of the subsets.

Data Frame Processing Using `dplyr`

What is dplyr?

Overview:
- dplyr is a powerful R package designed for data processing.
- It is part of the tidyverse ecosystem.
Key Features:
- Simplifies common data wrangling tasks.
- Intuitive syntax with chaining using the pipe operator |>.

Why use dplyr?

Ease of Use: Clear, human-readable code.
Efficiency: Built-in functions optimized for performance.
Consistency: Works seamlessly with other tidyverse packages such as ggplot2 for visualizations.
Data Frames and Beyond: Works with data frames, tibbles, and databases.

Core Verbs for Rows of `dplyr`

dplyr functions that operates on rows.

Verb	Purpose & Example
`filter()`	Chooses rows based on conditions `filter(data, col1 > 10)`
`arrange()`	Reorders rows `arrange(data, col1)`
`distinct()`	Finds all the unique rows `distinct(data, col1)`
`count()`	Finds all unique rows, then counts the number of occurrences `count(data, col1)`

\(\star\) Notice that the data frame data in the examples are always in the first argument in the verbs. The filter() verb uses logical operators, which we will discuss more in detail.

Core Verbs for Columns of `dplyr`

dplyr functions that operates on columns.

Verb	Purpose & Example
`mutate()`	Adds or modifies columns `mutate(data, new_col = col1-col2)`
`select()`	Chooses specific columns `select(data, col1, col2)`
`rename()`	Renames specific columns `rename(data, 1loc = col1)`
`relocate()`	Moves columns to the front `relocate(data, col1)`

\(\star\) The = signs in column verbs are not logical operators. Only the filter() verb uses the logical operators.

Chaining `dplyr` Verbs Using `|>`

What is |>?

The pipe operator.
It is used to chain multiple verbs in a logical sequence.
It starts with a data frame and ends with a transformed data frame.

Example: Filter One Table Using `|>`

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Simple Example. The following code sequence filters the iris data frame (in tibble form) to include only the “setosa” species.

iris_tibble |> 
  filter(Species == "setosa")

\(\star\) Notice that the first line is the data frame itself, then the next line is the verb without putting it directly into the first argument of the filter() verb. This is a common practice of organizing verbs in a pipeline.

Example: Transform One Table Using `|>`

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble |> 
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |> 
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |> 
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

\(\dagger\) Try the above code sequence in your console with the virginica species, and compute the ratio of Petal.Width to Sepal.Width.

\(\star\) The verbs do not explicitly take the resulting data frames as the first argument because the pipe operator automatically passes the output of the previous step as the input to the next verb in the sequence.

Data Transformation Strategy and Best Practice

What is a strategy and best practice on transforming a data frame?

Understand your data frame, its variables, the variable types, and how many rows.
Know the end goal of transformation by visioning what it should look like.
Figure out what verbs are appropriate to use.
Vision a sequence of verbs that would lead you to the end goal.
Implement the verbs in code line-by-line, check the output, then make adjustments.

In-Class Demonstration

Case example: Using the iris dataset, we want to determine the number of rows (or observations) for each species where the length ratio exceeds \(0.80\) and the width ratio exceeds \(0.50\). The length ratio is defined as Petal.Length divided by Sepal.Length, and the width ratio is defined as Petal.Width divided by Sepal.Width.

The original number of rows pre-transformation

## # A tibble: 3 × 2
##   Species        n
##   <fct>      <int>
## 1 setosa        50
## 2 versicolor    50
## 3 virginica     50

Post-transformation results

## # A tibble: 2 × 2
##   Species        n
##   <fct>      <int>
## 1 versicolor     2
## 2 virginica     40

\(\dagger\) The goal of the demonstration is to replicate the shown data frames using the dplyr verbs.

Activity: Transform a Data Frame

The purpose of this activity is for you to start developing a proficiency in transforming a data frame using dplyr verbs.

Log-in to Posit Cloud and open the R Studio assignment MA3: Transform a Data Frame.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

Basics of Data Wrangling

Fundamentals of Data Science

Objectives

Previously… (1/2)

Previously… (2/2)

R Packages

Base R versus R Packages

The `tidyverse` Package

Installing Packages

Loading Packages

Variables

Data Frame, An Introduction

Data Frame Key Characteristics

Differences of R Data Frames and Tibbles

R Data Frames vs Tibbles, An Example

Data Frame Processing Using `dplyr`

Core Verbs for Rows of `dplyr`

Core Verbs for Columns of `dplyr`

Chaining `dplyr` Verbs Using `|>`

Example: Filter One Table Using `|>`

Example: Transform One Table Using `|>`

Data Transformation Strategy and Best Practice

In-Class Demonstration

Activity: Transform a Data Frame

Basics of Data Wrangling

Fundamentals of Data Science

Objectives

Previously… (1/2)

Previously… (2/2)

R Packages

Base R versus R Packages

The tidyverse Package

Installing Packages

Loading Packages

Variables

Data Frame, An Introduction

Data Frame Key Characteristics

Differences of R Data Frames and Tibbles

R Data Frames vs Tibbles, An Example

Data Frame Processing Using dplyr

Core Verbs for Rows of dplyr

Core Verbs for Columns of dplyr

Chaining dplyr Verbs Using |>

Example: Filter One Table Using |>

Example: Transform One Table Using |>

Data Transformation Strategy and Best Practice

In-Class Demonstration

Activity: Transform a Data Frame

The `tidyverse` Package

Data Frame Processing Using `dplyr`

Core Verbs for Rows of `dplyr`

Core Verbs for Columns of `dplyr`

Chaining `dplyr` Verbs Using `|>`

Example: Filter One Table Using `|>`

Example: Transform One Table Using `|>`