MTH-361A | Spring 2025 | University of Portland
January 29, 2025
dplyr
packageTypes of Variables
Types of Variables
R Packages: tidyverse
is a collection
of packages suited for data processing and visualization, which includes
tibble
and dplyr
packages.
Tibbles: Tibbles are special kinds of data frames
using the tibble
package in tidyverse
.
Descriptive statistics involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.
For Numerical Variables
For Categorical Variables
The sample mean, denoted as \(\overline{x}\), can be calculated as \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(x_1, x_2, \cdots, x_n\) represent the \(n\) observed values.
In other words, the mean (or average) is the sum of all data points divided by the number of points: \[\text{mean} = \frac{\text{sum of all data points}}{\text{number of data points}}.\]
Example: What is the mean of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \overline{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]
So, the mean is \(4\).
The median is the middle value when the data is sorted.
The median is computed using the following cases:
Example: What is the median of the data set \(7,1,2,4,6,3,2,7\)?
The number of data points is \(8\), an even number. \[\text{sorted data} \longrightarrow 1,2,2,\color{blue}{\mathbf{3}},\color{blue}{\mathbf{4}},6,7,7\]
\[ \begin{aligned} \text{median} & = \frac{\text{sum of two middle values}}{2} \\ & = \frac{\color{blue}{\mathbf{3}}+\color{blue}{\mathbf{4}}}{2} \\ & = 3.5 \end{aligned} \]
## [1] 1 2 2 3 4 6 7 7
## [1] 3.5
So, the median is \(3.5\).
A percentile is a measure used to indicate the value below which a given percentage of observations fall.
The formula for computing the percentile rank and the percentile it given by \[\text{percentile of } x = \frac{\text{number of values below } x}{\text{total number of values}} \times 100\] where \(x\) is a value in the data.
\(\star\) Key Idea: The percentile is the value below which a certain percentage of the data lies.
Example: What is the percentile of \(6\) in the data set \(7,1,2,4,6,3,2,7\)?
\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{red}{\mathbf{2}},\color{red}{\mathbf{3}},\color{red}{\mathbf{4}},\color{blue}{\mathbf{6}},7,7\]
\[ \begin{aligned} \text{percentile of } \color{blue}{\mathbf{6}} & = \frac{5}{8} \times 100 \\ & = 62.5 \end{aligned} \]
So, the data value \(6\) is in the \(62.5\)th percentile, or 62.5% of the data is below \(6\).
Example: What is the 30th percentile of the data set \(7,1,2,4,6,3,2,7\)? (What is the value in the data below which 30% of the data lies?)
\[ \begin{aligned} 30\% & = \frac{\text{number of values below } x}{8} \times 100 \\ 0.30 \times 8 & = \text{number of values below } x \\ 2.40 & = \text{number of values below } x \\ 2 & \longleftarrow \text{rounded to nearest integer} \end{aligned} \]
\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{blue}{\mathbf{2}},3,4,6,7,7\]
\[ \begin{aligned} 30\text{th percentile} & \approx 2 \end{aligned} \]
So, the \(30\)th percentile is approximately \(2\), or precisely \(2.1\). This is a consequence of considering a small dataset when computing by hand.
\(\star\) Note: Whether the \(30\)th percentile is exactly \(2\) or \(2.1\) depends on the dataset and the method used to compute percentiles. In fact, \(2\) is exactly the \(25\)th percentile and \(2.1\) is exactly the \(30\)th percentile, but due to approximation, the values are close.
\(\mathbf{Q}_1\) (the 1st quartile), \(\mathbf{Q}_2\) (the 2nd quartile), \(\mathbf{Q}_3\) (the 3rd quartile), and the IQR (interquartile range) are statistical measures used to describe the spread and distribution of a dataset:
\(\star\) Key Idea: The numerical data is divided into four sections (quartiles), which is saying that the data is split into four equal parts, each containing 25% (\(Q_1\)), 50% (\(Q_2\)), and 75% (\(Q_3\)) of the observations when arranged in ascending order.
In general, quartiles are called quantiles, which are values that split sorted data into equal parts. Quartiles are just quantiles where we split the data into four parts.
Example: What are the quartiles of the data set \(7,1,2,4,6,3,2,7\)?
\[\text{sorted data} \longrightarrow 1,2,2,3,4,6,7,7\]
Note that the number of data points is \(8\), an even number.
\[ \begin{aligned} 25\text{th percentile} & \approx 2 \\ 50\text{th percentile (median)} & = \frac{3+4}{2} = 3.50 \\ 75\text{th percentile} & \approx 6 \\ \end{aligned} \]
Note that these are approximations due to the small dataset size, but the concept of percentiles still holds.
## 0% 25% 50% 75% 100%
## 1.00 2.00 3.50 6.25 7.00
So, the quartiles are \(Q_1 = 2\), \(Q_2 = 3.50\), and \(Q_3 = 6.25\).
Example: What is the IQR of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} & Q_1 \approx 2 \\ & Q_2 \text{ (median)} = \frac{3+4}{2} = 3.50 \\ & Q_3 \approx 6.25 \end{aligned} \]
\[ \begin{aligned} \text{IQR} & = Q_3 - Q_1 \\ & = 6.25 - 2 \\ & = 4.25 \end{aligned} \]
So, the IQR is \(4.25\).
The range is the difference between the maximum and minimum of the numerical data.
The formula for the range is given by \[\text{range} = x_{max} - x_{min}\] where \(x_{max}\) is the maximum value and \(x_{min}\) minimum value.
Example: What is the range of the data set \(7,1,2,4,6,3,2,7\)?
So, the range is \(6\).
The variance is roughly the average squared deviation from the mean.
The formula for the variance is given by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots (x_n - \bar{x})^2}{n-1}\] where \(x_1, x_2, \cdots, x_n\) are the data points, \(\bar{x}\) is the sample mean, and \(n\) is the sample size.
What is the meaning of the variance?
Example: What is the variance of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]
So, the variance is \(5.714\).
Why do we use the squared deviation in the calculation of variance?
\(\star\) Variance is the average of the squared differences between each data point and the mean.
The standard deviation (SD) is the square root of the variance, and has the same units as the data.
The formula for the standard deviation is given by \[s = \sqrt{s^2}\] where \(s^2\) is the variance.
What is the meaning of the standard deviation?
Example: What is the standard deviation of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \text{mean} \longrightarrow \bar{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]
\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]
\[ \text{standard deviation} \longrightarrow s = \sqrt{5.714} = 2.390 \]
So, the standard deviation is \(2.390\).
The frequency is the number of observations in each category.
The method of computing the frequencies of a categorical variable is as follows:
Example: How many b and g are there in the data listed below?
\[b, g, g, b, b, g, b, b\]
The unique categories are b and g.
Number of occurrences: \[ \begin{aligned} b,b,b,b,b & \longrightarrow 5 \\ g,g,g & \longrightarrow 3 \end{aligned} \]
So, there are \(5\) b and \(3\) g.
The relative frequency is the proportion of observations in each category.
The proportion is computed using the formula \[\text{proportion of a category} = \frac{\text{number of cases of a category}}{\text{total number of cases}}.\]
The percentage is the relative frequency multiplied by 100: \[\text{percentage of a category} = \text{proportion of a category } \times 100.\]
Example: What are the proportions of b and g in the data listed below?
\[b, g, g, b, b, g, b, b\]
The number of occurrences are the same as the previous example.
Number of cases: \(8\)
Proportions: \[ \begin{aligned} b & \longrightarrow \frac{5}{8} = 0.625 \\ g & \longrightarrow \frac{3}{8} = 0.375 \end{aligned} \]
## cat_data
## b g
## 0.625 0.375
So, there are \(0.625\) b and \(0.375\) g, or \(62.5\)% b and \(37.5\)% g.
dplyr
What is dplyr
?
dplyr
is a powerful R package designed for data
processing.tidyverse
ecosystem.%>%
.Why use dplyr
?
tidyverse
packages such as ggplot2
for
visualizations.dplyr
dplyr
functions that operates on
rows.
Verb | Purpose & Example |
---|---|
filter() |
Chooses rows based on conditions
filter(data, col1 > 10) |
arrange() |
Reorders rows
arrange(data, col1) |
distinct() |
Finds all the unique rows
distinct(data, col1) |
count() |
Finds all unique rows, then counts the number of
occurrences count(data, col1) |
\(\star\) Notice that the data frame
data
in the examples are always in the first argument in
the verbs. The filter()
verb uses logical operators,
which we will discuss more in detail.
Define Data Frame as a Tibble
Filtering by subsetting
\(\star\) Here, you have to call the tibble twice to filter it, and it returns as a vector.
Filtering by the filter()
function
\(\star\) Here, you just have to call the tibble once to filter it, and it returns a tibble.
dplyr
dplyr
functions that operates on
columns.
Verb | Purpose & Example |
---|---|
mutate() |
Adds or modifies columns
mutate(data, new_col = col1-col2) |
select() |
Chooses specific columns
select(data, col1, col2) |
rename() |
Renames specific columns
rename(data, 1loc = col1) |
relocate() |
Moves columns to the front
relocate(data, col1) |
\(\star\) The =
signs
in column verbs are not logical operators. Only the
filter()
verb uses the logical operators.
Define Data Frame as a Tibble
Adding Columns by subsetting
\(\star\) Here, you have to call the
tibble three times to add a new column named length_ratio
,
which is computed as Petal.Length
over
Sepal.Length
.
Adding Columns by the mutate()
function
\(\star\) Here, you just need to call the tibble once to add a column,a nd just update the original tibble.
dplyr
Verbs Using %>%
What is %>%
?
%>%
Define Data Frame as a Tibble
Simple Example. The following code sequence filters
the iris
data frame (in tibble form) to include only the
“setosa” species.
\(\star\) Notice that the first line
is the data frame itself, then the next line is the verb without putting
it directly into the first argument of the filter()
verb.
This is a common practice of organizing verbs in a pipeline.
%>%
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
iris_tibble %>%
# rule 1: choose only the "setosa" species
filter(Species == "setosa") %>%
# rule 2: pick the columns Sepal.Length and Petal.Length
select(Sepal.Length,Petal.Length) %>%
# rule 3: create a new column called length_ratio
mutate(length_ratio = Petal.Length/Sepal.Length)
\(\dagger\) Try the above code
sequence in your console with the virginica
species, and
compute the ratio of Petal.Width
to
Sepal.Width
.
\(\star\) The verbs do not explicitly take the resulting data frames as the first argument because the pipe operator automatically passes the output of the previous step as the input to the next verb in the sequence.
dplyr
Verbs for Tidying Datadplyr
functions that operates on rows and
columns
Verb | Purpose & Example |
---|---|
group_by() |
Groups rows by one or more columns, allowing operations
to be performed within groups.
group_by(data,category) |
summarise() |
Reduces multiple rows into a single summary row per
group. summarise(data,new_var = function(var)) |
\(\star\) The
group_by()
and summarise()
usually goes
together if you need to compute descriptive statistics of each category
of a categorical variable.
Define Data Frame as a Tibble
Summarising by Nesting Verbs
\(\star\) Here, you are nesting the
functions group_by()
and summarise()
to
compute the mean of the Sepal.Length
column in each
category of the Species
column.
Summarising by Piping Verbs
iris_tibble %>%
# Step 1: group by species
group_by(Species) %>%
# Step 2: Calculate the mean and variance of the Sepal.Length
summarise(mean_sepal_length = mean(Sepal.Length),
var_sepal_length = var(Sepal.Length))
\(\star\) Here, you are using the
piping operator %>%
, where you don’t need to nest the
verbs, and the verbs are written in a logical sequence line-by-line.
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.