MTH-361A | Spring 2025 | University of Portland
January 27, 2025
The guiding principle of statistics is statistical thinking.
Statistical Thinking in the Data Science Life Cycle
Types of Variables
Numerical variables are quantitative variables that represent measurable amounts or quantities.
Discrete
Continuous
Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.
Nominal
Ordinal
A survey was conducted on students in an introductory statistics course. Below are a few of the questions on the survey, and the corresponding variables the data from the responses were stored in:
Data collected on students in a statistics class on a variety of variables:
Example Data Matrix
How are variables created and stored in R?
\(\dagger\) Create a variable named
ria
and initially define it as a number. Then, redefine
ria
as a vector. What happens to the original value of
ria
(the number) after it is redefined as a vector?
What is a data frame?
Base R: These data frames are data structures that comes with base R.
Tibbles: Tibbles are special kinds of data frames
using the tibble
package.
\(\star\) A base R data frame can be
converted to a tibble data frame using the tibble()
function. The iris
data set is a built-in data in the
datasets
packages, which comes with base R.
What are R packages?
Aspect | Base R | R Packages |
---|---|---|
Availability | Comes pre-installed with R | Must be installed and loaded |
Functionality | Offers basic statistical and programming tools | Provides advanced or specialized tools not included in base R |
Customization | Limited to what’s already available | Highly customizable; users can install or even create their own packages |
Performance | Base R can sometimes be slower or more verbose | Packages often include optimized or simpler syntax for complex tasks |
Speciality | Limited only for basic statistics | Often built for a specific purpose or knowledge |
tidyverse
Packagetidyverse
is a collection of packages suited for data
processing and visualization.
Core packages specifically for data processing:
dplyr
provides a grammar for data transformation.tidyr
provides a set of functions that help you get
data in consistent form.tibble
is a data frame that prioritize simplicity,
enforcing stricter checks to promote cleaner, more expressive code.How to install R Packages?
The following two methods installs the tidyverse
package.
How to load R Packages?
The function library()
loads any installed package. In
this case library("tidyverse")
loads the
tidyverse
package specifically.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
\(\dagger\) Try the above code
sequence in your console. Then, load the package
plotly
.
How to Subset a Data Frame? Data frames are organized into rows and columns and can be accessed in the following ways:
$
[]
\(\dagger\) Try to subset the
iris
data set, accessing the “viriginica” in the
Species
column.
\(\star\) The above examples uses
the iris
data set, which comes with base R.
Define Data Frame as a Tibble
Example: The following code sequence subsets the
iris
dataframe (in tibble form) to include only the
“versicolor” species, and all rows with Sepal.Length
at
least \(6\).
# subset by species to match "versicolor"
iris_versicolor <- iris_tibble[iris_tibble$Species == "versicolor",]
# subset rows with Sepal.Length at least 6
iris_versicolor2 <- iris_versicolor[iris_versicolor$Sepal.Length >= 6,]
# display result
glimpse(iris_versicolor2)
\(\star\) The code sequence above uses different variable definitions to track subsetting conditions sequentially.
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.