Data Principles

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

January 27, 2025


The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Types of Variables

Numerical: Discrete vs Continuous

Numerical variables are quantitative variables that represent measurable amounts or quantities.



Categorical: Nominal vs Ordinal

Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.



Case Study 1

A survey was conducted on students in an introductory statistics course. Below are a few of the questions on the survey, and the corresponding variables the data from the responses were stored in:

Case Study 1: Data Matrix

Data collected on students in a statistics class on a variety of variables:

Example Data Matrix

Case Study 1: Identify Types of Variables

R Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

Introduction to Data Frames

What is a data frame?

Data Frame Key Characteristics

Base R: These data frames are data structures that comes with base R.

Tibbles: Tibbles are special kinds of data frames using the tibble package.

\(\star\) A base R data frame can be converted to a tibble data frame using the tibble() function. The iris data set is a built-in data in the datasets packages, which comes with base R.

R Packages

What are R packages?

Base R versus R Packages

Aspect Base R R Packages
Availability Comes pre-installed with R Must be installed and loaded
Functionality Offers basic statistical and programming tools Provides advanced or specialized tools not included in base R
Customization Limited to what’s already available Highly customizable; users can install or even create their own packages
Performance Base R can sometimes be slower or more verbose Packages often include optimized or simpler syntax for complex tasks
Speciality Limited only for basic statistics Often built for a specific purpose or knowledge

The tidyverse Package

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data processing:

  • dplyr provides a grammar for data transformation.
  • tidyr provides a set of functions that help you get data in consistent form.
  • tibble is a data frame that prioritize simplicity, enforcing stricter checks to promote cleaner, more expressive code.

Installing Packages

How to install R Packages?

The following two methods installs the tidyverse package.


\(\dagger\) Try the above code sequence in your console. Then, install a different package called plotly.

\(\star\) Knitting an RMarkdown with this function in a code chunk will give you an error. You need to run it directly on the console.

Loading Packages

How to load R Packages?

The function library() loads any installed package. In this case library("tidyverse") loads the tidyverse package specifically.

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<>) to force all conflicts to become errors

\(\dagger\) Try the above code sequence in your console. Then, load the package plotly.

Data Frame Subsetting

How to Subset a Data Frame? Data frames are organized into rows and columns and can be accessed in the following ways:

iris[iris$Species == "setosa",]

\(\dagger\) Try to subset the iris data set, accessing the “viriginica” in the Species column.

\(\star\) The above examples uses the iris data set, which comes with base R.

Subset a Data Frame based on Conditions

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Example: The following code sequence subsets the iris dataframe (in tibble form) to include only the “versicolor” species, and all rows with Sepal.Length at least \(6\).

# subset by species to match "versicolor"
iris_versicolor <- iris_tibble[iris_tibble$Species == "versicolor",]
# subset rows with Sepal.Length at least 6
iris_versicolor2 <- iris_versicolor[iris_versicolor$Sepal.Length >= 6,]
# display result

\(\star\) The code sequence above uses different variable definitions to track subsetting conditions sequentially.

Activity: Data Sub-Setting and Identifying Variables

  1. Log-in to Posit Cloud and open the R Studio assignment M 1/27 - Data Sub-Setting and Identifying Variables.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.


