Data Principles

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

January 27, 2025

Objectives

Previously… (1/2)

Previously… (2/2)

The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Statistical Thinking in the Data Science Life Cycle

Types of Variables

Types of Variables

Types of Variables

Numerical: Discrete vs Continuous

Numerical variables are quantitative variables that represent measurable amounts or quantities.

Discrete

Continuous

Categorical: Nominal vs Ordinal

Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.

Nominal

Ordinal

Case Study 1

A survey was conducted on students in an introductory statistics course. Below are a few of the questions on the survey, and the corresponding variables the data from the responses were stored in:

Case Study 1: Data Matrix

Data collected on students in a statistics class on a variety of variables:

Example Data Matrix

Example Data Matrix

Case Study 1: Identify Types of Variables

R Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

Introduction to Data Frames

What is a data frame?

Data Frame Key Characteristics

Base R: These data frames are data structures that comes with base R.

Tibbles: Tibbles are special kinds of data frames using the tibble package.

\(\star\) A base R data frame can be converted to a tibble data frame using the tibble() function. The iris data set is a built-in data in the datasets packages, which comes with base R.

R Packages

What are R packages?

Base R versus R Packages

Aspect Base R R Packages
Availability Comes pre-installed with R Must be installed and loaded
Functionality Offers basic statistical and programming tools Provides advanced or specialized tools not included in base R
Customization Limited to what’s already available Highly customizable; users can install or even create their own packages
Performance Base R can sometimes be slower or more verbose Packages often include optimized or simpler syntax for complex tasks
Speciality Limited only for basic statistics Often built for a specific purpose or knowledge

The tidyverse Package

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data processing:

  • dplyr provides a grammar for data transformation.
  • tidyr provides a set of functions that help you get data in consistent form.
  • tibble is a data frame that prioritize simplicity, enforcing stricter checks to promote cleaner, more expressive code.

Installing Packages

How to install R Packages?

The following two methods installs the tidyverse package.

install.packages("tidyverse")

\(\dagger\) Try the above code sequence in your console. Then, install a different package called plotly.

\(\star\) Knitting an RMarkdown with this function in a code chunk will give you an error. You need to run it directly on the console.

Loading Packages

How to load R Packages?

The function library() loads any installed package. In this case library("tidyverse") loads the tidyverse package specifically.

library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

\(\dagger\) Try the above code sequence in your console. Then, load the package plotly.

Data Frame Subsetting

How to Subset a Data Frame? Data frames are organized into rows and columns and can be accessed in the following ways:

iris$Species
iris[,"Species"]
iris[,c("Species","Sepal.Length")]
iris[42,]
iris[42,c("Species","Sepal.Length")]
iris[iris$Species == "setosa",]

\(\dagger\) Try to subset the iris data set, accessing the “viriginica” in the Species column.

\(\star\) The above examples uses the iris data set, which comes with base R.

Subset a Data Frame based on Conditions

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Example: The following code sequence subsets the iris dataframe (in tibble form) to include only the “versicolor” species, and all rows with Sepal.Length at least \(6\).

# subset by species to match "versicolor"
iris_versicolor <- iris_tibble[iris_tibble$Species == "versicolor",]
# subset rows with Sepal.Length at least 6
iris_versicolor2 <- iris_versicolor[iris_versicolor$Sepal.Length >= 6,]
# display result
glimpse(iris_versicolor2)

\(\star\) The code sequence above uses different variable definitions to track subsetting conditions sequentially.

Activity: Data Sub-Setting and Identifying Variables

  1. Log-in to Posit Cloud and open the R Studio assignment M 1/27 - Data Sub-Setting and Identifying Variables.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/