Data Principles

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

January 27, 2025

Objectives

Develop an understanding on how to identify different types of variables
Know how data is organized into data tables
Introduce R packages, variables, and data frames
Activity: Data Sub-Setting and Identifying Variables

Previously… (1/2)

Previously… (2/2)

The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Types of Variables

Numerical: Discrete vs Continuous

Numerical variables are quantitative variables that represent measurable amounts or quantities.

Discrete

Take distinct, separate values (often whole numbers).
Typically countable.
Examples:
- Number of students in a class (e.g. 25 students)
- Books sold per day (e.g. 10 books/day)

Continuous

Can take any value within a range, including fractions and decimals.
Measured rather than counted.
Examples:
- Height of a person (e.g., 5.7 ft).
- Time taken to complete a task (e.g., 3.25 hours).

Categorical: Nominal vs Ordinal

Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.

Nominal

Categories have no natural order or ranking.
Used for labeling without quantitative significance.
Examples:
- Colors of cars (e.g., red, blue, green).
- Types of animals (e.g., cat, dog, bird).

Ordinal

Categories have a meaningful order or ranking, but the intervals between ranks are not consistent.
Examples:
- Education level (e.g., high school, bachelor’s, master’s).
- Customer satisfaction (e.g., satisfied, neutral, dissatisfied).

Case Study 1

A survey was conducted on students in an introductory statistics course. Below are a few of the questions on the survey, and the corresponding variables the data from the responses were stored in:

gender: What is your gender?
intro_extra: Are you an introvert or an extrovert?
sleep: How many hours do you sleep at night, on average?
bedtime: What time do you usually go to bed?
countries: How many countries have you visited?
dread: On a scale of 1-5, how much do you dread being here?

Case Study 1: Data Matrix

Data collected on students in a statistics class on a variety of variables:

Example Data Matrix

Case Study 1: Identify Types of Variables

gender: nominal categorical
sleep: continuous numerical
bedtime: ordinal categorical
countries: discrete numerical
dread: ordinal categorical - could also be used as numerical

R Variables

How are variables created and stored in R?

$\dagger$ Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

Introduction to Data Frames

What is a data frame?

A data frame is a key data structure in R used for storing data sets.
It organizes data into rows and columns, similar to a spreadsheet.
Each column can contain data of a different variable type (e.g., numeric, character, vector).

Data Frame Key Characteristics

Base R: These data frames are data structures that comes with base R.

Tibbles: Tibbles are special kinds of data frames using the tibble package.

$\star$ A base R data frame can be converted to a tibble data frame using the tibble() function. The iris data set is a built-in data in the datasets packages, which comes with base R.

R Packages

What are R packages?

Packages in R are collections of functions, data, and documentation that extend the capabilities of base R.
It includes manuals and examples to help users understand how to use the package.

Base R versus R Packages

Aspect	Base R	R Packages
Availability	Comes pre-installed with R	Must be installed and loaded
Functionality	Offers basic statistical and programming tools	Provides advanced or specialized tools not included in base R
Customization	Limited to what’s already available	Highly customizable; users can install or even create their own packages
Performance	Base R can sometimes be slower or more verbose	Packages often include optimized or simpler syntax for complex tasks
Speciality	Limited only for basic statistics	Often built for a specific purpose or knowledge

The `tidyverse` Package

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data processing:

dplyr provides a grammar for data transformation.
tidyr provides a set of functions that help you get data in consistent form.
tibble is a data frame that prioritize simplicity, enforcing stricter checks to promote cleaner, more expressive code.

Installing Packages

How to install R Packages?

The following two methods installs the tidyverse package.

Using the “Tools” menu

Using the R Console Directly

install.packages("tidyverse")

$\dagger$ Try the above code sequence in your console. Then, install a different package called plotly.

$\star$ Knitting an RMarkdown with this function in a code chunk will give you an error. You need to run it directly on the console.

Loading Packages

How to load R Packages?

The function library() loads any installed package. In this case library("tidyverse") loads the tidyverse package specifically.

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

$\dagger$ Try the above code sequence in your console. Then, load the package plotly.

Data Frame Subsetting

How to Subset a Data Frame? Data frames are organized into rows and columns and can be accessed in the following ways:

Accessing a column by Name using $

iris$Species

Accessing a column by name using []

iris[,"Species"]

Accessing multiple columns using a vector of names

iris[,c("Species","Sepal.Length")]

Accessing rows by an index

iris[42,]

Accessing specific rows and columns by index or name

iris[42,c("Species","Sepal.Length")]

Accessing specific rows according to some condition

iris[iris$Species == "setosa",]

$\dagger$ Try to subset the iris data set, accessing the “viriginica” in the Species column.

$\star$ The above examples uses the iris data set, which comes with base R.

Subset a Data Frame based on Conditions

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Example: The following code sequence subsets the iris dataframe (in tibble form) to include only the “versicolor” species, and all rows with Sepal.Length at least $6$.

# subset by species to match "versicolor"
iris_versicolor <- iris_tibble[iris_tibble$Species == "versicolor",]
# subset rows with Sepal.Length at least 6
iris_versicolor2 <- iris_versicolor[iris_versicolor$Sepal.Length >= 6,]
# display result
glimpse(iris_versicolor2)

$\star$ The code sequence above uses different variable definitions to track subsetting conditions sequentially.

Activity: Data Sub-Setting and Identifying Variables

Log-in to Posit Cloud and open the R Studio assignment M 1/27 - Data Sub-Setting and Identifying Variables.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/