MTH-391A | Spring 2025 | University of Portland
March 10, 2025
High-dimensional data refers to datasets with a large number of variables (features/dimensions) relative to the number of observations.
Challenges:
Methods:
Identify Patterns & Relationships
Reduce Noise & Redundancy
United States Data Sets
usdata
package: Data on the States and Counties of
the United Statesusa
package: Updated US factsLoad Packages
facts
and state_stats
Data Set (1/2)facts_sub <- facts %>%
select(name,life_exp,college) %>%
mutate(region=state.region)
state_stats_sub <- state_stats %>%
select(abbr,state,income,med_income,poverty,unempl,land_area,pop2010)
df <- facts_sub %>%
left_join(state_stats_sub,by=c("name" = "state")) %>%
filter(!name %in% c("District of Columbia", "Puerto Rico")) %>%
select(-name) %>%
mutate(pop_density = pop2010/land_area) %>%
select(-land_area,-pop2010)
glimpse(df)
## Rows: 50
## Columns: 9
## $ life_exp <dbl> 75.3, 78.3, 79.7, 75.9, 81.5, 80.3, 81.0, 78.6, 80.0, 77.7…
## $ college <dbl> 0.2339, 0.2713, 0.2708, 0.2137, 0.3141, 0.3841, 0.3676, 0.…
## $ region <fct> South, West, West, South, West, West, Northeast, South, So…
## $ abbr <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA…
## $ income <dbl> 22984, 30726, 25680, 21274, 29188, 30151, 36775, 29007, 26…
## $ med_income <dbl> 42081, 66521, 50448, 39267, 60883, 56456, 67740, 57599, 47…
## $ poverty <dbl> 17.1, 9.5, 15.3, 18.0, 13.7, 12.2, 9.2, 11.0, 13.8, 15.7, …
## $ unempl <dbl> 7.6, 7.1, 8.7, 7.6, 10.9, 7.8, 7.8, 7.0, 9.4, 9.1, 6.4, 8.…
## $ pop_density <dbl> 94.376638, 1.244620, 56.270688, 56.037112, 239.145863, 48.…
life_exp
| Life expectancy in yearscollege
| Percent adult population with at least a
bachelor’s degree or greaterincome
| Average income per capitamed_income
| Median household incomepoverty
| Poverty rateunempl
| Unemployment ratepop_density
| Population density, which is
pop2010
/ land_area
abbr
| State abbreviationregion
| Census region to which each state belongsfacts
and state_stats
Data Set (2/2)\(\dagger\) Can we visualize or represent all features in a two-dimensional space?
Computing the Principal Components
# select the numerical variables
sub_num <- df %>%
select(-abbr,-region)
# apply principal component analysis
pca_res <- prcomp(sub_num, center = TRUE, scale = TRUE)
# convert principal component into a data frame and merge the categorical variables
pc <- tibble(data.frame(pca_res$x,
abbr=df$abbr,
region=df$region))
# view principal components
glimpse(pc)
## Rows: 50
## Columns: 9
## $ PC1 <dbl> -2.81823565, 1.34671603, -0.59066599, -3.38204922, 1.52869410, …
## $ PC2 <dbl> -0.3753737, 0.7158912, -0.3396640, -0.2871326, -1.4757631, 0.29…
## $ PC3 <dbl> 0.762569633, -0.330127302, -1.013110178, 0.465246298, -1.763041…
## $ PC4 <dbl> -0.295124993, -1.985938857, 0.251181457, -0.016770158, 0.156299…
## $ PC5 <dbl> 0.36861978, 0.40949364, -0.15767992, -0.07776277, -0.08938850, …
## $ PC6 <dbl> -0.039950619, -0.447131525, -0.397856523, -0.157909221, -0.4952…
## $ PC7 <dbl> -0.0194314032, 0.1464910527, -0.0290094878, -0.0090896835, 0.16…
## $ abbr <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA, KS,…
## $ region <fct> South, West, West, South, West, West, Northeast, South, South, …
\(\star\) Key Idea: The results of the PCA are called principal components or new variables. a linear combination of the original variables.
Variance Explained
\(\star\) Key Idea: Percent variance explained shows how much variability each principal component captures, while cumulative variance shows the total variability retained across components.
Visualize the Dominant Principal Components
\(\star\) Key Idea: PCA plot is to visualize high-dimensional data by projecting it onto principal components that capture the most variance, revealing patterns, groupings, and relationships among data points.
Principal Component Analysis (PCA) is a technique used to reduce the number of variables in a dataset while preserving as much important information as possible.
Why use PCA?
How does it work?
When to use PCA?
When not to use PCA?
digits
data set (1/4)Loading the digits
data set (using
Python)
digits
data set loaded from
thedatasets
package in Pythondigits
data set consists of 1,797 grayscale images
of handwritten digits.digits.data
contains the pixels information as
variablesdigits.feature_names
contains the names of the
variablesdigits.target
contains the digit labelsdigits
data set (2/4)Example digits in the digits
data
set
\(\star\) Key Idea: The images are 8-by-8 pixels in grayscale, which means there are 64 variables that represents an image containing handwritten digits.
digits
data set (3/4)Principal Component Analysis (using Python)
# load packages
from sklearn.decomposition import PCA
# apply PCA with defined number of components
pca = PCA(n_components=2)
# extract principal components
pca_result = pca.fit_transform(df_digits)
# convert results into a dataframe and merge the digit labels
digits_pca = pd.DataFrame(pca_result)
digits_pca = digits_pca.rename(columns={0: "PC1", 1: "PC2"})
digits_pca['digit'] = (digits.target)
# save results
digits_pca.to_csv("digits_pca.csv",header=True,index=False)
PCA
form the
sklearn.decomposition
package establishes the methodfit_transform
computes the principal
components and then applies the transformation to project the data onto
the new principal components\(\star\) Key Idea: Use Python for data sub-setting and complex computations (such as PCA) for large complex data sets.
digits
data set (4/4)Visualize the Principal Components of the digits
data set (using R)
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.