Visualizing High-Dimensional Data &
Advanced Visualizations

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

March 10, 2025

Objectives

What is High-Dimensional Data?

High-dimensional data refers to datasets with a large number of variables (features/dimensions) relative to the number of observations.

Challenges:

Methods:

Why Dimensionality Reduction?

Identify Patterns & Relationships

Reduce Noise & Redundancy

Case Study I

United States Data Sets

Load Packages

library(usdata)
library(usa)

Simple Example: The facts and state_stats Data Set (1/2)

facts_sub <- facts %>% 
  select(name,life_exp,college) %>% 
  mutate(region=state.region)
state_stats_sub <- state_stats %>% 
  select(abbr,state,income,med_income,poverty,unempl,land_area,pop2010)

df <- facts_sub %>% 
  left_join(state_stats_sub,by=c("name" = "state")) %>% 
  filter(!name %in% c("District of Columbia", "Puerto Rico")) %>%
  select(-name) %>% 
  mutate(pop_density = pop2010/land_area) %>% 
  select(-land_area,-pop2010)
glimpse(df)
## Rows: 50
## Columns: 9
## $ life_exp    <dbl> 75.3, 78.3, 79.7, 75.9, 81.5, 80.3, 81.0, 78.6, 80.0, 77.7…
## $ college     <dbl> 0.2339, 0.2713, 0.2708, 0.2137, 0.3141, 0.3841, 0.3676, 0.…
## $ region      <fct> South, West, West, South, West, West, Northeast, South, So…
## $ abbr        <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA…
## $ income      <dbl> 22984, 30726, 25680, 21274, 29188, 30151, 36775, 29007, 26…
## $ med_income  <dbl> 42081, 66521, 50448, 39267, 60883, 56456, 67740, 57599, 47…
## $ poverty     <dbl> 17.1, 9.5, 15.3, 18.0, 13.7, 12.2, 9.2, 11.0, 13.8, 15.7, …
## $ unempl      <dbl> 7.6, 7.1, 8.7, 7.6, 10.9, 7.8, 7.8, 7.0, 9.4, 9.1, 6.4, 8.…
## $ pop_density <dbl> 94.376638, 1.244620, 56.270688, 56.037112, 239.145863, 48.…

Simple Example: The facts and state_stats Data Set (2/2)

\(\dagger\) Can we visualize or represent all features in a two-dimensional space?

Principal Component Analysis: R Example (1/3)

Computing the Principal Components

# select the numerical variables
sub_num <- df %>% 
  select(-abbr,-region)

# apply principal component analysis
pca_res <- prcomp(sub_num, center = TRUE, scale = TRUE)

# convert principal component into a data frame and merge the categorical variables
pc <- tibble(data.frame(pca_res$x,
                        abbr=df$abbr,
                        region=df$region))

# view principal components
glimpse(pc)
## Rows: 50
## Columns: 9
## $ PC1    <dbl> -2.81823565, 1.34671603, -0.59066599, -3.38204922, 1.52869410, …
## $ PC2    <dbl> -0.3753737, 0.7158912, -0.3396640, -0.2871326, -1.4757631, 0.29…
## $ PC3    <dbl> 0.762569633, -0.330127302, -1.013110178, 0.465246298, -1.763041…
## $ PC4    <dbl> -0.295124993, -1.985938857, 0.251181457, -0.016770158, 0.156299…
## $ PC5    <dbl> 0.36861978, 0.40949364, -0.15767992, -0.07776277, -0.08938850, …
## $ PC6    <dbl> -0.039950619, -0.447131525, -0.397856523, -0.157909221, -0.4952…
## $ PC7    <dbl> -0.0194314032, 0.1464910527, -0.0290094878, -0.0090896835, 0.16…
## $ abbr   <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA, KS,…
## $ region <fct> South, West, West, South, West, West, Northeast, South, South, …

\(\star\) Key Idea: The results of the PCA are called principal components or new variables. a linear combination of the original variables.

Principal Component Analysis: R Example (2/3)

Variance Explained

# get standard deviations
sd <- pca_res$sdev

# create tibble of percent variances
ve <- tibble(pc = seq(1,length(sd)),
             ve = sd/sum(sd),
             cumve = cumsum(ve))

\(\star\) Key Idea: Percent variance explained shows how much variability each principal component captures, while cumulative variance shows the total variability retained across components.

Principal Component Analysis: R Example (3/3)

Visualize the Dominant Principal Components

\(\star\) Key Idea: PCA plot is to visualize high-dimensional data by projecting it onto principal components that capture the most variance, revealing patterns, groupings, and relationships among data points.

Principal Component Analysis (1/2)

Principal Component Analysis (PCA) is a technique used to reduce the number of variables in a dataset while preserving as much important information as possible.

Principal Component Analysis (2/2)

Advanced Example: The digits data set (1/4)

Loading the digits data set (using Python)

# load packages
from sklearn import datasets
import pandas as pd

# load the `digits` data set
digits = datasets.load_digits()
df_digits = pd.DataFrame(digits.data, 
                         columns=digits.feature_names)

Advanced Example: The digits data set (2/4)

Example digits in the digits data set

\(\star\) Key Idea: The images are 8-by-8 pixels in grayscale, which means there are 64 variables that represents an image containing handwritten digits.

Advanced Example: The digits data set (3/4)

Principal Component Analysis (using Python)

# load packages
from sklearn.decomposition import PCA

# apply PCA with defined number of components
pca = PCA(n_components=2)

# extract principal components
pca_result = pca.fit_transform(df_digits)

# convert results into a dataframe and merge the digit labels
digits_pca = pd.DataFrame(pca_result)
digits_pca = digits_pca.rename(columns={0: "PC1", 1: "PC2"})
digits_pca['digit'] = (digits.target)

# save results
digits_pca.to_csv("digits_pca.csv",header=True,index=False)

\(\star\) Key Idea: Use Python for data sub-setting and complex computations (such as PCA) for large complex data sets.

Advanced Example: The digits data set (4/4)

Visualize the Principal Components of the digits data set (using R)

Activity: Implement PCA Visualizations

  1. Log-in to Posit Cloud and open the R Studio assignment MA13: Implement PCA Visualizations.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.