What is High-Dimensional Data?

High-dimensional data refers to datasets with a large number of variables (features/dimensions) relative to the number of observations.

Challenges:

Curse of Dimensionality: As dimensions increase, data becomes sparse, making patterns harder to detect.
Difficult to Visualize: We can easily plot 2D or 3D data, but what about 100+ dimensions?
Computational Complexity: More features mean increased storage, processing, and algorithmic complexity.

Methods:

Dimensionality Reduction: Reducing dimensions help to simplify high-dimensional datasets, making patterns more interpretable while improving efficiency (e.g. Principal Component Analysis).

Why Dimensionality Reduction?

Identify Patterns & Relationships

Helps uncover hidden structures in complex datasets.
Useful for data compression, visualization, and feature selection.

Reduce Noise & Redundancy

Eliminates irrelevant or highly correlated features.
Makes data easier to interpret and less computationally expensive.

Case Study I

United States Data Sets

The usdata package: Data on the States and Counties of the United States
The usa package: Updated US facts

Load Packages

library(usdata)
library(usa)

Simple Example: The `facts` and `state_stats` Data Set (1/2)

facts_sub <- facts %>% 
  select(name,life_exp,college) %>% 
  mutate(region=state.region)
state_stats_sub <- state_stats %>% 
  select(abbr,state,income,med_income,poverty,unempl,land_area,pop2010)

df <- facts_sub %>% 
  left_join(state_stats_sub,by=c("name" = "state")) %>% 
  filter(!name %in% c("District of Columbia", "Puerto Rico")) %>%
  select(-name) %>% 
  mutate(pop_density = pop2010/land_area) %>% 
  select(-land_area,-pop2010)
glimpse(df)

## Rows: 50
## Columns: 9
## $ life_exp    <dbl> 75.3, 78.3, 79.7, 75.9, 81.5, 80.3, 81.0, 78.6, 80.0, 77.7…
## $ college     <dbl> 0.2339, 0.2713, 0.2708, 0.2137, 0.3141, 0.3841, 0.3676, 0.…
## $ region      <fct> South, West, West, South, West, West, Northeast, South, So…
## $ abbr        <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA…
## $ income      <dbl> 22984, 30726, 25680, 21274, 29188, 30151, 36775, 29007, 26…
## $ med_income  <dbl> 42081, 66521, 50448, 39267, 60883, 56456, 67740, 57599, 47…
## $ poverty     <dbl> 17.1, 9.5, 15.3, 18.0, 13.7, 12.2, 9.2, 11.0, 13.8, 15.7, …
## $ unempl      <dbl> 7.6, 7.1, 8.7, 7.6, 10.9, 7.8, 7.8, 7.0, 9.4, 9.1, 6.4, 8.…
## $ pop_density <dbl> 94.376638, 1.244620, 56.270688, 56.037112, 239.145863, 48.…

Each row is a state (or an observation), excluding DC.
Seven numerical
- life_exp | Life expectancy in years
- college | Percent adult population with at least a bachelor’s degree or greater
- income | Average income per capita
- med_income | Median household income
- poverty | Poverty rate
- unempl | Unemployment rate
- pop_density | Population density, which is pop2010 / land_area
Two categorical variable.
- abbr | State abbreviation
- region | Census region to which each state belongs

Simple Example: The `facts` and `state_stats` Data Set (2/2)

We can only visualize one to two variables at a time
Categorical variables can be added through colors or shapes

\(\dagger\) Can we visualize or represent all features in a two-dimensional space?

Principal Component Analysis: R Example (1/3)

Computing the Principal Components

# select the numerical variables
sub_num <- df %>% 
  select(-abbr,-region)

# apply principal component analysis
pca_res <- prcomp(sub_num, center = TRUE, scale = TRUE)

# convert principal component into a data frame and merge the categorical variables
pc <- tibble(data.frame(pca_res$x,
                        abbr=df$abbr,
                        region=df$region))

# view principal components
glimpse(pc)

## Rows: 50
## Columns: 9
## $ PC1    <dbl> -2.81823565, 1.34671603, -0.59066599, -3.38204922, 1.52869410, …
## $ PC2    <dbl> -0.3753737, 0.7158912, -0.3396640, -0.2871326, -1.4757631, 0.29…
## $ PC3    <dbl> 0.762569633, -0.330127302, -1.013110178, 0.465246298, -1.763041…
## $ PC4    <dbl> -0.295124993, -1.985938857, 0.251181457, -0.016770158, 0.156299…
## $ PC5    <dbl> 0.36861978, 0.40949364, -0.15767992, -0.07776277, -0.08938850, …
## $ PC6    <dbl> -0.039950619, -0.447131525, -0.397856523, -0.157909221, -0.4952…
## $ PC7    <dbl> -0.0194314032, 0.1464910527, -0.0290094878, -0.0090896835, 0.16…
## $ abbr   <fct> AL, AK, AZ, AR, CA, CO, CT, DE, FL, GA, HI, ID, IL, IN, IA, KS,…
## $ region <fct> South, West, West, South, West, West, Northeast, South, South, …

\(\star\) Key Idea: The results of the PCA are called principal components or new variables. a linear combination of the original variables.

Principal Component Analysis: R Example (2/3)

Variance Explained

# get standard deviations
sd <- pca_res$sdev

# create tibble of percent variances
ve <- tibble(pc = seq(1,length(sd)),
             ve = sd/sum(sd),
             cumve = cumsum(ve))

\(\star\) Key Idea: Percent variance explained shows how much variability each principal component captures, while cumulative variance shows the total variability retained across components.

Principal Component Analysis: R Example (3/3)

Visualize the Dominant Principal Components

Points that are far apart indicate greater differences, while clusters suggest similarities within states
Clustering of points may reveal inherent state similarities
If clusters intersect, then it means these groups of states are similar

\(\star\) Key Idea: PCA plot is to visualize high-dimensional data by projecting it onto principal components that capture the most variance, revealing patterns, groupings, and relationships among data points.

Principal Component Analysis (1/2)

Principal Component Analysis (PCA) is a technique used to reduce the number of variables in a dataset while preserving as much important information as possible.

Why use PCA?
- Simplifies complex data
- Helps visualize high-dimensional data
- Removes redundant information

How does it work?
- Finds new axes (principal components) that best capture variation in data
- First principal component captures the most variance
- Each following component captures the next most variance, and so on

Principal Component Analysis (2/2)

When to use PCA?
- Relationships between variables are mostly linear
- When you have too many features (variables) and want to simplify the dataset
- Useful when working with datasets that have many correlated variables
- Helps in cases where data has a lot of noise
- Useful for pre-processing data sets for advanced models

When not to use PCA?
- When interpretability is important—PCA transforms features into combinations, making them hard to interpret.
- When original features are already independent—PCA won’t provide much benefit.
- When data is not linear—PCA captures linear relationships, so nonlinear structures may require other techniques

Advanced Example: The `digits` data set (1/4)

Loading the digits data set (using Python)

# load packages
from sklearn import datasets
import pandas as pd

# load the `digits` data set
digits = datasets.load_digits()
df_digits = pd.DataFrame(digits.data, 
                         columns=digits.feature_names)

The digits data set loaded from thedatasets package in Python
The digits data set consists of 1,797 grayscale images of handwritten digits.
Each image is 8-by-8 pixels and has an associated label denoting which digit the image represents (0-9).
The digits.data contains the pixels information as variables
The digits.feature_names contains the names of the variables
The digits.target contains the digit labels

Advanced Example: The `digits` data set (2/4)

Example digits in the digits data set

\(\star\) Key Idea: The images are 8-by-8 pixels in grayscale, which means there are 64 variables that represents an image containing handwritten digits.

Advanced Example: The `digits` data set (3/4)

Principal Component Analysis (using Python)

# load packages
from sklearn.decomposition import PCA

# apply PCA with defined number of components
pca = PCA(n_components=2)

# extract principal components
pca_result = pca.fit_transform(df_digits)

# convert results into a dataframe and merge the digit labels
digits_pca = pd.DataFrame(pca_result)
digits_pca = digits_pca.rename(columns={0: "PC1", 1: "PC2"})
digits_pca['digit'] = (digits.target)

# save results
digits_pca.to_csv("digits_pca.csv",header=True,index=False)

The function PCA form the sklearn.decomposition package establishes the method
The function fit_transform computes the principal components and then applies the transformation to project the data onto the new principal components

\(\star\) Key Idea: Use Python for data sub-setting and complex computations (such as PCA) for large complex data sets.

Advanced Example: The `digits` data set (4/4)

Visualize the Principal Components of the digits data set (using R)

Using PCA, we reduced the high-dimensional data of hand-written digit images into two dimensions
The plot shows us the similarities within the hand-written digits (or clusters)
If two clusters are intersecting, then it usually means that these hand-written digits can look similar an appearance.

Activity: Implement PCA Visualizations

Log-in to Posit Cloud and open the R Studio assignment MA13: Implement PCA Visualizations.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

Visualizing High-Dimensional Data &
Advanced Visualizations

Fundamentals of Data Science

Objectives

What is High-Dimensional Data?

Why Dimensionality Reduction?

Case Study I

Simple Example: The `facts` and `state_stats` Data Set (1/2)

Simple Example: The `facts` and `state_stats` Data Set (2/2)

Principal Component Analysis: R Example (1/3)

Principal Component Analysis: R Example (2/3)

Principal Component Analysis: R Example (3/3)

Principal Component Analysis (1/2)

Principal Component Analysis (2/2)

Advanced Example: The `digits` data set (1/4)

Advanced Example: The `digits` data set (2/4)

Advanced Example: The `digits` data set (3/4)

Advanced Example: The `digits` data set (4/4)

Activity: Implement PCA Visualizations

Visualizing High-Dimensional Data & Advanced Visualizations

Fundamentals of Data Science

Objectives

What is High-Dimensional Data?

Why Dimensionality Reduction?

Case Study I

Simple Example: The facts and state_stats Data Set (1/2)

Simple Example: The facts and state_stats Data Set (2/2)

Principal Component Analysis: R Example (1/3)

Principal Component Analysis: R Example (2/3)

Principal Component Analysis: R Example (3/3)

Principal Component Analysis (1/2)

Principal Component Analysis (2/2)

Advanced Example: The digits data set (1/4)

Advanced Example: The digits data set (2/4)

Advanced Example: The digits data set (3/4)

Advanced Example: The digits data set (4/4)

Activity: Implement PCA Visualizations

Visualizing High-Dimensional Data &
Advanced Visualizations

Simple Example: The `facts` and `state_stats` Data Set (1/2)

Simple Example: The `facts` and `state_stats` Data Set (2/2)

Advanced Example: The `digits` data set (1/4)

Advanced Example: The `digits` data set (2/4)

Advanced Example: The `digits` data set (3/4)

Advanced Example: The `digits` data set (4/4)