Data Wrangling on Multiple Tables

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

February 7, 2025

Objectives

Loading an External Data Set

Load packages

library(tidyverse)
library(readxl)

Importing a (CSV) .csv file

birthdays <- read_csv("birthdays.csv",col_names=TRUE)

\(\star\) Note that you need to have the birthdays.csv file in your working directory.

Importing a (Excel) .xlxs file

us_capitals <- read_excel("usa-state-capitals.xlsx", sheet = "owssvr")

\(\star\) Note that you need to have the usa-state-capitals.xlsx file in your working directory.

Other File Types

Verb Function
read_tsv() Reads tab-delimited files
read_delim() Reads in files with any delimiter, attempting to automatically guess the delimiter if you don’t specify it

\(\star\) Note that data sets are commonly saved as CSV and Excel files but there are other filetypes (such as shapefiles) that require special functions, which we will discuss soon.

View the Imported Data Sets

birthdays
## # A tibble: 372,555 × 7
##    state  year month   day date       wday  births
##    <chr> <dbl> <dbl> <dbl> <date>     <chr>  <dbl>
##  1 AK     1969     1     1 1969-01-01 Wed       14
##  2 AL     1969     1     1 1969-01-01 Wed      174
##  3 AR     1969     1     1 1969-01-01 Wed       78
##  4 AZ     1969     1     1 1969-01-01 Wed       84
##  5 CA     1969     1     1 1969-01-01 Wed      824
##  6 CO     1969     1     1 1969-01-01 Wed      100
##  7 CT     1969     1     1 1969-01-01 Wed       90
##  8 DC     1969     1     1 1969-01-01 Wed       88
##  9 DE     1969     1     1 1969-01-01 Wed       32
## 10 FL     1969     1     1 1969-01-01 Wed      288
## # ℹ 372,545 more rows
us_capitals
## # A tibble: 50 × 5
##    State       ST    Capital     Longitude Latitude
##    <chr>       <chr> <chr>           <dbl>    <dbl>
##  1 Alabama     AL    Montgomery      -86.3     32.4
##  2 Alaska      AK    Juneau         -134.      58.3
##  3 Arizona     AZ    Phoenix        -112.      33.4
##  4 Arkansas    AR    Little Rock     -92.3     34.7
##  5 California  CA    Sacramento     -121.      38.6
##  6 Colorado    CO    Denver         -105.      39.7
##  7 Connecticut CT    Hartford        -72.7     41.8
##  8 Delaware    DE    Dover           -75.5     39.2
##  9 Florida     FL    Tallahassee     -84.3     30.4
## 10 Georgia     GA    Atlanta         -84.4     33.7
## # ℹ 40 more rows

\(\star\) It looks like the state variable in the birthdays data frame have categories matching the ST variable in the us_capital data frame.

Using inner_join()

Joining birthdays and us_capital by matching the state abbreviations

states_joined <- birthdays |>
  inner_join(us_capitals, by = c("state" = "ST"))
states_joined
## # A tibble: 365,250 × 11
##    state  year month   day date       wday  births State       Capital Longitude
##    <chr> <dbl> <dbl> <dbl> <date>     <chr>  <dbl> <chr>       <chr>       <dbl>
##  1 AK     1969     1     1 1969-01-01 Wed       14 Alaska      Juneau     -134. 
##  2 AL     1969     1     1 1969-01-01 Wed      174 Alabama     Montgo…     -86.3
##  3 AR     1969     1     1 1969-01-01 Wed       78 Arkansas    Little…     -92.3
##  4 AZ     1969     1     1 1969-01-01 Wed       84 Arizona     Phoenix    -112. 
##  5 CA     1969     1     1 1969-01-01 Wed      824 California  Sacram…    -121. 
##  6 CO     1969     1     1 1969-01-01 Wed      100 Colorado    Denver     -105. 
##  7 CT     1969     1     1 1969-01-01 Wed       90 Connecticut Hartfo…     -72.7
##  8 DE     1969     1     1 1969-01-01 Wed       32 Delaware    Dover       -75.5
##  9 FL     1969     1     1 1969-01-01 Wed      288 Florida     Tallah…     -84.3
## 10 GA     1969     1     1 1969-01-01 Wed      250 Georgia     Atlanta     -84.4
## # ℹ 365,240 more rows
## # ℹ 1 more variable: Latitude <dbl>

\(\star\) The result contains only those rows that have matches in both tables.

Using left_join()

Joining us_capitals into birthdays by matching the state abbreviations

states_leftjoined <- birthdays |>
  left_join(us_capitals, by = c("state" = "ST"))
states_leftjoined
## # A tibble: 372,555 × 11
##    state  year month   day date       wday  births State       Capital Longitude
##    <chr> <dbl> <dbl> <dbl> <date>     <chr>  <dbl> <chr>       <chr>       <dbl>
##  1 AK     1969     1     1 1969-01-01 Wed       14 Alaska      Juneau     -134. 
##  2 AL     1969     1     1 1969-01-01 Wed      174 Alabama     Montgo…     -86.3
##  3 AR     1969     1     1 1969-01-01 Wed       78 Arkansas    Little…     -92.3
##  4 AZ     1969     1     1 1969-01-01 Wed       84 Arizona     Phoenix    -112. 
##  5 CA     1969     1     1 1969-01-01 Wed      824 California  Sacram…    -121. 
##  6 CO     1969     1     1 1969-01-01 Wed      100 Colorado    Denver     -105. 
##  7 CT     1969     1     1 1969-01-01 Wed       90 Connecticut Hartfo…     -72.7
##  8 DC     1969     1     1 1969-01-01 Wed       88 <NA>        <NA>         NA  
##  9 DE     1969     1     1 1969-01-01 Wed       32 Delaware    Dover       -75.5
## 10 FL     1969     1     1 1969-01-01 Wed      288 Florida     Tallah…     -84.3
## # ℹ 372,545 more rows
## # ℹ 1 more variable: Latitude <dbl>

\(\star\) The rows of the first table are always returned, regardless of whether there is a match in the second table.

Using right_join()

Joining birthdays into us_capitals by matching the state abbreviations

states_rightjoined <- birthdays |>
  right_join(us_capitals, by = c("state" = "ST"))
states_rightjoined
## # A tibble: 365,250 × 11
##    state  year month   day date       wday  births State       Capital Longitude
##    <chr> <dbl> <dbl> <dbl> <date>     <chr>  <dbl> <chr>       <chr>       <dbl>
##  1 AK     1969     1     1 1969-01-01 Wed       14 Alaska      Juneau     -134. 
##  2 AL     1969     1     1 1969-01-01 Wed      174 Alabama     Montgo…     -86.3
##  3 AR     1969     1     1 1969-01-01 Wed       78 Arkansas    Little…     -92.3
##  4 AZ     1969     1     1 1969-01-01 Wed       84 Arizona     Phoenix    -112. 
##  5 CA     1969     1     1 1969-01-01 Wed      824 California  Sacram…    -121. 
##  6 CO     1969     1     1 1969-01-01 Wed      100 Colorado    Denver     -105. 
##  7 CT     1969     1     1 1969-01-01 Wed       90 Connecticut Hartfo…     -72.7
##  8 DE     1969     1     1 1969-01-01 Wed       32 Delaware    Dover       -75.5
##  9 FL     1969     1     1 1969-01-01 Wed      288 Florida     Tallah…     -84.3
## 10 GA     1969     1     1 1969-01-01 Wed      250 Georgia     Atlanta     -84.4
## # ℹ 365,240 more rows
## # ℹ 1 more variable: Latitude <dbl>

\(\star\) Notice here the there result is the same the using the inner_join() because the us_capital only includes all 50 states while the birthdays include all 50 states plus DC.

Activity: Combine Datasets

  1. Log-in to Posit Cloud and open the R Studio assignment MA6: Combine Datasets.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.