Alex Brasch

Reference & Startup

The documentation and examples within this tutorial were gleaned from the following resources:

tidycensus
tigris
tidyverse
Census Developers
Census Geography Program
Leaflet for R

tidycensus is an R package that allows users to interface with the U.S. Census Bureau’s decennial census and American Community Survey (ACS) APIs, in order to retrieve demographic and economic data for specified geographies. As noted by its author, Kyle Walker, tidycensus “returns tidyverse-ready data frames of selected variables, and the option to include simple feature (sf) geometries”.

The main function of the decennial census is to provide counts of people for the purpose of congressional apportionment and federal funding allocation, while the primary purpose of the ACS is to measure the changing social and economic characteristics of the U.S. population, including education, housing, jobs, and more. Due to their differing purposes, variables that exist in the decennial census may not be included in the ACS and vice versa. The same is true between ACS years, due to the evolving nature of the surveys (i.e., questions may be added, removed, or revised).

The decennial census is an enumeration, meaning it aims to count the entire population of the country (at the location where each person usually lives). The decennial census asks a relatively small set of questions of people in homes and group living situations, including how many people live or stay in each home, as well as the sex, age, and race of each person.

ACS data differ from decennial census data in that ACS data are based on an annual sample of households, rather than a complete enumeration. In turn, ACS data points are estimates characterized by a margin of error (MOE). tidycensus will always return the estimate and MOE for any requested variables. When requesting ACS data with tidycensus, it is not necessary to specify the “E” or “M” suffix for a variable name. Available survey types include ACS 5-year estimates (acs5) or ACS 1-year estimates (acs1). Note that the latter is only available for geographies with populations of 65,000 and greater.

Install and load packages

To get started, load the tidycensus and tidyverse packages. Additional packages used within this RMarkdown file include tigris, readxl, writexl, arcgisbinding, leaflet, kableExtra, janitor, and extrafont.
A Census API key is also required, which can be obtained from http://api.census.gov/data/key_signup.html. Entry of the key using the census_api_key function only needs to occur once (i.e., it is tied to RStudio, rather than a single R script or Markdown file).

Usage

The following section contains examples of using the tidycensus package and Census API to retrieve, prepare, reshape, and blend demographic data for analysis and visualization. To save on processing time and avoid local memory limitations, data-intensive code chunks have been commented out (e.g., retrieving large amounts of census blocks using the tidycensus or tigris packages). Readers can view the underlying code while the input data is read-in as part of the R Project’s local data.

Review Variables

name label concept
H001001 Total HOUSING UNITS
H002001 Total URBAN AND RURAL
H002002 Total!!Urban URBAN AND RURAL
H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
H002004 Total!!Urban!!Inside urban clusters URBAN AND RURAL
H002005 Total!!Rural URBAN AND RURAL
name label concept
B00001_001 Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE POPULATION
B00002_001 Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
B01001_001 Estimate!!Total SEX BY AGE
B01001_002 Estimate!!Total!!Male SEX BY AGE
B01001_003 Estimate!!Total!!Male!!Under 5 years SEX BY AGE
B01001_004 Estimate!!Total!!Male!!5 to 9 years SEX BY AGE

Example 3 - ACS

Retrieve 2018 ACS 5-year variables and geometries for a select set of a specified geography.

  • Define a geography and specify a subset
  • Define and name multiple variables
  • Retrieve geometries, specifically the TIGER/Line shapefiles

Concerning geometries, tidycensus used the geographic coordinate system NAD 1983 (EPSG: 4269), which is the default for Census spatial data files. tidycensus uses the Census cartographic boundary shapefiles for faster processing; if you prefer the TIGER/Line shapefiles (i.e., Topologically Integrated Geographic Encoding and Referencing), set cb = FALSE in the function call. Per Census documentation, the cartographic boundary files are simplified representations of selected geographic areas from the Census Bureau’s Master Address File (MAF)/TIGER geographic database. These boundary files are specifically designed for small scale thematic mapping. When possible, generalization is performed with intent to maintain the hierarchical relationships among geographies and to maintain the alignment of geographies within a file set for a given year. To improve the appearance of shapes, areas are represented with fewer vertices than detailed TIGER/Line equivalents. Some small holes or discontiguous parts of areas are not included in generalized files. Generalized boundary files are clipped to a simplified version of the U.S. outline. As a result, some off-shore areas may be excluded from the generalized files. Consult this TIGER Data Products Guide to determine which file type is best for your purposes.

GEOID NAME hh_medinc_totalE hh_medinc_totalM hh_foodst_totalE hh_foodst_totalM hh_foodst_recE hh_foodst_recM
41005 Clackamas County, Oregon 76597 1050 155456 776 16694 859
41051 Multnomah County, Oregon 64337 779 321968 1216 55018 1390
41067 Washington County, Oregon 78010 1086 216507 1013 22213 1020

Concerning the structure of the data frame, a wide format contains a single row for each observation with many columns representing all variables (human-readable), while a long/tidy format contains many rows per observation (assuming more than one variable) with name-value pairs for each variable and associated value (machine-readable. For more details, see Hadley Wickham’s seminal paper Tidy Data.

Example 4 - ACS

Retrieve 2015 ACS 1-year variables and geometries for a select set of a specified geography.

  • Define a geography and specify a subset
  • Define and name multiple variables
  • Retrieve geometries, specifically the cartographic boundaries
  • Output in long/tidy format

Compare the structure of the data sets.

GEOID NAME hh_medinc_totalE hh_medinc_totalM hh_foodst_totalE hh_foodst_totalM hh_foodst_recE hh_foodst_recM
41005 Clackamas County, Oregon 76597 1050 155456 776 16694 859
41051 Multnomah County, Oregon 64337 779 321968 1216 55018 1390
41067 Washington County, Oregon 78010 1086 216507 1013 22213 1020
GEOID NAME variable estimate moe
41005 Clackamas County, Oregon hh_medinc_total 69629 2898
41005 Clackamas County, Oregon hh_foodst_total 152414 1886
41005 Clackamas County, Oregon hh_foodst_rec 17527 1737
41051 Multnomah County, Oregon hh_medinc_total 59231 2100
41051 Multnomah County, Oregon hh_foodst_total 311797 3172
41051 Multnomah County, Oregon hh_foodst_rec 61844 2753
41067 Washington County, Oregon hh_medinc_total 70447 1703
41067 Washington County, Oregon hh_foodst_total 211139 2446
41067 Washington County, Oregon hh_foodst_rec 24253 2324

Example 5 - ACS

Retrieve 2018 ACS 5-year variables and geometries for a geography within a larger geography (e.g., all counties within a state).

  • Create a vector of all members of a specified geography (e.g., all counties in Oregon)
  • Define and name multiple variables
  • Retrieve geometries, specifically the cartographic boundaries
GEOID NAME variable estimate moe
41005 Clackamas County, Oregon hh_medinc_total 76597 1050
41005 Clackamas County, Oregon hh_foodst_total 155456 776
41005 Clackamas County, Oregon hh_foodst_rec 16694 859
41021 Gilliam County, Oregon hh_medinc_total 42976 4505
41021 Gilliam County, Oregon hh_foodst_total 848 60
41021 Gilliam County, Oregon hh_foodst_rec 141 38

Example 7 - ACS

At a time in 2019, retrieving all decennial census block group data for a specified state or county generates an error. This has since been resolved, but it provides a good example of how smaller nested geographies can be aggregated to larger geographies (e.g., blocks to block groups).

This works now…

Retrieve block group data for an entire county.

But if it didn’t…

Retrieve block data for an entire county.

Aggregate data to block groups via creation of the block group GEOID by removing the last 3 characters in the block GEOID and grouping by/summarizing to block groups.

Compare the data sets.

GEOID NAME P001001.x P001001.y
530330001001 Block Group 1, Census Tract 1, King County, Washington 1250 1250
530330001002 Block Group 2, Census Tract 1, King County, Washington 1234 1234
530330001003 Block Group 3, Census Tract 1, King County, Washington 1337 1337
530330001004 Block Group 4, Census Tract 1, King County, Washington 1492 1492
530330001005 Block Group 5, Census Tract 1, King County, Washington 942 942
530330002001 Block Group 1, Census Tract 2, King County, Washington 1086 1086

Example 8 - ACS and tigris

In some cases, you may want retrieve the tabular and spatial data separately (to avoid very large data sets during analysis) and join the data sets after analysis. In those cases, tidycensus can be used in combination with tigris, which is an R package that allows users to directly download and use TIGER/Line shapefiles from the US Census Bureau.

Retrieve geometries for a specified geography for a geography within a larger geography (e.g., all tracts within a county).

  • Create a vector of variables
  • Create a vector of geographies
  • Use tidyensus to retrieve tabular data
  • Use tigris to retrieve spatial geometries
  • Join the tabular and spatial data

Retrieve variables for a specified geography via the tidycensus package.

Retrieve geometries for a specified geography via the tigris package.

By default tigris retrieves the most recent vintage of a data set, so specify a different year if-needed. The default coordinate reference system (CRS) is NAD83 (EPSG 4269 https://spatialreference.org/ref/epsg/nad83/). The default type of geometry is TIGER/Line file. If cb is set to TRUE, tigris will download a generalized (1:500k) set of geometries.

Join the variables to the geometries.

Note that the resulting object’s class is dependent on the join order. The left side’s class takes priority; therefore, in the above, the attributes (right side) are being joined to the geometries (left side), so the resulting object class is sf (simple feature). If the order is flipped (below) and the geometries (right side) are joined to the attributes (let side), the object class is not sf. To make it so, add %>% st_as_sf()

Visualization

The following visualizations are created using the ggplot2 package (which is part of the tidyverse), mapview package, and leaflet for R package.

Faceted Choropleth Map

Created faceted choropleths maps of multiple variables across geographies within a single county.

As mentioned by Kyle Walker in his tidycensus tutorial, “one of the most powerful features of ggplot2 is its support for small multiples, which works very well with the tidy data format returned by tidycensus. Many Census and ACS variables return counts, which are generally inappropriate for choropleth mapping. In turn, get_decennial and get_acs have an optional argument, summary_var, that can work as a multi-group denominator when appropriate.” For example, view the racial/ethnic population distribution within a given county.

mapview

Create an unformatted, interactive map using the mapview package.

leaflet

Create an interactive map using leaflet.


A work by Alex Brasch