The following is part of the code that I utilized to pull, organize, and clean Census data for my doctoral dissertation.
TL;DR I ran a variety of advanced statistical models (OLS, LAGSAR, SSGLM) to analyze the demographics of the incoming Cuban, Venezuelan, and Puerto Rican migrants in Houston, TX and the impact of this entrance on local neighborhood composition. Results show there is a negative correlation between a change in the proportions of Cuban and non-Hispanic Blacks. Analysis was run in R and STATA, and visualized in R and Excel.
In this sample, I analyzed and compared the residential distribution of Venezuelan, Cuban, and Puerto Rican migrants in Houston, TX. Houston is not a traditional migrant destination for any of these three groups: U.S.-bound Puerto Rican migration has traditionally been associated with the Northeast and Chicago, while Cuban and Venezuelan migration has generally gravitated towards South Florida. My goal was to gauge the social mobility of migrants in a new destination and understand the impact that migrants have on the local communities. Therefore, I utilize spatial regressions to compare socio-economic changes in the neighborhoods with the highest percent change in the selected migrant groups.
Data for this research includes three different datasets: the U.S. Census Bureau American Community Survey (ACS) individual-level data, and ACS tract-level data.
- The first ACS dataset includes data gathered by the U.S. Census from 2005 to 2019 (repeated cross-section).
- The second dataset was comprised of census tract data from the ACS collected by the U.S. Census. This second dataset includes the 2010 and 2015 5-year estimates, which are based on data collected over the 5-year periods (2010 to 2014 and 2015-2019, respectively) and describe the average characteristics for those time periods.
I use a variety of platforms for the analysis, like Microsoft Office (Excel), STATA, and R (tidyverse, tidycensus, spatialreg, purrr, and others).
The analyses included here include 4 different statistical models: descriptive statistics, OLS regression, spatial autoregressive lag, and sparse spatial models.
- I utilized an OLS regression to test for the effects of birthplace on logged income and socioeconomic index (HWSEI). The OLS had four different blocks: predicting income or HWSEI with dummy variables for birthplace, 2) and 3) introducing control variables (employment, age, family size, marital status, employment history, and educational attainment), and 4) testing the race and ethnic interaction variables.
- I performed **descriptive statistical analysis ** to analyze and compare neighborhood characteristics. The goal of this was to better understand the residential and sociographic characteristics of the spaces that our target populations chose as their home.
- I employed a **spatial autoregressive lag model ** (SAR or Spatial Lag) to measure the global effects of population changes across census tracts. The SAR model can perform this analysis by calculating how the outcome variable in one area (census tract) is affected by the outcomes in nearby areas, covariates from nearby areas, and errors from nearby areas.
- I built a Sparse Spatial Generalized Model (SSGLM) to test for spatial autocorrelation between the outcome variables and the key independent variables. SSGLMS can reduce the interference of spatial confounding and improve regression inference (Hughes & Haran, 2013). Like the GLM, the SSGLM utilizes eigenvectors that exhibit spatial dependence, improving the speed of the calculation. I opted for a Gaussian linear model given the distribution of Delta Non-Hispanic Black score.
First, we can easily notice the steady increase of all three populations in Harris County between 2005 and 2019:
When we compare populations in Houston, we see that Cubans have higher percentages of men and Black Hispanics. Almost 60% of the Venezuelans have a college degree, double the proportion of Cubans (28%) and of Puerto Ricans (37%). Likewise, Venezuelans have a higher percentage of personal or individual income, while Cubans had the lowest ($33k). Overall, Venezuelans seem to have higher socioeconomic status while Cubans had the highest rates of poverty and lowest rates of homeownership.
But how statistically sound are these assumptions? Using an OLS, I find that Puerto Ricans have a higher income than Venezuelans and Cubans, while controlling for all relevant social and economic variables.
When analyzing socioeconomic prestige, Venezuelans and Puerto Ricans have a much higher socioeconomic index score than Mexicans. Meanwhile, Cubans are almost on par with Mexicans in terms of SEI. These two models suggest that all three groups are faring better than Mexican-born migrants in Harris County and that Venezuelans and Puerto Ricans enjoy better socioeconomic outcomes.
When looking at the distribution of Hispanics in Houston, we observe that migrant enclaves are concentrated in the outer rings of Houston, mostly southwest, west, and northeast. For the remainder of this analysis, I focus on six census tracts that observed the highest increases of all three populations (Cubans, Venezuelans, and Puerto Ricans).
For my Spatial Lag and Sparse Spatial GL models, I utilized four neighborhood-based change score variables as my outcomes: average income, poverty rate, proportion of non-Hispanic Whites, and proportion of non-Hispanic Blacks. Following the methodology tested by other research, I first determined if there was any spatial patterning by running a GLM and mapping its residuals (Chen et al.,2020; Corkeron et al., 2011). The Gaussian distribution of the Delta Non-Hispanic Black values required for a Gaussian GLM. After analyzing residuals, I constructed a network of spatial eigenvectors and I utilized the ‘ME’ command (part of the spdep package) to find the smallest subset of eigenvectors and remove the residual spatial autocorrelation. I reran the GLM with the eigenvectors as covariates, referred to as E-GLM. After applying a variety of relevant socioeconomic variables for control and running diagnostic tests (like Monte Carlo and Moran tests), I landed on one statistically significant outcome. There was a meaningful interaction between a change in the Cuban population and the non-Hispanic Black population. Below, you will see the results of the Spatial Lag model, which found that census tracts that had a percent increase in Cuban population had a decrease (-2.68) in their non-Hispanic Black population.
The SSGLM also rendered similar results: the Cuban-born population and the non-Hispanic Black population share a negative correlation.
I was able to map the residuals of the SSGLM in R, which allowed me to see which neighborhoods had the highest increases in Cuban-born population (shown in yellow)
Inversely, here are the sharpest drops in Cuban-born population (shown in yellow)
As I built these models--going from simple regressions to spatial autoregressions--I ran tests and plotted graphs to make sure I was headed the right way. Here are some examples. The GLM residuals show that the fitted residuals are somewhat clustered around positive predicted values for Delta Non-Hispanic Black, suggesting that there might be another socioeconomic characteristic missing in my model that could better explain this correlation.
Here are the two main takeaways from this sample analysis:
- I found that Puerto Rican and Venezuelan Migrants were faring better than Cuban and Mexican migrants in Houston, which contrasts past research where Puerto Ricans trailed behind most migrant groups.
- The entrance of Cubans into inner-city neighborhoods of Houston has a displacing effect on the local non-Hispanic Black population. While this research could not explain why, it demonstrated a nuanced method for measuring the direct and indirect effects of migration in neighborhood demographic composition.









