Transcript for 2025 Summer Training Series, Session 4, Exploratory Analysis Presenters: Alexander F. Roehrkasse, Ph.D., Butler University, Noah Won, M.S., NDACAN National Data Archive on Child Abuse and Neglect (NDACAN) [MUSIC] [VOICEOVER] National Data Archive on Child Abuse and Neglect. [ONSCREEN CONTENT SLIDE 1] NDACAN Summer Training series schedule July 2nd, 2025 Developing a research question & exploring the data July 9th, 2025 Data management July 16th, 2025 Linking data July 23rd, 2025 Exploratory Analysis July 30th, 2025 Visualization and finalizing the analysis [Andres Arroyo] Hello everyone, my name is Andres Arroyo and I am the Archiving Assistant for the National Data Archive on Child Abuse and Neglect or NDACAN. Welcome to the fourth session of our 2025 Summer Training Series. This session is called Exploratory Analysis and each session, this month has been building on the previous ones and are being recorded and will be posted on the NDACAN website. [ONSCREEN CONTENT SLIDE 2] Welcome to the 2025 NDACAN Summer training series! National Data Archive on Child Abuse and Neglect Duke University, Cornell University, UC San Francisco, & Mathematica [Andres Arroyo] The National Data Archive on Child Abuse and Neglect is a project of Duke University, Cornell University of California, San Francisco and Mathematica. We are funded by the Children's Bureau, which is an Office of The Administration for Children And Families. [ONSCREEN CONTENT SLIDE 4] Life Cycle of an NDACAN research project This session is being recorded. Please submit questions to the Q&A box. See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu. [Andres Arroyo] We ask that if you have a question, please enter it into the Q&A box of the Zoom interface, not the chat box. Questions will be answered at the end of the presentation. And if you need any assistance with Zoom issues, you can contact me at aa17@cornell.edu and I will try to work with you as best I can. There's also information about Zoom at the Zoom support help site. That concludes the introductory comments. Thank you very much. [ONSCREEN CONTENT SLIDE 5] SESSION AGENDA STS recap Exploratory analysis Demonstration in R [Alexander F. Roehrkasse] Hi everyone. My name is Alex Roehrkasse. I'm a professor of sociology and criminology at Butler University. And I'm also a Research Associate at NDACAN. You may recognize my voice from prior sessions this summer. It's been an exciting Summer Training Series this year leading a series of trainings that build on one another to illustrate The Life Cycle Of A Research Project Using Archive Data. Today though we're going to initiate a hand-off to Noah Won who's a statistician at the archive. So differently than in previous sessions, I'll be working through the slides today. But then when it comes to our demonstration in R, I'll be handing things off to Noah, who will be leading the demonstration. Noah will also be leading next week's final STS presentation on finalizing your analysis and visualizing your results. So just to outline today's agenda, as I've been doing in previous meetings, we'll briefly recap last sessions, learnings. And then I will work through a slide deck illustrating some basic principles and strategies for exploratory analysis. This will mostly hinge on linear regression analysis and basic strategies for estimating the things you actually want to estimate using regressions. Then as I said, I'll hand things over to Noah who will demonstrate some of those principles and strategies in R. As a reminder, we're very excited for questions. Please drop those in the Q and A, and we will address them at the end of today's session. Okay, let's get going. [ONSCREEN CONTENT SLIDE 6] STS Recap [Alexander F. Roehrkasse] What did we talk about last week? [ONSCREEN CONTENT SLIDE 7] LINKING DATA Record-level linkage possible internally with NDACAN administrative data Aggregate linkage possible with other NDACAN data, external data Linkage requires clean, well-formatted data files with shared variables Linkage is a useful tool for building large datasets, dealing with data limitations, and enabling powerful research designs Linkage can create and/or amplify data problems if data limitations are not understood and addressed [Alexander F. Roehrkasse] Mostly, we talked about linking data. We talked about how two different kinds of linkages are possible using the administrative data that's archived at NDACAN. Among those data sets, that is to say, within our sort of data ecosystem, it's possible to link records at the individual child level. That's to say, if you find a child in AFCARS, they that very same child might also appear in NCANDS, or NYTD, and across those different data sets, that individual child is very likely to have an individual identifying variable that can be used to follow that individual child across multiple data sets. If you're interested in linking outside of the NDACAN administrative data architecture or environment, that's also possible, but not at the record level, not at the individual child level. Linkage to external data can be done at the aggregate level though, so you can tabulate archive data and link it to external data at the state level, at the county level, at the year level, at the month level. You can, I suppose, link data at the aggregate level to individual level records in the NDACAN ecosystem, but you'll never to be never be able to link individual children in archive data to individual children outside of archive data. We talked about how linkage requires very clean and well-formatted data files that share at least one identifying variable common to each data set that you link. We talked about how linkage is a useful tool for a number of different goals for expanding our data sets either expanding the sample size or expanding the number of variables at our disposal. We talked about how linkage can be useful in dealing with data limitations. Sometimes we can carry over values from one variable that's not missing in one source, two values of that same variable that are missing in a different source, so it can be a helpful missing data strategy. Data linkage can also enable powerful research designs, particularly by allowing us to follow children over time, creating longitudinal data sets that can be particularly helpful in doing life course analysis or causal inference. Finally, we talked about how linkage can create or amplify big data problems. If you don't understand the limitations in your data or do not address them appropriately. So you want to think of linkage as being both potentially very useful, but also potentially very fraught. [ONSCREEN CONTENT SLIDE 8] Research question What is the relationship between lifetime incidence of removal and full-time employment among youth three years after aging out of foster care? [Alexander F. Roehrkasse] Okay, recall also from previous sessions that our working research question that guides the research project we've been developing is this. "What is the relationship between the lifetime incidence of removal and full-time employment among youth three years after aging out of foster care?" Recall that this lifetime incidence of removal means how many times has a child over the course of their life been placed into foster care removed from the care of their parent or care giver and placed into some foster care setting. This variable is only measured in the AFCARS. This is why we're bringing in our AFCARS data. This is also the main predictor of interest, sometimes called our independent variable. We're most interested in the relationship between this predictor variable and some outcome. And of course the outcome variable we're interested in is full-time employment among youth three years after aging out of foster care. This information is only available in NYTD, which is an administrative survey that follows children aging out of foster care. So in order to answer this question, we need linked data that identifies children in both AFCARS and NYTD and links those data so that we can study the relationship between this predictor and this outcome. [ONSCREEN CONTENT SLIDE 9] Exploratory analysis [Alexander F. Roehrkasse] Okay, let's talk now about some initial analyses that we might do once we've cleaned and linked our data. How do we start doing basic analyses to try to start answering this question? I want to emphasize that we're just going to be scratching the surface here. There's also some ways in which the strategies and principles described here don't fit our data perfectly. That's okay. This is often the case. We're just starting at the ground floor and in any research project you want to start there and then build up toward greater levels of complexity as you learn more about your data and the models and strategies appropriate to your question. [ONSCREEN CONTENT SLIDE 10] Introduction to linear regression Regression analysis is a statistical method for estimating the relationship between two (or more) random variables: An outcome (or dependent variable) One or more predictors (or independent variables) Linear regression is a powerful, flexible class of regression models that assume a linear relationship between the outcome and predictors [Alexander F. Roehrkasse] We're going to start with a discussion of linear regression. Regression analysis, more broadly, is a statistical method for estimating the relationship between two or more random variables. And most often, we're interested in some outcome of interest or dependent variable and one or more predictors that we sometimes call independent variables or covariance. Now, there are all different kinds of regression analyses, regression models. Linear regression is probably the most common and it's a powerful and flexible class of regression models. But one that includes a number of assumptions which may or may not always hold. We'll be talking about whether those assumptions hold and what the consequences of those assumptions are in today's presentation and next week's presentation. [ONSCREEN CONTENT SLIDE 11] Estimating linear regression Models Linear regression models find the line (or hyperplane) of best fit representing the relationship between two (or more) random variables The most common method for estimating regression models is ordinary least squares (OLS) OLS minimizes the sum of the squares of the differences (residuals) between predicted values (blue line) and observed values (red points) Image of a graph with horizontal and vertical axes. Data points (red dots) are arrayed noisily on the graph, trending upward and to the right. A blue line representing predictions from a linear regression model also trends upward and to the right. The red dots fall roughly equally below and above the blue line. [Alexander F. Roehrkasse] How does linear regression work? At risk of oversimplification, linear regression models essentially find the line of best fit representing the relationship between two random variables. Now, you'll see I've also said hyperplane here. This is just in the case where we're analyzing the relationship between more than two random variables. For the purpose of illustration, we'll just focus on two random variables here. The most common method for estimating this line of best fit for finding this line is a method called ordinary least squares, which is in most scenarios what we call the best linear, unbiased estimator. OLS essentially works by minimizing the sum of the squares of the differences, which we sometimes call the residuals between some predicted value, in other words, the line of best fit and the actual observed data points. Now that's kind of a word salad. Let's see if we can't illustrate this using the figure to the right. So you'll notice here that we're looking at a Cartesian coordinate plane. Usually in illustrating regression models like this, we'll put the predictor variable on the horizontal axis and the outcome variable on the vertical axis. For the purposes of illustration, it really doesn't matter what these variables are or what their units are. Each red dot represents an actual data point. Some combination of two values of our predictor variable and our outcome variable. What linear regression does is essentially try to draw a line through these data points that feels like the best fit. You can look at this figure and say, "Hmm, that feels like a good way to draw a line through those points on average." If you were to just, if I were to ask you draw a line through those points, that's about where I think most of us would draw the line. That's a little vague though, so how actually does a linear model draw that line? Well, it essentially tries out a bunch of different lines and chooses the one that minimizes the square of the distances between the red points and the blue line. There is no other line that we could draw that further reduces the sum of the squares of the distances between the red points and the blue line. This tells us that it's the best linear, unbiased estimate of this relationship. [ONSCREEN CONTENT SLIDE 12] Fundamental components of linear regression models Consider the following bivariate regression: ๐ฒ=๐›ฝ_0+๐ฑ๐›ฝ_1+๐›† ๐ฒ is a ๐‘ร—1 vector of outcomes, where ๐‘ is the number of observations in our data ๐ฑ is a ๐‘ร—1 vector of predictors ๐›ฝ_0 is the main intercept (the predicted value of ๐ฒ when ๐ฑ=0) ๐›ฝ_1 is the coefficient (or parameter) of interest ๐›ฝ_1 represents the slope of the line of best fit It is the main goal of regression analysis to estimate coefficients of interest validly (without bias) and efficiently (with precision) ๐›† is the error term, a ๐‘ร—1 vector of residuals (distances between red dots and blue line) [Alexander F. Roehrkasse] Okay, what are the fundamental components of a linear regression model? Linear regressions can sometimes get very complicated. What is the most simplest version of a linear regression model? Consider the following bivariate regression. And when I say bivariate, I mean, a regression that analyzes the relationship between two variables, a predictor, and an outcome. Here, I've drawn the model as an algebraic equation. In particular, I've used matrix notation to illustrate the structure of these data and this model. On the next slide, though, I'll back away from this matrix notation for those of you who are unfamiliar with this way of notating data structures. Usually on the left hand side of the equation is our outcome. And here y is a vector of outcomes that only has one variable. There's only one outcome we'll analyze. But of course, each observation or each unit has a value of that variable. And so we have a vector of outcomes that's only one across but is as long as the number of units or observations in our sample. X is a similar structure, vector of variable values, but values for our predictor variable. These are our data. These are the underlying data that we're trying to model. The rest of the regression equation are the features of our model, the parameters of our model that actually get estimated. Beta naught or beta 0 is what we call the main intercept. Essentially this is the predicted value of our outcome when the predictor variable has a value of 0. The real thing we're interested in is estimating beta 1. And this is sometimes called the coefficient or parameter of interest. Beta 1 represents the slope of the line of best fit. So from our previous slide, Beta 1 would represent the slope of that blue line. How much does it go up as it goes over? It's the main goal of a regression analysis to estimate coefficients of interest both validly, that is to say without bias, and efficiently, that is to say with precision. Epsilon is what we call the error term. And it's a factor of residuals. In other words, the distances between those red points and that blue line. It tells us essentially how much noise is there around our estimate. Okay, these are the fundamental components of a linear regression model. Let's try to notate them differently and illustrate such a model using our actual research project. [ONSCREEN CONTENT SLIDE 13] BASELINE MODEL Instead of using matrix notation, we can represent the model using indexing: ๐‘ฆ_๐‘–=๐›ฝ_0+๐‘ฅ_๐‘– ๐›ฝ_1+๐œ€_๐‘– In the case of our research design, the regression model takes the form: CurrFTE_3_๐‘–=๐›ฝ_0+TOTREM_iB_1+๐œ€_๐‘– Because our outcome is binary, this model is known as a linear probability model. In Presentation 5 weโ€™ll explore other models for binary and other categorical outcomes. [Alexander F. Roehrkasse] Instead of using matrix notation, which can sometimes be a little confusing, we can represent the very same model using indexing. So instead of talking about vectors and matrices, we can just say that y, our outcome of interest, has a separate value for every observation indexed by the subscript i. Similarly, our predictor variable has a different value for each individual or observation indexed by the subscript i. So you see that at y and x we'll have different values for each unit for each observation. This is not true for our parameters, our model parameters, but rather it is true for our residuals, but not true for our coefficients. The model will spit out, the model will estimate one intercept and one coefficient of interest, so it will only have one value of beta naught and one value of beta 1. Okay, recall our research question, which asks about the relationship between total removals and full-time employment three years after aging out of foster care. We can substitute these variables into our model and it would look something like this, where our outcome is current FTE, which has a different value for each child. Our predictor variable is the number of total removals, which also has a different value for each child, but our model will estimate the overall relationship between these two variables and yield a coefficient of interest beta 1 that summarizes the overall relationship across all cases. Now, because our outcome is binary, it can only take one of two values. That is to say, they are full-time employed or are not full-time employed. If we were to plot our data, it wouldn't look exactly like that Cartesian coordinate plane I showed you earlier, because the outcome variable can only have two values. There are a whole set of models that are designed specifically to analyze cases like this, where our outcome of interest has only two values. We're not going to talk about those today, though they will be introduced in next week's presentation, so that's a little teaser for next week's presentation. If you're interested in learning more about models, specific to categorical outcomes, including binary outcomes, outcomes that have only one two possible values. Then tune in next week for a more extensive discussion of those models. Suffice it to say though, we can use a linear model to analyze this outcome. This model is called a linear probability model. It's used commonly in a number of different disciplines. Just know that some of the illustrations I've given won't line up particularly to our example because our example outcome variable only takes two possible values: yes, or no, are you currently full-time employed? [ONSCREEN CONTENT SLIDE 14] Correlation and causality Recall our research question: What is the relationship between lifetime incidence of removal and full-time employment among youth three years after aging out of foster care? What if we want to strengthen it to something like: What is the effect of lifetime incidence of removal on full-time employment among youth three years after aging out of foster care? [Alexander F. Roehrkasse] Okay, I want to offer some cautionary advice about how to interpret the results of linear regression models. You may have heard the phrase correlation does not equal causality. That may be well and good, but it's important to understand why. Recall that our research question is this. What is the relationship between life-time incidents and full-time employment? This is arguably a little bit vague. What is the relationship? What kind of relationship are we talking about here? It's a little bit open-ended. That's partly by design, but arguably we want to be a bit more specific. What if we want to strengthen our question to something like, what is the effect of lifetime incidents of removal on full-time employment among youth three years after aging out of foster care? These are fundamentally different questions. They both ask about the relationship between these two variables, but the first is a bit agnostic as to what that relationship, what that association might be. The second is much more specific, about hypothesizing a causal effect of our predictor variable on our outcome variable. [ONSCREEN CONTENT SLIDE 15] Omitted-variable bias The Predictor can affect Outcome The Confounder can affect the Predictor and the Outcome An ommitted variable can affect the Outcome between the Predictor and the Outcome [Alexander F. Roehrkasse] There are a variety of reasons why results that indicate a relationship or an association between our predictor and our outcome-variable do not provide evidence of a causal relationship between our predictor and our outcome-variable. We won't go through all the different reasons why this might be the case. I'll just talk about the overwhelmingly most common and most problematic reason why correlation does not always equal causality. And that is the presence of some third variable that is fundamentally related to the two variables we're interested in. So say we're interested in the causal effect of our predictor on our outcome. But say there's some third variable that causes both our predictor variable and our outcome variable. Without accounting for that confounder, that third variable, our observed relationship between the predictor and outcome variable will be invalid or will be biased. Precisely in which direction and how much depends on the context. [ONSCREEN CONTENT SLIDE 16] Omitted-variable bias: Example Total removals can affect Full-time employment Race/ethnicity can affect the Total removals and Full-time employment An ommitted variable can affect Full-time employment between Total removals and Full-time employment [Alexander F. Roehrkasse] What does this look like potentially in our example research question? Our example research project. Of course we're interested in the relationship between total removals and full-time employment. Let's say we wanted to know the causal relationship between these. The causal effect of removals on full-time employment. Well analyzing only these two variables in a bi-variate regression model would probably lead to some spurious inferences about that relationship. Why? Well consider a factor like race and ethnicity. We know that children of different ethno-racial backgrounds have very disparate likelihoods of being placed into foster care. We also know that race and ethnicity has significant impacts on people's employment prospects. There's a large literature documenting racial discrimination in labor markets. If we were to look only at removals and employment, we would our model would essentially have baked into it, without our knowledge, the influence of race and ethnicity both on removals and full-time employment. How do we deal with this problem of omitted variables bias? [ONSCREEN CONTENT SLIDE 17] CONTROLLING FOR OBSERVABLE CONFOUNDERS We can deal with observed confounders by incorporating them into our model as additional predictors (or covariates). Note that adding a predictor with ๐ถ categories introduces ๐ถโˆ’1 parameters, which measure the difference in outcome for each category relative to a reference category (here, ใ€–๐‘Šโ„Ž๐‘–๐‘ก๐‘’ใ€—_๐‘–) ใ€–๐ถ๐‘ข๐‘Ÿ๐‘Ÿ๐น๐‘‡๐ธ_3ใ€—_๐‘–=๐›ฝ_0+ใ€–๐‘‡๐‘‚๐‘‡๐‘…๐ธ๐‘€ใ€—_๐‘– ๐›ฝ_1+ ใ€–๐ต๐‘™๐‘Ž๐‘๐‘˜ใ€—_๐‘– ๐›ฝ_2+ใ€–๐ด๐ผ๐ด๐‘ใ€—_๐‘– ๐›ฝ_3+ใ€–๐ด๐‘ ๐‘–๐‘Ž๐‘›ใ€—_๐‘– ๐›ฝ_4+ใ€–๐‘๐ป๐‘ƒ๐ผใ€—_๐‘– ๐›ฝ_5+ใ€–๐‘€๐‘ข๐‘™๐‘ก๐‘–ใ€—_๐‘– ๐›ฝ_6+ใ€–๐ป๐‘–๐‘ ๐‘ใ€—_๐‘– ๐›ฝ_7+ ๐œ€_๐‘– [Alexander F. Roehrkasse] The simplest way is to incorporate information about them into our model. If we can add them to our model and model them in an appropriate way, we can what is sometimes said control for them, account for them, incorporate into our model the very relationship we described as a problematic confounding relationship. Now race and ethnicity is a categorical variable and so you should know that whenever we add a predictive variable with C categories, some number of categories C. This actually introduces C minus 1 parameters to our model, each of which measure the difference in outcomes for each category, relative to some reference category. I always advise data users to choose their reference category. If you don't choose your reference category, your statistical programming software will for you and so it's better to do it explicitly. Here our model though, if we were to account for race and ethnicity would include a number of other new parameters where we would have a separate parameter for black Americans and a separate coefficient that measures the difference between black Americans and white Americans. Another coefficient measuring American Indian and Alaska Native individuals, Asian individuals, native Hawaiian and Pacific Islander individuals, and multiracial individuals. Each of these parameters of interest would measure the overall difference in outcomes relative to our reference group, which we might define as white Americans. [ONSCREEN CONTENT SLIDE 18] CONTROLLING FOR UNOBSERVABLE CONFOUNDERS There are many, many potential confounders that are not observed or even observable For example, the relationship between foster placement and employment may be confounded by (intangible) features of child welfare policy and practice One simple strategy for addressing such unobserved confounders: introduce group-specific intercepts, or fixed effects For example, if CPS systems vary across states but are stable within them, including state intercepts will control for them [Alexander F. Roehrkasse] There are many, many potential confounders. Some of these, we can observe, collect data about, and incorporate into our model. Many of them, though, we will not have information about and may even be impossible to collect information about. Take for example, the relationship between foster placement and employment, which may be confounded by certain intangible features of child welfare policy and practice. Child welfare works differently, in different places, at different times. We have some information about this, but things like culture, personnel, institutional memory, these are not things that we can collect data about, but which influence foster placement and potentially even employment. One simple strategy for addressing unobserved confounders, or even unobservable confounders, is to introduce group specific intercepts, or what we sometimes call fixed effects. To continue this example, if CPS systems vary across states, but are roughly stable within them over time, we can include a separate intercept for each state that will absorb all of this variation, all of this unobserved variation, and control for it. What does this look like? [ONSCREEN CONTENT SLIDE 19] Example: State Fixed effects ๐ฒ=๐›ฝ_0+๐ฑ๐›ฝ_1+๐’๐›„+๐›† ๐’ is an ๐‘ร—(๐บโˆ’1) matrix of indicator variables, where ๐บ is the number of US states ๐›„ is a (๐บโˆ’1)ร—1 vector of coefficients, or state fixed effects ๐‚๐ฎ๐ซ๐ซ๐“๐…๐„_๐Ÿ‘=๐›ฝ_0+๐“๐Ž๐“๐‘๐„๐Œ๐›ฝ_1+๐‘๐š๐œ๐ž๐„๐ญ๐ก๐ง๐œน+ ๐’๐ญ๐š๐ญ๐ž๐›„+๐›† [Alexander F. Roehrkasse] Well, to return to our matrix notation, our bivariate regression model looks much the same, except that we've added this parameter here. What is this parameter? S is a matrix of indicator variables, where G is the number of US states, so each observation gets an indicator variable for each state, and then oops, apologies. And then gamma is a factor of coefficients, or what we call state fixed effects. Now, differently than our coefficient of interest beta 1, we're not actually terribly interested in these coefficients, or these parameters. We want to think about fixed effects mostly as a control, that helps us estimate our coefficient of interest more validly. Under certain assumptions, for example, that the unobserved confounders at the state level do not change over time, we can include these state fixed effects, which will control for unobserved variation across states vis a vis their child welfare systems, but also really anything else that varies across states, but is stable over time. This is a powerful, flexible, simple way for controlling, for unobserved confounders under a certain set of simple assumptions. Combining these different approaches, we might arrive at a sort of working model here of our that would that would be well-suited to exploring our basic research question. Here we have current full-time employment three years after aging out of foster care. We have our model intercept, we have total removals, we have a parameter for race and ethnicity, and we have our state fixed effects. [ONSCREEN CONTENT SLIDE 20] Stratification Perhaps thereโ€™s reason to think the answer to your research question will be different for different populations. For example, the relationship between removal incidence and full-time employment may be different for people who are and are not currently enrolled in school Stratification allows our model estimates to vary across the values of a stratum variable For example, we could estimate our model separately on currently enrolled and not currently enrolled populations Or we could interact the enrollment variable (CurrEnroll) with all other model parameters [Alexander F. Roehrkasse] Sometimes there's a reason, though, to think that your answer might be different for different populations. For example, the relationship between removal incidents and full-time employment is very likely to be different for people who are and are not currently enrolled in school. If you're full-time enrolled in school, you're much less likely to be full-time employed. So we might want to know that. Stratification is a strategy that allows our model estimates to vary across values of a stratum variable. There's essentially two ways to do this. We can estimate our model separately for these different groups. So, for example, people who are and are not in school, or we could interact our enrollment variable with all of our other model parameters. These are functionally the same. [ONSCREEN CONTENT SLIDE 21] Relaxing parametric assumptions By default, linear models assume linear relationships between predictors and outcomes We can relax this constraint in at least two ways: Adding quasi-linear parameters like quadratic terms or splines Including separate parameters for each level of a variable The next presentation will explore models for non-continuous outcomes [Alexander F. Roehrkasse] Lastly, we might think about relaxing some of the assumptions that are built into the linear regression model. As I said earlier on, by default, linear models assume linear relationships between predictors and outcomes. While retaining the basic architecture of a linear regression model, we can relax this constraint by reformatting our variables in different ways, or interacting our parameters in different ways. If you'd like more detail about these strategies, reach out, and we can always talk about how to calibrate your model to best answer your research question. [ONSCREEN CONTENT SLIDE 22] EXTENSION: Dealing with MISSING DATA Statistical software (including most R packages) will almost always listwise-delete records with missing values of modeled variables Listwise deletion is rarely advisable, particularly if large amounts of data are missing Always: Examine the degree of missingness in your data Consider the mechanisms that generated the missing data Implement a defensible approach to dealing with missing data [Alexander F. Roehrkasse] Lastly, a reminder, this Summer Training Series is not focused on missing data. We have other available trainings, and are always happy to provide consultation on how to deal with missing data. Missing data is a common strategy in archive data, including administrative data. A reminder that your statistical software will almost always listwise-delete your records, if missing values are included in your regression model. This is rarely an advisable approach to missing data, and so you always want to examine the missingness in your data, consider the mechanisms that lead data to be missing, and implement a defensible approach to dealing with missing data. If you notice in your model there are many fewer observations than in your data set, this is what's happening, and you shouldn't ignore it. [ONSCREEN CONTENT SLIDE 23] Demonstration in R [Alexander F. Roehrkasse] Okay, that's it for my discussion of principles and strategies, and so we'll now move over to a demonstration in R. If you'll bear with me, this is going to require us to switch shared screens. [ONSCREEN CONTENT SLIDE 24] Questions? Alex Roehrkasse aroehrkasse@butler.edu Noah Won noah.won@duke.edu Paige Logan Prater paige.loganprater@ucsf.edu [Alexander F. Roehrkasse] Before I do though, I just want to remind you that we're always available to answer questions. You can find my email, Noah's email, and Paige's email here on this slide. With that, let's transition over to our demonstration in R. [Noah Won] Thank you, Alex. One second, let me share my screen. Okay, can you all see my screen well? [Alexander F. Roehrkasse] Yes. Thank you Noah. [Noah Won] Okay, perfect. All right, as Alex says, I'll set, I'll be handling the R-coding portion for session four of the Summer Training Series. My name's Noah Won, I'm a Statistician/data analyst at NDACAN, and we'll just be covering some of the concepts that Alex had covered in the previous presentation. [ONSCREEN] # NOTES # This program file demonstrates strategies discussed in # session 4 of the 2025 NDACAN Summer Training Series # "Data Management." # For questions, contact the presenter # Noah Won (noah.won@duke.edu). # Note that because of the process used to anonymize data, # all unique observations include partially fabricated data # that prevent the identification of respondents. # As a result, all descriptive and model-based results are fabricated. # Results from this and all NDACAN presentations are for training purposes only # and should never be understood or cited as analysis of NDACAN data. [Noah Won] So, as always, my contact information is that the top in the header. If there are any questions that may arise after this coding portion. And of course, all data is anonymized, so any findings that we may find here are not necessarily indicative of real-world data. [ONSCREEN] R version 4.2.3 (2023-03-15 ucrt) -- "Shortstop Beagle" Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. [Noah Won] Okay, well, let's jump right into it. So, as we have before in previous iterations of the coding session, we're going to clear our environment. We're going to run some code. Just to make sure we have our environment is clear, you know, of previous potential variables that were defined, we're going to load our packages, namely the Tidyverse. We'll be the most important package that we'll be using in this example for data cleaning, some graphic presentations, such as GGplot2, that we can go ahead, run it, make sure it loads in. [ONSCREEN] # TABLE OF CONTENTS # 0. SETUP # 1. Simple Linear Regression # 2. Multiple Regression # 3. Stratified Multiple Regression # 0. SETUP # Clear environment rm(list=ls()) # Installs packages if necessary, loads packages if (!requireNamespace("pacman", quietly = TRUE)){ install.packages("pacman") } pacman::p_load(data.table, tidyverse, mice) [Noah Won] And as always, we're going to define our working directories. So, me personally, my data is located in this pathway, but, you know, when you're working with your own data, of course, create the pathways that your data is in, and we'll set our working directories and our seed for any random numbers that we may generate. [ONSCREEN] # Defines filepaths working directory project <- "C:/Users/nhwn1/Downloads/STS5/data" data <- "C:/Users/nhwn1/Downloads/STS5/data" # Set working directory setwd(project) # Set seed set.seed(1013) [Noah Won] All right. So, as Alex stated before, you know, simple linear regression's a very powerful tool to find correlations in our data, linear correlations, namely, and it's just a very flexible and useful tool for researchers, advanced, and just starting. So, with that, we're going to read in our AFCARS data, and this again is anonymized data, and it's not indicative of real-world data, and we'll also read in just the first 20 observations to, you know, make sure our data is read in well and see what variables we have. [ONSCREEN] > afcars <- fread(paste0(data,'/afcars_clean_anonymized_linear.csv')) > head(afcars, 20) StFCID STATE St RecNumbr DOB SEX RaceEthn CLINDIS 1: AL000001456616 1 AL 000001456616 2003-02-15 Female White No 2: AL000001524474 1 AL 000001524474 2003-01-15 Male White No 3: AL000001528009 1 AL 000001528009 2003-01-15 Male White Yes 4: AL000001597400 1 AL 000001597400 2003-01-15 Male White No 5: AL000001612758 1 AL 000001612758 2003-01-15 Female White No 6: AL000001634843 1 AL 000001634843 2002-09-15 Female Hispanic No 7: AL000001699782 1 AL 000001699782 2003-02-15 Male White No 8: AL000001699789 1 AL 000001699789 2003-02-15 Female White No 9: AL000001714510 1 AL 000001714510 2002-09-15 Male White No 10: AL000001718423 1 AL 000001718423 2002-12-15 Female White No 11: AL000001718927 1 AL 000001718927 2003-03-15 Male White No 12: AL000001917909 1 AL 000001917909 2002-11-15 Male White No 13: AL000002036685 1 AL 000002036685 2003-08-15 Female Black Yes 14: AL000002041179 1 AL 000002041179 2003-04-15 Male White No 15: AL000002061766 1 AL 000002061766 2003-03-15 Female AIAN No 16: AL000002239265 1 AL 000002239265 2003-02-15 Male White No 17: AL000002270506 1 AL 000002270506 2003-01-15 Male Hispanic No 18: AL000002276813 1 AL 000002276813 2003-07-15 Female White No 19: AL000002501625 1 AL 000002501625 2003-07-15 Male Black No 20: AL000002788485 1 AL 000002788485 2003-04-15 Male Black Yes TOTALREM FCMntPay AgeAtStart 1: 2 5425 16 2: 2 4371 16 3: 3 732 16 4: 1 732 16 5: 4 3102 16 6: 5 732 16 7: 2 732 16 8: 2 732 16 9: 2 8525 16 10: 1 732 16 11: 2 732 16 12: 2 732 16 13: 2 0 16 14: 2 0 16 15: 2 0 16 16: 2 2961 16 17: 3 0 16 18: 1 0 16 19: 2 0 16 20: NA 938 16 [Noah Won] Okay, so it seems like the data read well. We have our STFCIDs, we have state on a numeric state variable, character state variable, our Rec numbers, date of births, sex, race and ethnicity, we've got clindus, we have total removal, we have FCMnTPay and AgeAtStart. So these are all, you know, good variables to start with, but we'll definitely need to do a little bit of cleaning before we can throw these variables in our model. So let's look at some predictors of interest, right? So I'd say sex is a very useful variable, maybe useful for to be uses predictor of race ethnicity and our outcome variable for this example will be FCMnTPay, which is the monthly payment made on behalf of the child. So let's go ahead and take a look at this, this sex variable. [ONSCREEN] # Running frequency tables of predictors of interest table(afcars$SEX) Female Male 88 323567 307598 [Noah Won] So we could see that the sex variable is a character, right? We have female and male, and we have a certain amount in each, which is, you know, it's great for, you know, reading a data in, but it's not great for a, for putting in our model. So we're going to have to change these to ones and zeroes, as Alex said, we'll need like a reference group, which is relatively arbitrary, we can choose female or male to be our reference group, and then, and then our one variable. So reference will be zero, and then the other variable you want. So let's see our race ethnicity variable as well. [ONSCREEN] # Running frequency tables of predictors of interest > table(afcars$RaceEthn) AIAN Asian Black Hispanic Multiracial 10255 14846 3393 138569 134412 48829 NHPI White 1698 279251 [Noah Won] Okay, so likewise, we have characters, right, of American, Indian, Asian, black, Hispanic, multiracial, white, and again, this is great for, you know, reading a data set in, but when we put it into our model, we really want to create dummy variables, essentially binary variables, that are ones or zeroes, that refer to a reference group. So we'll go into more detail when we create these variables, but it seems like we're going to have to make some changes for this. [ONSCREEN] # Running frequency tables of predictors of interest > table(afcars$FCMntPay) 0 1 2 3 5 6 7 8 9 10 11 265467 858 3 7 4 5 23 10 8 15 8 12 13 14 15 16 17 18 19 20 21 22 3 16 32 28 35 15 32 89 67 40 20 23 24 25 26 27 28 29 30 31 32 33 40 27 67 84 109 64 34 69 31 20 83 [ reached getOption("max.print") -- omitted 8076 entries ] [Noah Won] And our FCMntPay, which, as expected, is numeric, and has a, you know, large, as many different, values. So this is our going to be our continuous outcome for our linear regression. Okay. So as I said before, we're going to have to do some data manipulation. [ONSCREEN] # Creating Dummy Variables and Age Variables for Predictors # Also filtering out those older than 30 > afcars2 <- afcars %>% + mutate(SEX_d = case_when( + SEX == "Male" ~ 1, + SEX == "Female" ~ 0), + Hispanic = case_when( + RaceEthn == "Hispanic" ~ 1, + TRUE ~ 0), + age = as.numeric(difftime(Sys.Date(), DOB, units = "days")) / 365.25 + ) %>% + filter(age <= 30) [Noah Won] So for this sex variable, I want to make our females, our female group, the, reference variable, and our male, the, the, the other indicator variable. So essentially, in this code, I am, creating a new variable called SEX_d, sex dummy variable, and assigning, the value one when the sex valuable uh variable's male, and assigning the value zero when it's female. Likewise, I'm creating a Hispanic variable, kind of following this same flow, where every ethnicity equals Hispanic, we're going to assign this value one, and if it's not, anything else it's going to be zero. So in both the, in this second case, the reference group is non-Hispanics, right? So we're going to be comparing Hispanics, the non-Hispanics, males, the females, with this dummy variable. And, like I said, it's arbitrary, you know, you could swap the order could be males, could be the reference group, females could be the one value, non-Hispanics, be the one value Hispanics, be the reference group, it's just our interpretation of our beta values will change with that, and I'll go into more detail as, as we, progress. And of course, I also wanted to create an age variable, from our birthday. So essentially this line of code is, just reading in the date of birth variable and just dividing it by 365.25 for, you know, leap years on average. And I also wanted, to filter this data for people under 30. So, just in particular in this data set, I wanted to kind of capture like the, most recent, iterations most recent, it's, I suppose, like children that, you know, in this data set, rather than, extrapolating beyond 30 years of age. Okay, so let's run this code and see what happens. Okay, I'm just going to check some of our derived variables here. [ONSCREEN] # Checking new derived variables > table(afcars2$SEX_d) 0 1 323559 307580 [Noah Won] Okay, seems like it's worked. We have our females as zero, and we have our males as one. Let's check our Hispanic variable. [ONSCREEN] > table(afcars2$Hispanic) 0 1 496820 134407 [Noah Won] Okay. Likewise, we have ones are Hispanic, zeroes are non-Hispanic. And we're just going to take a look at our age, make sure it got derived correctly. [ONSCREEN] > table(afcars2$age) 4.27104722792608 4.51745379876797 4.60232717316906 4.68446269678302 1 1 1 1 4.76933607118412 4.93634496919918 5.02121834360027 5.10335386721424 1287 2465 2239 2479 5.18822724161533 5.27036276522929 5.35523613963039 5.43463381245722 2349 2705 2529 3326 5.51950718685832 5.60438056125941 5.68651608487337 5.77138945927447 2853 3395 3440 3884 5.85352498288843 5.93839835728953 6.02327173169062 6.10540725530459 3773 4179 4258 3826 6.19028062970568 6.27241615331964 6.35728952772074 6.43394934976044 3706 3935 3687 4479 6.51882272416153 6.60369609856263 6.68583162217659 6.77070499657769 3816 4306 4440 4522 6.85284052019165 6.93771389459274 7.02258726899384 7.1047227926078 4416 4503 4755 4171 7.1895961670089 7.27173169062286 7.35660506502396 7.43326488706365 4214 4004 3866 4328 7.51813826146475 7.60301163586585 7.68514715947981 7.7700205338809 3751 4113 4209 4152 7.85215605749487 7.93702943189596 8.02190280629706 8.10403832991102 4165 4149 4337 3544 8.18891170431212 8.27104722792608 8.35592060232717 8.43258042436687 3824 3549 3244 3607 8.51745379876797 8.60232717316906 8.68446269678303 8.76933607118412 3284 3450 3730 3643 8.85147159479808 8.93634496919918 9.02121834360027 9.10335386721424 3667 3655 3739 3088 9.18822724161533 9.2703627652293 9.35523613963039 9.43463381245722 3292 3116 2947 3235 9.51950718685832 9.60438056125941 9.68651608487338 9.77138945927447 3016 3184 3350 3219 9.85352498288843 9.93839835728953 10.0232717316906 10.1054072553046 3304 3257 3316 2831 10.1902806297057 10.2724161533196 10.3572895277207 10.4339493497604 2975 2935 2766 2966 10.5188227241615 10.6036960985626 10.6858316221766 10.7707049965777 2749 2891 3030 2867 10.8528405201917 10.9377138945927 11.0225872689938 11.1047227926078 3020 3083 3111 2657 11.1895961670089 11.2717316906229 11.356605065024 11.4332648870637 2627 2524 2539 2716 11.5181382614648 11.6030116358658 11.6851471594798 11.7700205338809 2492 2578 2930 2745 11.8521560574949 11.937029431896 12.0219028062971 12.104038329911 2762 2692 2885 2533 12.1889117043121 12.2710472279261 12.3559206023272 12.4325804243669 2684 2475 2308 2578 12.517453798768 12.6023271731691 12.684462696783 12.7693360711841 2170 2503 2613 2640 12.8514715947981 12.9363449691992 13.0212183436003 13.1033538672142 2462 2602 2648 2368 13.1882272416153 13.2703627652293 13.3552361396304 13.4346338124572 2406 2338 2149 2349 13.5195071868583 13.6043805612594 13.6865160848734 13.7713894592745 2170 2347 2399 2447 13.8535249828884 13.9383983572895 14.0232717316906 14.1054072553046 2278 2463 2437 2188 14.1902806297057 14.2724161533196 14.3572895277207 14.4339493497604 2376 2141 1997 2301 14.5188227241615 14.6036960985626 14.6858316221766 14.7707049965777 1950 2207 2335 2256 14.8528405201917 14.9377138945927 15.0225872689938 15.1047227926078 2181 2241 2296 2103 15.1895961670089 15.2717316906229 15.356605065024 15.4332648870637 2149 2118 2075 2208 15.5181382614648 15.6030116358658 15.6851471594798 15.7700205338809 1933 2136 2266 2312 15.8521560574949 15.937029431896 16.0219028062971 16.104038329911 2208 2395 2399 2141 16.1889117043121 16.2710472279261 16.3559206023272 16.4325804243669 2070 2134 2022 2165 16.517453798768 16.6023271731691 16.684462696783 16.7693360711841 1971 1951 2203 2264 16.8514715947981 16.9363449691992 17.0212183436003 17.1033538672142 2171 2404 2363 2095 17.1882272416153 17.2703627652293 17.3552361396304 17.4346338124572 2052 2141 2021 2221 17.5195071868583 17.6043805612594 17.6865160848734 17.7713894592745 2061 2137 2157 2219 17.8535249828884 17.9383983572895 18.0232717316906 18.1054072553046 2239 2315 2276 2182 18.1902806297057 18.2724161533196 18.3572895277207 18.4339493497604 2131 2171 1907 2287 18.5188227241615 18.6036960985626 18.6858316221766 18.7707049965777 1988 2217 2229 2319 18.8528405201916 18.9377138945927 19.0225872689938 19.1047227926078 2383 2316 2478 2167 19.1895961670089 19.2717316906229 19.356605065024 19.4332648870637 2204 2263 2041 2297 19.5181382614647 19.6030116358658 19.6851471594798 19.7700205338809 2118 2204 2320 2319 19.8521560574949 19.937029431896 20.0219028062971 20.104038329911 2276 2383 2437 2232 20.1889117043121 20.2710472279261 20.3559206023272 20.4325804243669 2330 2344 2258 2461 20.517453798768 20.6023271731691 20.684462696783 20.7693360711841 2172 2408 2534 2494 20.8514715947981 20.9363449691992 21.0212183436003 21.1033538672142 2411 2586 2615 2415 21.1882272416153 21.2703627652293 21.3552361396304 21.4346338124572 2487 2455 2304 2618 21.5195071868583 21.6043805612594 21.6865160848734 21.7713894592745 2350 2436 2575 2692 21.8535249828884 21.9383983572895 22.0232717316906 22.1054072553046 2632 2772 2737 2463 22.1902806297057 22.2724161533196 22.3572895277207 22.4339493497604 2515 2502 2439 2632 22.5188227241615 22.6036960985626 22.6858316221766 22.7707049965777 2426 2478 2636 2408 22.8528405201916 22.9377138945927 23.0225872689938 23.1047227926078 2558 2477 2591 2152 23.1895961670089 23.2717316906229 23.356605065024 23.4332648870637 2130 2028 2022 2055 23.5181382614647 23.6030116358658 23.6851471594798 23.7700205338809 1901 1894 2023 639 23.8521560574949 23.937029431896 24.0219028062971 24.104038329911 1852 598 661 526 24.1889117043121 24.2710472279261 24.3559206023272 24.4325804243669 527 479 446 494 24.517453798768 24.6023271731691 24.684462696783 24.7693360711841 428 441 490 378 24.8514715947981 24.9363449691992 25.0212183436003 25.1033538672142 428 366 385 357 25.1882272416153 25.2703627652293 25.3552361396304 25.4346338124572 403 369 339 368 25.5195071868583 25.6043805612594 25.6865160848734 25.7713894592745 365 318 302 316 25.8535249828884 25.9383983572895 26.0232717316906 26.1054072553046 319 291 325 263 26.1902806297057 26.2724161533196 26.3572895277207 26.4339493497604 261 248 202 270 26.5188227241615 26.6036960985626 26.6858316221766 26.7707049965777 258 219 229 2 26.8528405201916 27.0225872689938 27.2717316906229 27.356605065024 184 1 2 3 27.4332648870637 27.5181382614647 27.6030116358658 27.6851471594798 2 1 1 1 28.0219028062971 28.104038329911 28.6023271731691 29.0212183436003 1 2 1 1 29.1033538672142 29.4346338124572 29.9383983572895 1 1 1 [Noah Won] Okay, looks pretty good, you know, within the age range that we expect. So we're just going to keep moving forward. [ONSCREEN] # Let's run a linear regression using age as a predictor and fcmntpay as an outcome model <- lm(FCMntPay ~ age, data = afcars2) summary(model) [Noah Won] Okay, and the function I'm going to be using is the LM function. It's a linear regression model function. And we're going to assign it to this model variable. So the syntax is very simple. Essentially, our outcome variable will be FCMntPay, and then we're going to follow it by a tilde, and then a list of our predictors that we want to use. So this is very first one. Let's try and predict the monthly payment for the children and use age as a predictor. And of course, we're going to be using our afcars2 variable. And then we're going to summarize it using our summary function of this model that we assigned this linear regression model to. So let's run this. [ONSCREEN] Call: lm(formula = FCMntPay ~ age, data = afcars2) Residuals: Min 1Q Median 3Q Max -2421 -960 -446 248 68875 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -139.8838 6.5624 -21.32 <2e-16 *** age 85.5494 0.4437 192.80 <2e-16 *** --- Signif. codes: 0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1 Residual standard error: 2039 on 629859 degrees of freedom (1366 observations deleted due to missingness) Multiple R-squared: 0.05573, Adjusted R-squared: 0.05572 F-statistic: 3.717e+04 on 1 and 629859 DF, p-value: < 2.2e-16 [Noah Won] Okay, so we have some output here. Just kind of a summary of our residuals and a table of our coefficients. So if you could recall Alex's presentation, we had a general equation. It's our y equals our intercept and plus beta times our beta age values. So here I actually wrote it right here. [ONSCREEN] # A written form of our model is as follows: FCMntPay = 85.594 * age - 139.5495 [Noah Won] This is our equation to predict FCMntPay. So using these numbers that we have that the program has calculated using the least squares regression model it shows that these estimated values are our best estimates for our betas. So this equation right here can help us predict FCMntPay based off of any given age that we have in our data set. So and I can do a quick visualization using GGplot2. [ONSCREEN] #Let's visualize this model ggplot(afcars2, aes(x = age, y = FCMntPay)) + geom_point() + geom_smooth(method = "lm", col = "red") + labs(title = "Linear Regression: FCMntPay ~ age", x = "age", y = "FCMntpay") [Noah Won] Just kind of plotting our age variables against FCMntPay and then creating like that linear regression line that we saw in that presentation before. Just one second. Takes a little bit of time. There's a lot of data points in here. So it's going to take a little bit. Just bear with me. I think it's a little for it to load. But in the meantime, while that's loading, oh here we go. [ONSCREEN IMAGE] Plot titled "Linear Regression: FCMntPay ~ age" x axis is Age and goes from 0 to 30. y axis is FCMntPay and goes from 0 to 65000. Dots are approximately 80% clustered along ages ~2.5 to ~26 with values of FCMntPay less than 20000, but rising with age. At the bottom is a line overlayed over the plot, and it rises from left to right with a slope of about 3 degrees. [Noah Won] So as you can see, we have very high density of data points here. And it's kind of hard to see because it just looks like a sea of black. But essentially, there's like a higher density of points on here. And we have a slightly positive correlation between age and FCMntPay. And when we find that our estimates kind of support this, we see our P values are much greater, are much smaller than the arbitrary value of .05. So we find statistical significance and that our beta values are not equal to zero. And since our beta value for age is positive, there's a positive association between age and FCMntPay, which we can also see with this graph. And which gives us reason to believe that with increase of age, there'll be a larger FCM monthly payments for these foster care children that have aged out. And yeah, we get, you know, this shows a positive correlation. So in a nutshell, linear regression kind of allows us to find these correlations between, you know, age or these continuous outcomes and our predictors, but as Alex says, these are not necessarily indicative of any causative nature between variables. It just shows us that there's a correlation. And as he said, we're going to go over potential confounders that can kind of change, you know, that can present complexities between correlation and the causative nature of these two variables. And one way to address this is actually to use multiple linear regression. [ONSCREEN] # 3. Multiple Linear Regression # # Our model seems to describe a positive relationship between age and FCMntPay but what about # Hispanic status as a confounder? model2 <- lm(FCMntPay ~ age + Hispanic, data = afcars2) summary(model2) [Noah Won] So as Alex has said, we're going to include Hispanic status in this next model we're running in addition to age. It's a kind of control for any potential, you know, confounding effects that Hispanic status might have on, you know, on child monthly payment and age. So we're going to go ahead and run this second model, we're assigning it to model2, and we're running the very same thing, except we're also adding the Hispanic variable in our predictor list. Of course, we're going to use our data as AFCARS2, and we're going to run another summary model. [ONSCREEN] > ggplot(afcars2, aes(x = age, y = FCMntPay)) + + geom_point() + + geom_smooth(method = "lm", col = "red") + + labs(title = "Linear Regression: FCMntPay ~ age", + x = "age", + y = "FCMntpay") `geom_smooth()` using formula = 'y ~ x' Warning messages: 1: Removed 1366 rows containing non-finite values (`stat_smooth()`). 2: Removed 1366 rows containing missing values (`geom_point()`). > model2 <- lm(FCMntPay ~ age + Hispanic, data = afcars2) > summary(model2) Call: lm(formula = FCMntPay ~ age + Hispanic, data = afcars2) Residuals: Min 1Q Median 3Q Max -2446 -961 -443 246 68957 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -119.2461 6.6753 -17.86 <2e-16 *** age 85.6808 0.4437 193.11 <2e-16 *** Hispanic -105.2086 6.2719 -16.77 <2e-16 *** --- Signif. codes: 0 โ€˜***โ€™ 0.001 โ€˜**โ€™ 0.01 โ€˜*โ€™ 0.05 โ€˜.โ€™ 0.1 โ€˜ โ€™ 1 Residual standard error: 2038 on 629858 degrees of freedom (1366 observations deleted due to missingness) Multiple R-squared: 0.05615, Adjusted R-squared: 0.05614 F-statistic: 1.873e+04 on 2 and 629858 DF, p-value: < 2.2e-16 [Noah Won] Okay, and again, we have some information about the spread of our residuals, and we have our beta estimate tables. So it seems, you know, we have a very similar effect here in for age, and if you pay close attention, this variable is actually changed. So it's 85.594 to, well actually sorry, 85.6808, and it's changed slightly from 0.594. So, so when you add other variables in, they're not independent necessarily. It finds that best line of fit, right? That Alex was talking about, and that may change slightly change our estimates, you know, for age, since we added Hispanic status. And now we can see actually that Hispanic status has a negative effect on this model that we're running and we're going to, again, see, you know, these p values are much less than our arbitrary value of 0.05. So we find statistical significance for all these not equaling to 0. Essentially saying that they have a non-zero impact on the monthly payment of their, the monthly FCMntPay. [ONSCREEN] # It seems that Hispanic status has a negative effect on FCMntPay # Keep in mind that the beta values for age and the intercept have changed # A written form of our model is as follows: FCMntPay = 85.6808 * age + -105.2086 * Hispanic - 119.0115 [Noah Won] And we see I kind of wrote out this model in the same sense of that of this model I wrote out before. We have a positive correlation between age and FCMntPay. We have a negative correlation with hispanic status. Remember, the reference group is non-Hispanics and the, and then the one value is Hispanic. So when this value is one, i.e. the participant is Hispanic, we'll have a decrease in FCMntPay. Versus if this is a zero, they're non-Hispanic, then this whole, this whole, this whole phrase will be zero and the effect will be null. And of course, our intercept just kind of helps us orient our line of best fit. So yeah, as Alex said there are multiple ways to kind of control for confoundings and another way to kind of control for another potential confounding variable is a stratified multiple linear regression or stratified linear regression. [ONSCREEN] # 4. Stratified Multiple Linear Regression # Stratified regression models fit different models based on the stratifications of a provided variable # Adding a dummy variable and using a stratified regression model can be used to address confounding variables # Stratified models are helpful when a variable violates linearity or homoscedasticity assumptions and cannot # be used in a linear model [Noah Won] And this is especially useful if we have reason to believe that our confounding variable violates some of our assumptions. So linearity, homoscedasticity, which will be going into more detail in our next lecture, but it's just kind of an alternative way to provide, you know, control for confounding variables if we can't necessarily put them into our model. So, as Alex said, we're going to be stratifying by sex, actually, on this variable. There's sex_d variable that we created earlier. And for this particular example, there are many ways to deal with missing data, but just for this example, for simplicity, I'll be filtering out any missing data for sex our sex variable, which is generally, there are other ways to handle missing data, but just for simplicity of this example, I'll just be filtering it out. [ONSCREEN] > afcars3 <- afcars2 %>% + filter(!is.na(SEX_d)) > model3 <- afcars3 %>% + group_by(SEX_d) %>% + do(model4 = lm(FCMntPay ~ Hispanic + age, data = .)) > model3 %>% + do({ + model_summary <- summary(.$model) + data.frame( + SEX_d = unique(.$SEX_d), + Intercept = coef(model_summary)[1, 1], + Hispanic_coef = coef(model_summary)[2, 1], + Age_coef = coef(model_summary)[3, 1] + ) + }) [Noah Won] So that's what this line of code does. It creates a AFCARS3 story from our AFCARS2 data. We're just taking out the missing sex data. And we're going to be running, we're going to be grouping by our sex data, and we're going to be running the same model as we had before. So this group, by variable, runs two different models, one for males, one for females, and then we'll be just running the same model as we had before with FCMntPay, as the outcome. Hispanic status and age as the predictors and our data is just referencing AFCARS3 in this example. Let's run this model. Okay, nice. And we'll be just presenting some of this summary statistics in a data frame since it's slightly different way of approaching this. [ONSCREEN] # A tibble: 2 ร— 4 # Rowwise: SEX_d Intercept Hispanic_coef Age_coef 1 0 -206. -118. 98.5 2 1 -40.3 -85.7 73.2 [Noah Won] Okay, nice. So now we have two different models, right? One for females, zero, and one for males, one. And as I said, kind of arbitrary how we assign those, it's just how you want to interpret it. And I've added written forms of these models right here. [ONSCREEN] # A written form of our model is as follows: # Women - FCMntPay = 98.5 * age + -118 * Hispanic - 206 # Men - FCMntPay = 73.2 * age + -85.7 * Hispanic - 40.1 # It seems that women have a larger increase in FCMntPay compared to men as they age, but Hispanic # women have less FCMntPay than Hispanic men [Noah Won] And for women, we can see that, you know, we both have positive value, positive associations for age, that we see. And we both have negative associations for a Hispanic status, as we had seen before. But interestingly, it seems that women have a larger increase in the monthly payment compared to men as they age. So women have a larger increase of payment that they need compared to men. However, Hispanic women have less monthly payment than Hispanic men. So among women that are Hispanic, they are paying less monthly payment compared to men. So these kind of differences kind of highlight, you know, maybe the differences of, you know, Hispanic status and the intersectionality of, you know, women and men when it comes to monthly payment. And we necessarily, you know, if sex did violated some of the assumptions and we couldn't necessarily put it into our model, this is one way of addressing and controlling for male and female status or sex. And while we're still able to kind of draw conclusions from two separate models. Okay. With that, I want to leave some time for questions or, you know, Q&A. So, you know, of course, if time runs over, I have my email, it is Noah.Won@duke.edu but I'll just turn it over to Andres for any Q&A. So, thank you very much. [Andres Arroyo] Thank you Noah. I will read a question that arrived in the Q&A aloud and here it is. "So many child welfare-related things rely on an exact age, particularly turning 18. I'm curious if folks have thought through how to write the age code to be more exact for other analyses that rely on more specific ages." [Noah Won] So, question is like how to write the most exact age. I believe the standard. So, given like data birth information, I believe the standard is 365, like taking the number of days divided by 365.2 and dividing it by 365.25. In cases where we might not have a exact date of birth, you know, we try and use like as much, you know, auxiliary variable information as we can, but I believe the date the standard is, you know, date of birth, number of days since date of birth divided by 365.25. Do you have any more questions about, you know, in particular about, you know, how ages could be derived. I'm happy to answer them. Of course, you know, via email after this presentation, but I believe that's a standard. [Alexander F. Roehrkasse] I'll just briefly build on those answer, which is a very good one. Of course, some data sets like AFCARS have information about children's state of birth. Other data sets like NCANDS, do not. In NCANDS, we observe for each maltreatment report, the children's age in years at the time of report. But as the question asker I think rightly intuits, in research scenarios where it is important to know, say the child's age in days, or to know precisely when they turn 18, age in years at point of maltreatment report sometimes doesn't offer as much information as we would like to have. There's no single best strategy here. I've done different things depending on the research question, the research context, the research audience. But I would just invite folks to explore two different distributions, one thing that can be helpful is to look at the calendar frequency of maltreatment reports. So how likely are reports to occur at different points in the calendar year? And then second to look at the calendar distribution of live births. So for the children in the cohort you are analyzing what proportion of children were born at specific points in the calendar year. Combining information from these two distributions can help you to make inferences about the child's exact age although of course you won't be able to observe that directly. So I hope that's a little bit helpful if only a very partial answer. Lastly I'll say though, it can be yet another reason to link data. So one thing we talked about is how certain variables are available, about a given child, we might have different variables in different data sets. And so if we're able to link children say from the NCANDS to the AFCARS. We might then start to get information about their date of birth. Obviously that limits your samples. But if you're looking for specific age data, date of birth information, linking to AFCARS might be one way to get more specific age data. [Andres Arroyo] Thank you Noah and Alex. Okay, it looks like there are no more questions. So that concludes session four of the Summer Training Series 2025. [ONSCREEN CONTENT SLIDE 25] Next weekโ€ฆ Date: July 30th, 2025 Topic: Visualization and Finalizing the Analysis Instructor: Noah Won [Andres Arroyo] Thank you very much to Alex Roehrkasse and Noah Won for presenting. And we will have our final session next week at this time. Thank you very much. [VOICEOVER] The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families. [MUSIC]