[musical cue] [Voiceover] National Data Archive on Child Abuse and Neglect. [Clayton Covington] Welcome to the Summer Training Series hosted here by NDACAN which is housed in part at Cornell University and Duke University. Next slide please. We are also funded by the Children's Bureau which is an arm of the U.S Department Of Health And Human Services under the Administration For Children And Families. Next slide. To give you all a little overview of where we've gone or where we've been and where we're going so far. This is actually our fourth session of the Summer Training Series. We previously did some introductions to the archive at large we then talked about one of our newest data acquisitions that CCOULD data set. Last week we had a presentation by Garrett Baker on causal inference using administrative data, and today we're doing a session on evaluating and dealing with missing data in R. And then looking forward we have some time series analysis and stata different software and then finally we're going to conclude the series on august 9th on data visualization in R. So with that in mind I'm going to pass it off to our presenter Dr Frank Edwards who is an Assistant Professor at Rutgers University. [Frank Edwards] yep and thanks Clayton and thank you to everyone who is here today. I'm going to be giving a brief overview and introduction to how to both theorize and address missing data. And we're going to be using AFCARS as our the AFCARS foster care file as our kind of playground for that. I'm gonna start with a kind of broad statistical and theoretical overview of how to think about missing data and then we're going to do a bit of live coding. Just to show you both how to kind of process and think about missing data and also how to address it using multiple imputation with R and RStudio. So for those of you who are interested in following along once we get there I want to just drop a quick link in the chat [onscreen https://github.com/f-edwards/NDACAN_workshops/tree/main]. The code and data that I'm using today and these are de-identified data from AFCARS are available at this link this is a Github repository that includes the code and data we'll be using today, and if you wanted to load this file into RStudio to follow along you can do that. The data files we'll be using today are in afcars_demo.csv in the data folder [link to afcars_demo.csv https://raw.githubusercontent.com/f-edwards/NDACAN_workshops/main/data/afcars_demo.csv] I'll show you how to load that in when we get there but you could click through AFCARS's demo, click raw, and then it'll load up the csv file and you could save that as a csv file somewhere on your computer that you can then load the data into R if you know how to do that. Otherwise I'll kind of walk through that a little bit more slowly when we get to the interactive part of the presentation. So let's get started with the kind of theoretical overview. So most statistical software by default conducts what's called a complete case analysis and that's effectively a process that when we're running a regression or some other kind of analysis, even just crosstabs most software by default is going to drop any cases that include missing information on the variables that you're looking at. So that's called a listwise deletion approach where we systematically just delete those data points that do not have information. Now depending on how much data is missing in the variables you've chosen this could be throwing away a ton of good information. Let's say we have 20 variables that we're using for a regression let's say we're not being especially parsimonious in our modeling with those 20 variables, if we're missing data one of those variables in the model let's say 20 percent of the observations are missing one of the variables but the data is complete for every other variable we're throwing away a fifth of our data by using a complete case method, right? That is the missing data one variable is throwing away a lot of perfectly good information that we do have on those units across the variables that we do have complete information for so just from a kind of information standpoint it's not a great practice to delete any observation when we actually do know something about those units, right? You're throwing away perfectly good information and at minimum what you're doing there is you're reducing your n, right? So we know from how we compute standard errors, right? The square root of the standard deviation of the quantity of interest over N, right? That when we throw away information we're pushing our standard errors downward, right? So we are biasing our standard errors at minimum that is we're going to be potentially understating or overstating uncertainty depending on which direction things go and we could bias our coefficient estimates if missingness is correlated with some other feature of our data. That is we could both mis-state uncertainty in our model generally it'll lead us to understating how uncertain we are, that is our standard errors will be too small and it could give us biased estimates that is our estimates of like you know a parameter of interest, maybe a population average could be too low or too high depending on why the data are missing. However with a pretty with a with a relatively a few small set of assumptions we can correct these problems and we can't necessarily ever replace missing data that's not our goal. Our goal is never to replace missing data, our goal is to quantify a kind of accurate a valid estimate of the quantity we're after, the parameter we're after, and how uncertain we are about that estimate. Missing data is part of the uncertainty structure, right? We could think about uncertainty and statistics having a lot of different sources some of them could be the kind of sampling uncertainty that we're used to thinking about, right? That is we often you know are kind of in a frequentist regression context, right? We're always thinking about a kind of law of large numbers or a central limit theorem approach, that is if we repeated this study many times what is the confidence interval that we would estimate how often, right? So if we that's what our confidence intervals typically express our standard errors typically express that if we repeated this study many times, 95 of the time you know the estimate would fall within this range that we've estimated here, right? So that's one source of uncertainty is that kind of sampling uncertainty but we have other sources of uncertainty in our model, right? And the data generating process itself, data collection itself, is a source of uncertainty and missing data analysis is a way to be more honest about that in our presentation. And in administrative data it's a routine problem so it's really important when we're working with things like AFCARS or NCANDS to take missing data very seriously because it is a big problem. Okay so why are data missing there's three canonical approaches to theorizing the statistical kind of theory for why we might have this missing data relative to a kind of random process. So we could have data that are missing completely at random m-c-a-r. And I don't love this nomenclature because it's a little misleading but we'll stick with it and I'll try and explain why it's misleading. "missing completely at random" means effectively that there's some unconditional probability distribution that determines whether a value is missing. That is every single observation in the data within a vector so within a variable every single row, every single observation of that variable has equal probability of being missing, right? So let's just say we have an observation let's say we're yeah sticking with AFCARS we're going to be looking at the age variable in AFCARS the age at end variable, and let's just say that every single individual in there has a point zero one chance of being missing. The kind of simulation algorithm for that would then just be to take a one in a hundred sample from AFCARS and delete the data on the age variable and that would reflect that kind of missing completely at random process, that is it doesn't depend on anything it's just a pure random function from one variable. "missing at random" m-a-r is a little misleading it's the one that I don't love because what this tells us is that the probability of a value being missing is random but it's conditional on other observed variables. So we could think of this as missing conditionally at random, right? But then we have two m c a r s and that doesn't work. So I didn't invent this framework, right? I think this is Reuben's framework but. Anyway missing at random suggests that we that the likelihood of a variable being missing in the data is a function of some other set of observed variables in the data. That is, perhaps it's just that some states weren't as good at recording the race ethnicity of a child in AFCARS as some other states, so in that case the likelihood of being missing is a function of the state in which the data was recorded. And if that's really the only thing going on there then we can effectively model that, right? We can treat that as a statistical problem that can be answered with an uncertainty statement, right? The challenge we have is when we have non-random missing data that is the probability of a value being missing depends on either something we don't observe and can't observe well don't or can't, some exogenous process that we don't know anything about that structures the data, or the value itself. That is we have a censorship process going on. A classic example of this, right, is people who are extremely poor or extremely wealthy often won't report their income on surveys for kind of two directional processes for social desirability bias, right? And that kind of censorship is a real challenge if we don't have any way to triangulate a source of information about that variable from somewhere else because the only thing it depends on is the value itself, right? If I don't report income because of my income and there's no other way for the researcher to infer anything about my income then they have no information on income the value itself determined it's missingness. And that's a really difficult problem and sometimes an intractable problem in missing data analysis. So it's this third category this non-random missing data category that's the huge problem in missing data analysis and we're just not going to be able to deal with those using the methods that we described today. We're looking really at the first two categories missing completely at random and missing at random parentheses conditionally on other variables in the data, right? Okay so let's kind of look at common approaches to missing data I know I'll and I'll i'll come back to the questions in a second sorry we'll come to the questions at the end that is a good one though. So oh things not applying yeah again I'll come back to that in a little bit. So here are our basic approaches to missing data that we can use and sometimes we don't even need a statistical approach it depends on the nature of the challenge. Again I mentioned listwise deletion or sometimes called a complete case analysis. This is fine if there's very few missing observations or when missingness is completely at random and missingness is rare. That is missingness doesn't depend on anything in the data and it's like you know one out of a million cases, right, where it's not gonna affect our estimates of uncertainty. Then you know it's kind of it's ignorable, right, in that context, but that's typically not the kind of case we're going to be looking at with admin data. So list wise deletion is something that I strongly recommend not doing. I i think there are very few cases where listwise deletion is appropriate for data analysis that you're going to report. I think when you're at the preliminary stage and you're doing exploratory work it's it's I think fine because it's computationally yeah 10 percent I'm just seeing a question what percentage of missingness would typically be considered small some people have said 10 percent. I'm just going to throw that out. I don't think I think it's case by case and I'm not going to give you a rule of thumb especially in an admin data context because missing data in the case of something like AFCARS or NCANDS is going to vary across states. And so we may have like two percent missing data one variable across the whole data set but then it's like 20 percent in one state, right? So which is ignorable? It really depends and so anytime we see missing data and I'm going to say just go ahead and say more than .01 percent data missing I want you to at least check it out. You need to understand where is that data missing what is missing why is it missing. So there's a lot of exploratory and detective work that goes into this as well we really need to understand our data sets incredibly well before we start to apply statistical methods. Okay we can also use alternative information. So in something like the AFCARS we have often for children who are in foster care for long periods of time or who have multiple exposures to foster care over their lives, we have unique ids for those children available to you in AFCARS. So for example we might want to borrow an observation let's say that a child age seven is missing information on their sex at their age seven year in the AFCARS data and they were also in the AFCARS data at age four. But the sex variable was present at age four. We could consider borrowing that observation from age four to inform the missing value for age seven or race ethnicity or if age were missing we could consider using time itself to add you know the number of years that would be appropriate for that child. So we can always think about where are those cases where we have natural sources of other information from other places in the data to inform to simply replace that missing value to treat it as not truly missing? Now sex is obviously a little complicated because we could have non-binary or trans youth in the data that do have a sort of true shift in that variable over time so we want to be cautious when we do something like this and think about the context of what we're measuring. But it's an it's it's a good thing to think about because we could think of this as effectively we may not actually have uncertainty about what that missing value is, we may know what that missing value is if we borrow prior information. Non-response weighting is another approach that people might do. If we can kind of characterize the distribution of who's missing as a function of other variables and then think about using a weighting procedure to kind of overweight observations that are similar to those that are missing. That can get really tricky when many variables are missing and when the subpopulations that may be of interest differ within the data. I tend to think of that as a similarly complicated procedure to multiple imputation but with more downsides. So I would typically prefer to avoid non-response waiting but I can see the logic for it makes sense it's just I think there are approaches that to me feel like they capture that uncertainty a little more cleanly and aren't as dependent on the observed data itself as weighting would be. Okay so the approach we're going to talk about today is an approach called multiple imputation. Now what multiple imputation does is it effectively constructs a regression model where we're going to predict values for those missing observations, right? We're going to take the variable of interest where we have missing data, we're going to treat it as the outcome of a regression model we're going to estimate that model, and then we're going to predict values for those missing cells from the regression model. But what we're not doing is using the expected value of the regression just to fill it in. We're not that would be something similar to like a linear interpolation we're not doing that. Instead we're relying on the sampling uncertainty that comes from our the uncertainty that's inherent to a regression model that is like you know we get a probability distribution for a predicted value, right? We're going to sample from that probability distribution to get multiple possible realizations of what that variable could have been had we observed it according to our regression model. So effectively what we're doing is we're iteratively modeling all of the missing outcomes and predictors in the model and then we're going to draw random samples and what could we have observed if this regression model were truly capturing the data generating process. So for those of you who are familiar kind of crudely with the idea of a Bayesian approach this is following that similar kind of logic, right? We're going to use the probability distribution itself to kind of populate our uncertainty in what that value could have been. We're not recovering a true value. Instead we're drawing lots of possible values that could have been observed given what we know from the data and the assumptions of our model. Under the missing at random assumption this generates unbiased parameter and variance estimates if the variables that the outcome is conditional on and are in the model and our specification of the model is correct, right? So our typical regression assumptions hold here, that is like we have to have the correct model specification and we have to have you know the appropriate variables in the model. Okay so my preferred approach the kind of like you could think of this as my algorithm for how to solve missing data problems first understand your data know it in and out: read the documentation, understand what these things are measuring, understand what they're supposed to be measuring, understand if we have structural missingness, right? We have a question here about things not applying a geo variable regarding foster care entry dates are missing for children who never entered foster care. That's where the documentation is your friend. That variable is not missing it just doesn't apply, right? So it's not a missing value it's a different kind of n a so in R it's going to show up as an n a either way but we don't want to treat it as non-response missingness, right? We don't want to treat it as the same kind of missingness where there's many different ways data can become missing and it's really critical that you understand why the data are missing. Are they missing because the data isn't appropriate for that unit or are they missing because there was non-response or some other data collection problem? Okay we want to develop and that's what I say the mechanisms of missing data our kind of fourth bullet point here we want to develop that understanding of the mechanisms of missing data in each data set and for each variable you use. You're not always going to be able to cleanly identify that mechanism, but you can think really deeply about it and kind of reflect on what might be going on that could be driving this missingness and write that up, right? That's going to be part of your analysis this often turns into appendix one for me on a paper, right? Like how did I think about missing data and what did I do to address it? Why do I think this variable was missing? Here are the 10 reasons I could think of and here's how we can think about whether the approach we've used addresses those appropriately, right? Because you want to be transparent with this in the interest of reproducible science, right? And you want to be transparent in this that your reviewers can fully understand exactly what you've done and why you've done it. And then you can test your ideas for the mechanisms when feasible, right? So in admin context, right? I already mentioned state processes can vary a ton, right? And so for me the first thing I want to check is like do we think that there's heterogeneity in how jurisdictions collect different kinds of variables. One that people ask a lot about in like the sri and other contexts at NDACAN are the services variables in AFCARS, right? That is like what kinds of services did a child receive? NCANDS has variables like this too for things like risk factors. And when you look at the prevalence that is of missingness across states for things like services in AFCARS you see so much heterogeneity across states. And then if you look over time within states you see lots of heterogeneity over time within states, right? So like practices change at the state level and they differ across states. A clear mechanism for missing data might be internal agency data recording processes, right? And the way to sniff that out is to look at time series across states, right, for volumes of missingness. But you want to kind of test your ideas for why might things be missing with exploratory data analysis whenever possible. Okay so if missing at random if we've done all that and if missing at random that is conditional on other things we observe is a reasonable assumption which it often is we can conduct multiple imputation. Because missing at random is conditional on observables including a lot of variables in your imputation models is often a good idea as long as they're not collinear it's a prediction-based model, right? It's not we're not trying to get like theoretical precision here we're trying to get good predictions so in that case more information is the way to go. A prediction-based model we want to load up with as much information as we can. So including lots of variables is a good idea as long as they're not perfectly collinear or linear combinations of each other that'll run into problems. And then what we're going to do is we're going to estimate the model over each iteration of the imputed data set that we've constructed and Reuben has a set of rules for combination that packages like Mice which I'm going to show you can implement automatically, but if effectively we're going to take the arithmetic average of our coefficient estimate but we're going to do a weighted average of the standard error that adjusts for variance across the samples, right? So our our coefficient our parameter estimate itself the point estimate is just a simple mean across each of these models the standard error we're going to obtain is weighted and you can look up reuben's rules for combination and read all about that fun activity or you can use the Mice built-in features to use those in the context of simple regression models it's no problem. And we're going to report those estimates. Okay so we're going to apply these methods now. What time are we sorry yeah we're good. Okay so we're going to apply these methods to AFCARS. I'm going to so more work is absolutely going to be required to get it right for your analysis here I'm using R and the Mice package but all majors statistical packages STATA SAS SPSS use the multiple imputation through chained equations method and have that available to you. So you can check out the documentation for your software package to understand how to implement it there. The basic approach is going to be really similar. The code is available the URL that was posted earlier. This is where I'll be working from it's a Github repo and the imputation I'll come back to that after we demo it. So it can. If we're not careful. So let's just dive in I have on the slides an example of using AFCARS to I'll i'll walk through this actually there's a lot of data cleaning code in the slides but. Thank you Alex that's, right. The but I think it's gonna be easier for us to just work together. [voiceover] The program, written in R, is included in the downloadable files for the slides and the transcript. [Frank Edwards] so I'm going to tab over to my RStudio window, and first for those of you who have used Github before this shouldn't be too complex for you can just clone this into onto your machine. If you haven't used Github before no problem. This is a repository that includes some code and data that we're going to be using. The file that I'm going to be working from today is this workshop7_26_23.Rmd. If you want to get under the hood and see what I did you can look at make_pop_data make_sample_data read_NDACAN data to see how I formatted the data files that we'll be using today. But we're going to use this our markdown file workshop7_26_23.Rmd. I have that open on my hard drive as well and I'll be syncing that up to my Github repository as we go. The data we're going to use for this again are in afcars_demo.csv which if you want to follow along you can click the raw button after you click through that or you can just I'm going to paste this into chat if you click this [link to afcars_demo.csv https://raw.githubusercontent.com/f-edwards/NDACAN_workshops/main/data/afcars_demo.csv] and then you save that as when it loads you will have the option to save that as a csv file on your hard drive where you will be doing in whatever directory you'll be doing your work in. So that's what we need to start the other file we'll be using is the pop demo [link to pop_demo.csv https://raw.githubusercontent.com/f-edwards/NDACAN_workshops/main/data/pop_demo.csv] and this is derived from the national cancer institute's seer population data they give us county-specific age by race estimates based on the census small area data so that's the population data we'll be using. Okay so let's do some coding. So first this is an R notebook R markdown file if you haven't seen these before it's nice because we can kind of generate reports off of it and write flame text and code at the same time. First thing we're going to do is load the packages that we need to use today. We have two packages that we're using Tidyverse and Mice and let's just go ahead and restart my session oops I meant to just restart R oh well okay there we go. And you can click that broom to remove all the prior codes so it's not distracting. Okay so loading Tidyverse Mice these are the messages I'm going to get when I do that if you don't have them installed you can just run install packages Tidyverse I already have it so I'm not going to do it and install dot packages mice, right? To load to install those packages again I'm not going to actually run that code but we can do it. I'm just going to cancel that okay. So we have those packages loaded in now let's load in our AFCARS data. And this is what that data looks like. We have it I've kind of formatted this to include four variables the year of the data this is 2019, the state that we're reporting from which is reported both as a numeric fips code and a two letter state abbreviation, and then the AgeAtStart. The AgeAtStart variable is the kind of variable of interest we're going to be looking at today. AgeAtStart is how old's the child who was in foster care was at the start of the reporting period. A minus and let's look at the kinds of variables the code is no the code can't be included in the chat but it is available to you online if you go to the url that's in the chat. So if someone could post that url again so yeah it's just right here. I'll post it in the chat, right? There that's the code. So there we go right back to R okay. This is the AFCARS data we're going to be using and this is the population data we're going to be using. I'm going to format the variables on our population data such that they match the names on our AFCARS data it makes it easier to join. So SES you could get from lots of places like the American Community Survey it's not available in AFCARS now you would need to join that from an external data set we don't ask about the for lots of reasons poverty and other kinds of household characteristics are not asked about in the AFCARS or NCANDS data so you would need to join that at the place level counties or states. Okay so we have these data loaded in. The next thing we want to do is take a look at what might be going on in the data. So let's explore the AgeAtStart measure. So I'm going to start a new code chunk here that we'll look at so the thing the way I want to do that is to look at the unique values for AgeAtStart just to get a sense of what the range of the data are or rather maybe not a unique but maybe a table. Okay so here are the possible values and the frequencies of those values that we have in the data. Minus one if you read the code book tells us that we're talking about a child who had not been born yet at the time that we're recording. So this would be an infant who was not who was in foster care at some point during the year but hadn't been born yet at the start of the reporting period. So this would be a child who you know is going to be anywhere between 0 to 11 months old during the reporting period they hadn't been born yet at the time of the start of the reporting period which I think is November 1 each year but someone can correct me on that read the documentation because I'm just pulling that out of the back of my brain and I could be wrong. But let's look at this, right? We have 35,000 kind of a very young infants 52,000 who were infants zero counts as anyone below one year old, right? 54,000 one-year-olds at the start of the reporting period, 48,000 two-year-olds, 43,000 three-year-olds, etc.. Now where we get interesting is down here when we get over 18, right? Now we know that like some states have extended care to children who are aging out so the 18, 19, 20-year-olds 21-year-olds 22, 23, 24, 25, 27, are not terribly surprising to see, right? Those are going to be kids who are in kind of aging out process where they remain in care past the age of emancipation. And we only see like one 27-year-old. Now where we do have true missing data here is on the 99s and the 101s, right? 99 is our code for missingness on this variable and you'll find that for a lot of variables in AFCARS 99 is your kind of flag for something being missing and I'm just going to assume that 101 is a data entry error but we're going to treat it as missing as well. So once we know that I want to go ahead and just format those as missing in R. I'm using Tidyverse syntax to do that so that's our pipe operator but I want to mutate my AgeAtStart variable let's explicitly recode missings. So what I'm going to do is I'm going to change AgeAtStart where if AgeAtStart is greater than or equal to 99 oh wait I need to do this as an if else sorry. If else AgeAtStart greater than or equal to 99 na otherwise leave it alone. You could also do this with case when or other kinds of process procedures it would work just as well but we'll do it this way okay what okay this is why it's a good idea to do some of these things not live because then you make mistakes and it's okay. Okay well let's just say recode. No what am I doing wrong? Here we'll do it old school dat string because I don't want to spend too much time debugging is okay. So we'll say AgeAtStart that's brackets, here we go see how well I remember my base R. This should do it. What is going on? I suck at programming apparently. Sorry y'all this is the joys of you know working with me live let's try it with the case_when. Ok so now I'll do AgeAtStart is greater than, well actually let's I don't I think it's numeric but let's make sure. Yeah it's numeric I was thinking it might be character which might be giving me my problem. [inaudible] I don't understand this error message apologies everyone I'm just going to do something a little weird I don't know if this is a function of me being in an R notebook rather than a just R script environment so I'm just going to go to an R script environment and see if I still get the same problem. Yeah again joys. Yeah I didn't i? Should have thank you Alex. I see our you know yeah I don't know what's going on. Code looks fine but you know weird things happen. Oh my god. Okay. Age let me try it in a new variable. I swear I know how to code. What's going on? Thank you abby now let's try that. I thought I did that but yeah. Yeah I'm getting some funky is going on in my R environment. Well we need to capitalize there yeah. I should just to lower everything shouldn't i? I probably just have a missing misspelled variable name. Yeah something's going funky. Something funky is going on in my R I'm really sorry y'all. So let's do something slightly different. Well you know what we're gonna do we're gonna pivot over to the pre-prepared code I had over here and we're just going to work through what I know works and it's going to look similar to what I showed you but you know hey what are you gonna do? So this is instead sorry to pivot off of the live demo I know that would be good to work through but we also in the slides have prepared a batch of code that doesn't depend on whatever is going on with my installation of R, right? Now that I don't have time to fix okay. So again we're gonna work with AFCARS again we're going to look at race ethnicity this time. We have the entered variable so what we have here is a counts this is a instead of being the individual AFCARS this is a data set I produced that is a state by year count of the number of children that entered foster care by race ethnicity. And you can see that we have it for 115 counties here. So this is a more realistic approach and one that I've actually run a number of times. It is computationally intensive to run something like this the one I was going to show you only had 10 missing values so it wouldn't be computationally intensive but when we're talking about the race variable in AFCARS which is one that's going to be of interest to a lot of people we're talking about you know in any given year thousands of missing data points potentially tens of thousands and we might be thinking about using this data in time series so we might be working with you know potentially hundreds of thousands of missing values. So it is computationally intensive. Start with a single year of the data. If available using remote servers for this kind of work is really great because you can the job might take hours to run rather than kind of occupying your laptop with that. So we're going to use population composition here again the same data that I showed you a moment ago from NIH the seer population data is what we're starting with and here is code you can use this code is you can copy and paste it out of the slides once they're posted. This is using the file that you can download from this url this is just looks like a lot but formatting this data is kind of a pain so you can just borrow my code to format it. We're going to recode race ethnicity to match what I've recoded in the NCANDS. You know in NCANDS and AFCARS we have a number of bivariate I'm sorry binary variables for race ethnicity and I find it useful to collapse those for the purposes of time series panel analysis where we're looking at like changes in states over time we need to construct we need to effectively classify children into one category rather than allowing them to be in multiple categories. So I use a method that thinks about the relative salience of racial and ethnic categories for CPS and then code them into one of these five groups. And then this is what that population data is going to look like so for 1990 for fips code zero one zero one, 24 American Indian Alaska Native children 32 Asian Pacific Islander children, right? We're looking only under 18. Okay this is less important. We're gonna use those to compute population composition variables. So that is a proportion of the population in each racial and ethnic category. Now we do have a problem with collinearity here that if we include the white population in here too we're going to have that be perfectly predicted by the values in the others, right? That is we're going to have to leave one of the population categories out because it is you know additive to one. We're going to join AFCARS onto this population data and we're going to look at the numbers of missingness missings we have in race ethnicity for this time series of the AFCARS data. And we have 6,800 missing values so we need to think about how we're going to address that. Okay we're going to build a multinomial regression for race ethnicity and we're going to use foster care entry fiscal year and county population composition as predictors and it's really simple the way we do this is we just use the command mice. So you know what we really do need to demo this so I'm going to just go back over to RStudio for a second and we're going to use the just the demo data set that Mice includes as part of their help package to make this work. Yeah nhanes what they use okay. Let's see if we got it all, right? Cool so okay so we're gonna pivot and this code will go up pivoting to the Mice demo package. The Mice demo available with built-in data. Again really sorry for this but let's show you how Mice works. [onscreen head(nhanes)] okay so we have this data set called "nhanes" and this is a health data set that includes age, bmi, hypertension and cholesterol, right? So we could say whatever we want to say about how bad BMI is as a statistical measure which it is but there we are. So we have five variables here and let's quantify our missingness let's say there you go Mice package and again you can install dot packages, [onscreen install.packages("mice")], mice, and then you can library [onscreen library(mice)] Mice to load it in once you've installed it. Okay so first let's get our head around to the missingness, right? We can do a summary on the data frame [onscreen summary(nhanes)] and it'll tell us a lot about missingness in the data set. We know that BMI has nine missings, hypertension has eight missings, cholesterol has ten missings and how many rows do we have in the data? [onscreen nrow(nhanes)] 25? So you know we're at a pretty high proportion of missingness here for BMI or cholesterol pretty high proportion, right? The next thing we want to do is we want to think about okay five minutes. Okay so the easiest way to do this is let's say we want to do just a full [onscreen mice(nhanes)]imputation of all of the missing data in nhanes. We can just run the Mice command on it and what we're going to get out is this. The Mice command is going to iteratively identify each variable that's missing in the data, think about it has some algorithms for selecting an appropriate model for each of those variables, and then it will compute imputations for those. So I didn't store that anywhere so let's store that as imps, [onscreen imps<-mice(nhanes)] right? Rerun it okay so now we get this object called imps that has a lot going on in it. But the thing that I want us to look at is the imputations. [onscreen imps$imp] Imps string imp, right? Here's here's where we go. Okay so what we have here is for each of those rows that is missing we constructed five hypothetical alternative data sets. So for BMI row 1 was missing, row 3 was missing, row 4 was missing, row 6 was missing, row 10 was missing, row 11 was missing, right? So you can see that what we've done is we haven't actually said row 1 is 35.3 we've said these are five possible values that row 1 could have been, right? We're saying row 1 could have fallen in the range between 22.7 and 35.3. We can always increase the number of imputations we run if we want to get more granularity in a kind of broader representation of this. You'll see that this is using an algorithm called partial mean matching where it's only replacing it with observed values in the data but we can get really specific if we wanted to use alternative methods. Same thing for hypertension here, right? And I'm guessing this is clearly a binary true or false. For row 6 we have a couple guesses that yes this person might be might have hypertension and cholesterol the same. Now the thing that we want to look at though here is that those missing value predictions depend on the other things in the data, right? So we know that for example if we build up a quick linear model of let's say cholesterol by BMI plus age plus hypertension. [onscreen lm(chl ~ bmi + age+ hyp, data = nhanes)] That there are relationships in there, right? That cholesterol is a linear function of BMI age and hypertension, right? At least we observe that here under this specification. And so what the model is doing is it's using information on those variables to inform its prediction and then build out a more comprehensive picture of where that variable may have been if we didn't observe it. So yeah we'll pivot over to q and a now. Again thank you so much for bearing with me through this fun technically difficult presentation and I'll do a better job on checking my R installations before we go next time I swear it was working on my side before we got going. Okay. [Clayton Covington] Well thank you frank for not only leading us through this exercise but also being flexible, right? Because troubleshooting data is actually quite topical, right? This is kind of what happens when you get into the thick of things so thanks for being willing to pivot. I'm gonna just read the questions we have so far I will let you respond and also to our attendees if you have any questions please feel free put them in the chat and I'll read them aloud. So the first question asks where does missingness due to things not applying for example variable regarding foster care entry dates, missing children, who never enter foster care fall under, and some colleagues and I have disagreement regarding this? [Frank Edwards] so the question what about if we had something like foster care entry date missing for a child who never entered foster care and we know that child never entered foster care because maybe they're well they wouldn't be in AFCARS if they never entered foster care but we could imagine that in NCANDS. Yeah so we need to think about the population of interest at that point. If the population of interest is children who were in foster care and we have information that child is not in that population then yeah they should not be in the data set and you should remove them from the data set. You should filter them out of the data. That's not missing data that's accurately recorded data that the child is not in the subset that you're interested in. So I would not treat that as a like stochastic missing process or is a random missing process that's a structural missing process, right? Where like it just doesn't fit it's recorded accurately as missing in that case and that kind of goes back to we got to think really carefully about it. [Clayton Covington] All right thank you Frank the next question says imputation will add information based on the already existing information. Does that increase the chances of encountering collinearity? [Frank Edwards] So co-linearity so like a linear interpolation framework so like if we're using a model where we use already existing information and then use a linear regression model to basically like fill in the dots on where the prior data was and the current data is I'm thinking time series here and we use linear interpolation as a regression prediction to fill in those dots we definitely are increasing bias when we do that. We're over-trusting the observed data and we are introducing bias through that kind of approach. The upside to a if we have captured the kind of relevant if we've built a good prediction model that is if we have good information on based on what we've observed and what we can observe from third-party sources like population data and we pull that in, then no what we're doing is actually just folding inappropriate uncertainty into the model. We're not deterministically stating where that value is we're providing a plausible range for that value based on the kinds of analytic model we're going to use for analysis anyway. So in that case I really don't think we're introducing co-linearity or bias into the model. What we're doing is actually adding noise to the model. [Clayton Covington] the next question asks could you quickly review that last bit where you did a linear model and what that told you about the missingness pattern. [Frank Edwards] yeah okay so let's so we want to just I there I was just wanting to demonstrate that we have some structural relationships between [onscreen m1 <- lm(chl ~ bmi + age+ hyp, data = nhanes)] that we have relationships between variables that we've observed, [onscreen summary(m1)] right? So in this case this is a linear model where we've estimated cholesterol rates as a function of BMI agent hypertension. Again I'm not real interested in getting this theoretically right at this point I'm interested in predictive accuracy and anytime we add any information in the worst thing we're doing is is adding in no information, right? I mean we're not trying to estimate a causal value here we're just trying to improve prediction accuracy. Now yeah and collinearity and other kinds of problems can be an issue. But so I'm not going to really interpret this overmuch but we can see that there is that these estimates are non-zero, right? That the estimates for the expected relationships between these measures are non-zero and if we want to think about this maybe more formally we could look at model fit comparing m0 as a linear model with an intercept only linear model. Let's do an age-only model and then let's do a. [onscreen m0<-lm(chl ~ age, data = nhanes)] [onscreen m1<-lm(chl ~ age + bmi + hyp, data = nhanes)] Add in some more stuff our goal here is to improve fit, right? And to improve predictive fit. There's a lot of different ways we could think about evaluating that but we can see that m1 is performing better according to bic even though we have different numbers of observations here, right? We could subset out those missings and re-estimate it if we wanted to. But our goal here is just to show like what we're doing under the hood in Mice is when we give it when we ask Mice to just run on the inhanes data like that what we're implicitly doing because of defaults in the package is we're telling it to include all the variables in the data to use to predict each of the other variables, right? That is to use each variable to predict every other variable. We can actually pull this out and we can look at it as imps dollar sign predictor matrix [onscreen imps$predictormatrix] and this is going to show us which variables are turned on as predictors for which other variables so this is telling me that if age were missing it's going to use bmi, hypertension, and cholesterol to predict. If BMI were missing it's going to use age, hypertension, and cholesterol in the model to predict. If hypertension we're missing it's going to use age, bmi, and cholesterol etc, right? It's never going to use itself because you know we can't use variables to predict itself. So that's what we're doing there and we can modify this predictor matrix we can like treat this as something to modify where we can turn on and off particular variables if we want. The other thing we can do is we can change the methods if we have thoughts about which kinds of model we want to use. Now here I mentioned briefly this idea of the partial mean matching model but we have lots of options for method, right? Here's the method arguments in the help file but we can use things like lm we can use things like glm to specify a logistic regression model if it's a binary variable and we don't want to use partial mean matching. We can use lots of different methods. So I could for example go into imps here and say method equals lm and ask it to use linear regression for everything. Okay but I need to make sure that is specified accurately according to what Mice wants to do I forget the exact code it wants for a linear model but we have lots of options under methods. Yeah so I i we can google and get to the Mice help package and go through what all of the available methods under Mice are. It's pmm is usually the default it's going to pick. Does that answer the question? But the idea is basically from this predictor matrix here we can specify exactly what the kind of functional relationship we want to specify is between the variables. So if there are relationships between the variables which there usually are then we'll want to use that. But it'll do it by default. [Clayton Covington] Okay see no questions frank can you please do the last slide all, right? So again thank you all so much for joining us for today's session of the summer training series here at the National Data Archive on Child Abuse Neglect. Our next session is going to be at the same time next week from 12 pm eastern time to 1 pm eastern time where we'll have a presentation from Dr. Alexander F. Roehrkasse at Butler University who's going to be giving a session on time series analysis and stata so until then we'll see you soon. Thanks everyone. [voiceover] The National Data Archive on Child Abuse and Neglect is a collaboration between Cornell University and Duke university. Funding for NDACAN is provided by the Children's Bureau an office of the Administration for Children and Families. [musical cue] ----- R code for the program called make_pop_data.r library(tidyverse) pop<-read_fwf("./data/us.1990_2020.singleages.adjusted.txt", fwf_widths(c(4, 2, 2, 3, 2, 1, 1, 1, 2, 8), c("year", "state", "st_fips", "cnty_fips", "reg", "race", "hisp", "sex", "age", "pop"))) pop_demo<-pop %>% filter(year==2019) %>% select(year, state, sex, age, pop) %>% mutate(age = as.numeric(age), pop = as.numeric(pop)) write_csv(pop_demo, "./data/pop_demo.csv") ----- R code for the program called make_sample_data.r ####### make sample data for 2021 SRI workshop ###### read in and deidentify admin data for geo / time join library(data.table) library(tidyverse) ncands<-fread("~/Projects/ndacan_data/ncands/CF2019v1.tab") afcars<-fread("~/Projects/ndacan_data/afcars/FC2019v1.tab") ##### select variables for join ncands_demo<-ncands %>% select(subyr, StaTerr, ChAge) afcars_demo<-afcars %>% select(FY, STATE, St, AgeAtStart) write_csv(ncands_demo, "./data/ncands_demo.csv") write_csv(afcars_demo, "./data/afcars_demo.csv") ----- R code for the program called read_ndacan_data.r ### this script joins ndacan tables to SEER pop data ### load libraries library(tidyverse) ### read in the demo files ncands<-read_csv("./data/ncands_demo.csv") afcars<-read_csv("./data/afcars_demo.csv") pop<-read_csv("./data/pop_demo.csv") ### harmonize the names in ncands and pop ncands<-ncands %>% rename(year = subyr, state = StaTerr, age = ChAge) unique(ncands$age) ### note that 77 and 99 have special meaning ### recode 77 -> 0; 99 -> NA ncands<-ncands %>% mutate(age = ifelse(age==77, 0, ifelse(age==99, NA, age))) ### collapse NCANDS to state - year, collapse pop to state - year ncands_st<-ncands %>% group_by(year, state, age) %>% summarize(child_investigation = n()) pop_st<-pop %>% filter(age<18) %>% group_by(year, state, age) %>% summarize(pop = sum(pop)) #### join them together ncands_pop<-ncands_st %>% left_join(pop_st) ### super cool! ### now let's do afcars afcars<-afcars %>% rename(year = FY, state = St, age = AgeAtStart) %>% mutate(age = ifelse(age<0, 0, age), age = ifelse(age==99, NA, age)) %>% select(-STATE) ### collapse to state level afcars_st<-afcars %>% group_by(year, state, age) %>% summarize(fc = n()) ### now join to ncands_pop ncands_afcars_pop<-ncands_pop %>% left_join(afcars_st) ### compute per capita rates ncands_afcars_pop<-ncands_afcars_pop %>% mutate(investigation_rate = child_investigation / pop * 1000, fc_rate = fc / pop * 1000) ### quick visuals ggplot(ncands_afcars_pop, aes(x = age, y = investigation_rate)) + geom_line() + facet_wrap(~state) ggplot(ncands_afcars_pop, aes(x = age, y = fc_rate)) + geom_line() + facet_wrap(~state) library(geofacet) ggplot(ncands_afcars_pop, aes(x = age, y = fc_rate)) + geom_line() + facet_geo(~state) ----- code in the R markdown file called workshop7_26_23.Rmd --- title: "Handling missing data in AFCARS" output: html_notebook editor_options: chunk_output_type: inline --- Load in the needed packages ```{r} library(tidyverse) library(mice) ``` First let's load in the de-identified AFCARS data and state population data ```{r} dat<-read_csv("./data/afcars_demo.csv") ### read pop data and harmonize variable names to afcars names pop<-read_csv("./data/pop_demo.csv") %>% rename(St = state, FY = year) ``` Let's explore the AgeAtStart measure ```{r} table(dat$AgeAtStart) ### explicitly recode missings dat<-dat %>% mutate(AgeAtStart = case_when( AgeAtStart >= 99 ~ NA, T ~ AgeAtStart )) ``` PIVOTING TO THE MICE DEMO WITH BUILT IN DATA ```{r} head(nhanes) summary(nhanes) imps<-mice(nhanes) ``` ```{r} m0<-lm(chl ~ age, data = nhanes) m1<-lm(chl ~ age + bmi + hyp, data = nhanes) ```