National Data Archive on Child Abuse and Neglect (NDACAN) [Erin McCauley] Okay well everyone welcome this is our third session of the summer of NYTD training series for the summer we're really excited to have everyone here. Our presentation today's going to be very informative and more kind of in the nitty-gritty in our previous presentations and were very excited about it. So as you can see this is our third presentation, it's the first of our expert presentations and we designed these presentations around the common questions that we get from our data users. So this first session is mostly about missing data and how to deal with it and then our next session will be about the weighting. And so our presenters are going to be Michael Dineen and Frank Edwards and we're really lucky to have both of them if they're going to be kind of co-presenting and at the end we'll have questions which you can ask to either of them are to me. And I'm going to have them take it from here thanks for coming everyone. [Frank Edwards] So this is Frank but Michael is going to start us off by discussing just give you an overview of what we're doing today and then I'll hand it off to Michael this is Frank Edwards I'm a postdoc here at the data archive I have a PhD in sociology from the University of Washington and have worked extensively with the administrative data from Children's Bureau including AFCARS, NCANDS, and NYTD. And so just to give you a brief overview of what we're doing today. First were going to develop a clear understanding Michael's going to give us a really detailed overview of the design of the NYTD and the structure of the cohorts and the samples that are included in the NYTD data. Were going to discuss differences in the composition of the state samples and the methods states use to collect data, and then were also going to discuss some of the sources of missing data and nonresponse specific to the NYTD. Then were going to pivot over and give a sort of crash course in the theories and methods that we can use to address missing data in surveys like the NYTD with a focus on multiple imputation. And I'll be doing a quick tutorial in how to conduct multiple imputation using the open source R statistical software package. And we'll give a sort of practical overview of one approach you can use to address wave nonresponse in the NYTD using some free open source tools. So with that I'm going to pass it over to Michael. [Michael Dineen] Hi this is Michael Dineen I'm the NDACAN specialist in NYTD, AFCARS, the child the NCANDS child file. So I'll be first talking about the design of NYTD for review for some of you first time for others. So the it's called the National Youth in Transition Database is the long term for the acronym NYTD. The you can a lot of this information I'll be giving is in the user's guide and codebook so you can always refer to those for more detailed information. The NYTD outcomes survey is ongoing that means it's for the foreseeable future there will be a some outcomes survey being done on some group of kids every year. The first cohort was kids who turned 17 in 2011 and then the ordinary NYTD pattern is to resurvey those kids when they turn 19 and when they turn 21. So of course the 2011 cohort was resurveyed in 2013 and 2015 in fact cohort is complete we have all three waves of those available. Then the the next each cohort will be three years apart so the next cohort after the 2011 was the 2014 so that was kids who turned 17 in foster care in fiscal 2014 those with the baseline kids. And in the survey every one of those kids is contacted all in for 2011 it was I think 29,000 kids who turned 17 in foster care then a proportion of them will respond. That's the cohort. So the cohort is only the kids who respond at wave one. Currently we have two waves of cohort two the last wave is being conducted this fiscal year 2018 that we're in. In 2017 was the first wave of cohort three there's a 2017 cohort that we don't have yet but that'll be the next data we get is the first wave of cohort of the 2017 cohort. So who is in a cohort? You have to be in foster care on the day that they call you to do the survey and you have to answer at least one of the survey questions in the survey to be in the cohort. And this survey has to be administered by the state within 45 days of your 17th birthday. So there are kids in the data who responded but they were after the 45 day period so they're not officially in the cohort but there's data in there and some people might want to use people who responded even if they didn't have their survey completed within that 45 day window so there's additional data available. The follow-up surveys are conducted in a six months AFCARS reporting.. The AFCARS foster care file is reported by the states to the Fed twice a fiscal year. The fiscal year runs from October 1 through September 30 so a six month period would be October 1 through March 30 and then April 1 to September 30 would be the second so the follow-up surveys are conducted in one of those six-month periods that include the youth's 19th and 21st birthday. States are permitted to sample the cohort they're not permitted to sample the baseline they have to contact every child in the baseline of wave one but for waves two and three they can choose a sample of the kids who responded to the wave one survey and they have to use simple random sampling. And that sampling is done once, it's not done for each of the two follow up surveys it's only done once. So the same sample is used for the age 19 and age 21 surveys. So you're one thing I want to say about this is the original respondents at the wave one the baseline survey is not a random sample. Kids who whoever happens to respond to the survey there may be there may be something different about kids who don't respond or kids that do respond so the fact that there's a random sampling of that cohort does not correct for the nonrepresentationalness that might be in the original wave one data. The state sampling is done mainly for the convenience of the states it's not done for any kind of research reason like to help anything it actually hinders, it reduces the value of the survey but it does make it cheaper for the states to do the survey. So sources of missing data in the NYTD. The response to wave one is voluntary. Kids who don't respond are not followed up at the subsequent waves so all the survey data for these cases are missing. So whoever doesn't respond in wave one will not be included in either of the other waves. So those cases are always missing but you will always have the demographic data for the kids who did not respond to the survey and we get the demographic data from AFCARS since the kids are in foster care. So this is an unusual survey in that you have information about the nonrespondents that's usually a really good thing. Okay the cohort is not random or representative sample if choosing to respond is associated with any types of the variable like if kids were incarcerated and that made then different then you won't catch that particular population or any other of the variables in the survey if that's variable is related to why they didn't respond then there's going to be up bias in the survey. So not in the cohort is one big source of missing data and it will always be missing there's not much we can do about it. So the wave nonresponse means that a child responded at wave one but they didn't respond in wave two or wave three or both so all the survey data will be missing for that row of data. You'll have data for wave one but not data for wave two or wave three but still we'll have their demographic data and their wave one data. Then the reasons kids might not respond to the survey either at any wave could be that the youth declined that means the state agency located the youth and invited them to participate but the child declined to participate. And in some cases the parent can decline but only if the if the child is not able to respond themselves. Another reason is that the child's incapacitated if the youth has a permanent or temporary a mental or physical condition that prevents her or him from responding. They could also be in jail and they wouldn't respond if they're in jail. Usually the conditions in a jail are not private and so they probably wouldn't respond at all. If the child's runaway or missing they wouldn't be available to be found for a survey. Or there could be other reasons they still might be unable to locate especially in the subsequent waves because when kids are in foster care in wave one the agency will know where they live but if they age out after wave one which they often will they may not know their address or where they went. They may move to another state are not have left a forwarding address and so they're not able to locate them. That's one of the reasons there's a big drop off in response from wave one to wave two. Of course the child may have died too and then of course they couldn't respond to the survey. Then there's nonresponse to a particular question and that's what Frank's going to mostly deal with. [Frank Edwards] Apologies sorry we got a good question from Nadje and we about whether the response reason is captured even if they don't participate or dropped completely from the survey. We'll circle please hold your questions until the end and Nadje we'll come back around to that question once we get to the end of the presentation but that's a really great question and I think can also inform how we think about addressing missing data in the NYTD. So I'm going to give you a really quick theoretical overview of how we might approach missing data and then pivot over to specifically to how we can use these tools in the NYTD. And as Michael mentioned you know there's three kinds of missing data that we can think about in the NYTD. We have question nonresponse, we have wave nonresponse, and we have not in cohort. Today were going to be addressing question nonresponse and wave nonresponse. Were not going to address the not in cohort nonresponse but I will kind of give you some ideas about how we might use those techniques to address cohort not in cohort missing data and also encourage you to tune in next week when Michael talks about weighting as a strategy that we can think about using to address the not in cohort problem. Okay so why should we care about missing data? When you use most statistical software to conduct analyses generally the default is to use what's called the complete case analysis. That is it only uses those observations where the outcomes and predictors that you have in a regression model are not missing for all observation so it's going to systematically drop all cases where we are missing any values on any of the variables that are included in your model. So depending on how much data is missing in the data set you're using this could result in throwing away all a lot of perfectly good information and can dramatically affect the size of the analysis the size of the data set you're using in your analysis. So had minimum what this does is it bypasses your standard errors and it can affect your standard errors in two directions that I'll talk about but if missingness is correlated with anything that we care about or anything that we think might be sort of endogenous to the process that we're trying to estimate might be correlated with any measure per variable in our data than it can actually bias your parameter estimates in ways that we have some statistical techniques that can help to correct. So with a few assumptions we can address these problems in a pretty robust way. So first we need to think about why our data are missing and there are three basic frameworks we can use to think about why data are missing. The first is that data are missing completely at random. In this context the probability of the value being missing is entirely random and it's not conditional on any observed or unobserved characteristic of case. That is you could think about a coin flip determining whether or not of value is going to be missing. And this is a relatively rare form of missing data. Missing completely at random is often something that you would need to design for. So we're not generally dealing with a missing completely at random case. The two scenarios that we are often looking at in real-world data analysis is the missing at random and nonrandom missing data. So for a value to be missing at random within a data set the probability of that value being missing or being not recorded is not completely at random. It depends on the observed information, that is, once we condition on values that we do have observations for then we can build a model predicting the likelihood that a value will be missing and we can use that information to help us think about why that value's missing and where the actual values for that data may lie. So in the missing at random assumption we assume that the probability of a value being missing is determined by other variables in the data and we want to make sure that we're using those variables to help us understand the patterns of missing. Now the last alternative that we can think about sort of as a mechanism that would drive missingness in our data is nonrandom missing data and in this case the probability of a value being missing depends on some unobserved value are on the value itself. And unfortunately in this context there's not much we can do. And it's also difficult to establish whether data are missing at random or not missing or are missing according to some nonrandom mechanism that is associated with unobserved variables or the value itself that would be censorship right because we don't have the actual information on the data generating process right we don't have the true data generating process we don't observe that we're trying to make inferences based on that analysis so we're often working on assumptions about what we think the mechanisms of missing data are. We can do a lot of exploratory data analysis to help us better understand to help us clarify what we think those mechanisms are and I'll be walking through some of how I do that in my research next. But in most contexts a missing at random assumption is often safe to make but you want to make sure you're doing plenty of exploratory work to think about why data is missing in your case and I'll kind of come back around to that. So there's a few basic approaches we can use to missing data. The most commonly used approach is listwise deletion again this is that complete case analysis. So that is we just discard cases where we're missing data. This can be appropriate and can have you know smaller null effects on the conclusions you make from your research when you don't have very many missing observations are when missingness is completely at random. That is, it's not associated with any of the predictors or outcomes that you're looking at. Now in the case of listwise deletion you are always penalizing your standard errors upward because you're removing that is your standard errors for your parameters you estimate are going to be higher than they would had you not deleted those cases simply by the fact that the N you're including in your analysis is smaller. Another alternative we can use involves using alternative information. So Michael discussed that we include demographic information for all children even if they don't respond to the initial survey or if they don't respond to a follow-up survey by borrowing information about that child from prior from other data rights in this case the AFCARS because we have good information on those kids on those young people from other data sets we can use that information to populate data that we know about. Characteristics that are time stable right so for example now we could get into theoretical conversations about whether race is actually time stable and you know whether we're talking about self-identified or you know state-identified those things might actually be more slippery than we want to assume but for the purpose of this study and to keep things simple we can assume that certain characteristics of kids don't change. Date of birth doesn't change right? And so we can use that information from other data to update when our data are missing and that's obviously a really strong approach where we have certainty about where that value lies. We can also use nonresponse weighting which is something Michael will talk about in more detail next week but that can become really difficult when many variables are missing and when subpopulations of interest differ are when we think there's a lot of variance within individual subpopulations on things we might be interested in thinking about. So there's a couple deterministic imputation methods that you may be familiar with, that is, you know a kid is going to have a set value based on some set method and we're not going to be introducing uncertainty into our model so linear interpolation or last observed so given that this is a multi-wave survey one approach we could use to impute missing data to take a child's or youth's response at wave one and wave three and use that to infer their response at wave two in a deterministic way either by directly replacing the data or if it were a continuous variable by plotting a line between those two points and then putting that observation at that point on that line effectively creating a sort of you know within-child/within-person regression. This is generally not a good idea it's going to bias your standard errors downward in a way that's unrealistic because effectively we're saying with more certainty than we should have where that observation lies. We don't know where it lies but we're using prior information to deterministically set it there. So enter multiple imputation. Now multiple imputation does something similar in that it constructs regression models to tell us where missing values ought to lie but we do it multiple times and in this way it's a sort of pseudo-Bayesian approach to modeling missing data in that we're creating multiple fake data sets that are going to help us populate uncertainty around where those true values lie. So instead of saying we know for a fact in wave two when this young person didn't respond to the survey we're going to use the wave one response and say it's exactly there. Instead what we're going to do is use information across all of the respondents in cohort to create you know M fake data sets where each value that's missing in the data is populated according to a regression prediction that has a random component in it. So what we're doing is we're creating uncertainty around where that true value lies. We're not recovering the true value but we're creating intervals around where the true value likely is. So this allows us to average over the uncertainty that gets generated by missing data and under the missing at random assumption it generates unbiased parameter and variance estimates. So if we believe the missing at random assumption, that is, the likelihood that any value is missing is conditional on variables that we have in the data and we include in our imputation model then we get unbiased parameter and variance estimates. So it's a really powerful and flexible approach but also one we need to be cautious in using and we need to be really thoughtful about how we approach. So what it does. It has two effects on model uncertainty it increases your N because we aren't deleting any data so that pushes your standard errors down but it also adds in appropriate noise that is generated due to uncertainty around those missing values where those missing values are which pushes standard errors upward. So generally multiple imputation is going to pull your standard errors down but there can be circumstances in which we have a lot of uncertainty about where particular values lie where it's actually going to be increasing the variance on particular measures. If missingness is associated with observables and we're going to look at a case today in which missingness is in fact associated with observables in the data then multiple imputation can correct bias in your parameter estimates. So if you just for example the case we're going to look at today in which some employment outcomes and nonresponse to employment outcomes is associated with choosing to respond to the survey in a way that's mediated by gender in a way that's associated with gender multiple imputation can help us recover parameter estimates that are likely closer to the truth then we would obtain if we just deleted those values. If we just delete those values we are actually baking bias into our estimates if nonresponse is associated with things we care about and things that we put in our model or even things we don't care about but have information on. If we just do listwise deletion or some other deterministic imputation approach we can actually be drawing in some cases wrong conclusions. So we want to be really careful in thinking about missing data and how we approach it. Okay so if we want to think about this as an algorithm right here is my preferred approach. First you need to understand your data. Read the documentation very carefully. Do lots of exploratory data analysis to think about why data are missing and what kinds of variables in the data missingness might be associated with. I want you to sort of develop an understanding of why do think data are missing. Right? Here we discussed three kinds of missing data: question nonresponse, wave nonresponse, not in cohort, and each of those kinds of missing data may have different mechanisms that drive why a particular young person is or is not in the data for a question for a wave or for the cohort. And when you can we think it's a great idea to test your ideas for mechanisms of missing data either through simulations or through lots of exploratory data analysis. So let's dive in right and I'll also use available information as we previously discussed we have the AFCARS we can use that to help us gain information on these kids even when they don't respond to the survey. Some variables are time stable but again always be cautious when doing this. If it's not a variables that's time stable it's generally a bad idea to directly borrow observations from prior observations because what you're doing is effectively saying with certainty I know where this observation is when you don't right? And we want to make sure we're being honest about how much uncertainty we have about where these observations are. So if missing at random is a reasonable assumption then multiple imputation is a great strategy. So because the missing at random assumption is conditional on observables that is things we have observations for, you often want to include as many variables as is sort of computationally feasible in your imputation models and you know this requires a lot of trial and error and a lot of sort of computational work to get right but the more variables you included in your imputation model, assuming they have any correlation whatsoever with the likelihood that a value is missing, you are improving your predictions and you're making MAR assumption more reasonable. And once we've done this we've created let's say often five fake data sets we're going to run our preferred analysis over each imputed data set and then were going to combine those and report those revised estimates. So this is a lot I'm going to show you how to do it now. So this is a very brief introduction to missing data methods and I'm going to go through this sort of first you know a caveat what I'm showing you today, the code is up on the web, you can go to my GitHub page and download the slides and an R markdown file that contains all of the code I'm showing you today and if you have the NYTD outcomes file for the 2011 cohort and the tabular data file you should be able to just pull this code put it into your directory and run it so this should be a fully working demo happy to take questions if it's not again check out the GitHub page for the code if you want to use it I'm using the R statistical programming language and the Mice package. These are free and open source it's what I do most of my work in and the user base for R tends to be a little faster in incorporating the latest sort of statistical findings into the software. But that said you can do imputation in STATA, SAS, SPSS, you can use these techniques in other software packages but again I'll be showing it in R today. But the theory behind what we're doing transfers to those other packages. Okay so all of everything you need to do this is up on the web except for the data right? Request the data if you don't already have it. So first we need to load in our packages and data right and here I'm just going to show you the code we're using to do this. Were going to create subsets based on the population size and then we're going to filter the cohort right so we can see here I'm creating counts that is the size of the population and then I'm also creating a cohort based on the indicator we have for in the cohort that's included in the distributable file for the NYTD and then I'm also filtering out making sure we only include those kids that if they are in a sample state were actually sampled in the study. So here just to give you a sense of the response rates we're looking at and then number of kids we're looking at we have baseline population the number of valid responses we have and sends a response rate for waves 12 and three. Right? So we have 29,000 eligible youth that could have taken the survey in wave one. We have over 15,000 that did giving us a wave one response rate of over 50%. Wave 2 we have about 8000 valid responses we're down to around 25 to 27% effective response rate and then wave three we're about 7500 responses so about 25% response rate. These numbers don't match up perfectly to what you'll see in the documentation because we're using updated files but they are in the same ballpark as what you'll see in the documentation. Okay so here's the cohort response rates. So if we only look at those kids who took the wave one survey and met the cohort inclusion criteria obviously in wave one we're at 100% response at wave two among those in the cohort we are at about a 65% response rate and at wave three we're you know about a 60% response rate of those kids who were in the cohort right and were eligible for wave two and three inclusion. Okay so one kind of missing data that we think about is question nonresponse so of those kids who took the survey how many chose not to answer particular questions. This is a super easy form of missing data to deal with you know we have a lot of information about other things that youth answered on the wave and we can use that to help us impute those particular missing values. So here we split it out into four categories and we're looking at the currently part-time employed variable. Right? And so we have blank on the left and this is faceted by waves so you can see each box is wave one wave two wave three. And it's the count of responses on the y-axis. So on the x-axis if we go across right the blank is the not declined but just didn't answer that question right? And so you can see we have very few in that blank. We're going to treat declined as a valid response and we're not going to treat those is missing so the only ones we're going to treat as missing are the blanks. So we don't have a lot of those in the survey. So here we are going to kind of go back to this wave nonresponse problem and that's what we're going to focus our efforts on today is looking really closely at why we have nonresponse across waves. We're going to use multiple implication to try and effectively fill in those gaps between those you know 30 some percent of kids that didn't respond to the survey in waves two and three. And we're going to create five fake data sets where for each observation we have perhaps a different value that will give us some uncertainty about where that response would have lied had the youth actually responded to the survey in that wave. Okay so what drives nonresponse? I'm going to look at two things that we might think of as driving nonresponse. We're going to think about gender and we're going to think about race as two potential you know characteristics of kids that might be differentially associated with responding are not responding to subsequent waves of the survey. And gender does appear to matter quite a lot. Here we can see that young women were more likely than young men to respond to the survey in subsequent waves. Right? We can see the response wave for young women in wave two is close to 75% while the response rate for young men is around 62% right? So we can see a pretty clear gradient by gender in terms of response so that's something we may want to think about as we start to model this stuff. Here's nonresponse by race and here we're just looking at white/nonwhite and we don't see as clear a relationship between race and nonresponse as we saw between gender and nonresponse, nonetheless we might think it's interesting. Okay so for the purposes of the demonstration we're going to assume that part-time employment and sort of wave nonresponse and we're going to focus on that single variable of part-time employment is missing at random conditional on sex, race ethnicity, and age. And again this is a demo this is not what I would use for peer reviewed research. I would include many more variables in the model and as many predictors as is technically possible to maximize the predictive performance of my imputation models. And it gets computationally intensive but this will work and it does have an effect on our results as I'll show. So first we're going to set up our imputation data set and if the code is you know a little cumbersome to look at you can look at the bottom and that's the table that we're going to be working from. Right? Where you can see row 1 in wave one we have a young woman who is not part-time employed who has a race ethnicity of one which in this data is white and etc. so we're going to be working with those for variables in our model right? And I've converted them all to factors that's what we need to do to make it work appropriately with the software that we're going to use. So I'm using a package called Mice to do imputations and this just shows us how much missing data we're looking at right? We have 75 people missing data on sex we have 462 missing data on race and the big things that we are interested in tackling here the part-time employment variable we have 8500 missing values on part-time employment and that's a combination of those question nonresponse and wave nonresponses. Right? And so multiple imputation is going to simultaneously modeled these three variables right in an iterative fashion. We're going to run through each of those variables and then use information about the imputations for those variables to impute other variables in the data. Okay so and just a plug for this package Mice is the package I'll be using today if you check out that URL in the comments there's there's great documentation on this package available with a number of really detailed tutorials that can walk you through how to do this in more detail than what we'll do today. Okay so first we're going to designate the variables that we're going to use in our imputation that is I want to use everything to predict everything else right so this predictor matrix I'm showing you is a matrix that tells me which variables the rows are do I impute this measure right so it 000 for wave means no, don't impute so wave we have information on that for everyone. For sex we're going to use information on wave which is a proxy for age. Current part-time employment and race ethnicity to impute the value of sex. Likewise for current part-time employment, likewise for race ethnicity. And then below I'll show you the method that we're going to use to impute these right so for sex because it's a binary variable we'll use a logistic regression model, and for current part-time employment and race ethnicity we're using multinomial logistic regression a multinomial model that will effectively estimate separate logistic models for each of the values because these are multiple-value categorical variables. Okay here is the code to run our imputation. One thing I want you to note about multiple imputation, you may see a seed value when you run these. Multiple imputation as a stochastic or random component to it so if you run this and don't set a seed you might get slightly different values each time you run it right so if you want to make sure you are getting the exact same results every time you run this make sure to set a seed. And you can see this is what the output will give you. Now here is a really quick diagnostic we can use to check out whether our models converging or whether they have any problems and hear these are trace plots. Any of you who have done Bayesian analysis this will look somewhat familiar the idea here is that we don't want to see clear trends and in these results we do not see clear trends which is good. We don't want our samplers to be hanging out in particular zones. We want them to be sort of traversing the full space of where the values might be coming from rather than drawing the same values over and over again. Okay so here's the effects of imputation on our current part-time employment measure. Now the zero facet is the observed data you can see in the observed data that fourth column we have a lot of missing values in the observed data. Right? Value zero is not employed, value one is employed, and value two is declined so what we have here is imputed data set 1,2,3,4,5 and you can see that we've filled we've replaced those missing values and we've assigned those people to either not employed, employed, or declined and the individual responses may vary according to where those missing values were across each of those data sets and where we land as you can see here the mean we get for each of these values may differ across each of our imputations. Okay so here's a direct comparison right to the original data again zero is no is the observed data and Hughs I've got you we'll come back around to that we're going to use all of them. So you can see here that in the observed data set we estimated 15% of the cohort reported part-time employment 59.5% reported that they were not employed and in each of our imputed data sets we have a slightly different proportion for each. Right? Imputation one we're saying 21% were part-time employed 77% were not employed right so we're taking those people who had missing values and we are assigning them to one of three valid categories employed, not employed, declined. Right and so here we can see this has a pretty significant impact on the proportion of cases we're saying they were employed or not employed and again this is primarily wave nonresponse. Now comparing it to the proportion including the missing values is a a little misleading so let's look at only for those cases where we had non-missing values. Right? So so this drops the missing's from the original data so we said of those with observations we said 20% were employed and 78% were not employed only in the observed data in that imputation zero that's the observed data. Here we are saying 21% were employed 77% were not employed so that's a less dramatic change rights we're looking at a 2% change in the employed cell and again about 2% change in the not employed cell. Effectively what's this is doing is our imputation model's not putting as a lot of people in the declined category it's generally assigning them to either employed or not employed. Okay then what we're going to do instead of choosing one of those data sets we're going to pool across data sets so what I've got here is the regression model. Here's the model itself the GLM this is logistic regression model where I am asking is the kid currently employed and we're going to use sex, wave, and race ethnicity as predictors. Right? And by using this with command and with our imp out which is our imputation object we're going to run that model for each of the five imputed data sets right so we're going to get five models that result from at. Now a statistician named Rubin did a lot of research on multiple imputation methods and came up with an algorithm we can use to combine results across imputations to revise our parameter and uncertainty estimates and effectively you get an average on the parameter estimates but you get a weighted average on the standard errors so the standard errors take into account both between and within imputation variance and so it's a complicated formula but fortunately the Mice package includes this pool command that you can give it an object that has lots of models in it and it will pool the results. And I'm also just for comparison hitting home model only with the observed data so we can look at it. And I just want to show you what this does and this is obviously a simplified presentation the top is the observed data the bottom is the imputed data. And one thing really lights up here to me is our estimate parameter estimate for sex in the observed data is higher than the parameter estimate we get with the imputed data suggesting that because we know sex is associated with nonresponse, we might actually be inducing an upward bias in our estimate of the relationship between sex and part-time employment if we just do listwise deletion. That is under multiple imputation we're seeing a small negative penalty to the parameter estimate for sex when we conduct multiple imputation versus when we don't. Okay so in this very very limited scenario we're seeing some affect of imputation on our substantive results also note the standard errors right. As I discussed our N is 8000 higher when we have a multiple imputation model compared to when we're just dropping those observations so our standard errors are lower under the imputation model than they are under the listwise deletion model. So I know I'm taking up a lot of time here and we have a lot of questions but this was a really great thing to do and we can you know kind of get to questions later. We haven't dealt with selection into the cohort right I want to keep that in mind what we've done is we've recovered the cohort we have not covered cohort nonresponse here. And I want to encourage you to go deeper in your own analysis when you do this. Theoretically we could use this method to try and estimate uncertainty driven by selection into the cohort but that would be pretty computationally intensive and would require us to think more carefully about the missing at random assumptions. Okay yes and I know we're going to open it up to questions and answers now. Here's some future reading if you're interested in doing more and again there's the code so I'm going to pop up the group chat box. And let's start with Nadje who asked this nonresponse reason and I think that's a great question to start with. A more sophisticated model could use nonresponse reason as a predictor of nonresponse in a way that might improve our imputations. Michael do you want to address whether nonresponse is captured even if they don't participate in the survey? [Michael Dineen] Yeah that's that's actually a question I mean it's a variable that the state records of why the youth did not respond in fact those reasons for nonresponse that I gave you were the values of that variable. [Frank Edwards] Right so do you do you have a value for nonresponse reason if the youth is not does not respond to the wave? [Michael Dineen] Well they have to contact everybody at wave one so everybody gets contacted and there has to be a reason why when the interviewer finds out that they are not going to be able to get a survey, they have to answer a question, is that person answers the question not the not the respondent. So it's the interviewer's response to why they didn't get an interview. [Frank Edwards] Okay so wave one we have it. What about wave two? Would we have is that value missing for anybody in wave two? [Michael Dineen] Well I'd have to look, but I doubt it. [Frank Edwards] That's something we can look at but I think it's a really important question and something that I would certainly want to use in building out imputation models that were designed for a more serious data analysis. So let's pivot over to the question from Hughes Hughes asks of five imputations which imputation do you use to report results? Right as I showed we want to use them all we don't want to choose one. If we choose one we're not doing much better than a deterministic imputation, right? So we want to make sure that we're pooling across our imputations right? And there's you know visual techniques we can use to represent uncertainty intervals generated by imputation that can be really helpful to use but you'd never want to only use one of your imputations except when you are doing you know the kind of exploratory analysis we were doing to get ahead to get our head around kind of what are the imputation models doing. Yeah okay so an uncertainty interval this is a question from Jamieson he asked me to describe uncertainty intervals and how you use them. Right so what we could do theoretically and you know this something Jamieson if you're interested in you can email me and I'll show you some examples from my own research of how I've done this. What you can do is effectively go over those five data sets and asks what the maximum and minimum value you have for each observation is across those data sets and then you know maybe we have some parameter of interest like the proportion of youth who have part-time employment. What I could do is say what's my observed and then what's across those five imputed data sets what's the mean and then what's the maximum and minimum and then I can use those to report the uncertainty that's generated by my imputation. So hopefully that addresses the question Jamieson feel free to follow up with me if it doesn't. Kyle asks if I've tried other algorithms for imputation such as KNN K Nearest Neighbors. Yeah so am sort of theoretically familiar with them it's a message I tend to use in my work but you know it's certainly possible there's a lot of different imputation methods you could use here. You could use fully Bayesian imputation method, you could use nearest neighbors you could use a lot of different approaches and I sort of encourage you to dive into the statistical literature to think about what might be appropriate for your use case. I think multiple imputation is a good general tool but it's only one tool of many this is a pretty deep literature. What simpler metrics do you have in mind, Jamieson? Medians. So yeah you can extract a median from your imputed data sets you want to make sure that your whatever your missing data method you are using you want to make sure to use it on the full data and not on a summary statistic right? Anytime we use a summary statistic we're discarding information so you want to make sure any missing data method you are using to our applying to the complete data whenever possible. But you can certainly extract a median from a set of imputed data sets to summarize where you think that value is but whenever you are using imputed data whenever you are using any of these missing data techniques I strongly encourage you to also report the uncertainty you have around that. So you could report that as a standard deviation across the imputations or an uncertainty interval that you calculate based on what the imputations are showing. [Erin McCauley] All right well if we don't have any other questions then I'll just take a moment to highlight next week's presentation. So this was kind of the first expert presentation and then next week we're going to have the second expert presentation which will mainly be Michael leading it. He's going to be talking about developing and using sample weights. And send if there's extra time kind of other common questions we get from data users. So that will be quite exciting and I'll send a reminder out in the morning. Thanks everyone for coming out today this is been a really great presentation with some really fantastic questions. And if you kind of start playing around in the data and questions come up between this week and next week just write them down and we should have lots of time to go over questions next time. And a big thank you to Michael and to Frank because this has been absolutely fantastic presentation. [Frank Edwards] Thanks all. [Michael Dineen] Thanks everybody. The national data archive on child abuse and neglect is a project of the Bronfenbrenner Center for translational research at Cornell University. Funding for NDACAN is provided by the Children's Bureau.