[musical cue then voiceover]
National Data Archive on Child Abuse Neglect.

[Erin McCauley]
all right folks good morning we are getting started. well welcome to the NDACAN summer training series. this series is hosted by the national data archive and child abuse neglect which is co-hosted at Cornell university and Duke university. if this is your first summer with us I want to give you a really warm welcome. if you are a repeat summer training series participant then thank you so much for coming back we love seeing y'all year after year. next.

so this summer's theme is the power of linking administrative data and so we have six workshops throughout the summer all on wednesdays at this time and we're gonna be focusing a lot on linking data. if you've used our data before you know that linking is one of the unique strengths of these data and so these workshops will give you hopefully a little bit of a confidence boost and then an opportunity to ask some questions about linking our data both to itself to outside sources and then in the context of some data analyses. next.

so here's our overview of the summer. today we are starting out with just a little introduction to NDACAN. if you're new to us or you want to be reminded of our services and supports and thenwe have Alex with us who's going to be leading a workshop on data management and R. next week we're going to be talking about the administrative data cluster and then talking about linking within the data cluster. the following week we're going to be linking some NDACAN data to external data, then we're going to have a workshop on structural equation modeling which will be leveraging some linked data and then the following week we'll be talking about matching methods an overview of propensity score matching and then we're gonna round out by talking about using NDACAN data to study racial disparities and especially linked data. next. so I'm going to pass it now to Tammy White to do a little introduction. 

[Tammy White]
hi everyone thanks Erin I just wanted to give a quick introduction, a thank you and welcome to everyone here and as Erin said I a welcome to those of you who have attended these trainings before this one looks like a real great one this summer. I'm really excited for the archive to take this one on I think you'll all enjoy and I hope you get to attend as many of them as you can. as Erin said this is a real important thing for the archive and for the Children's Bureau to put our data out there and to get some really good analytic results out of linking data and power of in linking administrative data. so I just want to say thank you and I hope you all enjoy the series and welcome and we're really excited to see the summer go forward thanks a lot enjoy.

[Erin McCauley]
thank you Tammy so for our session agenda this is just a quick overview of what we're going to talk about and I also want to mention before we get in too far that the series can be viewed as individual workshops but they do build on each other over the summer and that at the end of the summer we release the series as a video series. It'll be available on our website and I'll talk about that a little bit when I talk about NDACAN and all the supports available on our website. but you know at the end of the summer you'll be able to see this summer series but right now we have series up from the last few summers that were all on different topics so you know if this kind of strikes your interest i'd recommend hopping over there to see our prior ones or if you missed one just know you'll be able to see the video at the end of the summer. so first we're going to talk about NDACAN who are we and what do we do. we're going to go over some data management overviews and then Alex is going to lead us through a demonstration in R about how to address these challenges.

so introduction to NDACAN we are the national data archive on child abuse and neglect and our main goal is to promote secondary analysis of child abuse and neglect data. so we do this by making sure that we have high quality data sets that are available for researchers and the public, by creating documentation that helps folks use the data. we provide technical support and then we really encourage collaboration within the scientific community.

we are located at two institutions Cornell University in the Bronfenbrenner Center for Translational Research which is actually where the archive was founded in 1988 and then we recently moved to also be co-hosted at Duke University and we're supported by a contract from children's Dureau and ACF.

so there's a quick overview of our staff you'll see a few of us here today. our directors are Christopher Wildeman and John Eckenrode at Duke and Cornell respectively. our archiving assistant Andres is here he is instrumental in organizing and putting together this series so big thank you to Andres. you'll be hearing from one of our data analysts later this summer Sarah Sernaker. you'll be seeing me each week to lead introductions and then we'll have both Frank and Alex lead sessions. and then our research assistant Clayton will be with us through the summer as well. if you've been to Office Hours you know his friendly face and he helps moderate the q and a and he also does a lot of the background work in helping me plan the series so big thank you to Clayton as well.

so as a little overview of the different activities that NDACAN does we kind of think of them as all falling into these buckets. we acquire data, we protect confidentiality for those who the data is about, we then transform the data to make it usable to researchers and others, we disseminate the data and then support folks as they use the data. and then all this is in an effort to expand the scope of research on child welfare.

so when we acquire data, we acquire both administrative data which are data records that were created not necessarily for research purposes, and then survey-based data and then we make them available to the public. we serve recipients of Children's Bureau grants who are required to archive their data but we also archive data that anyone has collected if it's about child welfare and data archiving is really great because it actually increases citations and you know a lot more research comes out of each individual data collection if it's archived so if you're ever interested in archiving data reach out to us because that's what we do.

and then you know our core tenet is to protect the confidentiality of the people about who the data is about. and so this looks like things like eliminating data from really small counties where folks can be more easily identified or slightly changing dates. so for example in AFCARS which is one of our most used data sets you know we censor count or we change counties with fewer than a thousand cases. we change dates so we put the children's date of birth as recoded to the 15th of the month and then we change the other dates to preserve time spans between dates. and then last week data users sign agreements so you have to you know promise that you won't try to identify people in the data.

and then when we transform the data you know a big part of this is creating our user guides and code books. we create them all in-house that's mostly Sarah and Michael. and so if you're new to the data or you know you're coming back to the data after a while away that's really the best place to start. and then just know that we also add variables for ease of use so you know variables that folks asked us about often we just started including the data, things like two character postal codes, children's age at certain times like the age of start, we added a rural urban continuand then a length of stay in foster care which we call the life LOS. and then we also have other resources aimed to help researchers with the data so you know if you need to link data there's guidebooks about how to do that there's also analysis-specific support for importing data and then we distribute the data in a lot of ways as I'll tell you soon.

so when we disseminate the data as I said we distribute it in multiple data formats so if you use data and you want a gta file or if you want to be able to import a csv file you know there's a lot of options about how to get the data. we license it to eligible researchers and then we will also do some ad hoc special analyses for non-researchers

but mostly what we do is support. so we like to think of ourselves not as akind of passive repository of data but really is an active data archive where we leverage our experience using the data and as academics and researchers to really support folks in in using the data. and so there's a lot of ways that we do this if you're not already on the electronic mailing list I highly recommend checking it out. and most of the stuff or actually all of this stuff is available on our website so Clayton could you put the link to our website in the chat so that the attendees can follow it if they want to sign up but this is where we distribute information about our events and supportsbut it's also a place where child maltreatment researchers can connect with each other about things like job opportunities or collaborations. we have a mailing list and a newsletter that comes out quarterly. we have a digital library where you can see how other folks have used the data this is a great kind of launching off point for checking ideas or seeing what's already been done. we have a really intensive Summer Research Institute and this is where folks are accepted on an application basis. it's competitive but if you're accepted you kind of have an idea about a research study you want to do and familiarity with the data and then over the course of a few days where you can have one-on-one or small group support from statisticians data analysts and staff we help you take that idea kind of through to the analysis stage. we also have the Summer Training Webinar Series which you are at least a little familiar with if you're here but as I said earlier all of the videos are available from prior summers. last summer was the survey data cluster and then in the last year and a half kinda since the start of the pandemic we started a new series called the NDACAN office hours where during the academic year once a month we have kind of an hour of office hours where the first 30 minutes is open support time in small groups and then we have some type of informal presentation or workshop.

and then expand so we really want to foster interdisciplinary research teams where we can and then draw on researchers from new disciplines or folks who haven't considered studying child welfare before. you'll see us at different conferences like we were recently at the Society for Research and Adolescence and next month we will be at the American Sociological Association. we do this through the Summer Training Webinar Series so if you have friends or colleagues who you think would benefit from being here please let them know and then also through the Office Hours where we can provide more of that one-on-one support for folks who are new to the data or have questions about things like publishing and child welfare journals.

and so we've got two main kind of data clusters. we're gonna be talking about the administrative data throughout this series so I won't spend time on it but I just wanted to give you a brief overview of some of our national and cross-site surveys next which are boxed here so we're talking about the NSCAW, LONGSCAN and the NIS and this will be the briefest of overviews but if you see something you like or you hear something you like I recommend checking out the webinar series from last year.

thank you. NSCAW so this is the national survey of child and adolescent well-being this is a longitudinal study looking at well-being has a lot of information about children's families oh I see a question coming in the chat yes we will be sharing the slides there's a few weeks because we turn them into an accessible video series but it'll have slides a video and a transcript good question. and so there's also information about interventions and services and the key characteristics of child development it's sponsored by ACF and the U.S. Department of Health and Human Services and you know we just put a little research example because that's kind of how I found it's best to to jump off and shout out to Clayton because he did make these slides. next. we have the LONGSCAN the Longitudinal Studies of Child Abuse and Neglect these data have been archived by this LONGSCAN investigator groupif you're familiar with a lot of the literature you'll recognize a lot of these names so we're grateful to them both for their work in this field but also for archiving these data that you know really get a lot of use. and the data follows 1300 children and their families as children age into adulthood you know there's a one benefit of it is that some of the data are collected from multiple sources and there's yearly telephone interviews. and here's another research example at the bottom something published in Child Abuse Neglect.

and then we have the National Incidence Studies of Child Abuse and Neglect. this happens approximately once a decade starting in the 70s and it really broadly estimates the incidence of child maltreatment in the U.S. and it has a common definitional framework so that's kind of one of the unique benefits of these data and there's a publication here from Current Problems in Pediatric and Adolescent Healthcare.

and then I also want to just highlight two of our newer data acquisitions. if you've been attending the office hours you've heard us hype these before but one of our big ones is the historical data this is the VCIS data this is really where Alex is the superstar. and then we also have policy data which reflects the different policies and definitions for different states.

and then a little quick overview of the VCIS it's annual state level counts of children entering in an exiting substitute care. and basically what this allows us to do is link this with AFCARS to create a longer time span.

and then the SCAN data as I said it has different state definitions and policies related to child abuse neglect and the child welfare system. and so you can use these to kind of link this data to other sources in order to address important questions about variation in state definitions and policies. and for example it can be linked with many of our data but also externally with things like the census data.

so now I'm going to pass it over to Alex who was really the superstar of today's presentations.

[Alex Roehrkasse]
 thanks Erin and thanks everyone else for being here today I'm really excited to share this presentation. so the goal today is really to talk through kind of five different families of strategies for for data management and and these are pretty helpful strategies really for for all social science data management but they're particularly helpful for using NDACAN administrative data. and the way this presentation is going to work is I'm going to talk through some conceptual issues and some examples here in a powerpoint slide and then we'll transition over to RStudio and we'll actually work through some code so you can see sort of practical solutions to some of these challenges. I'm going to go ahead right now and share a link to the R script I'll be using a little bit later so if you want to download that if you're an R user already you can open that take notes in the script but either way that we'll make sure that people have access both to these slides and to this R script going forward.

okay the firststrategy well I should pause to say quickly that we're not going to talk about data linking todaythe other presentations willfocus more explicitly on linking but you can think of these as all really helpful strategies that come very often before you link your data with other sources. so the the very firstuh thing that I think is really important for for usersto consider is keeping track of the data management process itself. so when we're working with these very complex data sets over the whole life course of a complex researchproject we're very often making lots of complex decisions it can be difficult to remember all the decisions we've made so it's always important to work from a script rather than entering commands into a command line that allows us to always go back and see what we've done and creates a sort of end-to-end reproducibleresearch flow. so so in R we're going to be using a a dot R file which is a popular way to sort of write a script although there are other options including markdown files. if you're a STATA user what we're talking about here is a do file and and these files will allow us to keep keep track of all the things we've done over the course of our project.

the next most basic thing that I think users often want to do is generate some counts or summary statistics from from data so NDACAN data are mostly organized in terms of individuals children or sort of incidentsso the units of analysis in the AFCARS for example are children in foster care. the unit of analysis in in NCANDS for example is the child report but very often we want to count up the number of incidents occurring in a given year or count up the number of children or or measure say the mean age of children experiencing some certain event so for example yeah here we see how many children of different ages entered foster care in each year this is a very basic question that we might want to ask. in the R context we're going to use the group_by command to sort of tell R to group children having certain attributes and then we'll use the summarize command to tell R to calculate some sort of summary statistic and in STATA the equivalent here would be something like collapse.

NDACAN data are mostly organized in what we call a long format which iswhich is to say that each child or child report represents a row. sometimes though we want to calculate certain quantities that are easierto calculate when we've reshaped our data into a wide format. it's a little bit hard to explain what I mean by that until we actually get into the data but as an example we might want to calculate say the sex ratio of children in or entering foster care in any given year to do that we would want to summarize our data and then pivot our data wide to make it easier to calculate a sex ratio. again I'll show how to do this in just a minute.

NDACAN data are pretty large we're very often talking abouthundreds of thousands sometimes even millions of recordsvery often with many many variables. the sheer size of these data often make even basic calculations pretty computationally intensive and so unless you're working on like a server very often it can be helpful to reduce the size of your data to develop your code, test your statistical models, and then come back and and apply them to the full set of data only when you're interested in generating some final results. so in R we're going to use the slice command in stata we would use the sample command and I'll show you how to create a 10 sample of your data so that we can do some more complex operations a little faster.

okay and the last probably the most complicated but arguably the most important thing we'll talk about today really only at a surface level is how to handle missing data. so most NDACAN data sets particularly the administrative data sets have considerable proportions of data missing and without a sort of intentional strategy to handling missing data, the inferences we draw from NDACAN administrative data can be pretty significantly biased. so right now I'm going to give a brief overview of the different types of missing data that there are, how to identify them, and the different candidate solutions that are available to us. I'll talk very briefly about multiple imputation as a possible approach to missing data and then I'll demonstrate how to do multiple imputation in the context of R in a little bit. broadly speaking there are three types of missing data and which type of missing data we're dealing with will dictate the appropriate solution but very often there's actually no empirical criterion for identifying which type of missingness we're dealing with so we have to reason through it conceptually and based on our contextual knowledge about the data generating process. 

the first type of missing data is called missing data that's missing completely at random and what this means is that the data the missing data mechanism is unrelated to any aspect of the data itself so consider the example where we're interested in we're using AFCARS  data say and we want to know something about whether children entering foster care have experienced physical abuse. so in AFCARS there's an indicator variable for physical abuse and it has the value one if the child has experienced physical abuse, it has the value zero if they've not experienced physical abuse, but some value for some children the value of the physical abuse variable is missing. now if the physical abuse data were missing completely at random, the state that the child comes from would have no predictive power in explaining whether that that value was missing. moreover values of that variable would be no more or less likely to be missing depending on whether that child had or had not actually experienced physical abuse. that is to say children who experience physical abuse would be no more or less likely to have a missing value for that variable than children who had not experienced physical abuse. 

if our data are missing completely at random we have a wide variety of available options this is rarely the case though. much more commonly, data are missing at random. what missing at random data means is that our data are missing sorry the missing data mechanism is in some way related to our data but only to observable parts of our data. so to return to this example let's say we were looking at children who had or had missing values for for the physical abuse variable and we saw that some children from certain states were much more likely to have missing values than children from other states. this is a scenario where missingness is not random but it's not random with respect to a variable that we observe namely the state that children come from. so the so the data is non-random but only with respect to observable parts of the data. fortunately we can use with sort of careful planning a variety of statistical methods to correct for this non-randomness and generate valid estimates. a much more serious challenge is the scenario where our data are missing not at random what this means is that the data the missing data mechanism is in some way related to unobservable parts of our data. the most common example here would be the scenario where for example children who did not experience physical abuse were much more likely to have a missing value of the physical abuse variable. and this is a plausible scenario we can imagine a social worker who's considering your case and and simply skipped over that field because that child hadn't experienced physical abuse. this is a much more challenging scenario because basically no statistical solutions we'll solve this problem. usually when our data are missing not at random we have to go back and collect more data or reconsider our research design.

to give you a quick overview of some of the most common approaches to missing data listwise deletion is the approach where for any given observation with data missing for relevant variables we simply drop that observation. you should know that without a sort of intentional approach to missing data, most statistical software will by default adopt a listwise deletion approach but recall that listwise deletion is really only going to be appropriate if our data are missing completely at random. hotdeck computation is a pretty popular approach that basically assigns values based on the observed distribution of variables. multiple imputation is arguably the most popular and and broadly applicable set of approaches to missing data and it's what we'll talk about in a little bit. it basically relies on using multiple regressions to estimate the missing values. there are other approaches that are sometimes a little more statistically sophisticated but are easily implemented in a lot of software programs so I just want to let you know that sort of maximum likelihood and Bayesian estimation methods offer some pretty compelling solutions as well.

again if I'm missing if our data are missing completely at random any of these solutions are likely to be appropriate. if our data are missing at random then we cannot list list-wise delete our data we have to take a more considered approach to our missing data. if our data are missing not at random we're in a little bit more trouble and we should go back to the drawing board.

okay to give you a very quick sense of what a multiple imputation really is we're going to use a regression model to estimate the values of our missing data using observed data as predictors. we'll then do this multiple times which will yield multiple imputations that capture our uncertainty about the true values and then whenever we calculate summary statistics or estimate a statistical model, we'll use our many partially imputed copies of our data instead of our single copy of our raw data to arrive at valid estimates.

okay so now I'd like to demonstrate some of these concepts in R. just in case you're not already in our user but you're interested in getting started here is where you can download R. R is a language you can work directly in R but everyone I know uses sort of a development environment called Rstudio which makes it much much easier to use R. here are also a couple links about how to learn to use R in Rstudio that have been very helpful to me and others I know. okay so now we're going to move over to RStudio to demonstrate some of these concepts. again I've dropped a link to the script in the chat give me just one second to transition my screen share so that we can move through some of these examples.

okay

if one of the other presenters could confirm that you can see my screen that would be helpful. 

[Clayton Covington]
we can see your screen. 

[Alex Roehrkasse]
great thank you.

okey dokey. okay so this is basically what RStudio looks like and we're gonna work through a script here that I've called s1. our script will be here here the console will sort of output the results of the commands we tell R to do. we'll see a variety of visualizations here and R is an object-based language and so we have an environment where various objects we create will pop up here. so the first thing we're going to do is clear our environment of anything that might already be in there and then we're going to load some packages. in R packages are just sort of families of commands that we'll use. I'm going to use the data table package which is very helpful for reading and writing large files and then we'll use the tidyverse package which is very helpful in an increasingly popular set of packages for data management. we're going to be using the mice package for multiple imputation but I'm not actually going to load this package now because it conflicts with certain aspects of the tidyverse I'll call those commands immediately.
if someone could repost the link to the script in the chat that would be helpful.

[Clayton Covington]
I've got it.

[Alex Roehrkasse]
I'm gonna set my file paths here and tell R what my working directory is I'm also gonna set a seed. this is very important whenever you're dealing with random numbers when we get to our multiple imputation we're going to be dealing with random number generators so to generate reproducible results it's always important to set a seed. you can choose whatever seed you want it doesn't matter but it's important to use the same seed every time. now we're going to go ahead and read in some data. this is a sort of modified version of the AFCARS so none of these records are actual records of children in the AFCARS but I've so I've I've changed the data to anonymize them but the structure of the data here should be very similar to the AFCARS so any AFCARS extract that you're pulling much of this command should be readily transferable to that context. so let's go ahead and just look at what these data look like. we have just a few variables the AFCARS has many many scores of variables but we just have a few here the year that children entered foster care, the state from which they come, their age at their at last removal, their sex their race and ethnicity and this indicator variable for whether they experienced physical abuse.

we're going to go ahead and clean the data just we're going to recode sex as an indicator having value 0 or 1. we'll tell R to treat it as a factor variable. we'll tell R also to treat some other variables as factor variables and we're going to recode our our missing value for race and ethnicity in in a manner that R recognizes as a missing value. okay now let's talk through each of these five aspects or strategies of data management. the first is just keeping track of data management and that's exactly what we're doing here we've created a script and anytime I want to go back and see what I've done or anytime you want to see what I've done you can open this script and run it to to reproduce your analysis. okay now let's go ahead and summarize some data. so the tidyverse approach to summarizing data usually involves some combination of this command group_by and then summary. so here we're going to take our data object d which is over here in our environment. you'll note that we have about 5 million observations here so about 5 million children entering care between 2000 and 2018. we're going to go ahead and group those children by their age and then we'll summarize our data generating a value N which is equivalent to the number of observations having each value of Age. so let's go ahead and do that and what we get here is a count of children having each value of age for each across the years 2000 to 2018. let's say though we wanted to know how many children entered foster care in each year. here instead we would group by the year they entered foster care and summarize in the same way counting up the number of children entering in each year. and here again we have counts of children entering foster care in each year. let's say we want to know how many children of different ages entered foster care in each year let's create a little more room to see our results.

here we'll group by both the year that children entered foster care and their age when they entered.

if we go ahead and run this command we'll see that in our console we get just a few rows but then R tells us well there's 332 more rolls to rows to this object we can't see them here in the console so what we can do instead is use this view command so that we can view the object right here in this window and this allows us to see much larger objects. indeed we could just click on this object and it would open in this environment but it might take a while. let's say we don't just want to count children we want to calculate some summary statistic about them so what if we wanted to know what the mean and standard deviation of the age of children entering foster care in each year was. here we would group children by the year they entered foster care and then instead of using the sort of counter we'll summarize our data generating a mean value using the mean function we'll tell R to remove any missing values in calculating the mean and we'll do the same for the standard deviation of age. if we go ahead and run this command and also tell R to let us view it in a brighter a wider window here all of a sudden we have, for each year that children entered foster care, their mean age and the standard deviation of their age. so here's just a brief crash course in how to do some summarizing data of the sort you'll find in NDACAN's administrative data sets. I want to take a brief excursis to sort of make a case for the importance of visualizing data and for people who are considering R,Rr is a particularly useful language for for data visualization. so I'm going to use here very quickly a data set called Anscombe's Quartet. Anscombe's Quartet is a famous set of data to illustrate some of the perils of relying only on summary statistics and some of the important reasons to use data visualization. if we look at Anscombe's Quartet we'll see that we basically have four different groups here. within each group we have an id and then for each observation we have a value of a variable x and a variable y. if we were to summarize these data calculating sort of the mean and variance of x and y, we would see that for each group the value the mean value of the x variable and the y variable is identical and the variance of x and variance of y across groups is likewise almost identical. so just looking at the mean and variance of these variables you might infer that the data from group 1 2 3 and 4 are the same but if you would actually visualize the data you would see that in fact these are very different groups of data, right? so we would make very different inferences about the relationship between x and ylooking across the four different groups. so this is to illustrate that when you're doing sort of data summarization, exploratory data analysis it's always important to visualize your data so that you can understand them better. to return to our concrete example about children entering foster care if we were to look at the mean age of children entering foster care in any given year we would see that across time that mean age is decreasing. 

if we were to actually visualize our data though assigning children to different age groups, we would find a bit more nuance. so for example we can see here that the decrease in in over time in children in the age of children entering foster care was driven in the early 2000s by children aged 10 to 14 but that sort of rate leveled off in the 2010s. meanwhile in the 2000s children 15 and older were entering foster care at a fairly stable rate but then it's that group of children among whom entrances into foster care decreased in the 2010s. so just a simple visualization a simple demonstration of why visualizing your data can be very helpful in exploratory data analysis and summarizing your data. okay let's now explore how we would reshape our data so again let's look at the structure of our data here you see that each row represents a child entering foster care in a given year. let's say we want to know the sex ratio of children entering foster care in any given state or any given year. let's focus on 2018 just as an example. we'll take our data set and we'll create a new object where we'll pull out only those children entering foster care in 2018, we'll then group those children by the state from which they come and their sex, we'll count the number of children belonging to each state and sex and then we'll just re-label our sex variables so that instead of zero and one it indicates male and female. so let's go ahead and create that object. we see that object appear in our global environment and we can click on it to view it. what we see here is this N variable represents the number of children having each value of state and sex so there are 676 male children entering foster care in Arkansas in 2018, identical number of female children entering foster care in Arkansas in the same year. we can calculate a sex ratio in R with our data organized in this way which we would call long but it would be much easier if we had a column of that counted male children entering foster care and a separate column that counted female children 

[Erin McCauley]
hey Alex

[Alex Roehrkasse]
yes 

[Erin McCauley]
really quickly would it be possible to make your text a little bit larger. 

[Alex Roehrkasse]
oh I'm not sure I know how to do that.
is that getting larger 

[Erin McCauley]
yes 

[Alex Roehrkasse]
it should be yeah is that better

[Erin McCauley]
i believe so thank you sorry.

[Alex Roehrkasse]
yeah so let's see if we can do that. in R we're going to sort of take this object we're gonna create an object of the same name and we're going to pivot our data wider. the id columns tells us which which variable we want to continue to identify our rows and we'll pull the names for our new columns from the sex variable and we'll pull the values for our new columns from the N variable. so if we run that code and then we look at this object again, we see that we now have multiple columns that count children entering foster care having different values of sex. these are the sort of children for whom the sex variable is not observed. now it's much easier to calculate a sex ratio we can simply divide the male count by the female count. we can do that here by using the mutate command to generate a new value sex a new variable sex ratio that's just the ratio of males to females entering care and we see we have a sex ratio variable here which we can use in any sort of summary or statistic statistical analysis that we're interested in. having the male and female count side by side is also helpful for visual visualization so we can create a plot here that shows the relationship between female entrances into foster care and male entrances to foster care. we can draw a sort of 45 degree line that indicates sexual parity and we can see the states that have larger numbers of males relative to females entering foster care or on the other hand Montana where a larger proportion of females enter foster care.

okay now let's talk about sampling data. we've been working with a pretty large data set here there's about five million observations here. let's say we want to start doing some slightly more complicated things where this is going to start to strain whatever sort of computer we're working on. so I'm working on a Mac laptop right now it's a few years old it's not so powerful anymore. so let's take this just just for the sake of demonstration let's confine our analysis just to children entering foster care in 2018. so we're going to filter all children entering in 2018 and we'll create a new object called d2018. and we'll see that our d2018 object has only 255,000 observations that corresponds to the 255,000 children entering foster care in 2018. let's say that that's still a little big for doing some of the sort of more complicated things we want to do we can take a random sample of that 2018 data simply by using this slice_sample command and telling R that we'd like to the proportion of records we'd like to keep is 10 percent. and then R will randomly select 10 percent of the observations in d2018 and create a new object d2018 underscore 10 and you can see that this object has precisely ten percent of the observations that this object has. let's see how helpful that can be. let's go ahead and estimate a logit model where we're going to predict whether a child entering foster care has experienced physical abuse. let's say we want to have a separate intercept for each age that children might have and a separate intercept for the state from which they have. so this is a fairly complex model there's going to be you know somewhere on the order of 70 different predictor variables in this model. and let's go ahead and see how long it takes are on my little Mac laptop to estimate this model. here just to talk through this I'm creating an object fit using the glm command this stands for general linear models,  I'm telling R to use a binomial model which tells it to basically estimate a logistic regression model and the data I'm using here is going to be the d2018 data. R tells us that that took my machine about 24 seconds. Now let's do the same process but instead we're going to use our 10 sample. okay so it took my computer only 1.8 seconds to do that operation which is less than 10 percent of the time it took to estimate the the model using the full sample. so what you can see is that particularly for very large data sets or at more complex operations to the extent that you reduce the size of your data you'll have sort of exponential increases in the speed with which you can calculate different quantities. if we compare the coefficients for these two models estimated on our full data or our 10 sample you'll see that the parameters are pretty close but not identical so whenever you're testing your data or developing testing your code or developing your models using sample data you should be aware that your final results using the full data set will change and that estimates using the sampled data are not themselves valid.

okay let's talk about this last challenge handling missing data we're going to approach missing data in this context using multiple imputations and specifically multiple imputations by chained equations. chained equations are a sort of computational strategy when we have missing values for multiple variables. in R the sort of best package for multiple imputation by change equations is called mice and the developers of mice have published a set of vignettes that are extremely helpful not only for understanding mice but for understanding the basic sort of best practices of data missing data exploration, the development of missing data imputation models, the diagnosis of sort or rather the evaluation of how how well your models are performing so i'd strongly encourage people to to explore the mice vignettes. okay let's return to this example where we're interested in physical abuse among children who are entering foster care in 2018 and just to kind of keep things speedy we're going to use our 10 sample. let's say we want to know what proportion of records in 2018 are missing information about physical abuse so we can go ahead and calculate that quantity and we see that about 86 percent of children entering foster care in 2018 did not experience physical abuse or rather have have a record indicating that they didn't experience physical abuse. about 12 or closer to 13 percent of children have records indicating that they did experience physical abuse. but there's another half a percent of children who don't have observed values of this physical abuse variable. someone asked if I can put the mice vignettes in the chat and I'll do that real quick.

so this isn't a very large proportion of missing data it's not likely to bias our estimates very significantly but it's still worth dealing with and I'll be honest with you that very often the proportion of missing values that you'll have in NDACAN data is much larger than this. okay now let's try and figure out if our data are missing completely at random or they're more likely to be missing at random. so let's see whether or not data are more or less likely to be missing depending on the state from which children are hailing.

so here this table tells us the different states where children have missing values of physical abuse. so there's only six states from which children have missing values but we can see that the proportion of children in each state having missing values is very different across different states. so fully 13 percent of children in New York have missing values of physical abuse whereas in you know roughly 45 other states there are children with no missing values so clearly our data are not missing completely at random. the missingness depends at least on the state that children come from. so we can't simply listwise delete our data we would have invalid estimates of the for example we were interested in the difference between physical abuse in New York and other states we would oops if we were to use our raw data we would sort of probably underestimate the prevalence of physical abuse in New York because 13 percent of children don't even have records. okay let's go ahead and investigate the patterns of missing data in our little sort of fake sample of AFCARS data. here we're going to use the mice command md pattern and here I've used this double colon to tell R this command comes from the mice package and I'm doing this uniquely for the mice package because again up here I didn't load the mice library so I'm telling R to do this manually.

this md pattern command tells us the sort of pattern of missingness in our data. this can be very helpful so we have 24,760 observations that have no missing values for any of our variables. 613 observations have missing values of race and ethnicity. another 144 observations have missing values of physical abuse. we have only one observation that has missing values for both physical abuse and race and ethnicity, and then four observations with missing values of sex. okay let's go ahead and calculate or estimate a multiple imputation model. to do that we're going to use the mice command that's part of the mice package. we're going to do it using our 2018 10 percent sample. we're going to tell R that we want to estimate three imputations. the default is five and when you're actually doing science I would recommend that you never use fewer than five and very often you'd want to use more than that just to keep things moving here we're going to do three and we're going to tell mice to use the default method of estimation for each variable and I'll explain that in just a second.

so this might take my computer a few seconds but what we're doing here basically is using our data and the observed values in our data and their relationship to observed values of data that are in some cases missing, to make best guesses about what these missing values might be.

we're going to do that multiple times namely three times so we'll actually end up with three copies of our data that have different best guesses at what the true value of the missing observations might be.

okay and so we've now created an object imp that we see pop up here in our environment and you'll see that differently than sort of a just a simple matrix of data if we click on this object we'll see that it's actually a pretty complex object that includes our original data, a variety of imputed data but then also all kinds of information about the imputation process itself. so let's let's say we want to see how this imputation actually went. let's first of all confirm that the mice program used the correct type of estimation to make guesses about our missing values. so it didn't have to guess about any values of state or age because we don't have any missing data for those variables. but this tells us the type of model that mice used to estimate the missing values for these other variables. so sex is a binary variable and so mice correctly understood that we wanted to use a logistic regression to estimate that outcome. race and ethnicity is a multinomial variable and so R correctly guessed that it should use a politimus regression or a multinomial logit regression to guess those variables. let's say we want to see how our multiple imputation models performed. one of the most important ways we can evaluate this is by looking at the convergence of our models. because we're using we're using say race and ethnicity to predict physical abuse and physical abuse to predict race and ethnicity, but we have missing values for both variables. the way chained equations work is to iterate over this estimation process multiple times. and so you can see here the multiple iterations that mice performed the default is to perform five iterations. ideally these iterations should confirm can conform over multi sorry our different models here represented by different colored lines should converge over multiple iterations. we can see that for physical abuse we arguably have decent convergence but for other variables like sex our models really aren't converging. that's some indication that our imputation model needs more work. now we can go ahead and compare the differences between our raw and imputed data. so let's go ahead and focus in on New York and let's see the in the raw data what proportion of children had positive, negative, or missing values of physical abuse in our raw data so this is just to revisit what we already saw before. again in New York 80 percent of children did not experience new physical abuse or we think, about seven percent did but then there's this 12 percent where we really don't know. if we look at our imputed data we can see that we no longer have any missing values instead we see that according to our our model estimates about 90 percent of children experienced did not experience physical abuse whereas about nine percent of children did. so if we were to only use our raw data and listwise delete these observations we would incorrectly guess that about seven percent of children in New York entering foster care had experienced physical abuse. when we impute our data we see that the true value is closer to nine percent. we can do the same for race and ethnicity where if we look in our raw data we see the different proportions of children having different values of race and ethnicity but there's another two percent of children who have missing ethnoracial data.

in our imputed data though we've reassigned those children with missing values to the the other categories based on our best guesses of what those true values might be.

I'll skip over well I'll just briefly say that we can then use our imputed data to estimate models, statistical models and we can compare the results of statistical models based on our imputed data to our raw data.

oops well seem to have run into an error that's probably not worth getting to the bottom of here at the end of our presentation but suffice it to say that imputation is important not only for generating valid descriptive statistics about our data but also for estimating valid statistical models based on our data. so again this code will be available to anyone who wants to use it. my contact information is at the top of the code if you ever want to talk through it or have any other questions about data management. thank you so much for your time and attention I think we'll open things up to Q and A.

[Erin McCauley]
yes thank you Alex so much I will also say that during the academic year when we have office hours Alex is one of our breakout group superstars. he is in the stats room that's also a great place to get support during the year. before we move into the Q and A I just want to highlight next week we'll be having a session on linking administrative data internally so we hope that you can join us. And Alex the first question is in the Q and A I'm going to ask Clayton to read it aloud because that way it is accessible when we create our series after. 

[Clayton Covington]
yes so the first question asks is there a threshold for when missing values need to be addressed for validity to avoid bias or is there a threshold or model for determining the impacts or validity of a variable based on the proportion of values that are missing?

oh Alex if you're responding I think you might be muted. 

[Erin McCauley]
has to happened once a webinar guys.

[Alex Roehrkasse]
okay so that was a two-part question right? so let me answer the first part first. if I recall the question was is there a threshold beyond which it becomes important to to address the missing data? so let's say only 0.1 percent of our observations are missing is do we really need to deal with those missing variables? what if one percent of our data are missing? what if 10 percent are missing? to my knowledge there's no clear convention or sort of statistically justified general rule about when you can and cannot worry about missing data. I would encourage people frankly to undertake some of the sort of missing data approaches I've just demonstrated even if it's a very small proportion. if you find that the difference in results is trivial it may be worth simply listwise deleting your data and noting maybe in an appendix that you did some some more sophisticated sort of missing data analysis but you never really know how much missing data may be biasing your results until you go and find out. so I would encourage users even with very small proportions of missing data to take a considered approach to how they might handle it. that's what I understood the first part of the question but maybe there was a second part that was slightly different. Clayton can you read it back to me?

[Clayton Covington]
yes the second part of the question says is there a threshold or model for determining the impacts or validity of a variable based on the proportion of values that are missing?

[Alex Roehrkasse]
I'm not quite sure I understand the question so I guess I would encourage the questioner to to email me and maybe we can work through it over over email.

[Clayton Covington]
okay that sounds like a good plan we have another question that I think Erin's answering in the chat so I won't read it aloud but if people have additional questions they want to ask we do still have a few minutes.

[Erin McCauley]
yes I did type a response but the question was just about the Office Hours I believe. during the academic year we host those Office Hours once a month. the information is on the website our website is really comprehensive and so that's definitely where I would recommend starting and Clayton could you just pop the website into the chat again in case folks missed it. we also are on Twitter and so we advertise all of our events on Twitter as well and can you also put our Twitter handle in the chat and then we have a new question. Can you use imputation on a dependent variable?

[Alex Roehrkasse]
yes yes absolutely so there's no reason not to. yep.

so just to clarify, in the multiple imputation model the outcome of the imputation model is any variable for which there is missing data. if you have missing data then when you get to your data model where you're actually sort of doing your analysis you may have had missing variables in your dependent variable in your predictor variables regardless it's important to sort of impute or find some other missing data strategy whether we're talking about dependent or independent variables. and the sort of approach I've demonstrated here applies equally to physical to dependent or independent variables. so for example you can see here physical abuse was the outcome in our data model, but it also had missing values. likewise, I guess we didn't have any we didn't impute any data for our predictor variables but we could have included say sex in this data model and then we would have been imputing data on both the right and left sides of our data model. 
[Erin McCauley]
Thank you Alex and Holly I would also recommend checking out the session the week after next. Frank is going to be leading a workshop on linking NDACAN data with external data products and I believe that will include the census data and in Holly's follow-up she said for example if you're trying to use census data to predict foster care entry, could you impute the foster care, for the census table.

[Alex Roehrkasse]
oh sorry could you repeat the question I apologize 

[Erin McCauley]
yeah it was just a follow-up on the original question for example if you were trying to use census data to predict foster care entry could you impute the dependent variable.

[Erin McCauley]
 yes absolutely yeah so if you were linking data and then then the sort of appropriate sort of sequence of steps would be to bring in all data that you're actually going to be using in your data model so let's say you're bringing in census data to predict some child maltreatment outcome, you would want to link all of that data get it all sort of cleaned up, and then estimate your imputation model using any variables that you might be using in your data model. to clarify one thing it's perfectly valid and often quite useful to use variables in your imputation model that you might not actually be using in your data model. so here we've only usedthese variables to rather for example we've used sex in our imputation model to help us make good guesses about the missing values for physical abuse and race and ethnicity even though we didn't end up using sex in our data model. so if you're using census data in your data model that's very important to use in your imputation model but you may even bring in data from other sources that's not even part of your data model if you think it's going to help you predict missing attributes in the child maltreatment data 

[Erin McCauley]
perfect thank you so much Alex so we are at time. so just another big round of applause thank you to Alex next week we'll be talking about linking NDACAN data internally in the administrative data cluster and we hope to see you next week. Bye everyone!

[voiceover]
The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, An office of the Administration for Children and Families.