[Musical cue then voiceover]
National Data Archive on Child Abuse and Neglect.

[Erin McCauley]
Okay so I see folks are still streaming in but it is noon slash nine depending on where you are so let's hit the ground. Welcome to the 2022 NDACAN Summer Training Series. During the presentation if you have a question that comes up please submit it into the q and a box. There's also a chat feature but we try to put all the questions in the q a so we have a transcript of them in order to turn it into our series at the end of the summer our webinar series. Also as a reminder this session is being recorded if you are new to us the Summer Training Series is obviously run live but we turn it into a webinar series that gets put on the website at the end of the summer. And so that's the reason that we record them and if you have any questions about our data that kind of aren't covered in this summer series I recommend looking at our website because we have the prior summers. Everyone seems to be here on zoom okay but if you're having trouble you know I think there's a lot of zoom support and then also Andres who's the one who organizes these lovely events for us can also provide support. Sarah next. 
Great so as I said this is the NDACAN Summer Training Series. This is a series of workshops we host every summer at the end of the summer you will get an evaluation survey from me you know we really like getting feedback from folks to help us improve the series but it's also how we pick all the topics for the following series. So you know all the topics that are covered in this Summer Training Series are actually recommendations from folks who attended last year. It's hosted by the national data archive on child abuse neglect. We are a data archive co-hosted at duke and Cornell and we make data accessible on child abuse and neglect for the research community and then help folks with some additional training like the events you see here today. This summer's theme is "the power of linking administrative data".
So this is the overview of the summer. As you can see today is our second session we're gonna be talking about linking and some administrative data I'm going to be doing kind of a brief overview of the administrative data to start and then passing it to Sarah who's going to talk more specifically about linking our administrative data set to another one of our administrative data sets and then doing a live walk through in order to give kind of the most time for the workshop it'll be a quick overview of the administrative data but we did do more in-depth sessions in the 2019 Summer Training Series and Alexandra is going to be putting a link to that training series in the chat so if you hear something about the administrative data but you have more questions I recommend checking out those longer sessions. Thank you Alex. Great so as I said the agendas I'll start by talking about some of the NDACAN's administrative data and then Sarah will go through the linking steps and kind of the process or theory of linking and then that live walkthrough will be in R. So our administrative data holdings. So administrative data are data that are collected for non-research purposes in the context of child abuse and neglect these are collected by the government agencies or large organizations that really manage the actual administration of the child welfare system in the united states. And these records are usually really developed for record-keeping rather than for statistical analyses but since all these data already exists we can really harness the power of these large data sets to study child abuse and neglect. We have three administrative data sets the first is NCANDS and this covers the child protective history of youth. We also have AFCARS which covers their time in foster care and then for those who you know start to age out they're identified at age 17. If they're if these youth do not find a permanent placement and they're likely to age out without that then you know we capture them in another data set called the transition out of care or nytd data set. And so these youth are are you know really started at 17 if they're identified as likely to age out and there are two follow-up sessions and there's also information about the services they receive while aging out. And one really interesting feature of these three data sets is that they can be linked together. So for any child that appears both in NCANDS or AFCARS so they have a child protective history and a foster care experience, we can match that youth. And then for the youth who eventually turn 17 and are identified as likely to age out without finding a permanent placement we can actually trace them all the way back through their child protective and foster care histories. And so these three data sets really cover kind of the three phases of the child welfare system. So NCANDS is the national child abuse and neglect data system it was created as a voluntary system and these are case-level data that were collected for all children who received a response from a child protective service agency in the form of an investigation or an alternative response. There is a file containing child-specific records for each report of alleged child abuse and neglect that received a cps response, complete reports that resulted in a disposition or finding during each reporting year, and for the report child pair there is an NCANDS record that has a report id and a child id which uniquely identifies a single record within a fiscal year. And so here's the types of variables that are in the report variables and the child variables. And again if you know this seems like a data set you might want to use you want to know more about I recommend checking out the series we did that goes more in-depth on this data. But there's report data, there's information about the child, there's maltreatment type there's a few examples there, there's child risk factors and then for child variables there's also the caregiver risk factors, services, perpetrator data and then some additional fields that we find people often use in research and wanted to highlight. There's also an agency file these are Capta-required items in their state-level summary data like prevention services, referrals and reports, additional information on child victims, child fatalities and then part c of the individuals with disabilities education act reporting. So AFCARS is the next data set that we're going to talk about. And again both of these received more in-depth trainings in in 2019. So you know I highly recommend checking out that link from alexandra if you have more questions about what's in these data. I will also say that on our website we have code books for these data sets so if you want to know if a specific field or variable or type of variable is available I recommend starting with either the 2019 series or actually looking at the code book itself. So for AFCARS this is going to be data for youth while they're in foster care. This is case level information on children who are under placement and the care responsibility of welfare agencies. And basically states compile information for their electronic electronic records system and send it to children's bureau and then children's bureau works with states to correct errors in that data. And these variables will include things like demographic information, removals, placements and other case-related information. These are some common variables that folks use: date of birth of the child, caretaker of the child, and the foster and adoptive parents. There are self-identified race information on the child and foster parents. There's the date of first and most recent removal, total number of removals and discharge date, date of placement, number of placements, placement location, and then things like the case plan goals, termination of parental rights and sources of federal financial support. So if we're working with multiple years of AFCARS you can stack the different years or link these data together. When more than one year of a foster care file is used they'll be duplicate AFCARS ids or the state fcid. The child has a record for each year that they're in foster care. If some of you have been with us for a while Sarah Sernaker who's going to be leading the second half of today's session also led an office hours about how to stack the AFCARS files and that'll be likely something that comes up in a future office hours as well. And we have nytd and this is you know really coming from the john h. Chafee foster care program for successful transition to adulthood and it aims to improve outcomes for youth and foster care who are likely to age out without funding that permanent placement. And so the law requires states track services provided to youth and outcomes in order to look at you know how these different services are potentially helping these youth. And so beginning at age 17 youth and foster care are surveyed voluntarily every other year until age 21. We have cohorts from multiple fiscal years and so we have I think now two full cohorts with all three waves. And so each child has three waves of data or young adult in this case has three waves of data and the information is in a different file about the services they used. And so those three waves are going to be more outcome data like if the youth has experienced homelessness in that time whereas the service variables will be more about if they you know receive academic supports. So now I'm going to pass it over to Sarah. 

[Sarah Sernaker]
Thank you Erin hello everyone I'm Sarah Sernaker I'm a statistician with NDACAN so I’ll be doing the second half. So Erin's given a really good introduction pretty brief on our admin data but as she's also mentioned we have had presentations in the past about linking data, specifically our admin data, so as she has referenced there is a link to previous years to help add more context and information about the data sets themselves and also probably different perspectives on linking. So now the linking half. Before we get into linking the admin data specifically I'm just going to go over a brief overview of what linking entails and why you might want to link. And so when we talk about linking data all we're really saying is at the very least you have two data sets and within these two data sets they have some similar attributes so let's say you have two data sets with measurements at the state level. Okay so maybe one is like health information and the other is population information and you want to link them so that they're just in the same place and that you can utilize them both with both pieces of information at the same time. And so for instance in that simple example the linking variable would be state because that's the common denominator between the data sets. So that's like the general purpose of linking and it is very handy in general not just using our data. So when you're linking data as I said before if you're linking data that means you have at least two sets of data two data sets that usually are different data files and you just wanna put them together you wanna put all the data in the same data set to do some analysis. And so this is just general steps this is not specific to our data but if you're linking you should first consult code books and user's guides and that is just in general any time you use data and any place any time you should always have your code book and user's guide open just to get an understanding of the data itself how it was measured, how categorical variables are measured things like that. So once you have an understanding of the data you have in mind how you want the data to look, what you want your linked data to look like you need to clean and prepare your data to get to that linking step. And this is important because especially if you have large data sets which sometimes ours can be if you're working with a lot of years it may even hinder your computer like your computer might have a hard time dealing with the raw, uncut data and also you need to clean your data to make the two data sets compatible. So like if you have information from both that overlaps you want to make sure that they're consistent and make sense. So a few steps I’ve included here like you would want to narrow your data to only include variables of interest so just throwing away variables that really are not going to add to your research they're just going to get in the way they might have unintended consequences with linking which I’ll get into a little more. So that's the first point narrowing your focus to variables of interest. Similarly filtering your data to your scope of interest for instance if you're only interested in certain years, certain geographies, or certain groups of people. For instance if you're only interested in analysis on new york there's no sense in keeping all the other states again it's just computationally easier things might get in your way in ways that you may not have accounted for so again narrowing your data not only to variables but to the scope of interest. And what I was getting at before so standardizing variables that may be linked maybe in one data set your states are listed in abbreviations and you're in the other data set they're listed as state fips codes. To you as a human you know that that's the same thing, they're states, but your programming language and your computer will not know that they're the same unless you standardize it. So like one data set needs to either be abbreviations or sorry they both need to have abbreviations or they should both have fips codes. So standardizing variables to enable linkage but also just the data sets themselves making sure that categories are consistent with what you know and understand and any of the other data sets you might be using. And then missing value codes are always an issue a lot of our data include like "99" instead of "missing" and so those are things you should be aware of so that if you were to run a model your programming language is not see "99" and think that that's a valid code when it should be missing. 
So that was a long spiel for cleaning but I will say in any project I say cleaning and preparing your data is more than half the work almost every single time. So that is not a step to be taken lightly or skipped over that's really a lot of thoughtful research processing and preparation goes into this step. So once you have your data cleaned you have two data sets at least that you want to link, you've cleaned them all, you understand, you want to you know what they want to look like. So then you need to collapse or resolve your data. So what I mean by this is when you're linking data you should be linking with one variable so in the example I gave you be linking on states but sometimes data does not come as one row per linking variable. For instance maybe in the states data this basic example that I keep going back to maybe you have multiple observations per state maybe multiple rows per state maybe it's over multiple years and you're only linking on data that has a single year. So you would need to collapse the data that has multiple rows per state to one row per state and that's just going to facilitate your linking. It especially helps in the context of our data because our data are usually given at the child level and usually people are interested in state level analyses or county level and so in that context you'd be collapsing data from an individual level to state level information. But we'll get into that a little more with the example too. And so rename linking variable to be consistent so this is kind of what I meant before if you had different data types let's say in one data set your variable's name state and the other it's st. Again your programming language is not going to know your programming language is going to think that those are two different variables so simply making sure that your variable to be linked on is named the same between the data sets. And then I find it helpful to once you've gone through these steps you've cleaned your data you've collapsed it you've made sure all the variable names are as they should be. I like to save a clean data set as the new data set so like you have your old raw one your big one and then you have your like data set ready for linkage. And again that is for computational reasons you have a smaller subset of data that's ready for linking and it just gets like all the other junk out of the way so you can just use your linking-ready data set.
So you're ready to link and you know you want them to be linked but how does it work? And I keep saying linking but people also use the term joining and joining I bring it up here because joining is sometimes more often found in the programming language functions for instance in R they call it a join and so these are completely interchangeable when I say linking or joining it just means the same thing. And so there's different ways to join data and I have a nice visualization on the next slide but the point is you have one data set and you want to link it with another and one data set might have more years in your other data set one might not be able to match on some of the values so you might get missing values in some variables when they can't be linked. And hopefully this is all this will all become more clear with the example and with this nice figure. So what I mean by different types of linking: think of table one as one data set and table two is another data set. What's called an inner join is when you take these two data sets and you join them but you only keep what was found in both. So let's say in one data set it's missing dc and Puerto Rico and then the other data set has these all 50 states and  and Puerto Rico if you did an inner join your final data would have all 50 states it would not include dc or Puerto Rico because that was not found in both tables. So an inner join only keeps what can be linked from both tables okay. So that's the inter join and then we have left join and right join and these are basically the same thing it just you know holding table 1 and joining table 2 verse and then vice versa so it's just kind of two sides of the same coin. So let's just focus on left join so in left join you're saying I have one data set and I want to bring in this extra data but maybe that's like auxiliary information it's not as important you want to keep all of your data from one table and just add what you can from the second table. So you're saying keep everything from my first table and just link what you can from the second table. So it's it may not take everything from the second table it's just gonna keep what was linked. So if the other table could not be linked to the first one the observations from table two that could not be linked would just be dropped. Okay and then similarly with a right join. Then a full outer join is what I like to do because this says I want to link the data and keep all of the observations it doesn't matter if they linked between the two or not just keep everything. And why I like this is because sometimes with these inner or left joins they can have unintended consequences or you could lose information you didn't really want to and it wasn't just on accident. With this full outer join you're keeping all observations from both tables and then you could see where the missing values are like if there's a variable from table two that was not linked with table one they'll just have missing values and then you can observe okay I have these values for my table two rows but for table one it's just missing. And so that's like informative on its own and then you don't lose data and in the end you can filter out once you do this full outer join you could then you know filter out to essentially the other joins on what they would yield. So those are all the different types of linking and I wanted to introduce this also if you're having trouble with joining like practically programming these are terms that you could google. When you're programming google is your friend and so again these are just terms that are commonly used in programming language that you could google if you're trying to figure that out and running into problems. So you're ready to make the link, you understand, you know which type of link or join you want to do, how do you actually do it? Luckily most common programming language do it really simply in Stata it usually comes down to one line of code you use the function merge there's a few specifications but it's pretty straightforward. In R there's a also a merge function I have two bullets here there's base R function so if you've just downloaded R you know nothing about R this is probably what you would use. If you're a little more familiar with R and you know what the Tidyverse is it's just a set of really nice functions within R and they have some nice functions. Notice it's exactly named left join, right join, full join, inner join as I introduced. And I prefer Tidyverse so that's what our examples are going to be in so we'll dive into that shortly. In SPSS there's this match files function and I think I did a linking talk two years ago now and it was in SPSS. So if you're trying to link in SPSS that would be on our website from two years ago it was. And then in SAS there's the data statement merge so. All pretty similar I can't think of another language you might be using. Hopefully this covers most of what everyone here uses. And if you ever have problems these start with your help functions so if you're within these programming languages and aren't sure how to use the function you can always use the documentations or should be your first place to start. Okay so I’ve gone through the overview and now I have an actual example so with the context of our data. So this is all within the context of our admin data so let's say you want to link AFCARS Foster Care and the NCANDS Child File. So let's say we want to link AFCARS foster data from 2020 to NCANDS Child File from 2015 to 2020. Let's say you have a research interest of looking at children who are in foster care in 2020 and you want to understand maybe risk factors oh I have a why here want to understand history of maltreatment and risk factors in the last five years leading up to what happened to the children who are currently in foster care in 2020. So yeah that might be a valid research question you're trying to understand maltreatment history from our NCANDS and understand how that might affect a child in foster care which is coming from our AFCARS data. So NCANDS as Erin briefly described comes in data it comes in the form as one data set per year and the data are organized by child report and there's a child report I think that's it within a year you have one unique row per child and report. So a child could show up multiple times within the year if they're on different reports and you could have multiple children on the same report. So NCANDS takes a minute to sit down and think of the structure it's a little tricky. So that's NCANDS. So step one would be aggregate years of NCANDS into one data set. So we want to collapse the five-year span of information into one row summary in essence per child by 2020. So we want to aggregate all of the NCANDS data into one row per child summarizing their maltreatment and risk factors over the past five years. So that's the first step in NCANDS. Next we would need to clean NCANDS and AFCARS since we're only working with one year of AFCARS and one child only shows up once in the AFCARS file in a year, you don't need to worry so much about the aggregating step in AFCARS. It looks like my second step two was what I was discussing in step one. Oh I think my step one just meant to put all the years together. So if you sorry this is us this is like taking us step aside so that NCANDS if you were to order NCANDS from us you would get individual files per year and if you wanted to put them all into one data set you could simply just stack them all. It's not really linking at that point because you're just basically stacking one data on top of the other. So that would be aggregating then you would clean it and then as I was discussing before you would want to collapse the NCANDS to one row per child and then it would be ready to link with AFCARS. So I’ve actually gone through these steps one and two because those are not, for a few reasons it's not trivial it took some time that we did I could not fit into this chat or this this presentation and second I simplified the data to protect what's in our data, our data is sensitive and we really do not want any disclosure risk. So I went in, aggregated, cleaned and collapsed or no I didn't collapse I’ve altered the data from what you would receive from us. Sorry I'm just skipping over to the code here yes so that's all to say I’ve done some processing but I just want to jump in to show you all what I mean. So let me take a beat we're in R right now I’ve jumped into R pretty abruptly sorry about that. And I'm not gonna give really too much background in the R unfortunately we just don't have the time to go over kind of basic programming steps so this kind of assumes a little bit of familiarity with programming. You don't necessarily need to know R but a little programming I think will help understand what's happening here. So first let's just run some of this so I'm setting my working directory just to where my files live. [ONSCREEN code displayed setwd("C:/Users/ss1216/Box/NDACAN/Presentations/Summer Series 2022/S2 linking/") ] So I'm gonna run that and then I'm loading libraries that I'm gonna use. [ONSCREEN code displayed setwdlibrary(tidyverse)] so as I mentioned before briefly I'm gonna be using some functions from what's called the Tidyverse library and this is really this is a little contentious some people really love the Tidyverse and I love the Tidyverse and some people really don't. There's definitely a learning curve but I really prefer because I think it makes code a little easier to digest and once you can get used to the syntax I think it actually makes things easier. [ONSCREEN code displayed library(data.table)] so I'm going to load that and then data table is a nice package to load data so it it includes this function "fread" and what it does is just really it just reads data faster than the normal built-in functions and this can help if you're dealing with large data sets. It'll just be read into your program quicker and easier without any snags. And if you did not already have these libraries you would need to install them so just take a step back. If you needed to install for instance Tidyverse this is the syntax [ONSCREEN code displayed install.packages("tidyverse")] it would look like install packages and then you might get this I'm going to just exit out because it's already in there. Any pop-ups you get for R about installing packages I just click OK it's all kosher, safe, I’ve never had any issues downloading packages or anything nefarious in that way. Okay so we're in the working directory we have our library set now we're just going to load the data so right we're ready for linking we have our NCANDS and AFCARS so what I'm doing here is I'm reading in the data [ONSCREEN code displayed ncands = fread("ncands_2015_2020_ex.csv",na.strings = "NULL") %>% rename_all(tolower)] I'm telling R that if there's any if there's any value that says NULL so literally in the data if you see the value NULL that should just be missing. Right like we as humans know NULL means no, data nothing's there but we need to tell R to fill in missing for those because R will just take them as character values. And then what I have here is called the pipe [ONSCREEN code highlighted %>%]. So this is where the Tidyverse pops up right away. So with the Tidyverse the quirk of the Tidyverse is you have these piping operators and what this does is it kind of allows you to just pile on functions to data. So here I'm loading in the data and then I'm telling R okay take my loaded in data and now apply this function to it. And what is this function? Rename all my variables to lowercase. Just makes it easier and you don't have to worry about case sensitive or like changing uppercase to lowercase so that's just kind of a preference I have. So I'm loading in the data I'm telling R I'm not done with it yet run it through my next function that's what the pipe is doing, and rename all my variables to lowercase just because Sarah likes it like that. That's truly all the only reason I have it here. So I'm gonna run this okay. Notice this is a pretty big data set and this is so this is over five years this is a sub sample as I said I’ve done some processing. It's only 22 variables but if you were to order the data directly from us I think it's like 140 variables. So these are just things to be aware of like if you were to go order the data from us now it's not going to look like this it's going to look a little messier but kind of the same. Okay so I’ve loaded in the data and with R it's nice you can see in your environment I have my AFCARS I have my NCANDS you can get a little summary okay this is how many observations I have this is how many variables but what I remember what I recommend doing is just do a quick check so down here just in the console I have highlighted [ONSCREEN code highlighted head(ncands)] this head function of NCANDS so that's going to give you the first six rows by default and this is great once you load data in you should always look at it. [ONSCREEN output table with 6 rows of data under these 22 variables staterr, stfcid, rptsrc, rptdisp, chsex, chlvng, chmal1, mal1lev, chmal2, mal2lev, chmal3, mal3lev, chmal4, mal4lev, fcalc, fcdrug, per1rel, per2rel, per3rel, rptvictim, raceethn, year] doesn't matter even if you know exactly what it looks like you should just look at it as soon as you load it in because you want to make sure nothing's gone wrong. Because sometimes things just unexpectedly go wrong with the computer. So this is just to make sure okay I’ve got my variables, I have report source here this is what I’ve chosen behind the scenes, so we have report disposition, whether the maltreatment was confirmed or not. We have the child sex, the living so I think that's where they're currently living so for instance foster care family or congregate care. Chmal and the mal1lev, so these are the different maltreatments a child could have experienced and they could have experienced up to four maltreatments on one report. And then we have some risk factors so I have alcohol drug risk factors in here also race and years. So as I said this is 2015 to 2020 so notice we have a bunch of different years and children from different years. We are observing the same child sometimes over different years or in this case in the same year so notice rows one and two is the same child as indicated by the state foster care id [ONSCREEN a seven-digit number] and it's within the same year and that's because so that means this child showed up on two reports in the same year for maltreatment and so that's why the NCANDS needs a bit of extra time for processing because it does have this tricky structure. And just to reiterate this has been processed these are falsified these are not real these are not true state foster care ids I have altered these. These are not even the correct state or data they have been adjusted to mask identity even further but this is the type of data you would see this is like a simplified clean subset of the data if you were to order it from us. Okay so that's NCANDS and let's take a peek at AFCARS. [ONSCREEN output table with 6 rows of data under these 41 variables staterr, stfcid, fy, sex, amiakn, asian, blkafram, hawaiipi, white, untodetm, hisorgin, clindis, mr, vishear, phydis, emotdist, othermed, totalrem, numplep, phyabuse, sexabuse, neglect, aaparent, daparent, aachild, dachild, childis, chbehprb, nocope, abandmnt, relinqsh, housing, curplset, placeout, casegoal, ctkfamst, fosfamst, disreasn, ageatlatrem, ageatend, inatend] so AFCARS I think I removed the date variables but I think otherwise I left everything. Again the state foster care id has been altered these are not true values. So I’ll just add that caveat. So this is AFCARS id we're just checking our variables notice we have some nas here and there notice we have 99s as I noted before that in some variables that is our placeholder for missing values so this 99 is not a true 99. This is a placeholder for a missing value. There's no information here. So this 99 should not be taken this should not be used in any sort of analysis. Okay so all we've done is load data and we're just making sure our data look good we understand our data this is what we expected when we loaded it in that's all we're doing here. Okay so now the next step was to clean the data so I did a little bit of cleaning but I left some for fun and for demonstration purposes. And it looks like I had already written head so I was already past there was looking out for future Sarah. But I did head. Another function I like to run is summary [ONSCREEN code displayed summary(afcars)] so again just understanding your data summary will give you summary statistics over every variable [ONSCREEN multiple small tables for each variable containing data under these headers min., 1st Qu., Median, Mean, 3rd Qu., Max] and this is nice because this will tell you min and max so you can identify which variables have that 99 placeholder, which variables have missing values and how many. And it just is a good check to make sure you know like if you expected one variable to be zero ones the min should be zero and the max should be one and that's just you know sanity checks to make sure so for instance min is zero, max is one, that's a zero one variable. So this is all just to make sure again that your data look good and as it should be just because your code runs does not mean it's right and so these are the checks that you should do to make sure things are right. And then the last little summary code I'm going to introduce is this this string [ONSCREEN code displayed str(ncands)] and this is to get a sense of what the data types are and this is important because for instance if you're trying to link [ONSCREEN a list of the ncands variables and next to each is one of the following two data format codes chr, int] for instance on let's say a state fips code that is should be numeric a state fips code is just a numeric code for each state it's a standardized value that all states have one through I think like 56. But anyway so if your state fips code in one data set is a character and your state phase code in another data set is is a numeric value you won't be able to link on them because your programming language is going to say they're incompatible. So these again are just checks you're doing to make sure your data is ready by the time you link it it should be ready to go and it's going to link how you expect it to. So again like I said this cleaning process I’ve been in R for 10 minutes and we haven't even gotten to like any sort of linking so like these are important steps that take time and are not like the fun need meat on the bone but like need to be done. Okay so those are the three summary head summary string. Okay just understanding your data you need to clean it so we've identified missing values and missing placeholders. There's not really much else to clean here so the only thing I'm going to clean for our data is set missing codes to na. So I want to set any variable that has a 99, and this is with consultation from the codebook, sometimes 99 is valid but sometimes but in our case in our data 99 almost always stands for missing. Also you'll learn from the code book some some variables have nine as the missing code for instance child sex or these risk factors a 9 is not a valid number it just is another placeholder. So again this is all information you would pull from your code book. [ONSCREEN code displayed ncands2 = ncands %>% mutate_at(vars(rptsrc,rptdisp, chlvng, mal1lev,raceethn), ~ifelse(. == 99, NA, .)) %>% mutate_at(vars(chsex,fcalc,fcdrug), ~ifelse(. == 9, NA, .))] Okay so let's take this step by step so I'm taking NCANDS what I want to do is set any placeholder for missing to just missing because it will interfere with any sort of analysis you want to do. So I'm telling R okay take my data set NCANDS I'm going to mutate some variables, right so this is my pipe I'm saying take this very take this data set and we're going to mutate at these specific variables and I’ve chosen these specific variables because I'm going to scroll up a little bit if you look at the summary these were the variables I identified with the summary, I don't know if it did a summary yeah these are the variables I identified with summary that have that 99 values. [ONSCREEN code highlighted rptsrc, rptdisp, chlvng] so report source, notice, report disposition, child living, so this is just systematically I’ve gone through in R and in the code book and just by knowing the data to change which variables I know use 99 as a placeholder. So that's what I'm saying you take these variables that use 99 as a placeholder okay take those variables only if they equal 99 so this period is taking is like the placeholder for our variables now so if any of these it's going to go through each variable so if report source equals 99 set it to NA otherwise just leave it as it is it's fine. Right because if it's not NA if it's not 99 then it's a valid. Okay so it's going to go through each of the variables listed here and replace all those. Similarly I'm mutating this set of variables it's separated because they don't use 99 as a placeholder they use 9. So I'm saying okay these variables if they equal 9 set them to missing otherwise they're fine leave them be. And that's why I have this mutate twice over different sets of variables. So again this is all fancy code just to replace 99s and nines with NA's. So I'm going to run that and before I run that I just want to again point out look at our summary we have these maxes of 99s there's no missing in these variables here so let me just run this and then we're going to do [ONSCREEN code displayed summary(ncands2)] summary of NCANDS2 down here in my console. [ONSCREEN output multiple small tables for each variable containing data under these headers min., 1st Qu., Median, Mean, 3rd Qu., Max] just again make sure it did what we want. Notice report source max is 88. That's if you look in the code book that's a valid code. We have more nas now I don't think we had any before. Report disposition that's another one we change the max is now five and we have missing values. So again every time you run something, and I’ve been programming for years, I do basic checks like this. Just because it ran doesn't mean it made it did what you wanted you have to check that it did what you want. So again just being really thorough here. So that was all NCANDS I'm gonna kind of fly through AFCARS because it just does the same thing we're just again just it's just a general cleaning step for demonstration purposes this is like one cleaning step that you would probably have to do out of many that I’ve already done beforehand. So you would do your summary so I’ve done NCANDS now I want to do AFCARS I can delete this yeah I don't want to confuse we're using the same function here, I’ve identified these variables [ONSCREEN code displayed afcars2 = afcars %>% dplyr::mutate_at(vars(disreasn, ageatlatrem,ageatend, raceethn, race), ~ifelse(. == 99, NA, .)) %>% mutate(afcars = 1)] that use 99 as a placeholder and I'm saying if they equal 99 make them missing otherwise they're fine. Another thing I’ve done is when you do linking in R it's not like other data program other programming languages which will tell you which values will linked. You kind of just have to observe by either missing or getting a sense of or just doing like more summary things like this. So what I'm doing here is I'm explicitly creating a variable that's called AFCARS and it's just going to be one for every value in AFCARS. So this is just kind of my way in R to keep track when we get to linking, which records I could find in AFCARS that will be linked to NCANDS [ONSCREEN code highlighted %>% mutate(afcars = 1)] so this I'm just creating as like a flag to myself to say okay in this this is AFCARS data and it's kind of silly within itself because they're all one and it's all AFCARS data but it's going to be helpful when we get to linking. So I'm gonna run this [ONSCREEN code highlighted afcars2 = afcars %>% dplyr::mutate_at(vars(disreasn, ageatlatrem,ageatend, raceethn, race), ~ifelse(. == 99, NA, .)) %>% mutate(afcars = 1)] and I have gone through this so I'm not gonna do these summaries because I inevitably run out of time so we're gonna assume that worked well and keep moving. So we have our cleaned AFCARS and NCANDS data. Notice I’ve called them AFCARS and NCANDS 2. I like to just index by numbers and not overwrite things. Overwriting can lead to problem unanticipated problems and it's just best to reduce any potential errors. So that's why I like to create new data and not just overwrite it. Okay so we've cleaned the data and AFCARS is ready to go AFCARS is one row per child but NCANDS remembers is over five years that's why we have like 2.5 million observations that's a five years’ worth of data. And so now we need to summarize we want to summarize a child's five years of risk factors and maltreatment and so we need to understand what kind of summarization we want to do on each variable. And so what I mean by this is we know we want one row per child in the NCANDS and to do that we need to summarize for each child we need to let's say a child shows up five years and they have information for each of the five years we need to think of a way to reduce that five years of information to just one summary. So instead of you know five years of indicators maybe they were removed in 2015 due to physical abuse and 2016 maybe due to neglect. Maybe our summary variable just says were they ever removed due to physical abuse or were they ever removed due to neglect? So these sorts of summaries where it's not year by year but just you know where they ever or what was the total number of times they were removed rather than a year by year view. So that's what I mean by summary and hopefully that makes sense because some of this I'm going to dive into this code which gets a little tricky so let's take it one step at a time. We have our NCANDS2 data which is cleaned I’ve did most of the cleaning behind the scenes but we replaced the missing values and so what I'm telling R [ONSCREEN code displayed ncands_collapse = ncands2 %>% group_by(staterr,stfcid)%>% summarise(fstrpt = min(year), lstrpt = max(year), numrpts = n(), chsex = min(chsex), substrpts = sum(rptvictim == 1,na.rm = TRUE), unsubstrpts = sum(rptvictim == 0,na.rm = TRUE), parentabuse = sum(per1rel == 1 | per2rel == 1 | per3rel == 1), rptneglect = sum(chmal1 %in% 2:3 | chmal2 %in% 2:3 | chmal3 %in% 2:3), fcalc = min(fcalc), ncands = 1)%>%mutate_if(is.numeric, ~ifelse(is.na(.),0,.))] Is okay take my clean data and group it by state and foster care id so group it so use our data and we want to work with the child level. I've included state here [ONSCREEN code highlighted staterr] it's kind of redundant I could remove this I guess let's I'm going to keep it because I don't know if that's going to break any code but it's redundant because state is found within state foster care id and so each foster care id is unique within each state I think I just kept it because then it keeps the variable so I’ll get to that in a second. So we want a group by child so we're saying for each child we want one row we want to summarize information by each child so that's my group_by statement. So I'm saying okay take my data we're looking within each child so R is going to say okay for each foster care id [ONSCREEN code highlighted stfcid] I'm going to look at the rows. So, once we have per child we're going to summarize so like I said we're going to summarize the last five years into one row of data. So I’ve just picked out a few variables here if you wanted to use it's a little tedious so this is why it's helpful to identify variables that you will need and variables that you won't because some of these [inaudible] some of these steps are a little tedious including this summarize so I’ve just included a few points of interest here. So instead of having each individual year of the child I said okay what is the first report [ONSCREEN code highlighted fstrpt = min(year)] within this five year frame okay so that's a summary value so the year of their first report which is taking them in. Okay when was their last report or their most recent [ONSCREEN code highlighted lstrpt = max(year)] and we take the max so again think of R as looking at each child's piece of information. A child could have at least could have a sorry a child could have a lot of rows in this data because they could be appearing multiple times within a year so they could have five rows of data one for each year I mean they could have like 10 rows of data if there they have multiple reports over the years. So R is taking all that so R is going child by child you know looking at the full extent per child and now we're summarizing. So over the child's records when was their first report min of you know the first year that's observed? When was the last report? We can count the number of reports so this N value [ONSCREEN code highlighted numrpts = n()] is handy this is just N and then open close parens and this will tell you the number of observations in your group by statement. So this is telling you how many rows of data are you observing for each child in this five year period. Child sex so we usually default to taking min of child sex I'm just going to breeze over that quickly because I'm running out of time. So you could take the number of substantiated reports [ONSCREEN code highlighted substrpts = sum(rptvictim == 1,na.rm = TRUE)] so this is saying if they were a victim so you could show up on a report and not be a victim so this is saying how many times were they victims so we're adding up the number of times they were observed to have been a report victim. And then kinda in parallel [ONSCREEN code highlighted unsubstrpts = sum(rptvictim == 0,na.rm = TRUE)] we could see the number of times they weren't a victim again seeing if they've ever experienced parental abuse [ONSCREEN code highlighted parentabuse = sum(per1rel == 1 | per2rel == 1 | per3rel == 1)] the number of times the perpetrator is listed as a parent [ONSCREEN code highlighted per1rel == 1] the number of if or the number of times [ONSCREEN code highlighted rptneglect = sum(chmal1 %in% 2:3 | chmal2 %in% 2:3 | chmal3 %in% 2:3)] they've been on reports due to neglect so adding up the number of times that their maltreatment is listed as neglect so that's saying the child maltreatment is equal to two or three which if you look in our code book is equal to neglect. And then so if they've ever had a alcohol risk [ONSCREEN code highlighted fcalc = min(fcalc)] I think this is their parent if their parent has an alcohol risk factor. And then this last one is our NCANDS indicator similar to our AFCARS up here I'm putting in this indicator [ONSCREEN code highlighted ncands = 1)%>%] for when we do get to linking to keep track of where we could find each row of data if it's in NCANDS or AFCARS or both. So that this chunk of code is summarizing a child's experience over five years. And I’ve definitely breezed through some of this but the code is available and I'm available over email so I'm just it's gonna keep going. And then [ONSCREEN code highlighted mutate_if(is.numeric, ~ifelse(is.na(.),0,.))] This last piece I'm saying okay so we've summarized a child's experience over the five years so now I'm saying you take the variable if it's numeric so that if it's missing to make it zero. I don't know why I did that. I don't remember why I did that I wrote this code a few months ago. I'm gonna breeze over that and just run this don't remember if I don't think this is important I think I might have just done that for preference. Notice it's taking a while because R is going through each child so each foster care id and doing this summary so it's kind of like if you're on a computer and looking up each kid and then summarizing so are like systematically going through each child and summarizing their experience in essence. And this again notice this is like two and a half million observations it's not a huge data set in terms of the big data world we live in today and so this is taking a few minutes. And this is why reducing your data, filtering is going to help computation. Because if we were looking I think I had originally made this data more years but it took way too long to run. So like these computational considerations are not, you know, not just to be ignored and so we can't do anything else until it stops running. But we can peek ahead so that is still running so we're collapsing our NCANDS. 

[Erin McCauley]
Sarah we have about two minutes until Q and A FYI.

[Sarah Sernaker]
Okay that's running. Like I said before we don't need to effect we don't need to adjust AFCARS because we're only using one year of AFCARS and a child only appears once in a year in AFCARS. I wrote here [ONSCREEN code highlighted afcars2 %>% group_by(stfcid) %>% summarise(n = n()) %>% summary] don't need to change but you should still double check so again checking. So we're grouping in AFCARS by the state foster care id and we're seeing how many are how many observations per child and then we could get a summary. [ONSCREEN output summary table for stfcid variable showing min., 1st Qu., Median, Mean, 3rd Qu., and Max all equal to 1] and notice every child has one observation so this is just again standing check. Okay so notice it was all cleaning all that now we're getting to linking and as I said before the reason I’ve been taking so long is because this is literally one line of code [ONSCREEN code displayed linked = ncands_collapse %>% full_join(afcars2, by = "stfcid")] so once your data is ready you just need to tell R to join it up. So we're saying okay use my collapse data, full join with AFCARS and you need to tell it by what. You don't necessarily do R can intuit it'll find the same variables, but sometimes it can find variables you don't mean to link on so it's good to specify what you're linking by. So we're linking by child so I'm gonna run this and that's it we linked it. And so we can see how many successful links, so how many kids in NCANDS could be found in AFCARS or vice versa. So this is why I have these indicators so we're saying which kids can be found in both data sets and that's 70 about 70,000 [ONSCREEN output 70898] so that's compared to our full AFCARS data so that means 4,000 kids in foster care in 2020 did not have any previous records in NCANDS that could be linked. That's what that means and as I was introducing the other ones we have what we call the inner join [ONSCREEN code displayed inner_link = ncands_collapse %>% inner_join(afcars2, by = "stfcid") Nrow(inner_link)] and again it's the same sort of syntax but notice this inner join so looking at here inner_link is only 70,000 so those are strictly 70,898 those are strictly the only ones that could be found in both. Remember I said the inner keeps only what can be linked in both. And then I guess I’ll just put a pin in that I mean this is here just to see just to get a sense of well let me let me just run this actually. This is to get a sense of how many observations in AFCARS can't be matched to NCANDS [ONSCREEN output 3543] so that's about three point thirty five hundred. [ONSCREEN code displayed linked %>% filter(ncands == 1 & is.na(afcars)) %>% count()] and then how many records in NCANDS can't be matched to AFCARS is what I'm into right here that was a lot more because we had a lot more data right. So I'm just going to re-save this let me just stop there for Q and A. 

[Erin McCauley]
So I’ve answered a few of the questions written in the chat including a question from Catherine about recommendations for software for using multiple years because of issues combining the size of the data. I recommended that it was probably not about software but about computing power and so to try and reduce the data as much as possible you saw how much of Sarah's time went into that data cleaning. And then using institutional computing power if you have access to it through either a vpn or a computer on campus but we did have another question I was hoping to answer 

[Sarah Sernaker]
Well I will say that I mean some programming languages do handle bigger data sets better than others. I think SAS is probably like you're I don't know not really gold standard it's probably the best as far as the ones I listed to handle bigger data sets. R is probably on the like the worse end of things it can't really handle huge data sets this is kind of pushing it to be honest. So that's like a downfall of R downside.

[Erin McCauley]
I've never heard Sarah give a downside of R so good job Catherine we have another question from Abby: Do AFCARS data include every placement for children or does it include only placement information for children on the dates of reporting example March 30th or September 30th?

[Sarah Sernaker]
It will give the most current placement as of the last reporting date so as of the whenever the last report came in and where the child was placed so there is there's really no information if they move around a lot it's just going to be the last placement by the time the report went in. If that makes sense.

[Erin McCauley]
Thank you for that so we are at time but I did want to highlight that next week, doctor Frank Edwards from Rutgers is going to be coming in and link and doing a workshop about linking NDACAN data. You're the same data sets you saw here today to external data. And so that you know would be things like from the American Community Survey or the Census data so if you're interested in linking I recommend coming back next week to see that workshop as well. And you know we're we're always available via email and during the academic year during office hours so thank you so much for joining us today and thank you so much Sarah for that lovely presentation.

[Sarah Sernaker]
Thank you here's my email up again [sarah.sernaker@duke.edu] just real quick before we all leave and Erin's too [ejm354@cornell.edu].

[Erin McCauley]
Yeah if you have follow follow-up questions let us know. We also have a formal technical support email [ndacansupport@cornell.edu] which you can always send questions into as well if you're struggling with linking or coding. And then we also if you look at our website have you know all the prior Summer Training Series workshops on linking as well as some pdfs that give you kind of steps of linking and then code and in different software. See you all next week.

[Sarah Sernaker]
Thanks Erin

[voiceover, then musical cue]
The national data archive on child abuse neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.