[Erin McCauley] [Michael Dineen] National Data Archive on Child Abuse and Neglect [Erin McCauley] okay everyone welcome to another session of the Summer of NYTD series we're really happy to have you guys here with us today. So this is the fifth in our series that's quite exciting and again this is the Summer of NYTD series. NYTD is one of the data sets we have in the National Data Archive on Child Abuse and Neglect in the Bronfenbrenner Center for Translational Research at Cornell University. And we're really lucky to have Michael Dineen giving us our presentation today so he's going to be talking about linking the NCANDS and AFCARS with our NYTD data set we've been hearing so much about. This is a really unique capacity of our data set and one of the things that I think is gives us a competitive edge and can help us answer some really pressing questions that we often are unable to do in this field. In the next two weeks will be more research presentations so I'll be giving the next presentation on on in progress research project and we'll hear a little bit more about that at the end and then after that we'll have one of our very esteemed professors who used uses our data set quite frequently kind of go over the strengths and weaknesses as she sees it and then tells about some of the research projects that she's published. So we're lucky to have her as well. And Michael is going to be taking it away so thanks everyone. [Michael Dineen] hi everybody this is Michael Dineen times of resident manager for the NYTD and the NCANDS and the AFCARS data sets so I've had some extensive experience unlinking things data sets. This is a family of data sets all managed are administered by the Children's Bureau. The AFCARS has to the AFCARS has two files the Foster Care File in the adoption file. Then there's NYTD witches the one that we are talking about today there's the outcomes file in the service file. The outcomes file is a series of surveys done with children who are probably aged out of foster care. It starts at age 17 in certain years kids who are in foster care at age 17 and then it follows them up at 19 and 21 so it's a set of three surveys and there are multiple cohorts of that. There was a 2011 one there's a 2014 and a 2017 and those will go on after every three years they'll start a new cohort of three outcomes surveys. Then there's the services file. The services file has a record for every child who receives services under the NYTD program which is aimed at kids who are likely to age out of foster care. The services file has a lot more records than the outcomes file and so we provide that separately. Then there's the NCANDS which has the Child File which is every report of maltreatment in the United States by fiscal year. Then there's the agency file which is just a state-level summary so there's only 52 records in that file but it's a summary by state. So those three data sets make up the family that can be linked together through the child identifiers. So the main thing you need to keep in mind if you want to link data sets is that the main work is preparing the linked file before you ever do the link. So you have to I'm going to show you some steps are some things to look out for when you're doing that. So with the AFCARS foster care file you need to know that we distribute the Foster Care File in one fiscal year batches like once a year we release it. There's also a six-month file but we are ignoring that for now the same thing supply. The fiscal year is denoted by a variable in the Foster Care File called FY and that's important when you are stacking foster care files by year because it's the FY variable you'll need to distinguish one year from the other. The child's RecNumbr the variable called RecNumbr will appear in every year the child's in foster care even if nothing happens. So just the fact that they are in foster care and no events occurred, there may be, the least thing that could happen is a periodic review because they have a periodic review is required I think it's every six months. So that might be updated but there are times when no other information other than what was in there the previous year. Years of data can be stacked. The procedure in your stats program will probably be called merge or concatenate. The foster care files it's a census every year you have all the children who are in foster care every year and you will have to you'll probably have to restructure the data to get it to one child per row. Not all years can be linked, and it's not all years can be linked for all states sometimes they change the child identifier. So other things to know the main thing is to try to reduce the file to one record per child. And you have to ask yourself about your analysis what kind of information are you going to need out of the Foster Care File. You want to make this file that you are going to link to the smallest possible and as compact as possible. So say you are interested in length of stay that piece of information will be in the most recent record. So the year before that it would have been one year less so you want the most recent record for something like length of stay. And then if you say you are interested in the reason for entry there are multiple reasons for entry and that might be something you are looking for. That information was cleaned in at the time they were admitted into foster care but that same information carries through every year. So it'll be in the most recent record as well the reason for entry it just gets carried along every year. If you need data for like a previous removal then you'll need to look at earlier year's records and you may have to do some sort of restructuring to get one record per child. This table shows you the years. These columns are pairs of years and the question that this table will answer is "Can you link the child between the two years?". In this column I'm looking at can you link the Foster Care File from 2001 to 2002. In other words does that child have the same ID, does the child ID called the RecNumbr in AFCARS, if a child matches on RecNumbr does it also match on birth date and sex. That's how we define that it's the same child if it happens to match on RecNumbr but also match on these other two variables which would not be if RecNumbr was just a random child. The chances are vanishingly unlikely that they would have the same birthdate and sex. So as you see here in this table this, for state of California, you can't link records across 2001 to 2002. So this string of one's means that ever since 2002, 2002 to 2003 you can link this child all the way through our most recent data. This now Connecticut looks like it broke its chain in 2009 so you won't be able to link back a child say from if you wanted to link a child in the 2016 data back to 2007 or 2006 or any of these previous years you can't do it because that record number is different. That's with the zeros you can't link across a zero so this is here to illustrate that you know you can't just go back all the way to the beginning in every state. Some states you can like here Montana goes all the way back to 1998 and other states do. A lot of states don't. Then the NCANDS Child File the child identifier in the Child File is called ChID are child ID. It's not called AFCARSID. There's a separate variable called AFCARSID in the NCANDS Child File which is the variable that you will use to link to the NYTD. You won't use ChID to link to NYTD but you will use ChID to link to previous years of the Child File. A row in the Child File is not just one child per row it's one report-child pair (RptID/ChID). So a report ID can appear multiple times in a year but it would be for a different child each row. And a child can appear multiple times in a year but it would be on a different report but but a report ID and a child ID pair is unique in a particular fiscal year. A multi-year child file may have duplicate report-child rows. The reason for that is because they are can be a record where the record was substantiated and the parents or whoever was found to be the perpetrator challenges that in court and it's reviewed and then it gets a second judgment about whether it was substantiated or not. So in that might be in a different year. so the report-child pair it's possible to appear in multiple if you stack child files you may get duplicate report-child pairs. That's something to be aware of when you are using multiple years of the NCANDS child file. And then again not all years of the Child File can be linked either. And this is a table that shows for the Child File what years can be linked. And there are more problems in the Child File of linking then there were in the foster care. These gray areas hear these gray rows are years when the state didn't even submit data to the Child File. So for something like here for Oregon you've only got like back to 2013 for data on Oregon. And then for Pennsylvania you basically can't do any links because there's only like you can link 2013 to 2014, 2011 to 2012, but for some reason they keep changing their child identifier and you can't link that. Puerto Rico has got similar issues either you can't link it at all are they don't send in the data. Again you cannot link across the zero in this table. And if you are linking to the Child File or to the foster care and you want copies of this table that shows you what years can be linked for each state you can request that from the archive so I'll send it to you. So for the child file again you want to try to reduce the file to one record per child which is not as easy in the child file because in the AFCARS the child is going to be in there every year and some of that information just repeats so you can get a lot of information from the last record. But with the Child File the record is new if the child appears in their multiple times it's because there were multiple reports of maltreatment. Again you have to ask yourself what information you need. If you are interested in history of abuse you are going to need multiple years obviously. If you are interested in something like the number of reports are number of victimizations for a particular child you can bring that to one row if you can summarize, like if you can count the number of victimizations and reduce it to one record per child that would be the way to do it if you were looking for victimizations. Okay that was within data sets, now were going to look if you've prepared your file that you want to link to and then you want to link it to NYTD. So there's a key. A key in a table is a variable that appears only in that one record. So it makes that row unique. So for the AFCARS foster care file, for the NYTD outcomes file and the NYTD services file both the NYTD files and the foster care file, that variables that child ID is called RecNumbr. For the adoption file it's called RecNum. And for the Child File it's called AFCARSID. So these are all the same identifier but they have different names in the different files so that's important to know. The key part two is the state because it's possible that there could a child identifier just a child identifier part that appears in more than one state so what you want is the state as well. So this state ID is also needed it's called St in all of them but the Child File where it's called StaTerr state or territory. The files can be linked on 2 variables well it's a lot easier to just create one variable, a linking variable. The advantages of that is it simplifies your code when you are doing the linking and filled linking variable you create it to have the same name in all the data sets. And it removes confusion resulting from the different variable names. So what I do it's called St state foster care ID so if you can concatenates ST variable for everything but the Child File and then the StaTerr variable for the Child File with the RecNumbr or the AFCARSID and StaTerr go together those are in the Child File and RecNumbr and St go together to create State Foster Care ID, StFCID. So that's what I would recommend you do prior in both of the files that you want to link so that they have the same variable name and it will be a lot easier and a lot clearer. When you are linking you are doing a join and there are different kinds of joins. There is an outer join. All the joins are based on Venn diagrams because you are dealing with sets. Every table is a set. On outer join means the outside, not the part that's overlapped, but the outside of the overlapped part for one of the tables and only the overlapped part for the other table. So when you link with NYTD you are going to want all the kids in the cohort. For the 2011 there's 15,597 so you are going to want to construct a table that has 15,597 records in it. And so you will have to do an outer join to tell the software however it's expressed that you want all the records from the NYTD cohort and only the records from the Child File or the Foster Care File that link to the cohort, that are in the cohort. So here this is a SQL code I don't know I hope some of you are familiar with SQL. SQL is kind of a common language for dealing with sets of data and it's used, pretty much all of the common statistical programs have a way of SQL code in asking for a file or asking for data or manipulating data. Or doing joins. SPSS has a thing called star join which is basically just a way of writing SQL. And Stata has that and I think R I'm pretty sure R has that. SPSS has a proc SQL. So it's a way they've kind of given up on trying to make their own you know doing data management within the the statue language and conceded that SQL is a much better language for data management, like breaking steps down and getting particular doing filtering and group by and that sort of thing. Anyway so in a this is a SQL statement. This statement here SELECT everything from the Foster Care File where the fiscal year is 2011. So then it's going to join that to the cohort, the 2011 cohort, and it's going to bring these variables this first row comes out of the cohort in the second row comes out of the Foster Care File. And this is saying give us only the ones where it's wave one and they are in cohort that's going to be the 15,597. So let's go over to SQL and we are going to run this. This is to illustrate that a left join is going to produce blank rows on the foster care side. This is the kind of things you need to look out for when you are doing a join. So we run this. Oh first let me show you I am just going to run this inner one and to show you that when we pull data from the 2011 foster care file, look down here we have 643,000 records that we are pulling that were in in the Foster Care File in 2011. So by doing a left join we are saying give us all the records from the cohort and only the records in the foster care file that match. So when I do that and say just give us outcome looked down here there's our 15,597 records. So you needs to do the left join on that. If I took the left out and just did a join which is an inner join I would get fewer records then I would when I do the left join because I'm not getting all the records out of the cohort I'm only getting some of the records out of the cohort. So let's go back. So the results of this query would be something like you are getting all your NYTD variables and some of the AFCARS variables are missing. The reason that some of the AFCARS variables are missing is because this child from Kansas or this child from Louisiana are not in AFCARS 2011. They are in some other year of AFCARS. So if you want to get those you say I want to get those other records that were in some other year then we widened our window here we widened our window to 2010 to 2015. And let's run that query. Now we are getting instead of getting... When we ran a left join on 2011, we got our 15,597 but if we do it on between 2011 to 2015, with gets 33,982 that's because even though we are getting all the records from the cohort because we did a left join, we are getting multiple records from the Foster Care Files for each record in the cohort and that's something you have to be careful of to. So it will look like this, we have all these AFCARS but we are getting duplicates on the NYTD side so that's something you have to be aware of. I'm just trying to tell you how that you have to be aware of all these things when you are trying to do join because you might not get what you think you are trying to get and you might not want multiple that's why I was saying it's important to reduce AFCARS or the NCANDS to one record per child. So just to show you what a middle join, it's only the common rows. And that would be that would look good but you wouldn't be getting all the records from the NYTD file then. And one thing I wanted to tell you to about linking to AFCARS let's go back. Here's our cohort will run that that's our 15,597 IDs this is saying give me records from the Foster Care File that are where the state foster care ID is in the cohort. So and I'm going to group by state foster care ID. So this should get us 15,597 records because this is looking in all the years of foster care and saying give us the record from foster care no matter what year it's in and link it with the NYTD. So look at this has only 15,109 records which means that no matter what you do you are never going to be able to catch all 15,597 NYTD cohort out of the Foster Care File because apparently there are records in the NYTD outcomes file that aren't in foster care. That's peculiar because all these kids come from foster care but they are not... This is trying to find them and they are not in their. So it's saying give us the foster care ID from the Foster Care File for every kid that's in the cohort and we're only getting 15,109. Now I'm going to show you I'm going to open this up and then I'll run it again. And this is going to show you what years those kids are coming from like what year is the most recent year that they were in the Foster Care File. So you have kids you have one child at least in the cohort whose first record whose most recent record 1998. Which means they are that's some sort of error because they wouldn't be or maybe they are in there and they are just they are like 20 some years old or something. Anyway so the maximum like 2011 to 2012 is when you get most of them so you get this kind of like grouping up here when you get to the more recent years. Kind of somewhat of a normal curve. Anyway I just wanted to illustrate to you that you are never going to get all the records out of the Foster Care File with cohort 11. Okay now linking to the Child File. The child file can be linked to the NYTD using the variable AFCARSID. The child file has 2 child identifiers ChID and AFCARSID. It's AFCARSID you want to use to do the linking. Preparation will have to be done on the Child File prior to attempting to link. Think about exactly what variables you need from the Child File for what years and how you can reduce the file to one record per child. You should know that annual patches are determined by the disposition date, not the report date or the incident date in the Child File. So here's an example of linking to the Child File you are not getting you have this would indicate this South Dakota record would indicate a child who is in the cohort but they are not in the Child File they have no records in the Child File ever. Same thing with these other and that's quite possible. So I'm going to open it up for questions if anybody has any questions that's the end of the presentation. [Erin McCauley] wonderful so we have Jamieson asked the question can you go back to specific slide referencing your concatenation techniques. The key part one. [Michael Dineen] this is just showing you for each of the data sets what the child IDs variable name is. And then this next one was showing the advantages of using you know concatenating state and the child identifier to have one variable that you can use to link. [Erin McCauley] do you want to take a minute to maybe talk about why you use both the RecNumbr and the ID and the birth date. I know that we talked about why you do this but I think it would be helpful for everyone to hear. [Michael Dineen] why I use the birth date? [Erin McCauley] yeah [Michael Dineen] well because there could be an error in the RecNumbr perhaps a state used a RecNumbr wants and then they used it again on a different child. So what I do is I include, when I'm doing the linking I include date of birth but I didn't bring that up in this presentation because the data birth that you guys have is not the data of birth so it's not as good of an identifier for a child because it's only to the nearest month. But it's still worth doing if you want a little more if you want to be a little more sure that your RecNumbr is linking to the same child. [Erin McCauley] wonderful thank you we have another question: does NYTD service file contain multiple records per child before merging with another data set, or did the duplicate records occur after the merge. [Michael Dineen] the duplicate records well, the file itself doesn't create records that weren't there before. Those records are all coming from the file that has, from the NYTD file that stays the same, but when you do the merge the result of the query the result of the merge is to have multiple records for the same child because you need a record to match that child's appearance every time a child appears in the Foster Care File. If child appears multiple times in the Foster Care File then it's going to link on that child identifier multiple times to the NYTD file so that's why you get duplicate records on the NYTD side. But they are not duplicate records on the foster care side. [Erin McCauley] right and so for the NYTD series there are not so when you download the NYTD data it's already merged across the waves and so you don't have the same child appearing multiple times just have the child wants and then the response is for the original waves [Michael Dineen] well wait a minute let's, that's not exactly true when you the file that we send out that you get from us has a child in up to three times because it's stacked, and the child can be in wave one wave 2 and wave three so they only appear once in each wave but I was filtering the data I was always filtering on the NYTD side to wave one equals one and cohort 11 equals one so that was only taking that 15,597 kids who were in wave one and they were in the cohort. I was already screening out wave two and wave three which are in your data and you will have to do that too. [Erin McCauley] great thank you thank you. Anyone else have any questions? All right well it seems like that is our questions thank you very much to Michael Dineen that was a wonderful presentation kind of getting in the nitty-gritty I know this is one of our more complicated ones but Michael did a great job over viewing it. And next week we have our second to last presentation. I will be the presenter next week and I'm going to be talking to you guys about study that I'm kind of in the middle of doing the analysis on right now so were going to kind of go over how I came up with my questions and then I'm going to be giving a conference style presentation with what I have so far, and then I'm going to be talking about where I'm going to take it next and how I'm going to kind of link back to what I've learned through this series. And then after that were going to have our last presentation. So we hope that we will see everyone next week, in the meantime please send me an email if you have any questions and thank you so much for joining us today. The National Data Archive on Child Abuse and Neglect is as project of the Bronfenbrenner Center for Translational Research at Cornell University. Funding for NDACAN is provided by the Children's Bureau.