Transcription of Webinar "Linking Administrative Data in SPSS", August 12, 2020
[Clayton Covington] Good afternoon everyone and welcome to the final session of the National Data Archive on Child Abuse and Neglect NDACAN Summer Training Webinar Series. And if you've been here with us all summer, you know that this is our sixth session. This is actually a makeup session due to some technical difficulties with our last session, but we're really excited to go over this information with you all again, as well as do an in-depth hands-on linking of data, and showing you all how to do that in SPSS. So, we are the National Data Archive on Child Abuse and Neglect, which is hosted in the Bronfenbrenner Center for Translational Research at Cornell University. And more recently, our director and co-PI, Dr. Christopher Wildeman, has changed his affiliation to Duke University, so a number of our staff are also affiliated with Duke University, including myself and Sarah. So, the agenda for this session is Barriers to Linking, Process with Linking and also, the SPSS Linking Walkthrough. With that I believe I'll turn it over to the NDACAN statistician, Sarah Sernaker.
[Sarah Sernaker] Hello! Thank you Clayton for the welcome and the introduction. So yeah, today is the session on data linkage and we'll go through general data linking practices and things to consider before we get to an actual practical example of linking data through SPSS. And so, let's start with some definitions. So, today we're talking about linking administrative data. So what exactly do we mean by administrative data? And this is just simply data that's collected by large organizations, usually government, and it's just created for the purpose of record keeping. It's not created for the purpose of doing statistical analysis, more like of a historical record keeping. And so, by data linking, we just simply mean the combination of two data sets that share at least one common variable or entity. And we'll talk more about variables and entities in a few slides. But think of the simple example, if you had two data sets that measure things on each state, so each row has information about each state, and so, by data linking you might want to combine these data sets to make use of the information that both data sets hold and so that's simply what we mean. Just combining two data sets that share at least one common variable, so, a linking variable. And by tables I'll probably be referencing data tables or data sets, these are all interchangeably the same thing, it's just the data. So when we think of data we think of rows, which are observations, and columns that hold the information of variables. So we have rows of observations and the columns are simply variables. So this stuff down here. And so, I'll be discussing what keys or linking variables are. And so, a key is exactly a linking variable and a linking variable is the variable that would found be in all of the data sets you hope to link. So it's the variable, the common variable that you would use to link the data sets. So in the simple example I just mentioned with the states, so if you had two datasets where each row corresponds to a state, your linking variable would be states. Right? You're linking, you're combining the data sets based on what state they come from. And so a linking variable may also be more than one variable, so in the simple example of the states, let's say you had information for each state for different years. Right? So now you have multiple information for each state, just based on the year, and so if you wanted to link data sets, you'd need the linking variable to be state by year. Right? You want to make sure your information is going to not only the right state but within the right year. And so, that's what I mean by linking variables. And so, like I mentioned, columns are simply variables. I think I'd probably stick with variables most often but that's simply just the columns of your data set or what is being measured. Right? And so we have three main administrative data sets here at NDACAN or rather our largest administrative data sets and this is NCANDS, that's the National Child Abuse and Neglect Data System. This is records of reports to CPS most often, and reports of maltreatment toward children. We also have AFCARS which is the Adoption and Foster Care Analysis Reporting System, which provides data, separate data sets on adoption and children in foster care, usually waiting to be adopted. And then we have the NYTD data set. I say NYTD. That's the National Youth in Transition Database and this is a longitudinal survey. So we have three cohorts in the NDACAN holdings. And so, each cohort goes through three waves and so, in essence, a child is going through three stages of a survey. And so, we have that which we'll be using in our linking later along with the AFCARS. So, let's first talk about barriers to linking. You know you want to link data, but here's some things that might arise when you're trying to do data linkage. And I try to talk about this in a general setting with examples in respect, with respect to our data sets specifically. So, the first barrier you might face is, the data structure is different between data sets. So, for example the AFCARS foster care file has one row per child. So if you think of the data set, each row, each observation, corresponds to a single child in foster care. Actually, it's per child per year. Every child in foster care gets a new observation every year. And so, let's just, for simplicity, let's say we're looking at a single year. So, the foster care has one row per child. The NCANDS child file, as opposed to the, what is the other one that we have, we have child file and then like the state, the agency file. But most often we use the child file. The NCANDS child file has one row per report/child. And so, like I mentioned before, NCANDS records maltreatment reports, like to CPS for a child. And so, a child may appear in one year on multiple reports. And so, a unique key or linking variables, you would need to know the report ID along with the child ID if you're really trying to get at the child-level analysis. And last, is the NYTD Outcomes file, has one row per child per wave. Like I mentioned before the NYTD is a longitudinal survey, so you have at most, three observations per child. For each wave, so they're interviewed, we'll be using the 2014 or 2011 cohort and they were interviewed in 2011 and 2014 and then 2017. And so, we have observations for each child, for each wave that they participated in. And so, more generally, back to the data structure, you would need to resolve these data sets to be, let's say we wanted to do analysis at the child level, you would need to resolve each of these data sets to represent one row, or one child per row, rather. So that you would easily be able to link them. And so, kind of along the same vein, there may be recording errors in the administrative data and this should not arise so much in the NDACAN data because it is cleaned and well maintained by Michael Dinnen and myself. But, more generally speaking, if you were to use a unique identifier, let's say child ID, which most often someone, a human being is inputting into their data, errors might arise, there might be a typo or there might be a missing value. And so, when you come across reporting errors or you're suspicious that there might be reporting errors in your data, it's useful to include additional identifiers with your key, or your linking variables. So, for example, maybe you don't trust the child ID to be error free. If you include child ID plus their birthdate and their sex, the likelihood of you matching the child that you want will go up. Right? Not many people have the child ID plus the same birthday and the same sex. And so, that is something to keep in mind. Additional barriers would be any missing data in your key fields. So any of your linking variables. So child ID, again let's use as an example, if any of the observations were to be missing child ID, that's just an observation you wouldn't be able to match. You just do not have the information to be able to match it. Right? Another big problem is changes in record keeping. And this is back to data structure just because the changes in record keeping may just be changes in data structure. But it's definitely something to keep in mind as you work with data in general. So, let's say you're working with data from AFCARS from 2000 and data from AFCARS from 2010. Just because it's both AFCARS data doesn't mean they're the exact same structure or file format. Recording may have drastically changed over the years. Most often the case is that more variables are recorded. Some variable become obsolete or change. So you might see variables not being recorded any more. Or more importantly and what you see most often is change variable codes. This happens a lot with race variables if you use old data sets there's more granularity nowadays when recording race. And so, these are all things to keep in mind when trying to link data. And just to touch on our data set, again our data is resolved, most of these issues are resolved. So if you were to do a data request from our data sets these aren't things that you should have to worry about so much because we have taken the time to go through this and make sure that records are consistent. But again, more generally speaking, definitely something to keep in mind. And lastly, and more generally, when you're working with any data in any context you should be familiar with it. And so, that means, leaving your user guide book or the code book up while you're working with the data. That's what I do every time I'm working with the data set. So then you can see the list of variables, you can see what variable types they are. Are they numeric or are they character? Is there some string length? Are they factor variables? Or are they binary variables? And it will help you just understand the data better. Which is always good, especially when you're trying to link data. And that will help kind of you scope out, like which variables and filters you might need. And so anytime you're working with our data sets I definitely recommend going to our website we have tons of documentation and codebooks to help anyone using our data. So we've gone through the barriers were on to linking. So how do you link your data sets? And so you can link data sets if they share a variable of the same entity. And so by entity I mean the object that the variable contains information about. And so in my simple example with the state, the state would be the entity. Right? You'd be linking on state-level information, your entity is state. And linking variables don't necessarily need to be named the same so the variable names don't necessarily need to be exactly the same. Let's say in one data set your state variable is named state, and in the next data set your state variable name is named ST. Right? As long as they are measuring the same thing, like the same entity coming back to that term, you can link the data sets. And similarly kind of along the same vein, like in the state case, if one of our data sets writes out the name of the state and one of the variables is simply the state abbreviation, you would need to change it to the same, the same format. They are measuring the same entity but they should be in the same format in order for you to link. So these are kind of all the nuances to think about when you are linking on some entity. And in our set of data sets. So the foster care, adoption, NYTD and the NCANDS child file, most often we use the common entity as child. So usually we want to do analysis on the child level whether it's looking at maltreatment toward the child or a child's experience in foster care or adoption, and so Michael Dineen has actually created a variable within I believe almost all of the the administrative data sets I've listed here. There's a variable in there called state FCID it's a foster care ID and this is a unique child identifier comprised of the state abbreviation plus AFCARS foster care ID. And so if you were to use one of our data sets want to do analysis as the child-level, the state FCID is the variable that would facilitate most of your linking. And it is the variable that we'll be using later for linking in SPSS. So the steps you would take to link your data. Right? First you need to know what you want to do. Like what analysis to you plan on doing? What variables do you need to accomplish your analysis? What's your research goal? What's your research interest? You shouldn't just like go all in on the data and link as much as you can. It's really helpful to clean the data sets you wish to link before you link them. To make it as clean as possible to make it as easy as possible and often to facilitate computation if you're dealing with really large data sets. And so thinking about your research goals and interests and whatnot you should specify the data sets you'll need. So if you were planning on using NDACAN's data, do you need AFCARS data set? Do you need NCANDS data set? Would you need NYTD? What years would you use? Questions of those nature you should ask yourself. And so once you've identified that only research goals and the data sets that will help you accomplish your research goals, you should like I mentioned clean each data set separately. So whichever data sets you hope to think, deal with them separately for linking. And so to each of those data sets you should remove any unnecessary variables just to remove complication. Again just getting them out of the way, it might complicate the linking process. And again if you're working with huge data sets this might be computationally necessary. And so kind of along this same vein and removing all unnecessary variables, you should subset or filter, I use those interchangeably, based on your scope of interest. So like I mentioned, to you have a certain range of years you are interested in? Not just everything NDACAN can offer you. Do you have certain states you are interested in like just New York, observations from New York for example or other characteristics like most often used his child maltreatment. So certain types of maltreatment, or kids just entering foster care or just leaving foster care. So all of these things kind of tie into what your research goal is. And again this is treating each of the data sets you hope to link separately before you link. Okay so once you've removed your variables you don't paid, filtered based on your scope of interest, you need to resolve the tables to a single row per unique identifier. So I'll just use child as an example here because that's usually what we deal with in NDACAN. And so if you wanted to link two data sets to do analysis at a child level, so a child's experience let's say in foster care and then through adoption you would need to resolve the tables to a single row per child. And that will look different depending on the data set that you're working with. It might already in a format where it's a single row per child, or you might be working with the NYTD data where you have three observations per one child. And so that is what is meant by resolving tables. I'll be using that term a lot, resolving tables to it just means to reduce your data to a single row per unique identifier where the unique identifier is your linking variable. And so once you've dealt with all of the data cleaning separately for each of the data sets you hope to link, it's just easiest to save the results as a new table. So if you're working with AFCARS you could save this new table has like AFCARS cleaned. Right? And this just helps you don't usually want to overwrite your data in case you need to go back to the original data it'll just be helpful to have a nice clean data set to use moving forward. And so last you just link them, right? It's as easy as that. But note that will depend on the programming language and how you hope to link them like what identifier and we'll go through that with the example. And so now we can talk about resolving some of the administrative data sets that NDACAN offers and again resolving as in reducing all of the observations to a single row per child. And so first we have the AFCARS foster care file which is already given as one row per child. However if you are using multiple years you'll you may find the duplicate IDs. If the child is in foster care for multiple years within the same state, their state foster care ID will be present multiple times. And so you would if you plan on using multiple years you would need to resolve that down to a single row, or to use year and the ID together as linking variables. And some filtering that we generally deal with you may just want kids who have entered foster care or are still in foster care at the end of the year, or age out for example. Yeah. So we'll get into the details of linking AFCARS foster care with the NYTD in the example. So the NYTD outcomes file like I keep mentioning is a longitudinal study with the outcomes set is the data set that provides the longitudinal NYTD outcome. So the outcomes of this study given to children over the three waves. And so each row has is unique by the child ID and the wave. So not only the ID of the child but what wave represents the responses given in that row. And this is a format called long, that's just that's not even in NDACAN term or any that's a general term for longitudinal data. There's long format versus wide format and most often when dealing with longitudinal data when you're trying to link longitudinal data you will need it in a wide format. And so again we'll talk about this in the example and I'll show you how to go from long to wide but longitudinal data anytime you're dealing with it, this is going to come up, which format do you want it in, which format is best for however you wish to link it in your analysis. And sometimes it depends on the programming language you want to use. And I will note that see outcomes set includes children who did not respond to the survey at all so if you were to use the data something to think about you may want to filter these out. The survey was sent out records were made of who the survey was sent to but that doesn't mean that people always responded to the survey. So again start thinking about the filtering and how you want to consider these specific data sets. And kind of catching up from the long versus wide to resolve to one row per child in this case that was exactly mean going from a long format to a wide format. And so creating variables as needed for each wave. So I've given an example will be looking at a homeless variable and those will be suffixed by wave one, wave to, wave three if we have it and similarly for the other variables to be used. So in addition to the NYTD outcomes file there's a NYTD services file which is records of children receiving services through foster care services and this is recorded twice a year so each six month period twice a fiscal year. And so that is something if you were working with this data set to consider because in one year periods you would most likely come into multiple child IDs if they were receiving services throughout the year that is. And so I won't linger on this too long because we won't see this in the example and this is not the data set that's used as often as the others. I noted here not all the children from outcomes file can be found in services so the reason it's all under this umbrella NYTD is because there are some overlaps between the children receiving services and the children who are in the outcomes longitudinal study but is not complete overlap. So definitely something to keep in mind if you plan to use this data set. So the NCANDS child file is a popular data set we use a lot in NDACAN or in my experience. And so this is organized by year so each file is a separate here so if you wanted to use the multiple years it would first need to collect each individual year separately and then essentially just stack them all together. And the child file has one row per report/child. Like a previously mentioned this is the record of maltreatment against children, reports to CPS. And so again the report could have multiple children or a child could be on multiple reports so if you wanted to resolve this you would really need to decide if you want to look at a child level and maybe just organizing a variable that counts the number of reports they are on in a certain year. Or if you want to look at a report level. So this one is sometimes a little trickier to deal with because of the fact that it's organized by report/child and not just child. So like I said if you're doing multiple years you should carefully stack them. I say carefully just to make sure that your variables are lined up, you're not just stacking the wrong variables on top of another wrong variable. Yeah. So just to reiterate I've gone through a lot of specifics towards our data sets, the steps you want to take, what variables you need, what your research goals, what data set can help you accomplish your research goals, specifically what variable from each data set can accomplish your research goal, and resolving your tables separately not only resolving but cleaning your data and then resolving your data into one row generally per child but sorry specifically per child generally per unique identifier. So whatever data you're working with you need to resolve your table 21 row per unique identifier in other words resolving your table to one row per linking variable or combination of linking variables. And like I mentioned before once you've cleaned everything, resulted, it's easy to just save your cleaned data as new data and then link the clean tables. This facilitates things makes things a little cleaner and nicer again computationally more efficient. So let's just jump into the example. So I talked a lot and I think it'd be helpful to see the practical implementation of this. And so today I'll be showing you how to link NYTD 2014 cohort data with AFCARS foster care data. And so what variables are we going to use today? I will mention here that the data I will show you has gone through some filtering, a little bit of masking for privacy reasons. I cannot show you just the raw data so the data I show you here is not true data but it is very representative of the type of data you would see from NYTD or the foster care file. And so if kind of created this fake research goal, the kind of fake general research goal of wanting to look at the relationship of homelessness of our 2014 cohort and foster care. So is there a relationship between being a foster care and then somehow leading to homelessness, while controlling for sex and race so just including some other variables we would want to include in the analysis of this type. And so defining that we would get this information from the NYTD outcomes file so that's where we'll find the 2014 NYTD cohort, and I just pulled out some relevant variables from this that we would need to answer this sort of research question. And the variables that we'll be using are that we'll see today are the waves so we'll need to know which wave each observation comes from. The state child ID will be important to link. Like I said we'll control for sex. Homeless is the variable that measures whether or not a child reports being homeless. We'll include their race and then their report year. And this was a variable that I created from the original variable REM 1 Date, just again for privacy reasons I've just taken basically the subset of this variable. And will also be so will be using the NYTD outcomes file and we want to link it with the foster care file. Like I said were looking at the relationship of homelessness with foster care. So from the foster care file will be getting the state Fips code ID notice it's the same variable between the data sets and that will be a linking variable. We have fiscal year so which year the report is from we also have sex and race from this data set as well. I've included a list of variables that might be interesting within this realm of research mostly just to show an example of using a bunch of variables within your linking and data analysis. And so I have current place setting whether they entered foster care that year whether they exited foster care that year, the total times they been removed from their home to go into foster care and then physical abuse, sexual abuse and neglect are indicator variables or rather binary variables that is just 011 whether they've experienced those things. And so for the rest of the slides well I'm not going to go through the rest of the slides because it's just code. I'm going to switch to SPSS right now so that the code that I use in SPSS is the exactly what you see on the slides so if you need to come back to it this will be posted on the website at a later point. So, now we switch over to SPSS. For those who've never used SPSS this is what it looks like. It's an IBM product. I myself am an R person so I am not as adept in SPSS but it is really easy to use in my opinion if you've ever done any programming. And so let me actually make this screen a little bit bigger can we do that?. I can just do that. Okay so let's just start the top. We need we know that we need the NYTD file than the foster care file you can see down here. So like I said before we've identified our research goal, we now have identified the data sets we plan to use, now were going to treat each data set we plan to use were going to save a new clean data set and then we'll get to the linking okay? So get file will pull your data set so I've already pulled from the raw data to for this presentation some NYTD data from the 2014 cohort and I saved it in this .sav that's an SPSS file format. So were going to say get file and then I'm naming the file or I'm naming the data set rather and I'm saying window equals front. So let's run this I've just highlighted it and I'm going to press run and it popped out on my other screen but you get this output right? And so that is what SPSS is telling use happening once you run your data. This looks good this is exactly the code we ran, there's no errors, there's no warning so this is a good sign right? It means something worked. And window equals front sense your data to the front so I read in the data and we can take a look and so notice we have the state foster care ID so this is the child ID in essence we have the waves that they responded in, we have the indicators on race, we have outcome, we have I think I've left in all of the variables or almost all the variables that come from the NYTD data set. And again I'll be iterate that none of this data is true I basically scrambled all of this information. But this is what the NYTD would look like. And so I just loaded it in. All we've done is loaded in. So next were going to trim the variables that we don't need so there's a lot of stuff that I didn't mention in the previous slide remember that I said we would use wave, state Fips code ID, sex, homeless, race and then report year. And so that's what we want to do we want to save a new data set so I'm saving a subset that were keeping these variables. And I've kept a few extra just to demonstrate if you might want to delete them later so there are a few and here that we'll end up deleting later just for demonstration purposes. But noticed this is how you would filter out variables you don't want by specifying variables you want to keep. And so I'm going to save this in the same folder and now I'm going to close it. So data set close. So does appear we have active data sets and we see the names that I've named the data set announcement to close it because I just don't want to confuse SPSS. Were going to be loading other data sets were actually going to be reloading the subsetted data and so we just want to get this out of the way. And so that is why I have this data set close here so I'm going to highlight that and run it. And so notice again we just see the code no errors this is still here but that's fine. Notice we have no active named data sets okay and let me just show you so this is the folder I'm working in. Notice this was just created. This was what I just saved that subsetted NYTD data set. Okay I'm going to have to move faster I didn't realize time is running out. Okay so were going to reload what I just saved that NYTD subset dot sav. So again get file we're naming it and then closing it. Again this is not necessary within the code but kind of facilitates dealing with multiple data sets. And then so it's going to open that me just run this and I'll show you. So it's going to load the data we just created, the subsetted data set. It closes from the active data set but we still are able to view it and I've now sorted by the child ID and then the wave. So it sorts first by the order of the child ID so alphanumeric order. It first puts those in order and then it goes to the waves and puts those in order. So notice now first of all we have a subset of variables this is our subsetted data set. Now we just have stacked race, homeless, report year and then this extra variable that I've kept in for now. And so our state fips code or state foster care child ID is now in order alphanumeric order. And noticed here where we get to a child who is reported on multiple waves. This is where we need to reconcile right? This is what I mean by long format. Each row corresponds to a child's response within a way of. And so we have multiple responses from his child reporting from multiple waves. And so this is exactly what I mean by the long format as we essentially just want to expand this into a wide format which I'll show you would in a minute. And we can see this person was in three waves. Not everyone is in three waves if we were to go down some people would just be in wave one and wave two. But yeah. So all I've done I've gone up to here. We reloaded the file just saved and I've sorted by child ID and then the wave. So now like I keep saying we need to convert this from long format to a wide format and so in SPSS this is accomplished by CASESTOVARS. And so the ID here corresponds with an actual ID this is not necessarily an ID variable or it doesn't need to be and ID. So like in this example of working with state level data ID may be just the state name. But here it does correspond exactly with the child ID so it's easy to identify and then we're indexing by wave. So we're saying we want the unique row per child and we want to expand the variables by the wave, right? Because each each row of information is unique to the wave the information from this child at this wave is different from the same child at the second wave. I guess not actually here but generally speaking. And so that is what a CASESTOVARS accomplishes. I'm going to run this I think it's just easier to just show you and then we're going to sort again by the child ID. So I've run this, notice in the output so we actually get something different this time and this is trying to help you understand the process of SPSS going from long to wide. Notice now we have we went from a single variable called homeless to homeless index one, homeless index two. So this is what is meant by the indexing and what I mean from going from long to wide. We have three variables per each of the observed responses. And it gives you some processing stats so before we had 9049 observations in the long format. Going to wide we end up with 4129 observations which is to say within these data that we originally loaded there was only 4129 unique children so no matter how many waves they were on this is a number of unique children that appear in our data set. And so cases in cases out is just saying that on average there are about two waves for each child. And that gives you some stats just on the number of variables so before we had eight in the long version, will have 15 in the wide and there were three indexed values there were three values of waves variable wave. So let's actually look at the wide format. Notice now we have the child ID, we have the sex, their race, notice these do not have indices because these should not be changing depending on what wave where in, right? Except for extenuating circumstances I guess the sex would not change generally speaking, your race should not change but the other variables notice have been expanded. So we have homeless one, we have homeless two, homeless three, report year one, report year two, report year three. Then we have the substance abuse, incarceration. So for this child these are the responses for each wave. So we can see by the missing this that this child did not respond to waves two and three. That's why we have missing values here. This child did respond to all three waves because we have information across the whole row that we can see that they did not appear to be homeless in any of the waves that they responded to. They did not appear to have any substance abuse in any of the the waves they responded to they were not incarcerated. And so this is wide format. When you're dealing with longitudinal data that is the difference in long so everything being split by the wave versus wide for now we've expanded the variables for each wave. So I'm going to go ahead and delete this. This is just to show you we have more variables that we can trim so I'm just going to go ahead and delete these because again we were looking kind of at homelessness and foster care. And so I've just gone ahead and delete them. Notice now that they are now gone. All we have is sex, race and then the homeless variable and then the report year if only to just keep track of when these were recorded. Okay so now I'm going to save this. This is our totally clean cohort 14 NYTD that that will be linking. So I'm going to run that. Go into my folder, notice this is new this was just created 12:45, right? And I did data set close all. This is just kind of a catch all to the data set close in specifying the name if you just say data set close all it just closes everything. And so now we need to prepare the foster care file. Luckily let me just load this in here to show you really quickly I should move a little faster. This is the foster care file, and again this is data that I have masked this is what it would look like but this is not true data. And I've taken a subset of variables. Notice we have fiscal year, the child ID again, if they're foster care at the end of the year, if they've entered this year, if they've exited this year and etc. Notice we also have race and sex. So when linking you could choose which data set you want these from baby you trust one data set versus another. Maybe you'd like to include both where you would need to add an index may be so like race index foster care to show that this is the race variable from foster care and then you could compare what's recorded in race and sex between the two data sets. But we'll be outputting it here. And so all I've done I've loaded foster care file we're just looking at it right now with your output. Notice no problems. And so this one's easy because it's already one row per child ID and so all we need to do is specify the variables that we'd like to keep. And so I've just kind of chosen a few here, whether they've entered or exited, the total number of removals they have, things that might be of interest in the general research goals that I outlined. And so I'm keeping the variables that I want, I'm saving it to a new file and then I'm closing everything down because I'm about to reload everything to link it. And so to show you we have this foster care 14 link I've just created this. And so that was easy that's all we need to do with the foster care file and sometimes it's as simple as that. It's just when dealing with longitudinal data that's always a little trickier. So now were going to open the files that we prepared for linking. So were going to Get this file so this is our cohort data. We're just going to load it it's just going to sit in SPSS to be linked. I think it cohort 14. Next were going to load fc14 so that's the foster care file. You can't see it but these are popping up in my other window. So we have cohort 14 and then we have foster care 14. And so we could look at these side-by-side. If you needed to do direct comparisons sort of thing. And I'm going to activate data set fc14. You need to activate one and this is a little redundant because we can see the active data set is fc14 but this is just to really make sure that there are no errors that arise when using SPSS. And so I'm going to activate this just to say that that's going to be basically like our baseline data set. We're going to link the cohort 14 data onto the foster care file. And now we're coming down to the actual thinking so star join is the function that you can use in SPSS to join the tables. Then we need to select which variables we want to keep. So this is literally just a list of the variables in each of the data sets that we want to end up in the full joined data set. And so notice it's t zero period blah blah blah t zero period blah blah blah t one period blah blah blah with the variable name. And so these are just relabeled down here so we're selecting these variables from fc14 as this we're relabeling fc14 as t zero. This is just makes it easier to write everywhere. If you wanted to use fc14 that should be fine, but this is just a lot quicker and simpler to write out especially if you have a really long data name. So we're labeling fc14 as t zero so anything you see with the t zero are variables that come from fc14 and we're joining it with cohort 14 labeled as t one. So anytime you see t one those are variables coming from cohort 14. And so we're taking these variables from these two data sets, we're joining cohort 14 onto fc14, and we're joining them on these variables. So we're matching we want to find matching child IDs within the foster care file that match with child IDs in the cohort file. And so this is where you specify your linking variables. Here this in statement is totally optional. This adds an indicator as to whether the data could be matched. So in some cases you'll just you would combine the data some of these files some of the observations may not have a match but they would still be like present and seen within the data and so this indicator variable will show a one if you were able to match that observation and zero if you are not able to match that observation and SPSS just fills in the links where it couldn't be matched. Okay. And then we save it in this joined file and so I'm going to run this and we won't really see anything interesting happening that'll run we have our code in the output. Notice now we have this new dot sav file just modified. And so now you want to check it. You should never just say oh it ran and it worked. No, we want to load it and I'm going to actually copy and paste this down here to avoid any errors. So I'm going to close everything out again we have this joined data set saved okay so I just closed everything. And now we want to get our joined data set. You should always visually check it to make sure it came out looking right and should do frequency checks, do other just checks that you can do to make sure this looks right. So notice like I said t one was an indicator as to whether it could be matched, zero says it couldn't be matched so the rows where you have a bunch of planks are where you have it in the foster care file but not in the NYTD file. So it just fills them with a bunch of blanks. Here notice that we have information and an indicator one saying that we were able to find a match, this child wasn't both of the files. So we have information from both files across the row and we have an indication that this was matched. So this is helpful also because then usually you just want to work with the data that was matched to make use of all the possible the data you have. So now we select if it was matched and we can see how much we actually matched. And so I've done this filtering so it only filters on whether it was matched so here is like the full data set that was matched for every child. And we can see that there was 455 matches so only four and 55 child IDs could be found between both files. And depending on how you look at it that could be a decent sample size or sometimes not. But it is what it is. And so that is the process to take put very quickly and probably best to link data in SPSS. Again the main beef here was once you clean the data using this star join function to actually link the data sets. And I'll just leave it at that. I'll put this back up we can get to questions.
[Clayton Covington] thank you so much Sarah for that excellent presentation I know you have to get through it fast but you were really detailed. And I think that'll really help our users who are both you know attending right now but also once this is recorded. So again we have a few minutes left to if you have any questions please direct them to the Q and A chat box. All right so we have first question. So it says "for restricted data sets how do we explain the data linkages we want to make?".
[Sarah Sernaker] I guess that really depends on what you mean and if you mean if you're trying to link data where you need variables that are restricted that might prohibit your linkage if you simply can't access them because they are restricted for public use. If you can make the argument to use holdings so if NDACAN is the one preventing you from using a restricted variable, in some cases this is due to the fact that you're simply not allowed to link data. There are some statutes and guidelines that say you can't link data based on sensitive information because it makes things more identifiable or it makes people more identifiable. So in that regard sometimes restricted data is just prohibited to data linkage. In other cases where you are able to get maybe sensitive information for linking you should be wary and make sure that you are able to do so as far as like legally and the guidelines of the data archive. Does that help to answer your question?
[Clayton Covington] yeah the person who asked replied thanks so I think you answered their question.
[Sarah Sernaker] OK.
[Clayton Covington] there is another question it says "is there programming code for other stats software available?" And I can actually answer this too so this year the linking session was conducted in SPSS but our previous years of the summer training webinar series have done them in other formats such as STATA such as R and those are all available on our website. So you can access those and if there's a particular format that we haven't done yet, one of the things that you'll all will receive at the end of the summer training webinar series is feedback form in which you can tell us what kind of ideas you have including using other statistical platforms. So if we haven't done it already please let us know and we are happy to incorporate your feedback into our future program.
[Sarah Sernaker] yes exactly and you should be able to link in any statistical language that you come across. 
[Clayton Covington] alright so it looks like were just about going to wrap things up there. So again thank you everyone for your participation in the NDACAN summer training webinar series 2020. Again please do fill out the feedback form when you receive it in your emails and we look forward to programming with you all in the near future.
[Sarah Sernaker] yes thank you for stopping by!
[Clayton Covington] alright bye everyone!