Transcript for 2025 Summer Training Series, Session 3, Data Linking
Presenter: Alexander F. Roehrkasse, Ph.D., Butler University
National Data Archive on Child Abuse and Neglect (NDACAN)

[MUSIC]
[VOICEOVER]
National Data Archive on Child Abuse and Neglect. 

[ONSCREEN CONTENT SLIDE 1]
Welcome to the 2025 NDACAN Summer training series!
National Data Archive on Child Abuse and Neglect
Duke University, Cornell University, UC San Francisco, & Mathematica

[Paige Logan Prater]
Hi everyone. Welcome. I'm gonna go ahead and get started. I'm sorry I am late. So Alex, are you all good? I want to make sure you're there. 

[Alexander F. Roehrkasse]
All set. 

[Paige Logan Prater]
Perfect. Thank you. Hi everyone. Welcome to session three of our NDACAN Summer Training Series. We are so excited that you're here. My name is Paige Logan Prater. I use she her pronouns and I'm the graduate research associate here at NDACAN which stands for the National Data Archive on Child Abuse and Neglect. We're housed at the institutions and universities listed on this slide and we are funded through the national Children's Bureau. If this is your first time with us, we host two learning series every year. This is our Summer Training Series. We have this week and two more weeks the two remaining Wednesdays in July after today for our Summer Training Series. And then in the academic year we host our monthly Office Hours. So all of our information about these events and offerings can be found on our website. And I emailed everyone that registered for this event about two paper award opportunities. So thank you all so much if you've submitted a nomination already. The due date is today. So if you have any lingering opportunities or papers that you're thinking of potentially nominating, we encourage you to do so. And if you have any specific questions about them, please reach out to me or anyone at NDACAN. Next slide, please. 

[ONSCREEN CONTENT SLIDE 2]
NDACAN Summer Training series schedule
July 2nd, 2025
Developing a research question & exploring the data
July 9th, 2025
Data management
July 16th, 2025
Linking data
July 23rd, 2025
Exploratory Analysis
July 30th, 2025
Visualization and finalizing the analysis

[Paige Logan Prater]
So the theme of our summer series is the Life Cycle of an NDACAN Research Project. And as you can see from our schedule, the sessions in these series build off of each other, which is slightly different than our previous formats. And so in session one and two, we talked about developing a research question and exploring the data. Last week, we talked about data management and organization. And today, we'll talk about linking data, the different data sets offered or some of the different data sets offered by the archive. If you've missed any of these sessions, the first two, they're all recorded. And we will be posting the recordings, the transcripts and other supporting material on our websites. On our website, not sites. And I think Oh, next slide please.

[ONSCREEN CONTENT SLIDE 3]
Life Cycle of an NDACAN research project
This session is being recorded.
Please submit questions to the Q&A box.
See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us 
If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
Just a couple of housekeeping items. So like I said, these sessions are being recorded so that we can post them on our website and refer y'all can refer back to them whenever you want. And because we are recording and making them publicly available after today, we ask that you all submit questions as they come up throughout the presentation to the Q&A box. And the way that we'll do Q&A is we'll save about five or 10 minutes at the end to address all of your questions. So, please put them in the Q&A box, which can be found at the bottom of your Zoom screen. There's like a little question mark in a comment bubble and there's a Q&A if you click it the Q&A box will come up. Please put those questions in there as they come up. And if you have if you need any help with Zoom there's some information there. You can also reach out to Andres who's on the call with us today. And otherwise I think we can go ahead and get started. Alex, kicking it to you. 

[Alexander F. Roehrkasse]
All right, great. Thank you, Paige. Thanks everyone for being here today. I'm excited to talk about linking data today. 

[ONSCREEN CONTENT SLIDE 4]
SESSION AGENDA
STS recap
Linking data
Demonstration in R

[Alexander F. Roehrkasse]
If you've been here in the last couple sessions, you know that each session has built on the previous session. If you weren't here though, as Paige said, you can access the prior presentations on our website. But you needn't have seen those already for today's session to be valuable. So don't freak out if you weren't there last week or the week before. In fact, by way of catch up, I'll offer a brief recap of what we talked about last week. Then we'll focus on some basic principles and strategies for linking data. We'll talk about what linking is, why one would do it, what are some of the benefits, but also some of the pitfalls that come from data linkage. And then I'll actually demonstrate how to do data linkage using some anonymized data. So some fake data that look very real so that we can illustrate some of the different strategies for linking data also see directly some of the challenges that arise. I want to clarify that this is a very large topic. There's a large methodological literature on data linkage and there's a lot of research published using archive data that's been linked. We are just going to be scratching the surface today. I'm going to try to keep things pretty simple and pretty brief today to leave more space for questions at the end. I want to remind you that we are always available if you want to reach out for consultation to archive staff on any projects you may be working on that involve data linkage. We're always happy to help. As you'll see in my presentation today, there are a few points on which it's especially important to reach out to archive staff when it comes to linking data. Okay, let's get started.

[ONSCREEN CONTENT SLIDE 5]
STS RECAP

[Alexander F. Roehrkasse]
What did we talk about last week? 

[ONSCREEN CONTENT SLIDE 6]
Data management
Develop workflow to guard against mistakes
Observe data security protocols fastidiously
Verify data features directly
Format your data according to your unit of analysis

[Alexander F. Roehrkasse]
Last week's session was largely about data management and I advised you to think about data management as sort of crisis management to develop a workflow that will help you avoid major problems in your research projects. And some of the advice I gave was to save everything you do and often, never to work from the console, that is to say entering functions into statistical software. Always running code that's been written in a script or a do file. This allows you to go back and see what you've done. I advised you always to annotate your code, basically to to give yourself and other people who might read your code some guidance about what your code is doing. I also suggested that you keep a research journal where each day that you're working on a research project, you take some notes about what you did, some to-do's that you have, some challenges or questions that arose. I talked about how I have used these research journals to go back years later to understand better what I did in a prior research project. These have been very helpful for me. I also advised you to follow data security protocols carefully. Any data use agreement with NDACAN particularly regarding sensitive data includes some terms that require you to store those data securely. I proposed a workflow that will allow you to store those data in accordance with your data use agreement but also have a fast and efficient workflow. I gave some examples where unfortunately real data do not correspond to how those data are described in various metadata sources like a data user guide or a code book. So we don't want to take code books as the gospel truth. We always want to verify yourself how are variables labeled? How are variables named? What values can a variable take? Particularly as regards missing values. Data do not always look the way they are described in codebooks. It's important to verify these things ourselves. And then lastly, we we worked through a bunch of different strategies for cleaning data and also reformatting your data to suit your data your research purposes. And I advised you that with very few exceptions, you want to think about organizing your data so that each row corresponds to a unit of analysis. And this of course requires you to think very carefully about what your unit of analysis is and then to format your data accordingly. Okay, that was what we talked about last week. 

[ONSCREEN CONTENT SLIDE 7]
Research question
What is the relationship between lifetime incidence of removal and full-time employment among youth three years after aging out of foster care?

[Alexander F. Roehrkasse]
I want you to recall though that over the last two sessions we've been moving toward an investigation of this research question. What is the relationship between lifetime incidence of removal, that is to say the number of removals a child has experienced over the course of their life, removal being placement into foster care. What's the relationship between the lifetime incidence of removal on the one hand, and full-time employment among youth 3 years after aging out of foster care. So our sort of predictor here is going to be the total number of removals in a child's life and the outcome of interest is going to be their full-time employment three years after aging out of foster care. You'll see that we'll be organizing our data linkage today specifically linking NYTD to AFCARS in order specifically in order to connect these two variables. The outcome of interest is measured in NYTD but the predictor of interest is not available in NYTD. The predictor of interest is available in AFCARS but the outcome is not available in AFCARS. So we'll link these two data sets specifically to be able to answer this question. The predictor is only available in one data set, the outcome in the other. We'll link these two data sets in order to be able to answer this question.

[ONSCREEN CONTENT SLIDE 8]
Linking data

[Alexander F. Roehrkasse]
Okay, let's talk about linking data. 

[ONSCREEN CONTENT SLIDE 9]
What is record linkage?
Linkage combines multiple data sources based on one or more shared variables
Internal record linkage
NDACAN administrative data files (NCANDS, AFCARS, NYTD) can be linked to each other at the child level using unique (encrypted) child IDs
External record linkage
Aggregated NDACAN data can be linked at the aggregate level to external sources using common variables:
Time: year, month, half-month
Place: state, county
Demographic groups: sex, race/ethnicity, age

[Alexander F. Roehrkasse]
What is linking data? This is also sometimes called record linkage. Linkage is essentially not just about pooling data, not just combining different data sets. It's a way of connecting data from multiple sources about the same unit. So linkage means that we're combining multiple data sources, each of which has information about the same unit. That's what's key to record linkage. Sometimes it's the same unit measured at the same point in time. And what we're interested in are different measures that come from these different data sets we'll link. Sometimes, and this is often the case with archive data, we'll be talking about same unit, say the child, same measure, say child maltreatment investigation, but different points in time, like whether they're four years old or 5 years old or 10 years old. Both of these are examples of record linkage because they involve the same unit. We might be connecting different points in time or different pieces of information at the same point in time. As long as we're connecting data to measure multiple things about a given unit, that's what we mean when we're talking about record linkage. For the purposes of record linkage using NDACAN data, we can talk about internal or external record linkage. All of the administrative data sets that are distributed by NDACAN namely NCANDS, AFCARS and NYTD can be linked to one another at the child level. That's to say we can identify unique children in any of those data sets and connect information about those unique children across data sets. And we do that using a specific identifier variable, a unique child ID. This is an encrypted ID, but one that is consistent across these different data sets. And so when it comes to linking archive data to other archive data, that is linking between these three administrative data sets, we can do these linkages at the child level. We can find out what we sometimes call micro-level information about a given unit, say a child, from different sources. We can of course link archive data to any source of external data, but this cannot be done at the child level because those child IDs are encrypted, because this is very sensitive data, it's not possible to link archive data at the child level to other sources of information about unique children. We can nevertheless link archive data to external sources of information, but we do that at the aggregated level. What this means is that we count up the number of children say in a period of time be that the year or the month or the half month. We can count up the number of children in a place, be that the state or the county. We can count up the number of children belonging to a specific demographic group say a sex group, an ethno-racial group or an age group. And then on the basis of those tabulations, we can  connect those data to other information about years, about counties, about age groups that come from really any other source of data. The American Community Survey, the current population survey, the decennial census, any number of other data sets that we might be interested in. Of course, it's this external record linkage that's particularly important whenever we're calculating, say, a rate of child maltreatment investigation or a rate of foster care placement. What we're doing here is counting up the number of children who experience some event in the archive data in the administrative data and then we usually divide that count by some population count that comes from maybe census data or the current population survey. This is a form of record linkage, linkage that's happening not at the child level but say at the state year level or the county month level.

[ONSCREEN CONTENT SLIDE 10]
Joining data
Two tables are displayed that will be joined. The first table is NYTD and has columns labeled StFCID and CurrFTE_3.  The second table is AFCARS and shows two columns labeled StFCID and TOTREM. Three other tables show the results of inner join (columns StFCID, CurrFTE_3, and TOTREM), left join (columns StFCID, CurrFTE_3, and TOTREM), and full join (columns StFCID, CurrFTE_3, TOTREM).

[Alexander F. Roehrkasse]
Okay. There's different ways to link data and it depends a little bit on the format that your data take. In last week's session, we talked about formatting data and reformatting data. We talked about long data and wide data. Sometimes if our data are long, we might use functions like bind in R or append in STATA to essentially stack data sets on top of one another. In last week's session, we created a sort of main format for our data that was wide. So, we're going to be working mostly with wide data here. And if our data have a wide structure, we're much more likely to be doing a linkage where we connect our data horizontally rather than vertically. In this case, to connect our data horizontally in R, we'll be using the function called join. This is roughly equivalent to the STATA function merge. What does it mean to join data? What are we actually doing when we talk about joining data? Okay, imagine two very small versions of our NYTD and AFCARS data sets. Now, these are crystallizations of the data sets we'll actually be linking. These are the variables we're going to be interested in the outcome, the predictor, also the the variables we'll use to link these data sets. So even though these data sets are very small, they illustrate just on a very tiny level the exact thing that's going to be happening with our actual data linkage involving tens of thousands of children. Okay. So let's look first on the top row here. We see two little tables. On the left we see NYTD on the right we see AFCARS. In the NYTD data set we have two variables. On the left we have a variable STFCID. This is going to be our unique child identifier. This is a variable that the archive creates and is consistent across all of the archives administrative data sets. AFCARS, NCANDS, NYTD with some exceptions that I'll talk about shortly. Each child has a unique value of the STFC ID variable. So the first row here A corresponds to a specific child A. The second row B corresponds to a specific child B. The second column current_FTE3 is our variable measuring current full-time employment 3 years after aging out of foster care. And you can see that for child A, they were full-time employed. For child B, they were not full-time employed. Okay, let's move over to the tiny little AFCARS table. Here you'll notice that we have a row corresponding to child B. That means that child B was observed both in NYTD and in AFCARS. You'll notice that we don't have a row for child A, but we do have a row for child C. The second column in that tiny little table, TOTREM is an abbreviation for total removals. This is the predictor variable that we're interested in associating with our current full-time employment variable. You'll see that child B had three total removals, child C had four total removals. Okay, what happens when we join these data sets? Whenever we join data sets in R, there's actually different ways we can join the data. Let me briefly illustrate three ways we can join the data. The first and now here we're looking at the bottom row. The first way to join our data would be to do what is called an inner join. An inner join takes only those values that are present. Sorry, only those retains only those observations or those rows that have shared values of the linking variable. So, we're going to link using the STFCID variable. The only row that has a shared value for that variable is B. And so if we do an inter join, it's only going to keep the observations that are jointly observed in both of the joining data sets. That's row B. If we were to left join our data set, in that case we would say okay keep everything in the NYTD data set and link to it only those or include in our new data set only those observations that can be linked to observations already in that NYTD data set. So essentially keep everything on the left and add to it anything from the right that we can. You'll notice that we still have A and B here. Those were in the original NYTD  data. We've linked to it a value of total removals for child B. But because we did not observe total removals for child A in the AFCARS data, R will return a missing value for that variable. We don't observe that variable for child A. You can imagine us doing a right join which would be the exact same thing except in reverse where we take everything in AFCARS and add to it whatever we can from NYTD. The opposite of an inner join is a full join. And this is where we keep everything in NYTD and everything in AFCARS. Don't drop anything. Keep everything and connect as much of those data as we can. So here now we have observations for child A, B, and C because they show up in at least one of the two data sets. You'll see that we've been able to link child B, which means that we observe both the outcome and predictor variables for that child. For the other two children, we weren't able to link those records. And so we have only an observation of the outcome or the predictor variable. I'll be illustrating both an example of a full join and a left join when we do our demonstration in R.

[ONSCREEN CONTENT SLIDE 11]
LINKING NDACAN ADMINISTRATIVE records
The variable RecNumbr is an encrypted version of the youth’s unique identifier used by the state agency.  The ID may go by different names in the various linkable files. These are:
NYTD Outcomes File: RecNumbr
AFCARS Foster Care File: RecNumbr
AFCARS Adoption File: RecNum (for some states)
NCANDS Child File: AFCARSID
To facilitate linking data among this family of files, a common linking variable – StFCID has been added. It consists of concatenating the state’s 2-character postal code to the ChildID, resulting in a 14-character variable.
Source: NYTD Outcomes File User’s Guide

[Alexander F. Roehrkasse]
Okay, let's now talk about some of the strategy but also pitfalls involved in linking archive data specifically. This is some text that comes from the NYTD Outcomes file user guide. I would advise anyone using the NYTD or any other administrative data set to use these users guides very carefully to read them thoroughly and to think carefully about what they imply for your analysis. Essentially what we see here is that each data set has a slightly different way of identifying children. But NDACAN has created this variable STFCID that results from concatenating the state's two character postal code to a unique child ID that is specific to that state. This is important to understand. Each state generates its own set of unique child ID codes. So one child could have the same record number or sorry multiple children could have the same record number if they came from different states. And so it's important not to use only the record number to identify children because that record number could be reproduced across states. It's the combination of a state identifier and a child identifier that creates a true unique identifier for children. The easiest way to do this is just to use the derived variable, the variable that the archive creates, STFCID. 

[ONSCREEN CONTENT SLIDE 12]
Caution in linking
The […] youth identifier is encrypted for all these datasets, but is encrypted consistently across datasets, so it serves as an indicator of the same youth across datasets and across years. Be careful, however. These commonalities are generally reliable, but are not applicable to all states in all years. Contact NDACAN Support for further information regarding which states can be linked across specific years.
Source: NYTD Outcomes File User’s Guide

[Alexander F. Roehrkasse]
There are some downsides here though. You would also see in the NYTD Outcomes users guide this cautionary text. The youth identifier is encrypted for all these data sets but is encrypted consistently across data sets so it serves as an indicator of the same youth across data sets and across years. Great. That means we can link the AFCARS to the NCANDS or we could link the NCANDS 2010 to the NCANDS 2020. But be careful, these commonalities are generally reliable, but are not applicable to all states in all years. Contact NDACAN support for further information regarding which states can be linked across specific years. This specific information is not available in the user's guide or the code book for any of these administrative data sets, but this is a critical piece of information if you're designing a research project that involves linked administrative data. So, I strongly encourage you as you're developing your research project, maybe you have a research question and you're trying to figure out, okay, specifically which data sets can I use here? What time periods can I use? Reach out to NDACAN support staff. They will be able to tell you directly, yeah, these dates can be linked in these years. No, you should not use these data sets to link information in these years. This is information the archive maintains internally. You should reach out to archive staff to understand whether your project is feasible. If you don't do this, you're very likely to generate false positive links or to get false negative links. Either of these scenarios could greatly bias your results. So, it's very important to reach out to understand what we know about the linkability of internal records across years, across data sets.

[ONSCREEN CONTENT SLIDE 13]
Joining multiple observations
One to one
Linking variable is unique in both datasets
One to many
Linking variable is unique in one dataset
Many to many
Linking variable is not unique in either dataset

[Alexander F. Roehrkasse]
Okay. We won't talk about this too much in our demonstration but you should understand that there are different ways to link observations. The simplest is a one to one linkage where say one record in one data set corresponds to one record in another data set. In this scenario the linking variable has to be unique in both data sets. We need a one to one match across these data sets. Of course, it may be the case that we want to link a piece of information covering multiple observations to each of those observations. This is what's called a one to many linkage. So for example, maybe we wanted to understand something about the kind of general say political environment of a state that a child lived in. This would be a scenario where we could link the child record to some attribute of the state in which that child lives. But every state or every child in that state would be linked to that same observation of the state level variable. So this would be a one to many example. Less common is a scenario where we're linking many records to many other records. And in this case, the linking variable will not be unique in either data set. This is a rarer case. I mostly introduce you to these different kinds of join to illustrate a variety of problems that can arise if you're not careful in cleaning the data sets that you link. We'll talk a little bit more about this in the demonstration.

[ONSCREEN CONTENT SLIDE 14]
Joining data: One-to-Many
Two tables are displayed that will be joined. The first table is NYTD and shows three columns labeled StFCID, Wave, and CurrFTE. The second table is AFCARS and shows two columns labeled StFCID and TOTREM. A table is displayed which shows the results of a full join and contains five columns labeled StFCID, Wave, CurrFTE, StFCID, and TOTREM.

[Alexander F. Roehrkasse]
For illustration here though, I will say that while we'll be focusing on linking wide data, we did create a long version of our data in last week's presentation. Long formats are the most common formats, the default format in which NYTD records are distributed. So if we wanted to leave our data long, that is to say, instead of having multiple observations across different waves organized as different rows, to have each unit observed multiple times and each of those observations organized as distinct rows. Here, each row in the AFCARS data set, the data set in the middle, each of the AFCARS records will be linked to multiple records in the NYTD file. Essentially, we'll be expanding our AFCARS data, multiplying it three times over so that in our full join data below, we see the values for that total removal variable replicated three times for child B because we observe child B one, two, three times across multiple waves.

[ONSCREEN CONTENT SLIDE 15]
Benefits to linking
Combining sources expands range of measures
Individual-level record linkage: within NDACAN administrative data
Aggregate-level record linkage: external data sources
Repeated observations:
Enhance missing data solutions
Help identify and address measurement error
Enable longitudinal research designs

[Alexander F. Roehrkasse]
Okay, this is starting to sound kind of complicated. Why would we ever even consider doing a data linkage? It sounds at this point like more hassle than it's worth. What do we get from data linkage? The most obvious thing is that we're able to combine sources of information that expands the information that's available to us in our analysis. So, for example, in our project that we're working through this summer, we're interested in the influence of lifetime removals on full-time employment. We simply can't analyze that association without doing a record linkage. The NYTD has the information about the outcome. The AFCARS has the information about the predictor. Without linking it, we can't combine information about those two events. Of course, linking NDACAN data to external data affords almost limitless potential. We can link records to economic data, political data, policy data, demographic data, geographic data, social media data, cultural data, almost anything you can imagine that can be measured over a specific period of time, a specific geography or a specific demographic group. By aggregating NDACAN data we can link those aggregate records to external data. Another source of benefit in record linkage is the the possibility that it allows us to repeat observations of a particular unit. To an extent, this is already baked into the NYTD, which is a longitudinal survey. It follows children over multiple points in time. And you saw from last week's session that we were actually able to use the repeated observations in NYTD to fill in some missing values of key variables that were observed at one point in time but not in others. This is a common strategy for dealing with missing data in NCANDS. If you link multiple NCANDS records, you may be able to fill in some missing data for variables that are missing in one observation and non-missing in another. We can link records over time to see whether certain pieces of information may be changing like actually changing for a child or maybe mismeasured at a a particular point in time. So by looking at conflicting measures across observations, this can help us analyze measurement error in administrative data. And then perhaps the most exciting way to leverage repeated observations of children is that repeated observations essentially create a longitudinal data set. Longitudinal research designs offer all kinds of exciting opportunities to study life course change, to do causal inference. So if you're interested in longitudinal research, record linkage can help to enable longitudinal research designs.

[ONSCREEN CONTENT SLIDE 16]
Pitfalls of linking
Myriad errors arise from linking less-than-clean data
Non-missing values of shared variables may not agree
Linking may result in (systematic) measurement error
NDACAN child IDs are state-specific; interstate moves lead to false negative links
NDACAN data reflect variation and changes in record-keeping
Data linkage can create/amplify missing data problems

[Alexander F. Roehrkasse]
Okay, as I've already started to hint at though, there are some serious pitfalls to linkage and I want to restate and expand upon some of them. If your data are less than squeaky clean, data linkage can let's say accentuate some of the issues that come from mismeasurement. So if your data aren't clean and you try to link them, the most common scenario here is that you'll fail to link records that you should in fact link. But another very common problem is that it will inadvertently create duplicates of specific observations. And so sometimes your your data set can expand two-fold, three-fold, four-fold as a result of trying to  smash together dirty data. Sometimes linking data generates some measurement error or missing data problems. So by combining multiple sources of data, sometimes you have missing values that you didn't have before. Sometimes you have conflicting values that you didn't have before. One of the most important things to understand about linking NDACAN data though arises from what I described earlier, the fact that child IDs are unique only within specific states. So we can use our STFCID variable to uniquely identify children. But what happens if a child moves from one state to another? Then they're assigned a unique ID in a new state using a whole new system. Essentially our data are unable to track children who move from one state to another. It is possible to analyze things like interstate migration among US children, but of course the children who show up in our data sets are not representative of the US population. We don't know if they move more often or less often. Archive data also reflect variation and changes in record keeping and to an extent archive staff know about these and consult with you on these issues. But there may be unknown unknowns. And so it's important to think carefully about what the unique structure of the child ID variable implies for your analysis. What happens if you don't think carefully about children moving from state to state? How does it affect your measures? How does it affect your models? These are the things you should be thinking about as you're considering linking data, as you're linking data, as you're assessing data linkages.

[ONSCREEN CONTENT SLIDE 17]
Demonstration in R

[Alexander F. Roehrkasse]
Okay, that's it for sort of principles. Let's move now to a demonstration in R. So, forgive me while I change my setup here. I'm going to go ahead and switch to R. As you may recall from last week, we'll be working through R Studio. And if you're not already up and running with R and R Studio, you can go back to the slides from last week for some guidance on how to get set up. Or frankly, you can just type R Studio into any search engine and you'll be directed to a website where you can download both R and R Studio.

[ONSCREEN]
# NOTES 
# This program file demonstrates strategies discussed in
# session 3 of the 2025 NDACAN Summer Training Series 
# "Data Management." 
# For questions, contact the presenter 
# Alex Roehrkasse (aroehrkasse@butler.edu).
# Note that because of the process used to anonymize data, 
# all unique observations include partially fabricated data
# that prevent the identification of respondents.
# As a result, all descriptive and model-based results are fabricated.
# Results from this and all NDACAN presentations are for training purposes only 
# and should never be understood or cited as analysis of NDACAN data. 

[Alexander F. Roehrkasse]
This is a typo. This session is actually session three on data linkage, but you'll find my contact information here and some some guidance about how to think about the data and models in this demonstration. A brief reminder that the data we'll be using here look very much like real data. That's to make this demonstration as lifelike as possible. But we are not looking at real NDACAN data. And so any figures we look at should not be taken as ground truth. These are not real child records that we're looking at.

[ONSCREEN]
# TABLE OF CONTENTS
# 0. SETUP
# 1. LINKING DATA
# 2. LINKING, SAMPLING, AND MISSING DATA
# 0. SETUP
> # Clear environment
> rm(list=ls())
> # Installs packages if necessary, loads packages 
> if (!requireNamespace("pacman", quietly = TRUE)){
+   install.packages("pacman")
+ }
> Pacman::p_load(data.table, tidyverse, mice) 
> # Defines filepaths working directory
> project <- 'C:/Users/aroehrkasse/Box/Presentations/-NDACAN/2025_summer_series/'
> data <- 'C:/Users/aroehrkasse/Box/NDACAN/'
> # Set working directory
> setwd(project)
> # Set seed 
> set.seed(1013) 

[Alexander F. Roehrkasse]
Okay. We have a brief setup section. Then we'll talk about linking data and some implications of linking for sampling and missing data. As before, we're going to clear our environment, load the packages that we need to run today's analysis, define some working directories, and set R's working directory to our project directory, and we'll set a seed to govern any random processes that we might use. Okay, now we're up and running. Let's read in our cleaned anonymized versions of the 2020 NYTD and the 2020 AFCARS files. These are essentially the files we created in last week's demonstration. 

[ONSCREEN]
> # Let's read in our cleaned, anonymized 
> # versions of the 2020 NYTD and AFCARS files
> nytd <- fread(paste0(data,'nytd_clean_anonymized.csv'))
> afcars <- fread(paste0(data,'afcars_clean_anonymized.csv'))

[Alexander F. Roehrkasse]
You'll see these objects pop up in our environment over here. You'll see that the AFCARS object is much larger. We have roughly 600,000 observations. We've selected only a few variables. And then our NYTD data set is much smaller about 21,000 observations. Particularly because we've organized our data wide, we have a very large number of variables because each observation is measured three times. Each variable is actually three columns. For demonstration purposes, let's also read in our long data. 

[ONSCREEN]
> # For demonstration purposes, let's also read in 
> # our long version of the NYTD file
> nytd_long <- fread(paste0(data,'nytd_clean_anonymized_long.csv'))

[Alexander F. Roehrkasse]
And now what we're going to do is create some indicator variables, each of which will tell us what source any given record comes from. So if a record has a value one for this new variable data_NYTD, it means it comes from the NYTD. Ditto data NYTD_long. And for the AFCARS we'll say if this variable is equal to one it means that record comes from AFCARS. 

[ONSCREEN]
> # Let's create a variable that will indicate
> # which record (row) corresponds to which dataset
> nytd <- nytd |> 
+   mutate(data_nytd = 1) 
> nytd_long <- nytd_long |> 
+   mutate(data_nytd = 1) 
> afcars <- afcars |> 
+   mutate(data_afcars = 1) 

[Alexander F. Roehrkasse]
Okay let's go ahead and link our data. For all that I've talked about this the actual syntax for linking data is surprisingly brief. We'll go ahead and create a new object d and we'll do that using our NYTD object to which we will apply we'll use the pipe operator to apply this function full_join. Recall we talked about different kinds of joining inner join, left join, right join, full join. Full join is going to retain all of the records in NYTD and AFCARS. We'll try to link everything we can but even if a record doesn't get linked, we'll keep it in this new object d. We're going to full join NYTD to the AFCARS object and we're going to specify a condition which says we want to use this specific variable as the joining variable. If you don't specify this joining variable, R will by default try to identify all of the variables that have shared names and shared formats and link on those variables. I advise you against doing this naively and letting R make these choices for you. You always want to specify the variables that you'll use to link your data. So, let's go ahead and do that.

[ONSCREEN]
> # Let's link our data! 
> D <- nytd |> 
+   full_join(afcars, by = 'StFCID')

[Alexander F. Roehrkasse]
This will take a minute. And now you'll see we have this new object here d which has you know slightly more records than the AFCARS object but slightly fewer than the sum of the NYTD and AFCARS objects. And that's because some rows show up in both data sets. Let's illustrate briefly what would happen if we hadn't specified this joining condition. 

[ONSCREEN]
> # Note that if we don't specify 'by', 
> # the function automatically joins on all shared variables. 
> D2 <- nytd |> 
+   full_join(afcars)
Joining, by = c("StFCID", "St", "RecNumbr")

[Alexander F. Roehrkasse]
We would get a similar object d2 that would have you know I guess the same number of observations a slightly different number of variables fortunately it's joined in this case on STFCID as we wanted and ST and record number just it's quite coincidental that STFCID is simply a concatenation of these two variables so essentially these are going to be identical in our specific case, but this isn't always the case. You really do always want to specify this condition, identify an identifying variable on which to join your data. Let's inspect some aspects of this data linkage. What number and proportion of records from each data set were linked? We'll group by each of these sort of source identifiers and then we'll count up the number of records that came from each of those sources and then we'll generate some proportion variables that just tell us what proportion of records from each source were linked. We run that chunk of code. 

[ONSCREEN]
> # Let's inspect some aspects of the linkage. 
> # What number and proportion of records 
> # from each dataset were linked? 
> D |> 
+   group_by(data_nytd, data_afcars) |> 
+   summarize(n = n()) |> 
+   ungroup() |> 
+   mutate(prop_nytd = ifelse(!is.na(data_nytd), 
+                             n/sum(n[!is.na(data_nytd)]), 
+                             NA), 
+          prop_afcars = ifelse(!is.na(data_afcars), 
+                               n/sum(n[!is.na(data_afcars)]), 
+                               NA))
`summarise()` has grouped output by 'data_nytd'. You can override using the
`.groups` argument.
# A tibble: 3 × 5
  data_nytd data_afcars      n prop_nytd prop_afcars
      <dbl>       <dbl>  <int>     <dbl>       <dbl>
1         1           1  20057    0.937       0.0318
2         1          NA   1345    0.0628     NA     
3        NA           1 611196   NA           0.968 

[Alexander F. Roehrkasse]
We get here a nice little table that says okay about 20,000 observations in our new D object  came from both NYTD and AFCARS. This is essentially the number of records that were successfully linked. This is about 93 or 94% of the NYTD records and about 3% of the AFCARS records. Another 1300 records show up in NYTD but were we weren't able to link them to an AFCARS record. So essentially about 6% of NYTD records we were not able to link and then about 97% of AFCARS records weren't linked and that's not really a source of concern for us because the NYTD itself is a very specific subset of all of the children who appear in AFCARS. So we're not too worried about this number or this number. We don't care how many AFCARS children get linked to NYTD. We care how many NYTD children get linked to AFCARS. So we essentially have about a 94% success rate and a 6% failure rate in linking the NYTD to the AFCARS.

[ONSCREEN]
# Note an important thing about the above merge: 
# The structure of the datasets resulted in a one-to-one merge. 
# This is always good to check.
> D |> filter(data_nytd == 1) |> nrow() == nrow(nytd)
[1] TRUE

[Alexander F. Roehrkasse]
Now I want you to note an important thing about the above merge or the above join. The structure of the data sets resulted in a one to one merge. This is something we can't always assume. This is something we need to check ourselves. Let's take this data object. Let's select only those rows that came from NYTD. And then let's ask is the number of rows equal to the number of rows in the original NYTD data set. Yes. Let's do the same for AFCARS. Let's look at our new data object and say among those records that that came from the AFCARS, is this the same number of records that were in the original AFCARS data object? 

[ONSCREEN]
> d |> filter(data_afcars == 1) |> nrow() == nrow(afcars)
[1] TRUE

[Alexander F. Roehrkasse]
Yes. Again, that's a good thing. We want each record in one source to correspond to one record in the other source. This would appear to result from the fact that our linking variable STFCID is unique or distinct in each data set. As I said, a one to one merge requires that each data set have a unique child identifier. If the child identifier is not unique, if the value on which you're linking the data is not unique, then records from one source will be multiplied or replicated to match records in the other source. But was it actually the case that our child identifier was distinct in each of the data sets? 

[ONSCREEN]
> # This would *appear* to be because the value of the 
> # linking variable StFCID was distinct in each dataset. 
> # But was that actually the case? 
> D |> 
+   group_by(StFCID) |> 
+   filter(n()>1) |> 
+   group_by(data_nytd, data_afcars) |> 
+   summarize(n = n())

[Alexander F. Roehrkasse]
Let's take this new linked data object. We'll group the data object by each unique ID and then we'll tell R to return all those observations for which there are duplicate values of this variable. We're essentially counting within each value of the identifier and saying tell me if that count is greater than one. Then we'll group by each data source and we'll say we'll ask R to return to us how many how many observations from each source violate this condition? This chunk of code will take just a second. It's a little more computationally intensive, but the result will be to tell us from each data source, were there any records that had duplicated child IDs?

[ONSCREEN]
`summarise()` has grouped output by 'data_nytd'. You can override using the
`.groups` argument.
# A tibble: 1 × 3
# Groups:   data_nytd [1]
  data_nytd data_afcars     n
      <dbl>       <dbl> <int>
1        NA           1    32

[Alexander F. Roehrkasse]
Ah, okay. This is not so good. You'll notice that there were no duplicate records from NYTD, but there were 30 records coming only from the AFCARS that had at least one other record with the very same value of STFCID. Now, we got lucky here. It just happened to be the case that none of these children ended up getting linked to the NYTD. But if they had, we would have produced erroneous duplicate copies of a particular child. So we only avoided a major error here as  as a result of pure luck. This is to illustrate that you really want to, before any data linkage, verify that your linking variables are unique, and after any data linkage you want to inspect the structure of your data to make sure that you haven't erroneously generated fake records in the course of linking different sources. Okay, let's briefly talk about some of the implications of linking for sampling and for missing data. 

[ONSCREEN]
> # Above we did a full join of our data. 
> # But we don't want all the AFCARS records that don't get linked. 
> # So we can left join to keep all records in NYTD, 
> # including ones that don't get linked.
> D_left <- nytd |> 
+   left_join(afcars, by = 'StFCID')

[Alexander F. Roehrkasse]
Above, we did a full join of our data. That is to say, we linked what we could, but we kept all of the data that still couldn't be linked, but we don't, for answering our research question, actually want to keep all of the AFCARS records. The AFCARS in any particular year includes tons of records that we're not interested in. Our primary source is the NYTD and we're trying to borrow from the AFCARS what information we can to supplement the information in the NYTD. So what we really want to do here is left join our data where NYTD is on the left. That is to say keep all of the NYTD data and add only those AFCARS records that can be linked to the NYTD. Let's create a new object d_left that left joins NYTD to AFCARS using the STFCID variable. You'll notice here that this d_left object has the exact same number of observations in the NYTD object. That's by construction. We're saying only keep those observations in NYTD and add to them whatever variables you can from the AFCARS. Let's do a little inspection of this linkage.

[ONSCREEN]
> # Let's inspect this linkage:
> d_left |> 
+   group_by(data_nytd, data_afcars) |> 
+   summarize(n = n())
`summarise()` has grouped output by 'data_nytd'. You can override using the
`.groups` argument.
# A tibble: 2 × 3
# Groups:   data_nytd [1]
  data_nytd data_afcars     n
      <dbl>       <dbl> <int>
1         1           1 20057
2         1          NA  1345

[Alexander F. Roehrkasse]
Okay, we'll see again by construction that every object or every row in this data object comes from the NYTD and about 20,000 of those records were linked and about 1300 of them were not. This corresponds to what we saw in our earlier full join. The unlinked records from AFCARS we really don't care about. But the unlinked records from NYTD we want to treat essentially as a missing data problem. Let's go ahead and visualize how much missing data for this total removal variable we have. This total removal variable being the key variable we're importing from AFCARS. The whole reason we're linking to AFCARS. How many missing values of this variable result from failures to link NYTD to AFCARS? We'll go ahead and take this new left joined data object. We'll group by this total removal variable. We'll count up the number of records that have different values of that variable. We'll generate a percentage variable that just tells us what percentage of all records in this object have a missing value or sorry have a particular value for this variable. And then we'll use this code to generate a nice little figure that will essentially produce like a bar graph of our different values. 

[ONSCREEN]
> # The unlinked records from NYTD we want to treat  
> # as a missing-data problem.
> # Let's visualize how much missing data 
> # for the TOTREM variable that we have 
> # as a result of failed links: 
> d_left |> 
+   group_by(TOTALREM) |> 
+   summarize(n = n()) |> 
+   ungroup() |> 
+   mutate(pct = n/sum(n)*100) |> 
+   ggplot(aes(x = factor(TOTALREM), y = pct, label = as.character(round(pct,1)))) + 
+   geom_bar(stat = 'identity') + 
+   geom_text(vjust = -.5) +
+   labs(x = 'Values of TOTREM', y = 'Percentage of NYTD records') + 
+   theme_classic()

[ONSCREEN IMAGE 01]
A bar graph with "Values of TOREM" on the x-axis in integers from 1-9, and NA. "Percentage of NYTD records" is on the y-axis with the following ten values corresponding to x-axis values 1-9 and NA: 61.9, 22.3, 6.6, 2, 0.5, 0.1, 0.1, 0, 0, 6.4.

[Alexander F. Roehrkasse]
So here you'll see a nice little bar graph where on the horizontal axis these are different values that the total removal variable can take. So and then on the vertical axis is the percentage of NYTD records in this linked data object that have different values of this variable. So about 60% of records were linked and have the value one. 22% Of records were linked and have the value two, etc., etc., etc. No one had a value of total removal greater than seven. 6.4% of records in the NYTD though have a missing value for the variable total removal because we failed to link those records to AFCARS. We can try to put this in the context of missing data generally in NYTD. Now, missing data is, broadly speaking, beyond the scope of this summer's, project, beyond the scope of this summer's training series, but we've produced a lot of training on how to deal with missing data in archive data sets. So, I encourage you to, look on our website for more information about how to do this. Briefly though we will use from a package mice this code md.pattern which is a helpful way of summarizing patterns in missing data. 

[ONSCREEN]
> # Let's consider this in the broader context of missing data. 
> # The mice package is R's best package for multiple imputation. 
> # It also has helpful diagnostic functions. 
> D_left |> 
+   select(State, 
+          Sex_3, RaceEthn_3, 
+          CurrFTE_3, CurrenRoll_3, 
+          TOTALREM) |> 
+   md.pattern(rotate.names = T)
      State Sex_3 RaceEthn_3 TOTALREM CurrFTE_3 CurrenRoll_3      
6741      1     1          1        1         1            1     0
90        1     1          1        1         1            0     1
80        1     1          1        1         0            1     1
13114     1     1          1        1         0            0     2
356       1     1          1        0         1            1     1
2         1     1          1        0         1            0     2
7         1     1          1        0         0            1     2
1012      1     1          1        0         0            0     3
          0     0          0     1377     14213        14218 29808

[ONSCREEN IMAGE 02]
A 6x8 grid of squares colored red and blue. The columns are labeled left to right as State, Sex_3, RaceEthn_3, TOTALREM, CurrFTE, CurrenRoll_3. The rows are labeled top to bottom  with values 6741, 90, 80, 13114, 356, 2, 7, 1012. All of the squares in columns State, Sex_3, RaceEthn_3 are blue. There are both blue and red squares in columns TOTALREM, CurrFTE, CurrenRoll_3.

[Alexander F. Roehrkasse]
What we see here is each row corresponds to a specific missing data pattern. Blue squares mean that we observe the variable. Red squares mean we don't observe the variable. So we have 6,700 records where we observe every single variable of interest. These are the kind of variables we'll be using in our main analysis. We have 1,300 records though that are missing values of our current full-time employment variable and another variable asking whether or not the respondent was currently enrolled in education. Now these are missing data that arise simply in through the NYTD through non-responses or attrition in the NYTD survey. We have 356 observations where the total removal variable is missing, but we observe those other variables and about a thousand observations that are missing all three of these key variables. We completely observe variables like state, sex, and children's race and ethnicity. Analyzing these specific missing data patterns can be very helpful for choosing the most appropriate missing data strategy. And again, I would refer you refer you to trainings that are already archived on our website for guidance about how to do missing data analysis.

[ONSCREEN]
# Analyzing these patterns helps us choose the best 
# missing-data strategy going forward. 
# This is beyond the scope of this year's STS, 
# but check out NDACAN's various trainings on missing data.
> # Lastly, let's save our linked data for next week's session 
> # on exploratory analysis.
> Fwrite(d_left, paste0(data,'d_linked_anonymized.csv'))

[Alexander F. Roehrkasse]
Lastly, we'll go ahead and save this left joined data object and it'll be this linked anonymized data object that we'll be picking up in next week's session on exploratory analysis. That concludes the demonstration and my presentation.

[ONSCREEN CONTENT SLIDE 18]
Questions? 
Alex Roehrkasse 
aroehrkasse@butler.edu
Noah Won
noah.won@duke.edu
Paige Logan Prater
paige.loganprater@ucsf.edu

[ONSCREEN CONTENT SLIDE 19]
Next week…
Date:  July 23rd, 2025
Topic:  Exploratory Analysis
Instructor:  Alex Roehrkasse & Noah Won

[Alexander F. Roehrkasse]
So I'll go ahead and pull up my slides again and remind you that next week we'll be doing a session on exploratory analysis in which I'll be talking through a slide deck and then Noah Won,  our archive statistician will be working through the actual demonstration in R. So I strongly encourage you to attend that presentation which I think will be very helpful in getting going on some basic data analysis. That's it for me. So, I think we'll open up the floor to questions if there are any. 

[Paige Logan Prater]
Thanks, Alex. We don't have any questions from throughout the presentation other than when will the slides from previous sections be posted online? I think Andres, maybe you could help us with that. Oh, I see Andre is typing an answer. So, we try to get them out as quickly as possible. There's a process that we have to go through to kind of clean the transcript up and get everything ready for posting. So please bear with us as we're uploading the previous sessions. But they will all be available. That's our only question right now, but we do have about five more minutes. So, I want to just save the time to see if anyone is typing. If there are any lingering questions from the presentation, please put that in the Q&A box now. Okay. And Andres says there's about a 2 to 3 week lag. The first session will be posted today, but if y'all have any specific questions about previous sessions or information shared, you can always reach out to us. I think Alex, if you do you have our contact info on the next slide? 

[Alexander F. Roehrkasse]
No, I don't believe so. 

[Paige Logan Prater]
Oh, okay. Well, we will. 

[Alexander F. Roehrkasse]
Apologies. 

[Paige Logan Prater]
No, that's okay. You can always reach out to us. I'll put our info in the chat. 

[Alexander F. Roehrkasse]
And all of our contact information is available on the NDACAN website. Yup totally. So, you can find contacts for general user support. You can find personal contact for me and for some of the other research associates and staff at the archive. I'm seeing a question pop up. Which data set was for the predictor and which data set was for the outcome variable. Yeah, let me clarify here. So our our main analysis this summer is going to be focused on NYTD and so we'll use NYTD to structure our sample and NYTD is going to be where we measure the key outcome of interest namely current full-time employment 3 years after aging out of foster care. So the outcome variable is going to be measured in NYTD. The main explanatory variable is going to come from AFCARS. So AFCARS is where we're drawing this variable totalrem or total removals. And this is a variable that measures the total number of times a child up to that point it's been measured has been removed and placed into foster care. Now we will use some other variables from NYTD as explanatory variables more like control variables or even variables along which to stratify our analysis. So for example I showed you some information about race and ethnicity that information comes from NYTD and I talked briefly about this other variable current enroll (CurrenRoll). We would, if we were interested in analyzing full-time employment, we would want to know, well, is that person enrolled in school? Because if they're enrolled in school, then we would give a different interpretation to someone who is not full-time employed than if someone was not also already fully enrolled in school. So we'll draw from a few different explanatory variables from from NYTD as well to give context to this main relationship between total removals and current full-time employment. Thanks for the question. Great question.

[Paige Logan Prater]
Alrighty, I am going to take that as no more questions. I see some folks moving on to the rest of their day. So, thank you all so much for coming to this third session. We have two more. Like Alex said, next week we'll be talking about exploratory analysis and then we'll have a final session on the last Wednesday of July.

[Alexander F. Roehrkasse]
Thanks so much everyone. Have a great rest of your day.

[Voiceover]
The National Data Archive On Child Abuse and Neglect is a joint project of Duke University, Cornell University, University Of California, San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, An Office Of The Administration For Children and Families.

[MUSIC]