Transcript for 2025 Summer Training Series Session 1: Developing a Research Question and Exploring the Data 
Presenter: Alexander F. Roehrkasse, Ph.D., Butler University
National Data Archive on Child Abuse and Neglect (NDACAN)

[MUSIC]
[VOICEOVER]
National Data Archive on Child Abuse and Neglect. 

[MUSIC]
[VOICEOVER]
National Data Archive on Child Abuse and Neglect. 

[ONSCREEN CONTENT SLIDE 1]
Welcome to the 2025 NDACAN Summer training series!
The session will begin at 12pm EST.
Please submit questions to the Q&A box.
This session is being recorded.
See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us 
If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
Good morning everybody. Welcome to our 2025 NDACAN summer training series. My name is Paige Logan Prater. I am the graduate research associate here at NDACAN. Before I kick off our time together, I just wanted to give y'all a couple of housekeeping items. We are recording this session so that we can post it to our website and we can refer back to it. There'll be transcripts available if you want to refer back or if you miss a week, things like that. And because we're recording this session, we're using a webinar format, which means that we would like y'all to ask any questions that you have throughout the presentation using the Q&A box function. So, at the bottom of your screen, at your Zoom screen, you should see a little kind of word box with a question mark that says Q&A. If you click that you can submit your questions and we will let our lovely colleagues here go through their whole presentation and then we will answer questions at the end in the order that they come in. So we really encourage you to ask questions throughout and just know that we'll get to them at the end of our time together. If you have any Zoom issues, you feel free to reach out to Andres. His contact information is here. Next slide, please.

[ONSCREEN CONTENT SLIDE 2]
NDACAN Summer Training series
National Data Archive on Child Abuse and Neglect
Duke University, Cornell University, UC San Francisco, & Mathematica

[Paige Logan Prater]
Great. So, welcome again to the NDACAN Summer Training Series. NDACAN stands for the National Data Archive on Child Abuse and Neglect and it is housed at these lovely institutions listed here. Next slide.

[ONSCREEN CONTENT SLIDE 3]
Life Cycle of an NDACAN research project

[Paige Logan Prater]
Our data archive is funded by the Children's Bureau in the office of Administration For Children And Families. And we do our archive does two learning offerings every year. So we have our monthly Office Hour Series during the academic year and in the summertime we have a Summer Training Series and so this is our 2025 Summer Training Series. The theme of the next five weeks will be The Life Cycle Of An Ndacan Research Project. So our amazing colleagues Alex Roehrkasse and Noah Won will be leading us over the next five weeks walking us through kind of soup to nuts. What does a research project using NDACAN data look like? Next slide. 

[ONSCREEN CONTENT SLIDE 4]
NDACAN Summer Training series schedule
July 2nd, 2025
Developing a research question & exploring the data
July 9th, 2025
Data management
July 16th, 2025
Linking data
July 23rd, 2025
Exploratory Analysis
July 30th, 2025
Visualization and finalizing the analysis

[Paige Logan Prater]
And this is just a preview of our schedule for the next five weeks. We will be meeting every Wednesday at 12 Eastern for an hour. And these are the topics that we'll cover. So we'll start with developing a research question. We'll get into data management and some analysis and we will end with some visualization and concluding our analyses. I think that is really it. A couple of plugs before we get into it. I did just want to say that NDACAN also is starting to we have two paper award opportunities that we wanted to share. If y'all are on our email list, you've probably gotten a few emails from me. If not, let me know and I can add you to our email list or I can give you instructions to get added. But we have two paper awards that are coming up in the middle of this month. The first is an award for outstanding paper and the next is an award for outstanding graduate student paper. And nominations for those two awards are Wednesday, July 16th. So if you have any specific questions about them, please reach out to me or anyone at NDACAN. I'll also send a follow-up email to folks that have registered for the series with more information. But please do take a look. We encourage self-nominations. Send that to your colleagues and your networks. And yeah, that's basically it. I'm going to kick it over to Alex to get us started. We're excited to to kick off session one of of our summer training series. Alex, take it away. 

[ONSCREEN CONTENT SLIDE 5]
Session Agenda
Developing a research question
Exploring the data

[Alexander F. Roehrkasse]
Thanks so much, Paige. I'm really excited to be here today. Thank you all for for attending. My name is Alex Roehrkasse. I'm an assistant professor of sociology and criminology at Butler University in Indianapolis, Indiana. I'm also a research associate at the archive where I used to be a a post-doctoral associate and I'm trained as a sociologist but I do a lot of research using archive data particularly the administrative data sets NCANDS and AFCARS to study racial and economic inequality in child welfare system contact. I'm really excited for this year's Summer Training Series which has a slightly different format. Often the trainings interact with each other to a degree, but but this summer will be a little bit unusual in that each session will build very closely on the prior one to illustrate kind of the life course of a research project as Paige indicated. And so today's session is really going to be about how to get started with a research project. This often involves developing a research question and exploring the data, usually kind of in that order, but as you'll see, there is a little bit of an iterative process that can strengthen each component process, but can also lead to some pitfalls. So, we'll we'll talk about how to do each of these sort of component parts of starting a research project in tandem with one another in a responsible and productive way. A reminder that we'll leave plenty of time for  questions and and hopefully answers at the end of today's session and you can you can put those questions in the Q&A on the Zoom.

[ONSCREEN CONTENT SLIDE 6]
Developing a research question

[Alexander F. Roehrkasse]
Okay, let's start with developing a research question.

[ONSCREEN CONTENT SLIDE 7]
First principles for good questions
Clarity
Focus
Answerability

[Alexander F. Roehrkasse]
In very broad terms, I want to start with just some basic first principles for asking good questions. These may seem obvious, but you'd be surprised how many research questions don't actually meet these criteria. First, your research question should be clear. Often, the question we have in our head or the question we write down is clear to us, but will it be clear to other people? I think it's often helpful to imagine someone else reasonably smart, but who's a non-expert in your field, might not know the terms, might not know the research well. Would they understand your research question? Second feature, your research question should be focused. A lot of research questions are unnecessarily and unproductively broad. A common example of this is what we sometimes call a double-barreled research question where we're actually answering two or even more research questions in a single research question. The answers to which could be different. And so we need to focus our research question so that so that people can interact with it more more productively. Third feature, your question has to be answerable. Some questions are interesting and important but aren't actually answerable and  that's okay. But in the context of doing empirical research on the child welfare system, we want our answers or our questions to be answerable. So questions about the nature of things or about the whole set of causes that lead to a particular outcome might not be themselves answerable. So we want to think is this a question that we can answer with data in the context of a discrete research project.

[ONSCREEN CONTENT SLIDE 8]
Improving research questions
How unequal is child welfare system contact? 
What is the ratio of annual incidence of maltreatment investigation across ethnoracial groups? 
How do youth experience instability after aging out of foster care?
What is the risk of homelessness three years after aging out of foster care?
 Why are children placed into different types of out-of-home care? 
What is the effect of maltreatment type on placement setting?

[Alexander F. Roehrkasse]
Let me give some examples about how we might improve research questions along these three lines of clarity, focus and answerability. So consider this first example research question. How unequal is the child is child welfare system contact? Now, broadly speaking, this is a question I ask all the time. But this question isn't very clear. Unequal between whom? Between which groups of children? Unequal with respect to what type of child welfare system contact. There's different ways in which children can come into contact with this system. How would we measure inequality? There's different ways we can measure disparities or inequality between groups. So, here's an example of clarifying that same research question. What is the ratio of the annual incidence of maltreatment investigation across ethnoracial groups? Now, this is a little bit wordier, but it's a lot clearer. We've clarified that we're measuring inequality across ethnoracial groups. We've clarified that we're measuring child welfare system contact specifically in terms of maltreatment investigations. And furthermore that we're going to look at the annual incidence of investigations for these different groups. And then we've clarified we're going to measure inequality in terms of the ratio as opposed to say the difference of those incident rates. So this is a much clearer question than the first one even though it's essentially asking the same thing. The second question is a bit unfocused. How do youth experience instability after aging out of foster care? Well, experience is multifaceted. There's a bunch of different ways we can talk about children's experience aging out of foster care. Maybe we can focus this question to make it a little more practical. So, a more focused version of the same question is what is the risk of homelessness 3 years after aging out of foster care? Homelessness is part of the in instability that a child might experience aging out of foster care. It's much more focused. It allows us to use data to answer this question of the experience of instability in a much more focused way. Third example of a research question that's arguably not answerable. Why are children placed into different types of out of home care? Well, the causes of placement are almost certainly multiple. There's many, many, many different things that cause placements. So we can focus in on studying a cause and its outcome, but we tend to avoid questions about the whole set of causes that lead to a particular outcome because it's very difficult in any given research project to capture all of those different causes. So a more focused version of this question or rather a more answerable version of this question might be, what is the effect of malreatment type on placement setting? So here we're focusing on a specific cause and asking what the outcomes of that cause might be.

[ONSCREEN CONTENT SLIDE 9]
Nesting research questions
Consider organizing your research question as a set of nested questions with different levels of specificity
This can help identify both theoretical implications (higher levels) and empirical details (lower levels) of your question(s)
What is the relationship between family stability and child well-being?
What is the relationship between foster placement and health? 
What is the association between placement type and health in the transition to adulthood? 
What is the effect of institutional placement (relative to foster family placement) on the risk of substance abuse referral among children three years after aging out of foster care? 


[Alexander F. Roehrkasse]
One of the best pieces of advice I got in graduate school was to think about research questions as being part of a nested set of research questions. And this has been very helpful for me in in my own work particularly in connecting my work both to theoretical work sort of exploring the theoretical implications of my research question and connecting my research question to the nitty-gritty details of empirical measures. So let me give you an example of this. You'll see down below in the bottom of this slide I have four different research questions. These are actually just four different versions of the same research question. And as we move up in that sort of nested structure, we see that the questions get more abstract. And as we move down, we see that they get more specific. Let's focus on the third question in that sort of stack of questions. What is the association between placement type and health in the transition to adulthood? Now, I think this is a great question to get started with. I think it's the kind of question I might write down if I were starting a research project in this area. But we can think about making this question more specific, more focused, clearer. And I think moving down that stack toward the last question is an example of that. What's the effect of institutional placement relative to foster family placement? So we're clarifying what the the marginal effect is here that we're interested in. What's the effect of institutional placement on the risk of substance use referral among children 3 years after aging out of foster care? So we've made this question even more specific. This is helpful in thinking about how we'll operationalize our question, how we'll use empirical data to answer this question, what kinds of measures, what kind of methods we might be using. But if we're trying to connect our research question to broader theories about family dynamics, about the transition to adulthood, about child development, we might actually help ourselves by making our question more abstract by leveling up, so to speak. So we might think even more abstractly, what's the relationship between foster placement and health? Or even more generally, what's the relationship between family stability and child well-being? Moving up toward greater abstraction is kind of like asking yourself, what is my research a case of? What is this an example of? Nesting your question in higher order research questions can help you connect your research question to different theories that your question might speak to, might contribute to, might help revise or improve.

[ONSCREEN CONTENT SLIDE 10]
Where to start
Interesting topic 
Influential theory
Valuable data source
Innovative method
Compelling study

[Alexander F. Roehrkasse]
Okay, but where to start? Like really honestly concretely like where do you start? There's really no single answer to this. I think research projects start research questions emerge in different ways depending on people's personal style, their stage in their career, their specific field. Whether they like to work independently or collaboratively. So let me just illustrate a few different types of starting place for research questions for research projects. Sometimes questions start from an interesting topic. So I myself do research on the criminal legal system and the child protective system. And so I sometimes think about how those systems interact or how we can learn from the criminal legal system to ask questions about the child welfare system. Sometimes questions start from an influential theory. So for example, some of my research lately has been informed by dose response mechanisms. I've been thinking a lot about theories of dosage and cumulative exposure to generate questions in my own research like what is the cumulative exposure not just the cumulative risk of contact with the child protective system. How does racial inequality appear differently when we think about exposure or dosage rather than just incidents? Oftentimes research questions emerge from a valuable data source that we gain access to. So maybe you know for one reason or another we come across some data that we think are really valuable, interesting, novel. We ask ourselves what can we do with this data? What kinds of questions can I answer with these data? One of the things I do for the archive is digitize and standardize and publish historical data on child welfare. So a few years ago, we archived something called the Voluntary Cooperative Information System. And you can go access those data now on the archive. When I was working with developing the VCIS, I was asking myself like how have trends in foster care placement changed over the last four decades? That's a question we couldn't answer without that data. That data source makes that question possible, makes it answerable. Other times, a question might emerge from a new or innovative method or or a method that's tried and true in another field, but that we bring to research on child maltreatment in a new and helpful way. So, for example, there have been developments recently in tests for discrimination, particularly racial discrimination. And so I've been developing research projects that that would use those new methods to test racial discrimination to ask questions about child maltreatment investigations. Who gets investigated? What are the results of those investigations? Very often a question emerges through interacting with a compelling existing study. So maybe a new study comes out or maybe there's a classic study that you engage with and you want to replicate or emulate or extend or critique that study. Interacting closely with prior research can be a way that that research questions emerge.

[ONSCREEN CONTENT SLIDE 11]
Questions and “framing"
Puzzle: tension between theory and observation that your question can resolve
Caution: perhaps there’s a better-suited theory
Gap: absence of prior observation that your question can fill in
Caution: perhaps there’s a good reason for the gap
Further reading:  “On Genre:  A Few More Tips to Article-Writers" by Ezra W. Zuckerman
https://www.dropbox.com/scl/fi/sv1u2txx9djc74gfywj9v/On-Genre.pdf?rlkey=6urcf87h57m4s75wktds4qnqe&e=1 

[Alexander F. Roehrkasse]
Sometimes researchers spend a lot of time thinking about how to frame their research question and  sometimes this can be frustrating and not very productive, but but but often it can be quite helpful. Cynically, thinking about framing can sometimes be about appeasing reviewers or heightening the perceived importance of your research. But more sincerely framing your research is about orienting your study toward the cumulative progression of scientific knowledge, figuring out how to build off of other people's research and contribute to it. There are many different ways to frame a research study, and I've included a link down below that to a piece by Ezra Zuckermanman. It's a very provocative piece about how to think about framing your research, particularly in a social research context, but I want to draw out a couple common frames that people often use and point to why they're helpful, but also why they can be a little bit tricky. One common frame is a little more academic. It's often oriented toward audiences focused a little more on basic research and that's to focus on a puzzle, some tension between theory and observation that your research question can perhaps help resolve. So imagine for example a theory X that predicts some outcome Y, but then you observe some different outcome Z. Why did you observe Z and not Y? What needs to be changed about the theory X to account for your observed outcome Z. This can sometimes lead to better theory. But it's important to understand to be aware that maybe there is a better suited theory out there. Maybe the puzzle results from your poor choice of theory. And it's important to note also that whatever theory you develop to explain your puzzling or anomalous outcome needs to have what we sometimes call excess explanatory power. It needs to be able to explain your new puzzling case but also still explain the stuff it was already explaining. If you develop some theoretical innovation that explains your new case but no longer explains the prior cases that it used to explain, we're kind of just going around in circles. A second frame for a lot of research is the gap frame. And this is to say, well, we lack prior knowledge about some empirical phenomena. And you're doing your research to fill in that gap to to fill in that absence of knowledge in a particular area. This can be especially helpful when that gap is decision-relevant. We need to know about that thing that we don't yet know in order to make some decision whether it's policy relevant or practice relevant. Sometimes though that gap is there for a reason. It's not actually as important or as interesting as you might think it is. And so it's always important to make sure that you've surveyed the literature well enough to be confident that there is in fact a gap, but also to ask yourself long hard questions about why that gap's there and whether it should be there.

[ONSCREEN CONTENT SLIDE 12]
Question development and exploratory research
Develop research question
Identify/explore potential data sources
Data documentation
Data
Revise/refine question
Repeat 2-3 as necessary
Avoid data dredging
Consider pre-registration

[Alexander F. Roehrkasse]
This brings us to the relationship between question development and exploratory research. So traditionally what we want to do is develop a research question a priori and then to set about identifying or exploring potential data sources that would help us answer that question. Usually most helpful in doing that is to look at data documentation metadata or information about the data sources in question. Sometimes it will be necessary to actually start looking at data in the question development process to actually do what we call exploratory analysis. And in looking over the data documentation and possibly even exploring the data itself, we may then revise or refine our question and then return to the data documentation and even the data to continue exploring refining that question. This can lead to more powerful, clearer, more impactful questions, but it also has its downsides. In using exploratory research to refine our research question, we run the risk of what's called data dredging or sometimes called P hacking. And essentially this is after doing some preliminary analysis refitting our question or our hypothesis so that our evidence confirms or refutes it more definitively. That is to say kind of fitting our question to our data instead of using our data to answer our question. The question should in principle always preede and be independent of our findings. And we can't use the same data to both generate and test a hypothesis. This happens sometimes intentionally when when researchers act in bad faith. More often it happens unintentionally. We're doing this area of process and  we start to lose sight of what our original goal was. Whether intentional or unintentional though, data judging does really reduce the credibility of your results. And so this is something you want to be aware of as you're doing exploratory research to improve your your research question. One very helpful way to avoid data dredging is to pre-register your research. This is common in certain scientific fields. It's becoming more common in child maltreatment research and and there are resources available. I've linked one from ACF on how to pre-register your studies doing research on on children and child maltreatment. What is pre-registration? Pre-registration is essentially where you publicly report your research question, your hypothesis, and even your analysis plan. You publish this for the public to see, and you timestamp it. And you do this before you have access to your data, before you engage in analysis of your data. And so, pre-registration helps separate the hypothesis generating process from the hypothesis testing process. It separates exploratory research from confirmatory research. Now, pre-registration can be tricky with observational data because it's difficult to prove that you didn't have access, weren't already doing your analysis before you did your pre-registration. Archived data actually provides an opportunity though to establish that record. All data that's archived with NDACAN requires a data use agreement and so you could pre-register your study before you execute that data use agreement and you could prove that you didn't have access to the data before you actually pre-registered your study. So I think there's an opportunity here to increase confidence in your research to help avoid data dredging by pre-registering research that's done using archived data.

[ONSCREEN CONTENT SLIDE 13]
Not reinventing the wheel
What sources and methods have prior researchers used to answer similar research questions? 
Standard library research methods
What research questions have prior researchers asked using potential data sources?
Child Abuse and Neglect Digital Library (canDL)

[Alexander F. Roehrkasse]
Lastly, it's really important not to make things harder on yourself than they need to be. I encourage data users to avoid reinventing the wheel. And there's really two ways to think about doing this. The first is to use standard library research methods to ask what sources and methods have prior researchers used to answer research questions that are similar to yours. So, this is where you'd be using, you know, standard methods to do keyword searches, search based on topics, to try to figure out what researchers are doing in a substantive area or a methodological area that you're working. The other way to do this though is almost the inverse, and that's to ask what research questions have prior researchers asked using a data source that you're considered considering using. The archive provides a very helpful resource for doing this and it's what we call canDL or the Child Abuse And Neglect Digital Library. This is essentially a repository of research that uses archive data to to do any a wide array of of different types of of research. And so you can use canDL to say what kinds of questions do people ask using the AFCARS? What kinds of studies do people do using NCANDS using NYTD? You can focus in on a particular data set and ask what kinds of things are people doing with this data set.

[ONSCREEN CONTENT SLIDE 14]
Identifying Relevant research
Screenshot of the NDACAN web site for the NYTD Outcomes File 2020 Cohort Waves 1-3. The site describes the data, and provides a link to studies using NYTD, organized in canDL, NDACAN's data library. 

[Alexander F. Roehrkasse]
How do we use canDL? How do we identify relevant research using canDL? I've included here a screenshot of the archive website, the NDACAN website. And what I've done here is navigate to a particular data set the national youth in transition database which is the data set that's going to be the focus of our research project that we're going to be developing over the course of the summer. I'll refer to this data set by its acronym NYTD. So what you're seeing here is a screenshot of the website for NYTD this data set has a specific data set number. It has a little abstract that describes the data set. And if you look to the very bottom of this screenshot, you'll see publications from this data set and a hyperlink to NYTD publications. If we click that hyperlink, we'll be taken to canDL. 

[ONSCREEN CONTENT SLIDE 15]
CANDL
A screenshot of NDACAN's data library canDL, implemented in Zotero and viewed in a web browser. The image lists the titles, authors, and publication dates for studies using NYTD data.


[Alexander F. Roehrkasse]
More specifically, all those canDL entries that have been flagged for being based on the NYTD study. Candl is implemented through Zotero, which is a helpful citation management software. What you're looking at here in the sort of list of studies are the titles, authors, and publication dates for studies that use NYTD in in any number of ways. If you look to the bottom left of this screenshot, you'll see that there's a NYTD tag that's kind of a magenta color and is highlighted. That means that canDL is filtering in any study that's identified as using NYTD data. If we wanted to look at studies published using NCANDS or AFCARS or both NCANDS and NYTD, we could click those other tags to filter in or filter out studies based on different data.

[ONSCREEN CONTENT SLIDE 16]
Additional considerations
What are your goals for your research? 
What will sustain your interest? 
What is your comparative advantage? 

[Alexander F. Roehrkasse]
I want to leave you with a few additional considerations for developing a research question. But but I think you should take these seriously. These aren't afterthoughts. These are pretty important considerations. First consideration really, what are your goals for your research? Being honest, thinking concretely and practically about what you're trying to get out of your research project is important for clarifying your goals, setting some expectations, putting some scope around your research project. If you're writing a job market paper, your goals are going to be different than if this is a side project for you. If you're the lead author on this study or you're the fourth author on this study, maybe your goals are a little bit different. Thinking honestly and clearly about your personal goals for this research question are important. Second consideration, what are you really interested in? And really, how interested in it are you? Research projects are hard. Those of you who have done research projects know they take a long time. Almost always longer than they think you're going to take or you think it's going to take. It's very important that you ask questions that you have a true deep interest that can be sustained over the course of the project. Third, and this is maybe even the most important one, what is your comparative advantage? What makes you the person to do this research? What makes this the right project for you? The research industry is collaborative in many respects, but it's also a competitive ecosystem. What makes your research competitive? So you might ask like of a specific research question or project, what makes you the right person to pursue this question, to ask this question, to do this research? More generally of yourself, you might ask, what kind of research are you just the right person to do? What methodological skills or data access or unique background or contextual knowledge do you have that's going to contribute specifically to a set of research questions, a set of research projects? There are certain things that I'm interested in that I would like to do research on, but I just don't have a comparative advantage in them. I don't have better data than other people. I don't have better skills than other people. And so I've done well for myself by focusing on the things where I think I can make a more meaningful contribution relative to other people doing similar research.

[ONSCREEN CONTENT SLIDE 17]
Preliminary research question
What is the relationship between placement history and unemployment among youth aging out of foster care?

[Alexander F. Roehrkasse]
Okay, this leads us to what I'm going to call a preliminary research question. And so this is the kind of first version of a research question that we'll be exploring over the rest of the summer. This research question will change even over the rest of this presentation, but this is just a starting point research question. And here's the question. What is the relationship between placement history and unemployment among youth aging out of foster care? What I mean by placement history is the history of children's placement into different foster care settings. Unemployment should be relatively clear. When we're talking about aging out of foster care, we're talking about the process of of reaching the age of 18 while still in foster care, moving out of foster care, not because one's been returned to their caregiver or been adopted, but because one reached the age of majority while still in foster care. I think this question has theoretical relevance. We might think about this question as being relevant to attachment theory, to cumulative disadvantage theory, to theories of legal cynicism or institutional disengagement. This question also has policy relevance. If labor force participation is a major and and growing issue in our country, if it's a cause and consequence of of social inequality, then maybe altering foster care placement strategy could help economic success in the transition to adulthood among people experiencing foster care. So this knowledge would be valuable from a policy perspective.

[ONSCREEN CONTENT SLIDE 18]
Exploring the data

[Alexander F. Roehrkasse]
Okay. So let's now talk about exploring the data. How we might do some exploratory analysis. Looking particularly at data documentation, maybe a little bit at the data. We won't actually do that directly today. We'll start to do that next week. How might we start to use data documentation in particular to do some exploratory analysis that might help us improve our research question? 

[ONSCREEN CONTENT SLIDE 19]
Finding Data and data Documentation
A screenshot of NDACAN's web site indexing all archived datasets. Datasets are numbered and named, and sorted by publication date (most recent first). Recurring administrative datasets are featured at the top. 

[Alexander F. Roehrkasse]
Okay. How do we find data? How do we find data documentation? Of course, there's data everywhere. This presentation is focused on data sets that are archived with NDACAN, with the National Data Archive on Child Abuse and Neglect. Those data sets can all be found on the NDACAN website by clicking the data sets tab up at the top of the website. And so, you can see another screenshot here of the NDACAN website. And you'll see a blue bar across the top where data sets is highlighted in black. That's because I've clicked on data sets there. And it's taken me to a list of all the data sets that are archived on NDACAN. Now, NDACAN highlights some of its administrative data that in particular have multiple data sets that are sort of generated again and again year after year. So things like the AFCARS foster care and adoption files, the NCANDS child and agency files, NIS, NSCAW, NYTD, different administrative data sets that have multiple instantiations. Below that, you'll see a list of more specific studies and those data sets are numbered. They're numbered sequentially. So the larger the number, the more recent the study it is. Each of these studies you can click on to be taken to a specific website that describes those data describes that data set. So let's do that for NYTD the focal data set for this summer's research project.

[ONSCREEN CONTENT SLIDE 20]
ACCESSING Data and data Documentation
Screenshot of the NDACAN web site for the NYTD Outcomes File 2020 Cohort Waves 1-3. The site describes the data and provides links to the User's Guide, Code Book, and data access application.

[Alexander F. Roehrkasse]
Now you've already seen this screenshot when we were navigating to canDL the canDL entry for NYTD through that hyperlink at the very bottom of the page. But let's talk a little bit more about what's actually on this page that can be helpful. The abstract is a great place to get started. To be candid, some abstracts are more helpful than others. Most helpful here are the links to data documentation. Similarly to many data sets, most archive data sets have two kinds of data documentation, both of which are extremely helpful, particularly when used in combination. You'll see down below toward the bottom of the page two sources of data documentation. The NYTD outcomes users guide and the NYTD outcomes codebook. What are each of these? 

[ONSCREEN CONTENT SLIDE 21]
Using Data Documentation
Cover page for the User's Guide for the Outcomes File of the NYTD FY2020 Cohort. The cover page includes revision dates and contact information for support. 
Cover page for the Code Book for the NYTD Outcomes File. The cover page includes revision dates and contact information for support. 

[Alexander F. Roehrkasse]
The users's guide is really like a set of it's a more text-focused document that explains the structure of the study, the choices that the study authors made in developing the study, both the potential uses, but also some of the pitfalls that arise through the study's design. Guidance for users of the data in pursuing different types of research. The codebook on the other hand is a more cut and dried description of what's actually in the data. These are more lists of variables included in the data set and descriptions of each of those variables, what they measure, descriptions of the possible values that any variable can take and what those values mean. Let's look a little more about at what we can learn from each of these types of data documentation.

[ONSCREEN CONTENT SLIDE 22]
Understanding study design
Illustration of the design and sampling structure of NYTD. The baseline population overlaps with only a small proportion of the population receiving services. Wave 1 is contained in the baseline population; contained within it are Waves 2 and 3, which closely but not completely overlap.

[Alexander F. Roehrkasse]
One of the most important things to use the users's guide for is to understand the structure of a particular study. Some studies have relatively straightforward designs. Other designs are more complicated. For better or for worse, NYTD has a more complicated structure than many other studies that NDACAN archives. The figure you're looking at is a figure that comes directly from the NYTD users guide. Let me talk a little bit about what we can learn from this figure just to illustrate how you would think about using a user's guide to understand study design. Every archive study has its strengths and its weaknesses and these are really baked into its research design. The NYTD study is a little bit unique in that it's a survey based on administrative records and that survey is therefore linkable to those administrative records. What do I mean by this? So the way NYTD actually works is that every child who is in foster care on their 17th birthday or who enters foster care within 45 days of their 17th birthday is part of what's called the baseline population. And this is essentially the sampling frame, the group of people we would like to study if at all possible. That baseline population, that sampling frame is illustrated with a blue circle here. Anyone in the baseline population, anyone in this sampling frame gets sent a survey. That survey asks them questions about all different kinds of experiences in foster care, but also different kinds of outcomes that we might be interested in. Not everyone answers that survey. The people who answer that survey are represented by the orange circle here. We call them the age 17 cohort or wave 1. Wave 1 is everyone who receives a survey and responds to the survey. And so we actually have demographic information for everyone in the baseline population, everyone in the sampling frame because we have administrative records for those children when they enter foster care. And when they enter foster care, when we create those administrative records, we collect basic demographic information about them. The information that's only collected in the survey questions, though, is only available for people who respond to the survey. Those people in the orange circle, those people in the age 17 cohort or wave 1. Anyone who responds at age 17 to this survey gets followed up two years later and two years after that. We call these wave two and wave three. If you answer the survey two years later, then you're part of wave two. Even if you don't answer the survey in wave two, you still get sent the survey in wave wave three as long as you answer the survey in wave 1. And so the the observed data in wave two and wave three are largely overlapping but not the same. We might have people who are in wave 1, two, and three. We might have people who are in wave 1 and two, but not 3. We might have people who are in wave 1 and three, but not two. Without understanding this study design, it's very easy to make methodological errors, to make errors of inference that would lead us to spurious conclusions. When we choose to use a data set, we owe it to ourselves to invest in learning the structure of this data set, the design of the study that yielded the data. It's very important to invest in this knowledge in order to do your research responsibly.

[ONSCREEN CONTENT SLIDE 23]
Understanding available data: NYTD
A table listing the NYTD Outcomes Variables by position. Listed variables include technical information like Wave and unique child ID, demographic information like sex and race, and outcome information like employment and housing status. 

[Alexander F. Roehrkasse]
Okay. What do we learn from a codebook? This is a list of entries in the NYTD outcomes code book. You'll usually see two such lists in archive data at least in the administrative data. You'll see the variables listed by I think by type and or by essentially by the way they're ordered in the data set and then you can also see an alphabetical list of those data. The left column tells you what page the entry is on. You'll see the variable name that should correspond to how a column is labeled in the data set. This list of variables, so you the way you want to think about this is each row is a variable that's available in the NYTD data set. We're going to be focusing in particular on three different variables in the NYTD data set. The first is the Wave variable that's in the very first row. And that just tells us like at what point in time at what point in in this child's transition to adulthood are we observing them. We'll also be using the STFCID variable that's in the second row. That variable will help us identify unique children and link those children to their records in other data sets, specifically AFCARS. Over the course of the summer in outlining this research project, we'll be linking the NYTD to the AFCARS. This is a very common strategy in research using archived data, at least administrative data. And so we'll be showing you how to do that data linkage and we'll use the STFCID variable to do that record linkage. And then of course our question was about unemployment. And so we're going to be interested in particular in a variable that falls about halfway down this list, CurrFTE, or current full-time employment. I'll talk in a few minutes about how that actually measures employment.

[ONSCREEN CONTENT SLIDE 24]
Understanding available data: AFCARS
A table listing the AFCARS Foster Care variables by position. Listed variables include technical information like unique child ID, demographic information like sex and race, and system contact information like number of removals. 

[Alexander F. Roehrkasse]
Here's a similar list of variables that are available in the AFCARS as illustrated from the AFCARS codebook. The AFCARS is a data set that measures all children who pass through the the foster care system in any given federal fiscal year. We'll link our NYTD records to the AFCARS records to learn something about the child's placement history. Now, if you were to scroll through the AFCARS codebook, you'd notice that there is information in the AFCARS about the number of placements that a child has had. But that's only for the current spell in which that child has has been in foster care. There's not information in the AFCARS about all spells, all placements that the child experiences across all spells in foster care. We could try to approximate such a measure by linking many different AFCARS records, but it would be imperfect. And explaining why is beyond the scope of today's presentation. What's important to understand is that we set out to measure placement history for a child's entire life. But if we look closely at the AFCAR's data, we find that that might not be possible. That might not be feasible. What we do see though at the bottom of this table is that variable number 27 (TotalRem), the second to last row, this variable is called total removals or the total number of removals from home. And this measure does capture the total number of times that a child has been removed from their home over the course of their life. This is not exactly the same as the number of placements that a child has had over the course of their life, but it does capture something similar in terms of the instability that a child might experience over the course of their life.

[ONSCREEN CONTENT SLIDE 25]
Understanding measures 
Description of the CurrFTE variable from the NYTD Outcomes Files code book. The code book entry includes a variable label (Current Full Time Employment), variable definition, and description of variable values and value labels.

[Alexander F. Roehrkasse]
Okay. Once we identify a variable that we think we're interested in using, we might navigate to that variable's entry in the code book and we would find additional information about that variable. So, for example, there's an entry for this current FTE variable. Here we would find a definition for that variable. A youth is employed full-time if employed at least 35 hours per week in one or multiple jobs as of the date of the outcome data collection. Yes means the youth is employed full-time. Decline means the youth didn't answer. Blank means the youth did not participate in the survey. So it's very important to examine what the variables actually measure. It's not always exactly what you think it is. Now recall we asked a question about unemployment. Now, technically, unemployment means not only that you don't have a job, but that you don't have a job and are seeking one. You are in the labor force. You are trying to be employed, but you're not. It turns out that's not actually measurable using this current full-time employment variable. We just know whether someone is full-time employed or not. If they're not full-time employed, we don't know if they're unemployed, that is to say, seeking employment but don't have it, or if they've left the labor force. They're no longer seeking employment. They're not trying to work. So, this tells us we can still ask a question about employment, but we can't ask it in exactly the way we did before. We actually can't ask about unemployment specifically. We need to ask more precisely about employment or full-time employment, which we can answer.

[ONSCREEN CONTENT SLIDE 26]
Revised research question
What is the relationship between lifetime incidence of removal and full-time employment among youth three years after aging out of foster care?

[Alexander F. Roehrkasse]
So let's look at our revised research question. In light of what we learned from the NYTD code book, the NYTD user guide, in light of what we learned from the AFCARS code book, how can we update our question to make it more clear, more focused, more answerable? Here's my best version of an updated research question. What is the relationship between the lifetime incidence of removal and full-time employment among youth 3 years after aging out of foster care? So, you'll notice that I've made a few changes here. Based on data availability, we've chosen to focus on removals rather than placements. We've chosen to focus on full-time employment rather than unemployment. We've also added some precision to this question. We've clarified that we want to look at the lifetime incidence of removals, not just in the number of removals, say in the teen years or in the year before aging out of foster care. We want to know how many removals this child has experienced over the course of their whole life. We've also specified that we want to look at employment 3 years after aging out of foster care. We measure children in united at two points after they've left foster care. One year after and three years after. So we need to be specific. Which of those points are we looking at? So I think this is a good research question from which to proceed doing a little more exploratory analysis actually starting to use some data to see whether this research question still makes sense, whether this is a research question we want to finalize and pursue. One thing you might consider doing with this research question, and I'll even leave this with you as some homework to do if you're considering coming back for next week's session, which I hope you will. I talked earlier in this presentation about nesting research questions. How can we make this research question even more specific to clarify how we'll measure these things empirically? How can we make this question broader or more abstract, more conceptual so that we can link it to broader theoretical considerations? What is this question a case of? What are the broader, deeper constructs or concepts that this question is getting at? So, I'd invite you to explore nesting this question to see how you might connect it to other research questions, other theories that you're interested in. 

[ONSCREEN CONTENT SLIDE 27]
Questions?
Alex Roehrkasse aroehrkasse@butler.edu 
Noah Won noah.won@duke.edu 
Paige Logan Prater paige.loganprater@ucsf.edu 

[Alexander F. Roehrkasse]
Okay, that brings me to the end of my presentation. So, I'll open things up for questions, but before I do, I want to remind you that I'm always available on email as are Noah and Paige. And so, if you didn't get a chance to answer your question today or if you didn't if we don't get to your question, you can follow up with us. Or if you feel like you have a question brewing but you you you you maybe don't get to it in in the course of this presentation that's fine. Please feel free to reach out to us over email. Either we'll be able to answer your question or we can connect you with someone who can. So I'll end my presentation there and let's see. 

[Paige Logan Prater]
Thanks Alex ask you to curate the questions. Yeah, we have some really amazing questions. Some are a little more like tactical, others more conceptual. So I'm actually going to ask I think what will be a very quick question first and then I'll go in the order that they came in. So great. One of the more recent questions that we had is will you show how to link the data with AFCARS using R this summer? And I think the answer is yes, but I want to make sure with you Alex.

[Alexander F. Roehrkasse]
Yes, I'm happy to say the answer is yes. So we'll be we'll be using R for our demonstrations this summer. So this is the only presentation where I'll just work through a slide deck and that'll be it. In all subsequent presentations, I'll start with a slide deck in which I'll give a kind of conceptual overview of the things you want to be thinking about, paying attention to, and then I'll transition to R and I'll still share my screen and I'll work through an R script using some using some anonymized data. So, not not real data, but data that that very much like illustrate the structure and features and pitfalls of real data. And so you'll be able to follow along and see how I actually implement all of the things I'm talking about in R. And one of those things certainly will be linking data sets. We'll specifically be linking NYTD and AFCARS, but much of the strategy in doing that specific linkage is general to linking archive data sets in general. So, so even if you're interested in linking say AFCARS and NCANDS or NYTD and NCANDS, you'll learn something about how to do those linkages. 

[Paige Logan Prater]
Amazing. The next question or the first question that came in was in reference to how to frame research questions. So, I don't know if it's helpful to go back to that slide, but I'll read it, which is I noticed that these questions change from how or why questions to what. Do strong research questions start with what and or can they not start with how or why? 

[Alexander F. Roehrkasse]
This is a great question. Yeah, it's a it's an astute observation and I think you you you have identified a change in strategy. Let's focus on the why questions first. Why questions are usually about causality. But a why question is a pretty open-ended question about causality. So say you know why are rates of placement into foster care so much higher among American Indian children than white children for example. This is a clear causal question, but as structured as a why question, it invites so many different possible explanations. Now, if we were writing a book about this, that would be great. We would want to explore all these different these different questions. But if we're trying to design a discrete empirical study using archive data, we're probably going to want to focus on a more discrete set of possible causes and to try to really understand what their causal impact is. And so it sounds like you picked up on the fact that I'm I was transitioning from some like why does X happen to what is the causal impact of Z on X? This is just a way of kind of making why questions more specific, more tractable, more answerable. How questions are often about basically like how a cause works. How do we get from X to Y? And these are great questions. They often involve adjudicating between competing explanations for a causal relationship. So, I don't mean at all to discourage people from how questions or even why questions for that matter, but I think you do whenever you're answering a why question or a how question want to think very carefully about the answerability of your question in a discrete empirical context. 

[Paige Logan Prater]
Thanks, Alex. The next question we have is how do you know when you are data dredging? And regarding delegitimizing your findings, how would others know when whether you've data dredged or drudged? 

[Alexander F. Roehrkasse]
Yeah. Yeah, this is a great question as well. Yeah, so this is I mean you can read up on this. We are currently kind of experiencing what's called a crisis of replicability. And this is kind of moving through different research fields psychology, economics, political science where folks are going back and trying to replicate others research and realizing it's very difficult to do. That raises questions about the validity, the reliability of people's research. And it's widely suspected that part of what's going on here in the replicability crisis is data dredging. So let's start with like how would others know if we've dated dredged. It's very difficult to know in any particular case if if someone has data dredged. It's much easier to know in broad strokes statistically speaking whether a large number of data sets include whether a large number of studies include data dredging. We know this because we would expect certain distributions of statistical findings across a large number of research studies. If you're reading a research study though, and you find that every single statistical test, say, is just barely above the threshold of 95% statistical confidence, you know, that that would maybe give you some concern that they've kind of fitted their model so that they make it just over this kind of arbitrary socially agreed upon threshold for reportable evidence. Sometimes you see that, not that often though. How would you know if you are data dredging? Now this is tricky. You would know that you are data dredging if you are say running a model. You're sort of estimating a statistical model and you find my results aren't statistically significant. Well, what if I add this this additional variable? Then do my results become statistically significant? Oh, okay. Well, I set out to answer this one question, but I didn't really find anything. What if I ask the question a little bit differently? Then are my results statistically significant? If you're doing that, then you're data dredging. Often though, data dredging isn't quite so intentional, isn't so explicit, and arises from this iterative process between exploratory analysis and hypothesis generation that's done in good faith, but but proceeds too deeply for your research really to be reliable. So, if you're concerned about this, be honest with yourself about how you're developing your questions in the context of your exploratory analysis. I invite you to do more research on P hacking and data dredging. And if you ever have concerns about that, feel free to reach out and we can talk about it. 

[Paige Logan Prater]
Thanks, Alex. And I'm seeing that very helpfully Andres is answering some more of the technical questions about re-registering data and access to data things like that. So I am going to skip to another question that is less is a little bit more conceptual. So and it kind of relates back to this these like how why research question. So looking at the preliminary question you provided and thinking about the metric you provided regarding clarity, focus and answerability, how do you know that question is focused enough? What is a general test we can do to ensure our questions are focused? Yeah, that in one minute Alex. 

[Alexander F. Roehrkasse]
This is a great question. Yeah, you might have noticed that my presentation is is more focused on kind of like guiding principles than on hard and fast criteria. And so focus is just kind of like a a best practice or a principle that you want to think about your question manifesting. There is no test or threshold or agreed upon level at which your question is sufficiently or insufficiently focused. I think the way you should think about my advice is when you're developing a research question, ask yourself, is this question focused? Will the research question be strengthened by focusing it? Maybe your research question is too focused. That's possible. It's not more is more is more is more. So more than anything, I'd just like you to consider focus as a dimension along which you can evaluate your research question rather than there being some specific threshold of focus that your question should satisfy.

[Paige Logan Prater]
Amazing. You answered it beautifully and on time. But yes, that could be like a whole, you know, 30-minute conversation. That is all of our time today. Thank you all so much for joining us. Alex, I don't know if there's another is there another slide that shows what we're covering next week. Maybe not. 

[ONSCREEN CONTENT SLIDE 27]
Next week…
Date:  July 9th, 2025
Topic:  Data Management
Instructor:  Alex Roehrkasse

[Alexander F. Roehrkasse]
There is. Yep. 

[Paige Logan Prater]
Oh, great. So, next week, next Wednesday, July 9th, we will be meeting at the same time. We'll be talking about data management. And there were a couple of questions that we didn't have time to answer. So, please do reach out if you have any follow-up questions. Like Alex said, either we can answer it or we can direct you to the right people to get your questions answered. Also some of the questions that we weren't able to get at were a little bit more on like the career and professional development side of things. And I just want to plug our monthly office hour series during the academic year that we're starting back up in September, I believe. And we have, you know, that that series has a specific professional development focus. So, we really encourage you to of course get your questions answered now, but if you want a more ongoing avenue of support, that is available and and coming down the pike. So, thank you all again so much. We will see you next week and I hope you all have a have a nice rest of your week and hopefully you're able to get some rest for the holiday weekend. 

[Alexander F. Roehrkasse]
Thanks everyone. 

[Paige Logan Prater]
Thanks everybody.

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California, San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an office of the Administration for Children and Families.

[Music]