Transcript for 2025 Summer Training Series, Session 2, Data Management
Presenter: Alexander F. Roehrkasse, Ph.D., Butler University
National Data Archive on Child Abuse and Neglect (NDACAN)

[MUSIC]
[VOICEOVER]
National Data Archive on Child Abuse and Neglect. 

[ONSCREEN CONTENT SLIDE 1]
Welcome to the 2025 NDACAN Summer training series!
National Data Archive on Child Abuse and Neglect
Duke University, Cornell University, UC San Francisco, & Mathematica

[Paige Logan Prater]
Hi everybody. Welcome to our NDACAN summer training series. My name is Paige Logan Prader. I am the graduate research associate here at NDACAN. NDACAN stands for the National Data Archive on Child Abuse and Neglect. We're housed across the universities and institutions listed below on the slide and we're funded through the national Children's Bureau. Welcome to session two of our five session training series. We do this every summer and we're happy to have you here this year. Next slide please. 

[ONSCREEN CONTENT SLIDE 2]
NDACAN Summer Training series schedule
July 2nd, 2025
Developing a research question & exploring the data
July 9th, 2025
Data management
July 16th, 2025
Linking data
July 23rd, 2025
Exploratory Analysis
July 30th, 2025
Visualization and finalizing the analysis

[Paige Logan Prater]
The theme of our series this summer is "The Life Cycle Of An NDACAN Research Project". And as our this slide shows Alex last session walked us through developing a research question and exploring the data. Today we're going to be talking a little bit more about data management. And the next three weeks are also listed here. Next slide please.

[ONSCREEN CONTENT SLIDE 3]
Life Cycle of an NDACAN research project
This session is being recorded.
Please submit questions to the Q&A box.
See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us 
If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
Just a few housekeeping items before we jump in. All of our sessions will be recorded and they are available on the NDACAN website. If you'd like to refer back to them or if you aren't able to join us live, everything can be found there. And if this is your first time with us, we host two learning series every year. Our monthly Office Hours that is that occurs during the academic year and our Summer Training Series that we're in right now. All of our information about events and offerings can be found on our website, which also includes recordings and transcripts from past series as well as current ones. If you have any questions about the next the next item here is please submit questions to the Q&A box on Zoom. If you if you hover over the bottom of your Zoom screen, you should see a Q&A icon. We will be using that to collect questions throughout the presentations. We encourage you to submit questions throughout the presentation as they come up. But we are going to save them for the end and we'll answer as many as we can. Because we want to capture these questions in our recording for posting it on our website, I will read the questions aloud and then Alex or Noah will be answering them live. For some of the more technical you know data request or access questions, Andres has been really helpful in responding to those directly in the Q&A box. So look out for that. But otherwise we will get to as many questions as we can and also please reach out to Andres if you need any assistance with Zoom or any technical difficulties throughout the session. Really quickly before we jump in, I just want to do another plug for two paper awards that NDACAN is accepting nominations for right now through July 16th. If you are on our list serve, you've probably seen a bunch of emails from me about these awards. I will be sending out some more information throughout the next couple of days. But they are all due on Wednesday, July 16th. And if you have any specific questions about them, please reach out to me or anyone at the Archive. Okay, those are all of my announcements. Alex, feel free to take it away and get us into our next session on data management.

[ONSCREEN CONTENT SLIDE 4]
SESSION AGENDA
STS recap
Data management
Demonstration in R

[Alexander F. Roehrkasse]
Great. Thanks, Paige. Hi, everyone. Thanks for being here today. My name is Alex Roehrkasse. I'm an assistant professor of sociology and criminology at Butler University in Indianapolis. And I'm also a research associate at the National Data Archive on Child Abuse and Neglect. This is the second session in our Summer Training Series this year. And I'm excited about the structure of the STS this year because it follows the life cycle of a research project. And what that means is that each session is going to build on the previous session. If you weren't here for the first session, don't worry. All the material today will still be accessible to you and I'll be providing a brief recap about some of the major themes that we talked about last week that will segue into our topic for today. So like I said, I'll do a brief recap of the previous session on developing a research question. Then I'll move through a slide deck here in PowerPoint on data management. I'll talk through some basic principles, questions, strategies. Then differently than last week's session, we'll move over into an actual demonstration of these principles, these strategies in R. If you want to follow along, you can open R yourself. But as I believe Paige mentioned, we'll be archiving this presentation and that will include a transcript and also the output of the demonstration in R. So if you're not able to keep up with each line of code, don't worry. You'll be able to go back and find that once we upload this presentation. As Paige said, I'm very happy to take questions and I'll leave plenty of space at the end of the presentation for that. But if questions arise throughout the presentation, please just drop them in the Q&A and we'll do our best to address all the questions.

[ONSCREEN CONTENT SLIDE 5]
STS RECAP

[Alexander F. Roehrkasse]
Okay. What did we talk about last week? 

[ONSCREEN CONTENT SLIDE 6]
DEVELOPING A RESEARCH QUESTION & EXPLORING THE DATA
Consider clarity, focus, and answerability
Nest questions at different levels of generality
Use data documentation (User Guides, Code Books), limited data analysis, and prior research (canDL) to refine research questions

[Alexander F. Roehrkasse]
In broadstrokes, we talked about a few different principles and best practices for developing a research question and then honing that research question through an exploration of data documentation and to a limited extent the data itself. I encourage you to think about the clarity, focus, and answerability of your research question. This may seem obvious, but many research questions could be clearer, could be more focused, could be more answerable. So, it's not so much of a yes or no, but can we improve along these dimensions? I also suggested that nesting questions at different levels of generality can be a helpful way to connect your research question at the lower levels to concrete issues of measurement and operationalization and at the higher levels to broader theoretical concerns. So, how might you specify your question, make it more specific, more concrete to highlight its connection to actual data? How might you make your question broader or more abstract to help identify connections between your research question and larger theoretical concerns? Lastly, we talked about using different forms of data documentation, more specifically user guides and code books that NDACAN produces and distributes for each of its archived data sets. How to use that documentation to begin to explore your data and how to use CanDL which is the archives repository of research published or published research that that analyzes archived data to to better understand what prior researchers have done using archived data. Again, if you missed that presentation or have any questions about that presentation, you'll be able to find it archived on the NDACAN website.

[ONSCREEN CONTENT SLIDE 7]
RESEARCH QUESTION
What is the relationship between lifetime incidence of removal and full-time employment among youth three years after aging out of foster care?

[Alexander F. Roehrkasse]
Recall from our prior session that we we proposed a preliminary research question and then on the basis of some of the skills we learned in that presentation refined that research question to a working question that looked something like this. What is the relationship between lifetime incidence of removal, that is to say removal from the home, placement into foster care? So, what's the relationship between the lifetime in incidence of removal and full-time employment among youth 3 years after aging out of foster care? This will be the research question that we'll be focusing on for the rest of the Summer Training Series. So when we're exploring our data, cleaning our data, linking our data, starting to do analysis and visualization, make decisions about how to present our data, all of these are going to be framed around trying to develop a research project that answers this question.

[ONSCREEN CONTENT SLIDE 8]
DATA MANAGEMENT

[Alexander F. Roehrkasse]
Okay. So today's presentation is focused on data management. And this is obviously a very large topic. We're just going to be skimming the surface today. Many of the things we talk about are general principles for data management. They'd be good good ideas to to think about to to to implement no matter what kind of data you're working with. Some of my suggestions though will be a little more specific to Archive data and even more specifically to administrative data that's archived with NDACAN. Recall that on the basis of our research question, the main data source we're going to be exploring this summer is NYTD. But we'll also be linking the NYTD to AFCARS.

[ONSCREEN CONTENT SLIDE 9]
DATA MANAGEMENT AS CRISIS MANAGEMENT
You’ll make mistakes, You won’t remember, They can’t read your mind.
Never work from the console, Annotate code liberally, Keep a research journal, Save everything and often.

[Alexander F. Roehrkasse]
I like to think about data management as crisis management. There are lots of problems that arise in the research process and it can be helpful to think about best practices in terms of what will help you avoid different crises, different problems. What are some of these problems? You're going to make mistakes when you're coding. You're going to make coding errors when you're trying to clean up your code or edit your code. You'll accidentally delete things that you didn't mean to delete. You'll make all kinds of mistakes and you'll need to be able to fix these mistakes. You're also not going to remember what you did. This is maybe one of the most important things I've learned over the course of becoming a professional researcher. It's really easy to convince yourself you're going to remember some really important decision you made. It was obvious why you made that decision. There's no way you'll forget you made that decision. You're going to forget that decision. It's important that you remember each of the consequential decisions you make while doing research. And so it's important to have a record of those decisions. That's more about communicating with your past selves or your future selves. But of course, we want to communicate with other people about our research process as well. Good research is transparent and replicable. And so we want other people to understand what we've done and why we've done it. Other people can't read our minds. They don't know why we've made the decisions we've made. Reasonable people can disagree about good decisions. And so it's important to have a record of the decisions we've made and why we've made them. For these reasons, I recommend a few different basic practices for kind of treating data management as crisis management. The first is to save everything and often. I like to save a version of any file I'm working on, particularly like a script, a piece of code that I'm working on. Any day that I work on it, I save a new version of that code. And and at the end of that file, I just save it with a like a year year year month month day day suffix. And so that means that my code files are kind of organized by date. It makes it really easy to go back and find different versions of my code at different points. Of course, if you work through something like GitHub, archiving of your work is a little bit different. In some ways easier, in some ways more challenging. Whatever you do, the most important thing is to save, your work often and ideally to save multiple versions of your work so that it's archived. You can go back and find different versions of your work before you made different consequential decisions. Second principle, never work from the console. This will be a little bit clearer what I mean by this when we move into our demonstration. Most statistical software will allow you to simply enter different commands or different instructions into the graphical user interface. I advise doing this almost never. The reason for this is that when you enter commands directly into the console, there's no record of you having done so. For this reason, it's always helpful to write a script. In R, it's often called a script or a markdown file. In STATA, it's called a do file. And this is a written record. It's essentially a text file. And it's a written record of everything that you every command that you told your software to perform over the course of a research project. Third principle, all scripts, whether it's a script or a markdown file in R or a do file in STATA, all of these different types of files allow you to make comments in the file. That is to say, like take notes about the code. And R or STATA won't actually run these notes. They won't do anything about these notes. They're they're just there to explain what the code is doing and why it's doing what it's doing. So when we move into this demonstration, you'll see that I've annotated my code liberally to explain to you, to myself, and to other people what the code is doing and why it's doing it that way. Lastly, I find it very helpful to keep a research journal. I do this in a Google doc. You can use whatever works for you, but basically any day where I'm doing any meaningful work on a research project, I just take some notes about what I did, the decisions I made, why I made them. I keep a running list of of questions or to-dos or challenges that are live in my research. I can't tell you how many times I've gone back and searched through these research journals to find more information about decisions I made sometimes a month ago, sometimes a year ago, sometimes five years ago. You never know when you're going to need to find information about decisions you made. Keeping a journal is helpful for keeping track of these decisions. This is a way of thinking about data management as a way to avoid major problems in your research projects.

[ONSCREEN CONTENT SLIDE 10]
FILE ORGANIZATION AND WORKFLOW
Project (local machine with backup to cloud server)
Programs (R scripts)
Drafts 
Figures
Tables (tabular data)
Data (FedRAMP-authorized server, encrypted external drive)
Raw
Derived (micro-data)

[Alexander F. Roehrkasse]
Now, people organize their files and their workflow differently. But I would recommend thinking about basic file organization and workflow in this way. You want to think about having a project folder and a data folder. Your project folder can be on your computer on a local machine, but I recommend that you always have that backed up to a cloud server so that if your computer crashes or you lose your computer, you always have a backup of your project. And in that project folder, you'll have a subfolder that includes all of your scripts or your do files. And then you'll have a drafts folder that includes output like figures that you make or tables that you make. And by tables here, I don't mean your raw data, but maybe you want to generate some summary statistics or a crosstabulation. And you might save that information in your draft file. Separately, you want to have a data folder. And it's important that this data is either on a FedRAMP authorized server or on an encrypted external drive. Those are the terms of your data use agreement with any data ar archived data that you you use at least any administrative data that you use. And this more secure place is where you would keep any raw data and also any derived data that still includes micro data which would be identifiable. Again, you can have a slightly different file organization or workflow than this, but in basic strokes, this kind of organization and workflow will help you keep track of the different components of your research project and also uphold the requirements of your data use agreement.

[ONSCREEN CONTENT SLIDE 11]
EXAMINING YOUR DATA
What is the structure of your data? What are the columns? What are the rows? 
What viewpoint on your data do you need to understand it?

[Alexander F. Roehrkasse]
Okay. Once we're up and running, it's time to start examining our data. One of the very first questions that's helpful to ask yourself is what's the structure of your data? By which I mean like literally how are the data organized? What are the rows? What are the columns? How many rows are there? How many columns are there? Most archived data sets are going to be too large to open in a program like Excel where you would normally look at tabular data and even programs like STATA or R that will allow you to open up a spreadsheet and actually look at your data. Very often these data are going to be too large to simply open and scan. So you'll need to think about what viewpoint onto your data will help you most understand it. There's different ways we can do this and I'll be illustrating some of them when we get to our demonstration.

[ONSCREEN CONTENT SLIDE 12]
SUMMARIZING YOUR DATA
What are the distributions of key variables? 
How can summarization help understand study design? 
How can summarization help understand measurement issues, such as missing data? 

[Alexander F. Roehrkasse]
The next thing we might want to do is start generating some summaries of our data. So instead of looking at the raw data, how might we generate some summaries? Summary statistics, different counts or crosstabulations of our data to help understand what's actually going on. Whenever talking about summarizing, we're talking about generating summary statistics like measures of central tendency or variance or simply counting up our data. And this can be helpful very often as like preliminary results. So you often see summary statistics reported as the first table in published research but it can also be helpful for understanding the data, any errors or issues in the data, and it can be helpful for understanding study design. So for example we'll be looking at NYTD today. NYTD is a longitudinal study that follows people over multiple points in time. We can use summarization techniques to ask questions for example about how much attrition is there in NYTD across different waves. So I'll be showing you how to generate a crosstabulation of different observations to try to track how many people fall out of the NYTD survey. This is one way we might summarize our data to understand the study design a little bit better and some issues or errors that arise in the data set.

[ONSCREEN CONTENT SLIDE 13]
SUMMARIZING YOUR DATA
Image description: Description of the CurrFTE variable from the NYTD Outcomes Files code book. The code book entry includes a variable label (Current Full Time Employment), variable definition, and description of variable values and value labels.

Actual text shown in the image:
CurrFTE
Variable Label: #37: Current Full Time Employment
Definition:
A youth is employed full-time if employed at least 35 hours per week, in one or multiple jobs, as of the date of the outcome data collection.
"Yes" means the youth is employed fulltime.
"Declined" means the youth did not answer this question.
"Blank" means the youth did not participate in the survey.
Data Type: TinyInt
NYTD Element: #37
Value 0
Value Label no
Value 1
Value Label yes
Value 2
Value Label declined
Value 77
Value Label blank

[Alexander F. Roehrkasse]
An example of a variable that we might want to summarize begin to start exploring is one of the main outcomes in our or I guess is the main outcome in our research question this summer and that's current full-time employment. This screenshot is from the codebook for NYTD and you saw this image in the previous presentation. So in the codebook for each variable you'll you'll see the variable name, the variable label, a definition of the variable. You'll also see the kind of data that it is. Here it's an integer. You'll see the different values that the variable can have and the labels that attach to each of those values. How might we summarize a variable like this? Well, you see that the values of the variable are numbers. So, we could think about this variable as like a continuous numerical variable. 

[ONSCREEN CONTENT SLIDE 14]
SUMMARIZING YOUR DATA: CONTINUOUS VARIABLES
CurrFTE
Mean 26.1479
Median 0
Standard deviation 36.33093

[Alexander F. Roehrkasse]
And then we could summarize our data using different measures of central tendency like mean, median or mode. We could think about calculating different measures of variance like standard deviation. We could have also calculated minimum and maximum values of this variable. But truthfully, this doesn't make very much sense in the context of this variable because even though this variable has numerical values, this is not a continuous variable. Really, what current FTE is a categorical variable that has been encoded with different number values. So we want to think about each of the possible values of this variable as being different categories that have been arbitrarily assigned different numbers. There's no meaning to the interval between zero or one, one and two, 2 and 77 in this variable. There's simply encodings of different categories. 

[ONSCREEN CONTENT SLIDE 15]
SUMMARIZING YOUR DATA: CATEGORICAL VARIABLES
CurrFTE
Value	0
Frequency	25426
Proportion	0.539
Value	1
Frequency	5038
Proportion	0.107
Value	2
Frequency	492
Proportion	0.0104
Value	77
Frequency	15799
Proportion	0.335
Value	NA
Frequency	434
Proportion	0.0092

[Alexander F. Roehrkasse]
And so a more helpful way to think about summarizing a categorical variable like this would be to tabulate the frequency of observations having different values of this variable. That's what you see in the second column here. How many observations in NYTD have the value zero for current full-time employment? How many observations have the value one or two or 77? We can also think about the proportion of observations that have that value. So you'll see about 54% of observations NYTD have the value zero for the current FTE variable. Calculating the frequency or proportion of different values can be more helpful when we're summarizing categorical variables. And I'll illustrate how to do this in R.

[ONSCREEN CONTENT SLIDE 16]
CLEANING DATA
How are variables formatted?
String, numeric, labeled factor, etc.
How are values coded?
Codes will not always correspond to the code book: always verify
Do you care about the missing data mechanism?

[Alexander F. Roehrkasse]
Okay. It's important also to clean our data. Now, we would like our data to be as clean as possible when we distribute them through the Archive, but I'll be candid with you. The data need some additional cleaning. At the very least, you want to look at how variables are formatted. Different variables can be formatted in different ways. They can be text variables or what we sometimes call strings. They can be numerical variables, basically numbers. They can be factor variables, categories that might have labels. These might be ordered, they might not be. There's different formats for variables that different statistical softwares will assign to different variables. Even more important is to understand how different values are coded. This is extremely important. Codes will not always correspond to the code book. And so for any variable that you are analyzing, it's important to independently verify the values of the different variables that you might be using. Let me go back just a little bit to illustrate this problem. In the code book description of current FTE, you'll see that there are four possible values. No, yes, declined, and blank. But when we actually tabulate the values, we see that there are actually five possible values of this variable. There are 434 observations for which there is no value for that variable. You see that as NA. That's how R codes missing values. And so the code book doesn't actually reflect the way that missing values have been encoded in the distributed data file. So it's important that we independently verify all the different values that a variable can have. Particularly when it comes to missing data, you'll see that sometimes missing data are encoded in different ways depending on the missing data mechanism. So for example, sometimes the data are missing because the respondent chose not to state. Other times they're missing because a respondent didn't fill out that variable. Other times they might be missing because a respondent who responded to the survey in one wave didn't take this wave at all. Didn't respond to any variables in this wave. So you need to ask yourself, do you care about the missing data mechanism? Do you care about the different reasons why data might be missing? If so, you might want to leave those missing data encoded in different ways. If not, you can treat them all as missing data and assign them the value of NA.

[ONSCREEN CONTENT SLIDE 17]
PROGRAMMATIC CODING
Excessive copying/pasting of code can decrease interpretability, increase risk of error
Programmatic coding makes repetitive processes more transparent, malleable
Common examples:
Functions
Loops
Tidyverse functions, especially in the purrr package

[Alexander F. Roehrkasse]
As you'll see again when we get to our demonstration script files, do files can get quite long. Often we're implementing a certain process over and over and over again and it can be tempting to just copy and paste our code and sometimes this is appropriate but it can introduce possibilities for error and interpretability. To reduce the possibility that we make errors, to increase the likelihood that people understand what we're doing, it can be helpful to the extent possible, to use programmatic coding that makes repetitive processes more transparent and more malleable. What do I mean by programmatic coding? A few different things. One, we might write functions. And what a function is is essentially a chunk of code or an algorithm that we name. We assign it a name. We take a process and we label it. And then when whenever we tell R to run that algorithm, we sort of tell R, okay, consider this named algorithm. It will run the same chunk of code every single time. This is one way to reduce the possibility that we make an error in copy and pasting our code many times. Another example of programmatic coding is a loop. A loop is a certain process that we tell R or STATA to run. It's a single process, but then we tell it to perform it over multiple elements. So we say, okay, take this set of elements and do this one thing to it over and over and over again to each element. That's another way to avoid errors to increase the interpretability of our code. Lastly, I'll be illustrating today using many functions that belong to what's called the Tidyverse, which is a kind of super family of different packages and programs in R. The Tidyverse has a package called purr p u r r r that includes many commands, functions that are particularly helpful for programmatic coding for avoiding the kinds of problems that I'm describing here. So you can look up the purr package. You can find a cheat sheet for the purr package. You can find vignettes, illustrations of how to implement different parts of the purr package to code programmatically.

[ONSCREEN CONTENT SLIDE 18]
FORMATTING DATA
Rows should almost always correspond to your unit of analysis
Columns should almost always correspond to variables
When they don’t, pivoting (R) or reshaping (Stata) the data is usually necessary
Example:
For our project, the unit of analysis is the person
In the NYTD 2020 Cohort file, each rows represents a person-wave
Reshaping wide will leave persons as rows, with variables observed multiple types represented as multiple columns

[Alexander F. Roehrkasse]
How should we think about formatting our data? There may be exceptions to these suggestions, but in my opinion, rows in your data should almost always correspond to your unit of analysis. One of my pieces of advice from the previous presentation was to consider very carefully and at the beginning of your research process, what is my unit of analysis? This seems like a straightforward question. It isn't always. Getting clear about what your unit of analysis is will help you format your data appropriately. Ideally, each row in your data set should correspond to your unit of analysis. Columns then should almost always correspond to different variables. They may measure variables measured at different points in time, but more or less columns should always correspond to different variables. Very often, data sets that we distribute will already be formatted in this way. But when they don't, it's necessary to reformat this data. Sometimes called transposing. In R, we'll use a set of commands called pivoting. In STATA, this is called reshaping our data. Let me give you an example of what it would mean to pivot or reshape our data. So for our project, the unit of analysis is the person. Given our research question, the unit of analysis that we'll be focusing on is the person. In the NYTD 2020 cohort file, each row does not correspond to a person. Each row corresponds to a person wave. That is to say, each row is a person measured at a specific point in time. So a single person can have multiple rows, which means that our unit of analysis, the person, doesn't correspond to the row. We have too many rows for each person. We should have one row per person. In order to address this problem, we can reshape our data or pivot our data wide. And this will leave persons as rows with variables observed multiple times represented as multiple columns. Let me try to illustrate what this means to pivot or to reshape our data.

[ONSCREEN CONTENT SLIDE 19]
PIVOTING DATA
Long data
Wave 1
StFCID AL12345
CurrFTE 0
Wave 1
StFCID AL67890
CurrFTE 0
Wave 2
StFCID AL12345
CurrFTE 0
Wave 2
StFCID AL67890
CurrFTE 1
Wave 3
StFCID AL12345
CurrFTE 1
Wave 3
StFCID AL67890
CurrFTE 2

Wide data
StFCID AL12345
CurrFTE_1 0
CurrFTE_2 0
CurrFTE_3 1
StFCID AL67890
CurrFTE_1 0
CurrFTE_2 1
CurrFTE_3 2

[Alexander F. Roehrkasse]
The default formatting of the NYTD file as I that I just described is what we call long data. We can visualize that on the left. So the table on the left illustrates the basic organization, the default organization of the NYTD file. You'll see here that we basically have two people observed three times. So if you look at that middle column STFCID, this is the unique identifier for each child in NYTD. You'll see though that each of these IDs appears three times. That's because each ID is measured three times as a different value of this wave variable. When we reshape our data wide, what happens is we essentially take that wave variable and we append it to the variable name current FTE. So now instead of having one column corresponding to one variable current FTE, we have three columns which measure the value of that variable at each wave at each point in time. Now in the wide data visualized on the right, we only have one ST one value of STFCID per row. In other words, one person, one row. Our rows now correspond to our unit of analysis.

[ONSCREEN CONTENT SLIDE 20]
DEMONSTRATION IN R

[Alexander F. Roehrkasse]
Okay, now it's time to move to a demonstration in R. 

[ONSCREEN CONTENT SLIDE 21]
GETTING STARTED WITH R AND RSTUDIO
A screenshot of the web site (https://posit.co/download/rstudio-desktop/) at which R and RStudio can be downloaded and installed. The screenshot includes click buttons for downloading and descriptions of R and RStudio.

[Alexander F. Roehrkasse]
Before I do a brief primer in how to get going with R yourself, there's different implementations of R. Far and away the easiest and most popular is a graphical user interface called RStudio. The way you want to think about this is that R itself is a language. And it needs a sort of vehicle, an implementation that you can interact with. And so getting started with RStudio is essentially a two-step process. There's a link here in the slides, but basically if you type RStudio into any search engine, you'll be taken to one of the first options will be to to go to POSIT, the distributor of RStudio. And there will be two buttons there where you can first download and install R, and then second, download and install RStudio that will implement R on your desktop. I'll be using an implementation of R in terms of RStudio and I'll use those interchangeably. Okay, here we are. This is RStudio. Let me give you a brief introduction to the RStudio environment. In the upper left quadrant here, you'll see I have an R script running and this is the name of the script. R scripts have the suffix dot R. This is essentially a text file and it includes, as I said, all the annotations for myself and for others. And then the actual code that I'm going to be running. Over on the right is what we call the environment. R is an object-based software, an object-based language. As I run code, I'll be creating objects and the various objects in my environment will show up here. I'll be able to see them here in the environment. Lastly, down on the bottom left is the console. This is where we'll see all of the code that I run, all of the commands that I tell R to do, and also all the output of those commands. You'll see that I'm able to leave comments or annotations in my code by prefixing them with the pound sign. This tells R don't do whatever is written here,  this is just a note to myself. 

[ONSCREEN]
# NOTES
# This program file demonstrates strategies discussed in
# session 2 of the 2025 NDACAN Summer Training Series
# "Data Management."
# For questions, contact the presenter
# Alex Roehrkasse (aroehrkasse@butler.edu).

# Note that because of the process used to anonymize data,
# all unique observations include partially fabricated data
# that prevent the identification of respondents.
# As a result, all descriptive and model-based results are fabricated.
# Results from this and all NDACAN presentations are for training purposes only
# and should never be understood or cited as analysis of NDACAN data.

# TABLE OF CONTENTS
# 0. SETUP
# 1. EXAMINING DATA
# 2. SUMMARIZING DATA
# 3. CLEANING DATA
# 4. FORMATTING DATA
# 5. SAVING DATA

[Alexander F. Roehrkasse]
You'll see that very often at the beginning of my script, I leave some notes for myself and for others. I have a table of contents that outlines the different things that we'll be doing in the script. And then each each section organizes a different process, a part of the analysis that we'll be doing. I may be moving a bit fast here. So don't stress out if you're not following every little step. My guess is that each of you is coming to this presentation with different levels of familiarity with R and programming in general. Just a reminder that you'll be able to go back and access the transcript of this presentation which will include all of this code. As always, if you have any questions about programming or analyzing archive data in a programmatic environment, feel free to reach out to us for help. Okay.

[ONSCREEN]
# 0. SETUP
# SETTING UP THE ENVIRONMENT 
# Let's clear the environment
rm(list=ls())

# Pacman installs packages if necessary, otherwise loading them.
If (!requireNamespace("pacman", quietly = TRUE)){
  install.packages("pacman")
}
Pacman::p_load(data.table, tidyverse)

[Alexander F. Roehrkasse]
Very first thing that's helpful to do is to clear the environment. There's nothing in our environment, so nothing to clear. Next, we're going to install our packages. We're going to be using two packages today. Data table and the Tidyverse. PacMan is a helpful program for loading packages. This chunk of code just says if we haven't already installed PacMan, install PacMan. And then use PacMan to load the other packages. We've already installed PacMan. And now we'll use PacMan to load the data table and tidyverse packages. These are essentially families of commands that we'll call on to analyze our data. 

[ONSCREEN]
# Let's define some filepaths (note the organization of project and data folders)
project <- 'C:/Users/aroehrkasse/Box/Presentations/-NDACAN/2025_summer_series/'
data <- 'C:/Users/aroehrkasse/Box/NDACAN/'

> # And set one as the working directory.
> Setwd(project)
> # Always set a seed to allow for reproduction of random processes.
> Set.seed(1013)

[Alexander F. Roehrkasse]
Next, we'll define some file paths. Recall the organization of our projects, our workflow that we talked about in the slides. We'll create a project folder and a data folder. This is us just telling R where is the project folder? Where is the data folder? And you see that they now show up in our environment as values. They're just essentially text strings that R is aware of. Now we'll tell R to set our working directory as the project file or the project folder. Lastly, it's always helpful to set a seed in case we're doing anything that involves a random process so that we can replicate that process. 

[ONSCREEN]
> # Let's read an anonymized version of the NYTD 2020 Cohort Waves 1-3.
> Nytd <- fread(paste0(data,'NYTD/297 NYTD Outcomes 2020 Cohort Wave 1-2-3/Data/Text/Outcomes20_w3_anonymized.tab'))


[Alexander F. Roehrkasse]
And now we'll read in our data. I want to clarify here. We'll be working with anonymized data. These are not real data. The data do not reflect real individuals. Please don't take any results or any description of patterns or trends here to be an analysis of real data. These data reflect the structure of actual NYTD and AFCARS data, but I've anonymized them to avoid disclosure risk. Okay, we'll go ahead and read in our data. "Fread" is a command belonging to the data table package and it's a helpful way to read in a large data files quickly. We'll paste in the the data directory and then specify precisely what file we want to read in. And when we do this, we'll create a data object over here in the environment called NYTD. And we can see that it has 47,000 observations and 51 variables. 

[ONSCREEN]
> # And an anonymized version of the AFCARS 2020 Foster Care Annual File.
> Afcars <- fread(paste0(data,'AFCARS/DS255 FC2020v3/Data/Text Files/FC2020v3_anonymized.tab'))

[Alexander F. Roehrkasse]
We'll also read in an anonymized version of AFCARS. This file is a bit larger and so it takes a little bit longer to read. As you can see, 631,000 observations. Okay, let's start to examine our data. Most NDACAN data files are too large to just open a spreadsheet and look at all of them. One of the very first things that can be helpful is to just look at the dimensions of our data.

[ONSCREEN]
> # Most NDACAN data files are too large for spreadsheet viewing to be helpful.
> Dim(nytd)
[1] 47189    51

[Alexander F. Roehrkasse]
If we type this dimensions command, we can see that the NYTD data have 47,000 rows and 51 columns. These correspond to observations. These correspond to variables. There are different ways though, helpful ways to view snippets of the data. One is to simply subset our data. 

[ONSCREEN]
> # Subsetting tells R to print only the cells corresponding to certain rows, columns.
> Nytd[1:5,1]
   Wave
1:    1
2:    1
3:    1
4:    1
5:    1

[Alexander F. Roehrkasse]
So we can take this NYTD object and if we use these brackets we're essentially using like matrix notation almost to say okay take the first through fifth rows of this object and the first column of this object and output them to me. And here what we get is the first five rows of our data and just the first column of our data which happens to be the wave variable. We can do the same thing say give us the first five rows but instead give us the 1st through 8th and 20th through 22nd columns and here we would get essentially the same first five rows but then the specific variables that we've asked R to report back to us. 

[ONSCREEN]
> nytd[1:5,c(1:8,20:22)]
   Wave         StFCID State St     RecNumbr RepDate        DOB Sex CurrFTE CurrPTE EmplySklls
1:    1 AL000001456616     1 AL 000001456616  202003 2003-02-15   2       0       0          0
2:    1 AL000001524474     1 AL 000001524474  202003 2003-01-15   1       0       1          0
3:    1 AL000001528009     1 AL 000001528009  202003 2003-01-15   1       0       0          0
4:    1 AL000001597400     1 AL 000001597400  202003 2003-01-15   1       0       1          0
5:    1 AL000001612758     1 AL 000001612758  202009 2003-01-15   2       0       0          0

[Alexander F. Roehrkasse]
Note that some variable names for example state don't actually match what's listed in the code book. So this variable state in the code book is listed as "StFIPS". This is another discrepancy between the actual data files and the code book. Something it's important to verify for yourself.

[ONSCREEN]
> # head() returns the first five rows of all columns.
> Head(nytd)
   Wave         StFCID State St     RecNumbr RepDate        DOB Sex AmIAKN Asian BlkAfrAm HawaiiPI White
1:    1 AL000001456616     1 AL 000001456616  202003 2003-02-15   2      0     0        0        0     1
2:    1 AL000001524474     1 AL 000001524474  202003 2003-01-15   1      0     0        0        0     1
3:    1 AL000001528009     1 AL 000001528009  202003 2003-01-15   1      0     0        0        0     1
4:    1 AL000001597400     1 AL 000001597400  202003 2003-01-15   1      0     0        0        0     1
5:    1 AL000001612758     1 AL 000001612758  202009 2003-01-15   2      0     0        0        0     1
6:    1 AL000001634843     1 AL 000001634843  202003 2002-09-15   2      0     0        0        0     1
   RaceUnkn RaceDcln HisOrgin OutcmRpt   OutcmDte OutcmFCS CurrFTE CurrPTE EmplySklls SocSecrty EducAid
1:        0        0        0        1 2020-02-03        1       0       0          0         0       0
2:        0        0        0        1 2020-02-21        1       0       1          0         0       0
3:        0        0        0        1 2020-02-25        1       0       0          0         0       0
4:        0        0        0        1 2020-03-03        1       0       1          0         0       0
5:        0        0        0        1 2020-04-07        1       0       0          0         0       0
6:        0        0        1        2 2019-12-11        1      77      77         77        77      77
   PubFinAs PubFoodAs PubHousAs OthrFinAs HighEdCert CurrenRoll CnctAdult Homeless SubAbuse Incarc Children
1:       88        88        88         0          7          0         1        0        0      1        0
2:       88        88        88         0          7          1         1        0        1      0        0
3:       88        88        88         0          7          1         1        0        0      1        0
4:       88        88        88         0          7          1         1        0        0      0        0
5:       88        88        88         0          7          1         1        0        1      0        0
6:       77        77        77        77         77         77        77       77       77     77       77
   Marriage Medicaid OthrHlthIn MedicalIn MentlHlthIn PrescripIn SampleState InSample Baseline FY20Cohort
1:       88        1          0        88          88         88          NA       NA        1          1
2:       88        1          0        88          88         88          NA       NA        1          1
3:       88        1          1         1           1          1          NA       NA        1          1
4:       88        1          3        88          88         88          NA       NA        1          1
5:       88        1          0        88          88         88          NA       NA        1          1
6:       77       77         77        77          77         77          NA       NA        1          0
   Elig19 Elig21 Responded Race RaceEthn FIPS5
1:     NA     NA         1    1        1     8
2:     NA     NA         1    1        1     8
3:     NA     NA         1    1        1     8
4:     NA     NA         1    1        1     8
5:     NA     NA         1    1        1     8
6:     NA     NA         0    1        7     8

[Alexander F. Roehrkasse]
The head command is a very helpful way to examine our data. It just returns the first five rows of the entire data set. So here's the first five rows of all the different variables that are in our data set. Head can be nicely used with the variable with the command select. And note here we're going to introduce what's called the pipe function. The pipe function has a kind of new encoding in R. And the pipe essentially takes the preceding element as the first input of the following function. That might be a little confusing. It's basically like saying like saying okay and to that thing that you just created now do this new thing. So for example we'll take the NYTD object the NYTD data set and to that object we'll apply the head command and then whatever results from that we will furthermore apply the select command which tells us that we want to pull out just these three variables. 

[ONSCREEN]
> # head() can nicely be combined with select().
> # Note that here we introduce the pipe function '|>' (FKA '%>%'). 
> # The pipe takes the preceding element as the first input of the following function. 
> # It's like saying, 'and to that, now do this.'
> Nytd |> 
+   head() |> 
+   select(Wave, St, Sex)
   Wave St Sex
1:    1 AL   2
2:    1 AL   1
3:    1 AL   1
4:    1 AL   1
5:    1 AL   2
6:    1 AL   2

[Alexander F. Roehrkasse]
When we run this chunk of code we get essentially that head command but selecting just those three variables. This chunk of code here is just equivalent to this chunk of code here where we put each object inside the previous command.

[ONSCREEN]
> # So it's equivalent to typing:
> select(head(nytd), Wave, St, Sex) 
   Wave St Sex
1:    1 AL   2
2:    1 AL   1
3:    1 AL   1
4:    1 AL   1
5:    1 AL   2
6:    1 AL   2

[Alexander F. Roehrkasse]
The glimpse function is another way to summarize our data and what it essentially does is transpose the data on the diagonal.

[ONSCREEN]
> # glimpse() transposes the dataframe.
> Glimpse(nytd)
Rows: 47,189
Columns: 51
$ Wave        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ StFCID      <chr> "AL000001456616", "AL000001524474", "AL000001528009", "AL000001597400", "AL000001612758",…
$ State       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ St          <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
$ RecNumbr    <chr> "000001456616", "000001524474", "000001528009", "000001597400", "000001612758", "00000163…
$ RepDate     <int> 202003, 202003, 202003, 202003, 202009, 202003, 202003, 202003, 202003, 202003, 202009, 2…
$ DOB         <date> 2003-02-15, 2003-01-15, 2003-01-15, 2003-01-15, 2003-01-15, 2002-09-15, 2003-02-15, 2003…
$ Sex         <int> 2, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2,…
$ AmIAKN      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ Asian       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ BlkAfrAm    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ HawaiiPI    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ White       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ RaceUnkn    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ RaceDcln    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ HisOrgin    <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0,…
$ OutcmRpt    <int> 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ OutcmDte    <date> 2020-02-03, 2020-02-21, 2020-02-25, 2020-03-03, 2020-04-07, 2019-12-11, 2020-02-14, 2020…
$ OutcmFCS    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ CurrFTE     <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ CurrPTE     <int> 0, 1, 0, 1, 0, 77, 1, 77, 0, 0, 0, 0, 0, 0, 0, 1, 1, 77, 0, 77, 1, 0, 1, 0, 0, 1, 0, 0, 0…
$ EmplySklls  <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 1, 0, 0, 0, 0, 0, 77, 0, 77, 0, 1, 0, 0, 0, 0, 1, 0, 1…
$ SocSecrty   <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ EducAid     <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ PubFinAs    <int> 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 8…
$ PubFoodAs   <int> 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 8…
$ PubHousAs   <int> 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 8…
$ OthrFinAs   <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 1, 0, 0, 0, 0, 1, 0, 0, 77, 0, 77, 0, 1, 0, 0, 0, 0, 0, 0, 1…
$ HighEdCert  <int> 7, 7, 7, 7, 7, 77, 7, 77, 7, 7, 7, 7, 7, 7, 7, 7, 7, 77, 7, 77, 7, 7, 7, 7, 7, 7, 7, 7, 7…
$ CurrenRoll  <int> 0, 1, 1, 1, 1, 77, 1, 77, 1, 1, 1, 1, 1, 1, 1, 1, 1, 77, 1, 77, 1, 1, 1, 1, 0, 1, 1, 1, 1…
$ CnctAdult   <int> 1, 1, 1, 1, 1, 77, 1, 77, 1, 1, 1, 1, 1, 1, 1, 1, 1, 77, 1, 77, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ Homeless    <int> 0, 0, 0, 0, 0, 77, 1, 77, 0, 1, 0, 0, 0, 0, 0, 1, 0, 77, 1, 77, 0, 1, 0, 1, 0, 0, 0, 0, 0…
$ SubAbuse    <int> 0, 1, 0, 0, 1, 77, 1, 77, 0, 0, 1, 0, 1, 0, 1, 0, 0, 77, 0, 77, 0, 1, 1, 0, 0, 0, 0, 1, 0…
$ Incarc      <int> 1, 0, 1, 0, 0, 77, 0, 77, 1, 1, 0, 0, 1, 0, 1, 0, 0, 77, 1, 77, 0, 1, 0, 0, 0, 0, 0, 0, 0…
$ Children    <int> 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0, 77, 0, 77, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Marriage    <int> 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88, 77, 88, 77, 88, 88, 8…
$ Medicaid    <int> 1, 1, 1, 1, 1, 77, 1, 77, 1, 1, 1, 1, 1, 1, 3, 1, 1, 77, 1, 77, 1, 1, 1, 1, 0, 1, 1, 1, 1…
$ OthrHlthIn  <int> 0, 0, 1, 3, 0, 77, 0, 77, 0, 0, 0, 0, 1, 0, 3, 1, 1, 77, 0, 77, 1, 0, 3, 0, 1, 0, 3, 1, 0…
$ MedicalIn   <int> 88, 88, 1, 88, 88, 77, 88, 77, 88, 88, 88, 88, 1, 88, 88, 1, 1, 77, 88, 77, 1, 88, 88, 88…
$ MentlHlthIn <int> 88, 88, 1, 88, 88, 77, 88, 77, 88, 88, 88, 88, 1, 88, 88, 1, 1, 77, 88, 77, 1, 88, 88, 88…
$ PrescripIn  <int> 88, 88, 1, 88, 88, 77, 88, 77, 88, 88, 88, 88, 1, 88, 88, 1, 1, 77, 88, 77, 1, 88, 88, 88…
$ SampleState <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ InSample    <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Baseline    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ FY20Cohort  <int> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Elig19      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Elig21      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ Responded   <int> 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ Race        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ RaceEthn    <int> 1, 1, 1, 1, 1, 7, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 7, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ FIPS5       <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…

[Alexander F. Roehrkasse]
So now each column gets a row here and we can see the different values in that row. Lastly, to get an overview of our data, it can sometimes be helpful to take a random sample. When we do the head command, we just get the first five rows, but maybe we want to know what things look like further down. 

[ONSCREEN]
> # To get an overview, it can sometimes be helpful to view a random sample
> # of the data rather than a block of data.
> Nytd |> 
+   slice_sample(prop = .001) 
    Wave         StFCID State St     RecNumbr RepDate        DOB Sex AmIAKN Asian BlkAfrAm HawaiiPI White
 1:    1 NA00C6WiYw18kK     7    00C6WiYw18kK  202003 2003-04-15   2      0     0        0        0     0
 2:    1 NAOOOONMFFMGKM    48    OOOONMFFMGKM  202003 2003-01-15   1      0     0        0        0     1
 3:    1 NC010669284150    37 NC 010669284150  202003 2002-12-15   2      0     0        1        0     0
 4:    2 UT®©¬®µ¼úüüùþû    49 UT ®©¬®µ¼úüüùþû  202209 2003-05-15   1      0     0        0        0     1
 5:    1 AR005000881725     5 AR 005000881725  202009 2003-06-15   2      0     0        0        0     1
 6:    1 TXOOOOOGILNFMI    48 TX OOOOOGILNFMI  202009 2003-05-15   1      0     0        1        0     0
 7:    2 AR001000481881     5 AR 001000481881  202209 2003-06-15   1      0     0        0        0     1
 8:    1 NA239127252248    52    239127252248  202003 2003-04-15   1      0     0        0        0     1
 9:    1 GA310005548401    13 GA 310005548401  202009 2003-07-15   1      0     0        0        0     1
10:    1 NDP99052746396    38 ND P99052746396  202003 2003-02-15   2      0     0        0        0     1
11:    1 AK000004932451     2 AK 000004932451  202009 2003-05-15   2      0     0        1        0     0
12:    1 MA100459010445    25 MA 100459010445  202009 2003-08-15   2      0     0        0        0     1
13:    3 NC261481452050    37 NC 261481452050  202409 2003-05-15   1      0     1        0        0     0
14:    1 GA410002184401    13 GA 410002184401  202003 2002-09-15   1      0     0        0        0     1
15:    1 LAª¡¬½¶¿³¹¨´¿ý    22 LA ª¡¬½¶¿³¹¨´¿ý  202009 2003-05-15   1      0     0        0        0     1
16:    1 IN000329611291    18 IN 000329611291  202003 2003-01-15   1      0     0        1        0     0
17:    2 NC022429487010    37 NC 022429487010  202203 2003-02-15   1      0     0        1        0     0
18:    1 MEXXXCZAPPPPSZ    23 ME XXXCZAPPPPSZ  202009 2003-08-15   1      0     0        1        0     0
19:    3 MN999830413874    27 MN 999830413874  202409 2003-06-15   1      0     0        1        0     0
    RaceUnkn RaceDcln HisOrgin OutcmRpt   OutcmDte OutcmFCS CurrFTE CurrPTE EmplySklls SocSecrty EducAid
 1:        1        0        3        1 2020-03-26        1       0       0          0         0       0
 2:        0        0        0        1 2020-03-20        1       0       0          0         0       0
 3:        0        0        0        1 2019-11-18        1       0       0          1         0       0
 4:        0        0        1        9       <NA>        0      77      77         77        77      77
 5:        0        0        0        1 2020-06-30        1       0       0          0         0       0
 6:        0        0        0        1 2020-07-13        1       0       0          0         0       0
 7:        0        0        0        9 2022-04-01        1      77      77         77        77      77
 8:        0        0        0        1 2020-03-27        1       0       0          0         1       0
 9:        0        0        0        1 2020-09-04        1       0       1          1         0       0
10:        0        0        0        1 2020-02-14        1       0       0          0         0       0
11:        0        0        0        1 2020-08-04        1       0       0          0         1       0
12:        0        0        0        1 2020-08-11        1       0       1          0         0       0
13:        0        0        0        1 2024-06-21        0       1       0          0         0       0
14:        0        0        0        1 2019-10-15        1       0       0          0         2       2
15:        0        0        0        1 2020-08-03        1       0       1          0         0       0
16:        0        0        0        1 2020-02-21        1       0       0          0         0       0
17:        0        0        0        1 2021-10-29        1       0       0          0         0       1
18:        0        0        0        6       <NA>        1      77      77         77        77      77
19:        0        0        0        1 2024-09-15        0       1       0          0         0       0
    PubFinAs PubFoodAs PubHousAs OthrFinAs HighEdCert CurrenRoll CnctAdult Homeless SubAbuse Incarc Children
 1:       88        88        88         0          7          1         1        0        0      1        0
 2:       88        88        88         0          7          1         1        0        1      0        0
 3:       77        77        77         0          7          1         1        0        0      0        0
 4:       77        77        77        77         77         77        77       77       77     77       77
 5:       77        77        77         0          2          1         1        1        0      0        0
 6:       88        88        88         0          7          1         1        0        1      1        0
 7:       77        77        77        77         77         77        77       77       77     77       77
 8:       88        88        88         0          7          1         1        0        0      0        1
 9:       88        88        88         1          7          1         1        0        0      0        0
10:       77        77        77         0          7          1         1        0        0      0        0
11:        0         0         0         0          7          1         1        0        0      0        0
12:       88        88        88         0          1          1         1        0        1      0        0
13:        0         0         0         0          1          0         1        0        0      0        0
14:       88        88        88         2          7          1         1        0        0      0        0
15:       88        88        88         0          7          1         1        0        0      0        0
16:       88        88        88         0          7          0         1        0        0      0        0
17:       88        88        88         0          1          1         1        0        0      0        0
18:       77        77        77        77         77         77        77       77       77     77       77
19:        0         0         0         0          4          0         1        0        0      0        0
    Marriage Medicaid OthrHlthIn MedicalIn MentlHlthIn PrescripIn SampleState InSample Baseline FY20Cohort
 1:       88        1          0        88          88         88          NA       NA        1          1
 2:       88        1          0        88          88         88          NA       NA        1          1
 3:       88        1          0        88          88         88          NA       NA        1          1
 4:       77       77         77        77          77         77           1       NA       NA          1
 5:       88        3          3        88          88         88          NA       NA        1          1
 6:       88        0          0        88          88         88          NA       NA        1          1
 7:       77       77         77        77          77         77           1       NA       NA          1
 8:        0        1          0        88          88         88          NA       NA        1          1
 9:       88        1          0        88          88         88          NA       NA        1          1
10:       88        0          0        88          88         88          NA       NA        1          1
11:       88        1          0        88          88         88          NA       NA        1          1
12:       88        1          0        88          88         88          NA       NA        1          1
13:       88        1          0        88          88         88           0        0       NA          1
14:       88        1          3        88          88         88          NA       NA        1          1
15:       88        0          0        88          88         88          NA       NA        1          1
16:       88        1          0        88          88         88          NA       NA        1          1
17:       88        1          0        88          88         88           0       NA       NA          1
18:       77       77         77        77          77         77          NA       NA        1          0
19:       88        0          1         1           0          1           0        0       NA          1
    Elig19 Elig21 Responded Race RaceEthn FIPS5
 1:     NA     NA         1   99       99  6037
 2:     NA     NA         1    1        1    NA
 3:     NA     NA         1    2        2 36061
 4:      0     NA         0    1        7 48201
 5:     NA     NA         1    1        1  4013
 6:     NA     NA         1    2        2     8
 7:      0     NA         0    1        1  4019
 8:     NA     NA         1    1        1     8
 9:     NA     NA         1    1        1     8
10:     NA     NA         1    1        1 37051
11:     NA     NA         1    2        2  1073
12:     NA     NA         1    1        1 24510
13:      1      1         1    4        4 36029
14:     NA     NA         1    1        1     8
15:     NA     NA         1    1        1     8
16:     NA     NA         1    2        2 17031
17:      1      1         1    2        2 36061
18:     NA     NA         0    2        2     8
19:      1      1         1    2        2 26081
 [ Reached getOption("max.print") -- omitted 29 rows ]
 
[Alexander F. Roehrkasse]
And so, we might take, for example, a 0.1% random sample of our data. And that might provide a different viewpoint onto our data where, for example, we get not just values from Alabama, but values from different states. And we can see the different values that they have. Okay, let's move on to summarizing our data. The most helpful tidyverse command for summarizing our data is unsurprisingly summarize. This can be helpful for calculating all kinds of summary statistics. Now, recall that we might naively summarize this variable current FTE. It had numerical values. Why don't we just calculate a mean or a standard deviation or a median? We can tell R take the NYTD object, summarize it, and in doing so generate the a mean a new variable mean that's equivalent to the mean of the current FTE variable where we're going to remove any variables that have a missing any any observations that have a missing value for that variable. 

[ONSCREEN]
> # Naively summarizing data can sometimes lead us astray: 
> nytd |> 
+   summarize(mean = mean(CurrFTE, na.rm = T), 
+             sd = sd(CurrFTE, na.rm = T), 
+             median = median(CurrFTE, na.rm = T))
     mean       sd median
1 26.1479 36.33093      0

[Alexander F. Roehrkasse]
So if we run this chunk of code, you'll see that we get values that correspond to the values I showed in the slide deck. But of course, we said that's kind of a dumb thing to do because the current FTE variable is not in fact a continuous variable. So we can't generate a mean or a mode for that variable. Instead, let's group NYTD according to the values of that variable. And then we'll summarize observations where we'll generate a new variable N which just counts up the number of observations with each value of that current FTE variable. The n parenthesis function is just a counter. It counts the number of observations corresponding to each group. So we'll group our data by the values of the current FTE variable. We'll count the number of observations having each variable. We'll then ungroup our data and calculate a new variable proportion which is essentially the number of each observation the number of observations belonging to each group divided by the number of all observations across all groups. 

[ONSCREEN]
> # We need to know a bit about each variable before we can summarize it appropriately. 
> # Here we also use the tidyverse function mutate() for creating variables.
> Nytd |> 
+   group_by(CurrFTE) |> 
+   summarize(n = n(), .groups = 'keep') |> 
+   ungroup() |> 
+   mutate(prop = n/sum(n))
# A tibble: 5 × 3
  CurrFTE     n    prop
    <int> <int>   <dbl>
1       0 25426 0.539  
2       1  5038 0.107  
3       2   492 0.0104 
4      77 15799 0.335  
5      NA   434 0.00920

[Alexander F. Roehrkasse]
When we run this chunk of code, we get, as I showed you in the slide deck, a count of the number of observations with each variable of current FTE, each value of current FTE, and also the proportion of all observations corresponding to that variable. So about 54% of all observations have the value zero for current FTE. About a third of observations have the 77 code and about 1% have a true missing code. We can understand the summarize we can use summarization to understand the structure of NYTD. 

[ONSCREEN]
> # Let's use summarization to understand the structure of NYTD 
> # by counting the observations in different waves.
> Nytd |> 
+   group_by(Baseline, Wave, FY20Cohort) |> 
+   summarize(n = n(), .groups = 'keep')
# A tibble: 4 × 4
# Groups:   Baseline, Wave, FY20Cohort [4]
  Baseline  Wave FY20Cohort     n
     <int> <int>      <int> <int>
1        1     1          0  6634
2        1     1          1 14768
3       NA     2          1 14768
4       NA     3          1 11019

[Alexander F. Roehrkasse]
So for example, we can use some of these same principles to see okay how many observations in the first wave have responses? Among that first wave how many responded in the second survey? In the third survey? We can learn very quickly that between wave 1 and wave 2 of the survey there's a pretty good response rate we don't lose I guess any respondents. But then in the third wave we do seem to lose a few thousand respondents and so that's an important thing to be aware of. 

[ONSCREEN]
> # Let's see if there was non-random attrition between waves 1 and 3 
> # with respect to sex
> nytd |> 
+   filter(FY20Cohort == 1 & Wave %in% c(1,3)) |> 
+   group_by(Wave, Sex) |> 
+   summarize(n = n(), .groups = 'keep') |> 
+   group_by(Wave) |> 
+   mutate(prop = n/sum(n)) 
# A tibble: 5 × 4
# Groups:   Wave [2]
   Wave   Sex     n   prop
  <int> <int> <int>  <dbl>
1     1     1  7725 0.523 
2     1     2  7043 0.477 
3     3     1  5636 0.511 
4     3     2  5155 0.468 
5     3    NA   228 0.0207

[Alexander F. Roehrkasse]
I'll briefly say that we can add nuance to this to to analyze whether or not that attrition is dependent on the sex of different respondents. In this case, it happens not to be, but these are questions that are helpful to ask yourself as you're trying to understand the study design and any challenges to your analysis that might arise from it. Okay, let's now clean some data. As I said, most data sets are large and so be before cleaning them, often before doing really anything, it can be helpful to just pull out some variables that you know you're going to be using. If there are a bunch of variables that you know you're not interested in, dropping them will free up some memory on your computer. It will allow you to work a bit faster. So, I'm going to pull out just a few variables from NYTD and AFCARS that I know we're going to be interested in. 

[ONSCREEN]
> nytd <- nytd |> 
+   select(Wave, StFCID, State, St, RecNumbr, Responded, Baseline, FY20Cohort,
+          DOB, Sex, RaceEthn,
+          CurrFTE, CurrPTE, EmplySklls,
+          HighEdCert, CurrenRoll)
> afcars <- afcars |> 
+   select(StFCID, STATE, St, RecNumbr,
+          DOB, SEX, RaceEthn, 
+          CLINDIS,
+          TOTALREM)

[Alexander F. Roehrkasse]
So, our objects get a little bit smaller. We're using less memory. It's a little easier to work. As I said, it's important to understand that your data will not always be coded exactly in the manner that they're described in the code book. 

[ONSCREEN]
> # It's important to understand that data will not always be coded 
> # exactly in the manner they're described in the Code Book. 
> Nytd |> 
+   group_by(HighEdCert) |> 
+   summarize(n = n())
# A tibble: 10 × 2
   HighEdCert     n
        <int> <int>
 1          1 10879
 2          2   344
 3          3    85
 4          4   156
 5          5    37
 6          6   117
 7          7 18670
 8          8   643
 9         77 15824
10         NA   434

[Alexander F. Roehrkasse]
Let's look, for example, at higher ed certification. The code book would tell you that all missing values have the code 77. But what we see is that there's about 400 observations that have a true missing value but but it's not coded as 77. And so if you assumed that all missing values were coded 77, you would miss you know a small but non-trivial number of observations with missing values for that variable. Let's recode a bunch of variables how we want them to look. 

[ONSCREEN]
> # Lets recode variables how we want them
> nytd <- nytd |> 
+   mutate(Sex = factor(Sex, labels = c('Male', 'Female')),
+          RaceEthn = ifelse(RaceEthn == 99, NA, RaceEthn), 
+          RaceEthn = factor(RaceEthn, 
+                            levels = 1:7,
+                            labels = c('White','Black','AIAN','Asian','NHPI','Multiracial','Hispanic')), 
+          HighEdCert = case_when(HighEdCert == 7 ~ 0, 
+                                 HighEdCert == 1 ~ 1,
+                                 HighEdCert %in% 2:3 ~ 2, 
+                                 HighEdCert %in% 4:6 ~ 3),
+          HighEdCert = factor(HighEdCert, 
+                              levels = 0:3, 
+                              labels = c('No HS', 'HS', 'Vocational', 'AA+')), 
+          CurrenRoll = ifelse(CurrenRoll %in% c(2,77), NA, CurrenRoll))

[Alexander F. Roehrkasse]
This is a big chunk of code and I'm recoding multiple variables in one chunk of code. But notice that I'm essentially saying, okay, take this sex or create a new sex variable and we're just going to factoriize the existing sex variable. There already exists a sex variable. We're just going to tell our treat this as a factor variable and give it the labels male and female. Here we're going to take the race and ethnicity variable and we'll say okay look there already exists a race and ethnicity variable but sometimes missing values are coded as 99. We want to tell R to treat missing values as missing values. R understands missing values as NA. So here what we're doing is we're saying okay if the race and ethnicity variable has the value 99 replace that variable with na otherwise keep the value of that variable. Then we're taking that that variable which has seven different numerical levels and we're labeling them according to the categories that they they encode. We're essentially doing the same for higher ed. We're recoding the variable to four levels which correspond to no high school, high school, vocational, and associates or more. And then again, we're recoding the missing values in the current enrolled variable. Sometimes they're coded as two, sometimes they're coded as 77. We want to treat them as missing. Tell R that they should be NA. We can run this run this one chunk of code and R will clean each of these variables in turn. 

[ONSCREEN]
# Let's try to recode a few other variables a little more "programmatically"
# to avoid possible errors. Instead of writing the following...

# Nytd <- nytd |>
#   mutate(CurrFTE = ifelse(CurrFTE %in% c(2,77), NA, CurrFTE),
#          CurrPTE = ifelse(CurrPTE %in% c(2,77), NA, CurrPTE),
#          EmplySklls = ifelse(EmplySklls %in% c(2,77), NA, EmplySklls))

# We can write:
nytd <- nytd |>
  mutate(across(c(CurrFTE, CurrPTE, EmplySklls), ~ ifelse(.x %in% c(2,77),NA,.x)))

[Alexander F. Roehrkasse]
Now, there's several other variables that we're interested in. Current FTE, current PTE, employment skills that all code missing values as either 2 or 77. Now, we could write a chunk of code that copies and pastes this if else command over and over and over again where we swap out the different variable names. But, as I said, this copying and pasting introduces opportunities for error and decreases the interpretability of our code. So one way we might might write this chunk of code more pro programmatically would be to say okay R let's create let's let's mutate some variables we're going to create some new variables or or modify some existing variables let's move across each of these three variables and as we do we'll apply this single function which says if the value of that variable is in the the string two or or the  set 2 or 77, we'll replace it with a missing variable. Otherwise, leave that variable as it exists. When we run this single chunk of code, it will perform these three different cleaning operations identically to the three different variables. This is one way to code programmatically so that we increase interpretability and avoid errors. 

[ONSCREEN]
> # Now let's examine our new data 
> head(nytd)
   Wave         StFCID State St     RecNumbr Responded Baseline FY20Cohort        DOB    Sex RaceEthn CurrFTE
1:    1 AL000001456616     1 AL 000001456616         1        1          1 2003-02-15 Female    White       0
2:    1 AL000001524474     1 AL 000001524474         1        1          1 2003-01-15   Male    White       0
3:    1 AL000001528009     1 AL 000001528009         1        1          1 2003-01-15   Male    White       0
4:    1 AL000001597400     1 AL 000001597400         1        1          1 2003-01-15   Male    White       0
5:    1 AL000001612758     1 AL 000001612758         1        1          1 2003-01-15 Female    White       0
6:    1 AL000001634843     1 AL 000001634843         0        1          0 2002-09-15 Female Hispanic      NA
   CurrPTE EmplySklls HighEdCert CurrenRoll
1:       0          0      No HS          0
2:       1          0      No HS          1
3:       0          0      No HS          1
4:       1          0      No HS          1
5:       0          0      No HS          1
6:      NA         NA       <NA>         NA

[Alexander F. Roehrkasse]
Let's examine our new data. So, here's our new data set. A smaller number of variables all cleaned up in a nice way. We can summarize each of these variables. 

[ONSCREEN]
> nytd |> 
+   group_by(CurrFTE) |> 
+   summarize(n = n(), .groups = 'keep')
# A tibble: 3 × 2
# Groups:   CurrFTE [3]
  CurrFTE     n
    <int> <int>
1       0 25426
2       1  5038
3      NA 16725

[Alexander F. Roehrkasse]
Let's focus just on the current FTE variable. We'll see that 25,000 respondents were not full-time employed, 5,000 were full-time employed, and we're going to have to figure out a way to deal with this. 16,000 People had missing values for this variable.

[ONSCREEN]
> # And let's also clean our AFCARS data.
> Afcars <- afcars |> 
+   mutate(SEX = factor(SEX, labels = c('Male', 'Female')),
+          RaceEthn = ifelse(RaceEthn == 99, NA, RaceEthn), 
+          RaceEthn = factor(RaceEthn, 
+                            levels = 1:7,
+                            labels = c('White','Black','AIAN','Asian','NHPI','Multiracial','Hispanic')), 
+          CLINDIS = ifelse(CLINDIS == 3, NA, CLINDIS), # Treats 'not yet diagnosed' as equivalent to missing value
+          CLINDIS = factor(CLINDIS, 
+                            levels = 1:2,
+                            labels = c('Yes', 'No')))
> # And inspect it.
> Head(afcars)
           StFCID STATE St     RecNumbr        DOB    SEX RaceEthn CLINDIS TOTALREM
1: AL000001456616     1 AL 000001456616 2003-02-15 Female    White      No        2
2: AL000001524474     1 AL 000001524474 2003-01-15   Male    White      No        2
3: AL000001528009     1 AL 000001528009 2003-01-15   Male    White     Yes        3
4: AL000001597400     1 AL 000001597400 2003-01-15   Male    White      No        1
5: AL000001612758     1 AL 000001612758 2003-01-15 Female    White      No        4
6: AL000001634843     1 AL 000001634843 2002-09-15 Female Hispanic      No        5

[Alexander F. Roehrkasse]
I won't talk through the cleaning of the AFCARS data. Suffice it to say that it's it's all very similar to the cleaning of the NYTD data. I want to highlight one trick for dealing with missing data in administrative data where we have repeated observations of the same person. Let's say we observe one person multiple times as we do in the NYTD. Let's say in one of those observations, we have a missing value for say their date of birth but in another observation we do observe their birth that value is not missing. On the assumption that someone's date of birth does not change, we can actually fill in one observation at a specific point in time with the observation from the same person at a different point in time assuming that that value doesn't actually change. 

[ONSCREEN]
> # Let's first see how much missing data we have for DOB, sex, and race/ethnicity
> nytd |> group_by(DOB) |> summarize(n = n(), .groups = 'keep')
# A tibble: 13 × 2
# Groups:   DOB [13]
   DOB            n
   <date>     <int>
 1 2002-09-15  4267
 2 2002-11-15  4258
 3 2002-12-15  4019
 4 2003-01-15  3763
 5 2003-02-15  4042
 6 2003-03-15  3881
 7 2003-04-15  3569
 8 2003-05-15  3712
 9 2003-06-15  3769
10 2003-07-15  3898
11 2003-08-15  4159
12 2003-10-15  3418
13 NA           434

[Alexander F. Roehrkasse]
Let's see how often this occurs. Well, let's see. For example, for date of birth, 434 people are missing date of birth. 434 People are missing the variable sex. And 1,099 people are missing a value for race and ethnicity. 

[ONSCREEN]
> nytd |> group_by(Sex) |> summarize(n = n(), .groups = 'keep')
# A tibble: 3 × 2
# Groups:   Sex [3]
  Sex        n
  <fct>  <int>
1 Male   24173
2 Female 22582
3 NA       434
> nytd |> group_by(RaceEthn) |> summarize(n = n(), .groups = 'keep')
# A tibble: 8 × 2
# Groups:   RaceEthn [8]
  RaceEthn        n
  <fct>       <int>
1 White       19487
2 Black       11980
3 AIAN          735
4 Asian         341
5 NHPI          120
6 Multiracial  2916
7 Hispanic    10511
8 NA           1099

[Alexander F. Roehrkasse]
Let's try to fill in some of these observations based on the information we observe from the same person observed at a different point in time where that information is not missing. We'll arrange our data in a certain way. 

[ONSCREEN]
> # Let's first arrange things logically
> nytd <- nytd |> 
+   arrange(State, StFCID, Wave)

[Alexander F. Roehrkasse]
Okay, we'll order them so that they move in terms of state and then individual ID and then wave. And then what we'll do is we'll tell R, okay, first group by individuals. So create a group that corresponds to each individual. And then for each of these variables, we're going to fill in the values of those variables whenever they're missing. And we'll do that first by moving down and then by moving up. So we'll go ahead and run this code. 

[ONSCREEN]
> # And within individual IDs, fill in any missing values with non-missing
> # preceding or succeeding values. 
> Nytd <- nytd |> 
+   group_by(StFCID) |> 
+   fill(c('DOB', 'Sex', 'RaceEthn'), .direction = 'downup') |> 
+   ungroup()

[Alexander F. Roehrkasse]
This code will take just a second to run. It's a slightly more complex process than anything we've done so far. And once we do this, we'll re-examine how many, excuse me, how many missing values we have for each of these variables. 

[ONSCREEN]
> # Note that this addressed all missing DOB and sex data,
> # and about two thirds of missing race/ethnicity data!
> Nytd |> group_by(DOB) |> summarize(n = n(), .groups = 'keep')
# A tibble: 12 × 2
# Groups:   DOB [12]
   DOB            n
   <date>     <int>
 1 2002-09-15  4312
 2 2002-11-15  4292
 3 2002-12-15  4051
 4 2003-01-15  3807
 5 2003-02-15  4088
 6 2003-03-15  3932
 7 2003-04-15  3617
 8 2003-05-15  3753
 9 2003-06-15  3796
10 2003-07-15  3915
11 2003-08-15  4188
12 2003-10-15  3438

[Alexander F. Roehrkasse]
And we'll see that we've actually been able to reduce some of these missing data. Now, actually, we have no missing data for date of birth. We were able to fill in every value of date of birth. 

[ONSCREEN]
> nytd |> group_by(Sex) |> summarize(n = n(), .groups = 'keep')
# A tibble: 2 × 2
# Groups:   Sex [2]
  Sex        n
  <fct>  <int>
1 Male   24402
2 Female 22787
> nytd |> group_by(RaceEthn) |> summarize(n = n(), .groups = 'keep')
# A tibble: 8 × 2
# Groups:   RaceEthn [8]
  RaceEthn        n
  <fct>       <int>
1 White       19957
2 Black       12110
3 AIAN          746
4 Asian         344
5 NHPI          124
6 Multiracial  2980
7 Hispanic    10654
8 NA            274

[Alexander F. Roehrkasse]
Same is true for sex. No more missing data about sex. We still have 274 missing values of race and ethnicity, but we were able to reduce the number of the the amount of missing data by about 2/3. That's a pretty cool trick. Of course, the true values of these variables may change. 

[ONSCREEN]
> # Of course, the true values of these variables may change, 
> # and so you need to decide how to interpret this change
> # and handle it appropriately.
> Nytd |> 
+   filter((Wave == 2 & lag(Wave) == 1 & StFCID == lag(StFCID) & RaceEthn != Lag(RaceEthn)) | 
+            (Wave == 3 & lag(Wave) == 2 & StFCID == lag(StFCID) & RaceEthn != Lag(RaceEthn)) )
# A tibble: 479 × 16
    Wave StFCID    State St    RecNumbr Responded Baseline FY20Cohort DOB        Sex   RaceEthn CurrFTE CurrPTE
   <int> <chr>     <int> <chr> <chr>        <int>    <int>      <int> <date>     <fct> <fct>      <int>   <int>
 1     3 AK000000…     2 AK    0000009…         1       NA          1 2002-11-15 Male  Black          0       0
 2     2 AK000008…     2 AK    0000084…         1       NA          1 2003-03-15 Fema… White          0       0
 3     2 AR000001…     5 AR    0000013…         0       NA          1 2003-07-15 Fema… Black         NA      NA
 4     3 AR000001…     5 AR    0000013…         0       NA          1 2003-05-15 Fema… White         NA      NA
 5     2 AR000001…     5 AR    0000017…         0       NA          1 2003-05-15 Fema… White         NA      NA
 6     3 AR000001…     5 AR    0000017…         1       NA          1 2003-03-15 Male  Hispanic       0       0
 7     2 AR001000…     5 AR    0010007…         0       NA          1 2003-07-15 Male  Hispanic      NA      NA
 8     3 AR001000…     5 AR    0010007…         1       NA          1 2003-03-15 Fema… White          1       0
 9     2 AR001001…     5 AR    0010018…         0       NA          1 2003-05-15 Fema… Hispanic      NA      NA
10     3 AR001001…     5 AR    0010018…         0       NA          1 2003-03-15 Fema… White         NA      NA
# … with 469 more rows, and 3 more variables: EmplySklls <int>, HighEdCert <fct>, CurrenRoll <int>

[Alexander F. Roehrkasse]
And so this chunk of code will tell you exactly how many observations for which an individual actually changes their response to the race and ethnicity question across different waves in NYTD. So we might assume that someone's date of birth doesn't actually change and we can fill those things in. But other things like race and ethnicity do change across time. People change their racial and ethnic identity. And so this strategy is only valid on the assumption that the value the variables to which we apply it do not change over time. Let's go ahead and save a version of our data. 

[ONSCREEN]
> # Let's save a version of this data
> fwrite(nytd, paste0(data,'nytd_clean_anonymized_long.csv'))

# 4. FORMATTING DATA 
# We're interested in the outcomes observed in NYTD
# 3 years after aging out of foster care (age 21).
# This is the set of observations in Wave 3.
# But we might also want to use earlier outcomes
# as predictors for later outcomes.
# Here we might want to want to pivot our data to treat

[Alexander F. Roehrkasse]
Now we'll call it a long data. As you recall the default organization of the NYTD data are long. But recall that we're interested in outcomes observed in NYTD three years after aging out of foster care. So at age 21. This is the set of observations in wave three. We could just drop all of the earlier waves. Drop wave one, drop wave two from our data. Then we would have our data formatted such that each row corresponds to a person. But we might also want to use earlier outcomes as predictors for later outcomes. We might want to treat information from wave 1 or wave two as a predictor of outcomes in wave three. If that's the case, what we want to do is pivot or reshape our data wide so that each observation, each row corresponds to a person and each variable shows up as multiple columns observed at multiple points of time. The way we do this is to take our NYTD object to pivot wider. We tell R okay what are the columns that will continue to identify each row or each unit of observation and that is the unique person ID and we can also include these other variables as identifying variables. We will pull names for our new variables from the wave variable. That means this variable will go away and we will take its value and assign them to the variable names. We will then take our values from all of the other variables that we want to keep in our analysis. So let's go ahead and run this code and let's look at the structure of our new data comparing it to the structure of our old data. 

[ONSCREEN]
> nytd_wide <- nytd |> 
+   pivot_wider(id_cols = c(StFCID, State, St, RecNumbr), 
+               names_from = Wave, 
+               values_from = Responded:CurrenRoll)
> head(nytd_wide)
# A tibble: 6 × 37
  StFCID State St    RecNumbr Responded_1 Responded_2 Responded_3 Baseline_1 Baseline_2 Baseline_3 FY20Cohort_1
  <chr>  <int> <chr> <chr>          <int>       <int>       <int>      <int>      <int>      <int>        <int>
1 AL000…     1 AL    0000014…           1           1           1          1         NA         NA            1
2 AL000…     1 AL    0000015…           1           1           1          1         NA         NA            1
3 AL000…     1 AL    0000015…           1           1           1          1         NA         NA            1
4 AL000…     1 AL    0000015…           1           1           1          1         NA         NA            1
5 AL000…     1 AL    0000016…           1           1           1          1         NA         NA            1
6 AL000…     1 AL    0000016…           0          NA          NA          1         NA         NA            0
# … with 26 more variables: FY20Cohort_2 <int>, FY20Cohort_3 <int>, DOB_1 <date>, DOB_2 <date>, DOB_3 <date>,
#   Sex_1 <fct>, Sex_2 <fct>, Sex_3 <fct>, RaceEthn_1 <fct>, RaceEthn_2 <fct>, RaceEthn_3 <fct>,
#   CurrFTE_1 <int>, CurrFTE_2 <int>, CurrFTE_3 <int>, CurrPTE_1 <int>, CurrPTE_2 <int>, CurrPTE_3 <int>,
#   EmplySklls_1 <int>, EmplySklls_2 <int>, EmplySklls_3 <int>, HighEdCert_1 <fct>, HighEdCert_2 <fct>,
#   HighEdCert_3 <fct>, CurrenRoll_1 <int>, CurrenRoll_2 <int>, CurrenRoll_3 <int>

[Alexander F. Roehrkasse]
Here's our new wide data. And you can see that we have observations of this STFC ID variable. Each of these is now a unique value of that ID variable. One person, one row. For all the other variables that we're interested in, though, we now have three versions of each variable observed in the first wave, second wave, third wave. This is what we call wide data. And it differs from the long data. 

[ONSCREEN]
> head(nytd)
# A tibble: 6 × 16
   Wave StFCID     State St    RecNumbr Responded Baseline FY20Cohort DOB        Sex   RaceEthn CurrFTE CurrPTE
  <int> <chr>      <int> <chr> <chr>        <int>    <int>      <int> <date>     <fct> <fct>      <int>   <int>
1     1 AL0000014…     1 AL    0000014…         1        1          1 2003-02-15 Fema… White          0       0
2     2 AL0000014…     1 AL    0000014…         1       NA          1 2003-02-15 Fema… White          0       0
3     3 AL0000014…     1 AL    0000014…         1       NA          1 2003-02-15 Fema… White          0       0
4     1 AL0000015…     1 AL    0000015…         1        1          1 2003-01-15 Male  White          0       1
5     2 AL0000015…     1 AL    0000015…         1       NA          1 2003-01-15 Male  White          0       1
6     3 AL0000015…     1 AL    0000015…         1       NA          1 2003-01-15 Male  White          0       0
# … with 3 more variables: EmplySklls <int>, HighEdCert <fct>, CurrenRoll <int>

[Alexander F. Roehrkasse]
Here's the long data where we have multiple observations of each ID variable. So single person, multiple rows. That's because we're observing them in multiple waves. And we only have one column corresponding to each variable. Our wide data have a different structure where for each wave, we have a different row corresponding to a variable.

[ONSCREEN]
> # Let's save our cleaned and reformatted data: 
> fwrite(nytd_wide, paste0(data,'nytd_clean_anonymized.csv'))
> fwrite(afcars, paste0(data,'afcars_clean_anonymized.csv'))

[Alexander F. Roehrkasse]
Let's go ahead and save our wide data and also save our AFCARS data. And it will be these files that we'll pick up next week when we start to do some exploratory well actually when we start to link our data that is to say connect the NYTD data to the AFCARS data. Okay, that's it for my presentation. So we have about six minutes left for questions. I'll let Paige address each of the questions as she sees fit. 

[Paige Logan Prater]
Thanks Alex. And thanks so much for this super helpful walkthrough of R. Most of the questions that have come through are around access to the R script and example code. And I just want to confirm that those will be available they'll be posted on our website. Correct.

[Andres Arroyo]
Yes, those will be posted on our website. 

[Paige Logan Prater]
Okay, great. Yeah, everything will be posted, recordings, transcripts. It just takes a little bit of time to prep everything and have it ready for public access. And so bear with us. But that all of that will be posted online and is available. And also just a plug for our if you weren't with us for the academic year for our monthly Office Hour Series, there was a R training component for that series and all of those transcripts and recordings and course files are also on the website at our Events on our Events page. So the link that I put in the chat has everything from the past and will have everything from this series as well. Alex, the one there is one question about I think a some of the coding. And I don't know how to communicate it. 

[Alexander F. Roehrkasse]
Sure. I can answer that question. Okay. First though, Paige, in response to the earlier question, so yes, the the scripts should be Andres, if I understood you correctly, it's not just the output of this presentation, but the R script itself that will be posted on the archive. Is that right? 

[Andres Arroyo]
Yes, the Rcode will be in the PDF slides of the presentation at the end of the slides. 

[Alexander F. Roehrkasse]
Perfect. Okay. Is it possible to simply post the R script as well, the R script file? 

[Andres Arroyo]
That's possible. 

[Alexander F. Roehrkasse]
Okay. Yeah. So, I just want to clarify that like the code I've shared with you today is you know, we're standing on the shoulders of giants here. So, I learned to I am by no means a coding wizard. So you should not take this as the gospel truth. But most of what I've learned about coding, I've learned from looking at other people's code. So you should feel free to steal, repurpose, use this code in whatever way is helpful to you. No attribution necessary. We're very excited to kind of make whatever coding resources available that we can. I have found myself that this is the most helpful way to learn how to code. On that note, there was an anonymous question about pipe operators. The simple answer is yes. There's a kind of new way to write a pipe in the Tidyverse world that corresponds to it's it there are very very slight differences in the pipe operator for the two different sort of two different pieces of syntax that the the question asker listed. But for all intents and purposes those operators are exactly the same. Yeah. So you can use those interchangeably. And in fact, you'll see in my code as I update old code, you'll see both both types of pipe pop up. They're exactly the same.

[Paige Logan Prater]
Thank you, Alex. And just want to give you some flowers because we have a comment that presentations one and two were very clear and helpful and that you are a great teacher. So really happy that things are clear and helpful. And yeah, I think we'll we're excited to keep this series going. As always, if you have any questions, Alex, Noah, and my, contact information are here. 

[ONSCREEN CONTENT SLIDE 22]
Questions?
Alex Roehrkasse
aroehrkasse@butler.edu
Noah Won
noah.won@duke.edu
Paige Logan Prater
paige.loganprater@ucsf.edu

[Paige Logan Prater]
And that brings us right to the end. 

[ONSCREEN CONTENT SLIDE 23]
NEXT WEEK…
Date:  July 16th, 2025
Topic:  Linking Data
Instructor:  Alex Roehrkasse

[Paige Logan Prater]
Next week we will be meeting on the 16th. The topic will be linking data. We'll be back with Alex again. And we look forward to seeing you all next week. 

[Alexander F. Roehrkasse]
Thank you all so much. 

[Paige Logan Prater]
Thanks everybody.

[Voiceover]
The National Data Archive On Child Abuse and Neglect is a joint project of Duke University, Cornell University, University Of California, San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, An Office Of The Administration For Children and Families.