Transcript for video titled "LeaRn R with NDACAN, Week 4, 'Linking Multiple Years', January 17, 2024".

[MUSIC]

[VOICEOVER]
National Data archive on Child Abuse and Neglect.

[ONSCREEN CONTENT SLIDE 1]
WELCOME TO NDACAN MONTHLY OFFICE HOURS!
National Data Archive on Child Abuse and Neglect
DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO
 The session will begin at 11am EST
 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R)
 11:30 - 12:00pm - Office hours breakout sessions
 Please submit LeaRn questions to the Q&A box
 This session is being recorded.
 See ZOOM Help Center for connection issues:
https://support.zoom.us/hc/en-us
 If issues persist and solutions cannot be found through Zoom, contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
Hi everybody welcome, happy 2025! My name is Paige Logan Prater I am a research associate here at the National Data Archive on Child Abuse and Neglect. This is our monthly Office Hours series. This is week four. We hope that you had a nice break since our last session in november and we are ready to to start the series up again in the new year. So just to give folks who haven't been to any of our sessions yet a quick overview we'll do about 30 minutes of our R training session with frank. Maybe 30 35 minutes we'll give you a little extra time there. And then the second portion of our session today will be kind of like an office hours breakout format where we'll have frank and another ndacan colleague Alex and potentially Erin but I'm not sure actually hold breakout sessions for more tailored questions. Anything from R follow-up questions to research design and stats help, stats with the data sets things like that. So that will be the second half of the of the session today. If you have any questions throughout the presentation that frank is going to give, feel free to put them in the chat or put them in the q&a. We will read them out aloud at the end so that we can capture it for the recording. So yeah feel free to put any questions in the chat or the q&a box. The session is being recorded and will be available for reference on our website. All of our sessions are recorded and available on our website if you needed to catch up or revisit anything. And if you have any issues with Zoom feel free to reach out to Andres who is on this call as well. Yeah let's let's get started frank I'll kick it over to you.

[Frank Edwards]
Sounds good thanks Paige. All right so we'll dive in.

[ONSCREEN CONTENT SLIDE 2]
LeaRn with NDACAN, Presented by Frank Edwards

[Frank Edwards]
Today we're going to be discussing how to join multiple years of our administrative data.

[ONSCREEN CONTENT SLIDE 3]
MATERIALS FOR THIS COURSE
Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains
Data (will be released as used in the lessons)
Census state - level data, 2015 - 2019
AFCARS state - aggregate data, 2015 - 2019
AFCARS (FAKE) individual - level data, 2016 - 2019
NYTD (FAKE) individual - level data, 2017 Cohort
Documentation/codebooks for the provided datasets
Slides used in each week’s lesson
Exercises as that correspond to each week’s lesson
An .R file that will have example, usable R code for each lesson – will be updated and appended with code from each lesson

[Frank Edwards]
And so the lessons we're talking about today would apply to primarily the AFCARS and NCANDS but also could apply to the NYTD as well. The data we'll be using today is in the Box link that Andres just shared we're going to be using individual year data from AFCARS that's fake data it's mass deidentified definitely not the real AFCARS but of course you can request it from the data archive for the real data. We've provided you with an aggregate file for 2015 to 2019 but that's not actually how you would receive the data if you requested it from us you would generally receive it in individual years. And so that's how I've restructured the data to work with it today if you go to the Box folder you'll see that I've created three data files AFCARS 2016, 17, and 18 as separate csv files and those are the files that we're going to be using today. 

[ONSCREEN CONTENT SLIDE 4] 
WEEK 4: LINKING MULTIPLE YEARS
November 11, 2024

[Frank Edwards]
So again it's not november 11th 2024 my apologies. I kind of I guess wish it were but it's not. Here we go. 

[ONSCREEN CONTENT SLIDE 5] 
DATA USED IN THIS WEEK’S EXAMPLE CODE
AFCARS fake individual - level data (afcars_2016_indv_fake.csv, …) Simulated (fake) individual - level data that mimics AFCARS structure. We will use the 2016, 17, and 18 files today
Can order full data from NDACAN: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm


[Frank Edwards]
We're going to use the AFCARS fake individual level data and again those are under AFCARS 2016_indv_fake.csv and we'll also use 16 and 17 today and these are simulated individual level data. Again you can request the full data from NDACAN.

[ONSCREEN CONTENT SLIDE 6] 
CONSIDERING TIME

[Frank Edwards]
Now time is an incredibly complicated question to deal with when we're working with the administrative data. We want to think of it as being incredibly straightforward. We just have a year in the data and we want to treat a year as a calendar year. But we need to be very careful with how we think about and manage time in these data.

[ONSCREEN CONTENT SLIDE 7] 
SUBYRIN NCANDS
Variable Name: SubYr
Variable Label: Submission Year
Definition: The submission year is the Federal Fiscal Year (FFY) 12 - month period. The FFY is from October 1 through September 30. The year of data submission is based on the report disposition date. All report disposition dates fall within the submission year. The report date may have occurred in a previous fiscal year. All records have the same year in this field.

[Frank Edwards]
For example the submission year variable in AFCARS right? The submission year and the year period that we cover in each of our data release is the 12-month federal fiscal year not the calendar year. So the year that we're going to be looking at anytime we pull up an AFCARS file or an NCANDS file is actually october 1st through september 30th not january 1st through december 31st. Right? We're covering slightly different year so this is covering the federal fiscal year we want to be aware of that for a few reasons.

[ONSCREEN CONTENT SLIDE 8] 
RPTDT IN NCANDS
Report Date
Variable Name: RptDt
Definition: The month, day, and year that the responsible agency was notified of the suspected child maltreatment referral. This is the date that the report of maltreatment was made. If a state combines several allegations into one report, the date of the report is the date of the initial allegation. The determination as to whether additional allegations are considered part of the original report or whether they are considered a new report is left to each state's procedures.
To help assure confidentiality of the data, this date has been rounded to either the 8th or the 23rd of the month. All other dates in this data are then similarly adjusted by the same number of days to maintain the timespan between all dates.

[Frank Edwards]
For example we can think about maybe we're interested in looking at cases in 2019 and we want to use the field rpddt. Now if I'm looking at the 2016 data and I'm using rpddt I might be interested in how many reports came in in november of 2016. Well the Child File or the AFCARS file for 2016 is not going to have that data in this case for NCANDS this is the date on which a report was filed. That data will not be contained within the 2016 release if the report was closed after october 30th. Right? So all cases that either were reported or closed after the close of the federal fiscal year will end up in the next year's file so november and december of 2016 will actually end up in the 2017 file. So if you're really interested in going underneath the year level and thinking about time variation the lowest level we can do that at is every other week. You'll notice here that we round the week to either the 8th or the 23rd of every month to ensure anonymity and confidentiality. This means that the lowest level of time that we can cons consider is bi-weekly. That is we can have two observations each month and they're rounded to either the 8th or the 23rd. When we're considering something like report date though so we could have the case where a report in NCANDS was originated in july but it wasn't closed until december. We actually get that case will then end up in the 2017 data and not the 2016 data because the date at which it's in the the determination of when it which year it's included in in the data the threshold we use is the closing date not necessarily the original report date. So this speaks to if we're thinking about using time in any analysis we really do need to be working with multiple years of the data simultaneously and then evaluating very carefully when the reports occur, when the interventions occur.

[ONSCREEN CONTENT SLIDE 9] 
A screenshot of an RStudio window.
The script browser has a file temp.r open.
The script contains the following code:
library(tidyverse)
dat<-read_tsv("~/Projects/ndacan_data/ncands/CF2022v2.tab")
p1<-dat %>%
  mutate(RpDispDt = as_date(RpDispDt)) %>%
  group_by(RpDispDt) %>%
  summarize(n = n()) %>%
  ggplot(aes(x = RpDispDt, y = n)) +
  geom_line()
table(year(dat$RptDt))

The console contains the following output:
> table(year(dat$RptDt))
   2016    2017    2018    2019    2020    2021    2022
      1       2       2      10    5578 1481174 2265982
> table(dat$subyr)
   2022
3752749

The plot window shows a time series plot of the variable RpDispDt. It shows largue maonthly fluctualtions between October 2021 and October 2022, with spikes at the 8th and 23rd of each month reflecting NCANDS rounding procedures.

[Frank Edwards]
This is what report date looks like I'm sorry report disposition date which tells us effectively when the case is determined to be closed and that's how Children's Bureau is assigning a particular case to a particular year of the NCANDS data. So I want you to notice a couple features of the data. First you can see that we end up with this incredible drop at both the october 2021 threshold and the october 2022 threshold. Right? Those represent the beginning and end of the fiscal year this is off of the 2022 Child File. And you can see also that there are some weird things going on here. If I look at report date alone for within data that are included in the 2022 Child File I end up with a 2016 report. I end up with a 2017 two 2017 reports, two 2018 reports ten 2019 reports, 5500 2020 reports, 100 or I'm sorry 1.4 million 2021 report child pairs. Right? And that's because for 2021 we have those november december cases. Right? And any case that was closed after november 1, 2021 ends up in the 2022 file. And so we might have in this case some very old cases from 2020 that ended up closed in the window that would make them attached to the fiscal year 2022. Same thing for 19, 18, 17, 16. So the report date alone can be very misleading. We need to think about how time is defined and very strongly consider working with multiple years of the data. We could also see the rounding feature of our data when we look at those spikes, right? Those are not real spikes that occur periodically over time that's a simple function of rounding, right? So we would want to in this case to ensure that we're not treating this as such an incredibly unstable time series probably smooth it to a bi-weekly as we're treating time rather than thinking about it as daily which is how R is parsing it at this point.

[ONSCREEN CONTENT SLIDE 10]
Guidance
Carefully consider HOW you want to consider time.
For both NCANDS and AFCARS, many cases that occur in the calendar year 2022 will not be present in the 2022 FY data.
For annualized analysis, treating SubYr or FY as equal to the calendar year is OK, if not exactly correct. With a long time series, and for large geographies, the 2 month mismatch isn’t likely to cause much error .
But for smaller geographies and sub - annual analysis, care is required!

[Frank Edwards]
All right so my general guidance for working with NCANDS and AFCARS data and I see Alex on here as well and he's thought a lot about time in how we consider the NCANDS and AFCARS so feel free to chime in if you think I'm missing anything Alex but we want to be very careful with how we consider time. For both NCANDS and AFCARS many cases that occur in the calendar year 2022 will not be present in the 2022 FY data, right? So we know that the november december cases for 2022 are not in the 2022 FY data and we're also going to have cases that did not originate in 2022 closing in the 2022 FY data. So we want to think about what we're counting as a case in incredibly carefully. We always have to do this with NCANDS because of the structure of it where we have both unique reports and child, and children within reports that is each row of the data is a report child pair so we already have to be careful when we're thinking about what constitutes a report but we also have to be careful about thinking about how we're working with time. For annualized analysis we can kind of assume that some of this stuff may average out and we can treat SubYr or FY as equal to the calendar year as a crude approximation. It's not perfect. It's not exactly correct. So depending on how precise you want to be you may want to consider not relying on SubYr or in SubYr in NCANDS or FY and AFCARS as being exactly equal to the year that you're trying to capture. With a long time series and for large geographies the two-month mismatch and the kind of rolling over of old cases isn't likely to cause a tremendous amount of error but again it's something you want to think about, it's something that you want to have a response for and address in your data analysis. For smaller geographies and for sub-annual analysis you need to be incredibly careful with thinking about time.

[ONSCREEN CONTENT SLIDE 11]
OVER TO RSTUDIO

[Frank Edwards]
Okay so let's pivot over to some live code.

[VOICEOVER]
The program written in R is included in the downloadable files for the slides and the transcript.

[Frank Edwards]
Now I'm going to close us out of powerpoint and show you a script that I've written up that will help us kind of understand some of the basics of how we can do a join of multiple years of data. This code is available for you under joining years. In the Box folder and the data you'll need for it is available for you in the data subdirectory. I will be using Tidyverse functions so I'm going to load those in. You'll also notice we covered this in the data management portion last time but I'm going to be using an R project, as well, an R studio project to orient my path to the root directory for my learn for my learn data files and my scripts. So we'll start of course by loading in Tidyverse and we'll load in each year of the AFCARS individual level fake data.

[ONSCREEN]
> # load tidyverse packages
> library(tidyverse)
> 
> # Read in data. I am using an RStudio project so that paths are 
> relative
> afcars16<-read_csv("./Data/afcars_2016_indv_fake.csv")
Rows: 132347 Columns: 20                                                           
- Column specification -
Delimiter: ","
chr  (6): phyabuse_f, sexabuse_f, neglect_f, prtsjail_f, aaparent_f, daparent_f dbl (14): fy, id_num, stnum, sex_f, raceth_f, curplset_f, ctkfamst_f, fosfamst_...

I Use `spec()` to retrieve the full column specification for this data.
I Specify the column types or set `show_col_types = FALSE` to quiet this message.
> 
> Afcars17<-read_csv("./Data/afcars_2017_indv_fake.csv")
Rows: 85450 Columns: 20                                                            
- Column specification -
Delimiter: ","
chr  (6): phyabuse_f, sexabuse_f, neglect_f, prtsjail_f, aaparent_f, daparent_f dbl (14): fy, id_num, stnum, sex_f, raceth_f, curplset_f, ctkfamst_f, fosfamst_...

I Use `spec()` to retrieve the full column specification for this data.
I Specify the column types or set `show_col_types = FALSE` to quiet this message.
> 
> Afcars18<-read_csv("./Data/afcars_2018_indv_fake.csv")
Rows: 51549 Columns: 20                                                            
- Column specification -
Delimiter: ","
chr  (6): phyabuse_f, sexabuse_f, neglect_f, prtsjail_f, aaparent_f, daparent_f dbl (14): fy, id_num, stnum, sex_f, raceth_f, curplset_f, ctkfamst_f, fosfamst_...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
> 

[Frank Edwards]
So I have those parsed as three separate files 16,17, 18, and let's just take a look at the structure of the data.

[ONSCREEN]
> # evaluate data structure
> glimpse(afcars16)
Rows: 132,347
Columns: 20
$ fy            <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016,…
$ id_num        <dbl> 8422, 8423, 8424, 8425, 8426, 8427, 8428, 8431, 8435, 8437,…
$ stnum         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ sex_f         <dbl> 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2,…
$ raceth_f      <dbl> 7, 7, 7, 1, 7, 7, 1, 99, 1, 1, 7, 6, 7, 1, 6, 7, 1, 6, 1, 9…
$ phyabuse_f    <chr> "0", "0", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0",…
$ sexabuse_f    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
$ neglect_f     <chr> "1", "1", "0", "1", "1", "1", "1", "1", "1", "1", "0", "1",…
$ prtsjail_f    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "1", "0",…
$ aaparent_f    <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0",…
$ daparent_f    <chr> "0", "0", "1", "0", "0", "0", "0", "0", "1", "0", "0", "1",…
$ curplset_f    <dbl> NA, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 4, 2, 2…
$ ctkfamst_f    <dbl> 2, 2, NA, 2, 1, 2, 2, 2, NA, 2, 2, 2, 1, 2, NA, 1, 2, 1, 2,…
$ fosfamst_f    <dbl> 0, 1, 2, 0, 2, 2, 1, 1, 0, 2, 2, 2, 1, 2, 2, 2, 0, 0, 2, 2,…
$ agefirstrem_f <dbl> 4, 3, 10, 16, 7, 8, 16, 9, 8, 12, 7, 15, 12, 14, 13, 8, 12,… $ ageatlatrem_f <dbl> 11, 6, 10, 16, 7, 10, 16, 9, 8, 12, 7, 15, 12, 14, 13, 8, 1…
$ inatend_f     <dbl> 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
$ inatstart_f   <dbl> 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1,…
$ entered_f     <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0,…
$ exited_f      <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,…
> 

[Frank Edwards]
I'll use glimpse here because it's a Tidyverse version that works similar to head but it lets us actually see all of the data when we have a large number of columns so I think it's a little better than using head. So here right we have FY, we have the unique id number we have the state number, that's a fips code for the state, we have sex variable, these again are all faked, the race ethnicity variable and then we have the set of maltreatment indicators about whether these particular forms of maltreatment were involved in the case. We have the risk factors available, parents jail, alcohol abuse, drug abuse parent are very commonly used ones. The current placement set variable is always a really interesting one to look at. And then a number of others but today I want us to think about how we're going to look at entries specifically. Entries and exits are obviously as well as inatend_f are are very kind of important measures to think about both the processes of flow into a state's foster care system and the processes of flow out of a state's foster care system. So I'd like to look at unique id coverage across years. I'm going to use the intersection operator %in% to do this. This is a funky one that some of you may not have seen before but R has notation for set operations. So what I'm going to do here is I'm going to basically say let's take all of the id numbers in AFCARS 16 dollar dollar sign id number and ask R to find all of the matches that occur in AFCARS 17.

[ONSCREEN]
table(afcars16$id_num %in% 
> afcars17$id_num)
FALSE  TRUE
46897 85450 
> Table(afcars16$id_num %in% afcars18$id_num)
FALSE  TRUE 
80798 51549 
> Table(afcars17$id_num %in% afcars17$id_num)
TRUE 
85450 

[Frank Edwards]
And that's going to give me a vector of trues and falses but that alone isn't necessarily particularly informative so let's put that in a table. What this is telling me is that I've got 85,000 cases where the AFCARS 16 record the id number appears in AFCARS 17. I've got 51,000 from 16 that also appear in 18. And I have yeah that one is obviously all true AFCARS 17 and AFCARS 17 I'm not sure why I even did that sorry but there we go. So we have some overlap here we have repeat cases for children over time. Now the easiest thing to do here to join these together is to take our AFCARS 16 our AFCARS 17 and our AFCARS 18 and simply bind them together.

[ONSCREEN]
> afcarsfull %%
+ filterentered -f == 1)
# A tibble: 57,711 x 20
fy
<dbl>
idnum stnum sexf racethf
<dbl> <dbl> <dbl> <dbl>
phyabuse_f
<chr>
sexabusef neglect_f prtsjail
<chr>    <chr>    <chr>
1 2016    8424    1    1    7    0    0    0    0
2 2016    8425    1    1    1    1    0    1    0
3 2016    8428    1    1    1    0    0    1    0
4 2016    8435    1    1    1    0    0    1    0
5 2016    8440    1    2    7    0    1    0    0
6 2016    8442    1    1    6    0    0    0    0
7 2016    8443    1    2    7    0    0    1    0
8 2016    8464    1    2    2    0    0    1    0
9 2016    8470    1    1    7    0    0    1    0
10 2016    8475    1    1    7    0    0    0    0
# I 57,701 more rows
# I 11 more variables: aaparent_f <chr>, daparent_f <chr>, curplset_f <dbl>,
# ctkfamst_f <dbl>, fosfamstf <dbl>, agefirstremf <dl>, ageatlatrem_f <dbl>,
# inatendf <dbl>, inatstart_f <dbl>, entered_f dbl, exited f dbl
# I Use print(n = ...) To see more rows
# afcarsfull %%
 filter(entered_f == 1) %%
 group_by(fy) %>%
 summarize(total_entries = n,
 neglect_entries = sum(neglect_f == "1"))
# A tibble: 3x3
fy total_entries neglect_entries
«dbl>    <int>    int
1 2016    50274    30994
2 2017    3925    2160
3 2018    3512    2043

> # Simple case: binding rows to combine into single long table
> # Note: columns must identically match. Harmonize beforehand if needed
> 
> Afcars_full<-afcars16 %>% 
+   bind_rows(afcars17) %>% 
+   bind_rows(afcars18)

[Frank Edwards]
We know that they have the same structure in terms of the names of columns and the number of columns so we don't need to worry about a join. Here we can stack them on top of each other vertically and I'm going to use the bind rows function for that. So I'll start with AFCARS 16 I'll add AFCARS 17 to the bottom of it and then add AFCARS 18 to the bottom of that and now I'll end up with a data set with 269,000 rows which should be the sum of the rows for each of the years. So there we go.

[ONSCREEN]
> nrow(afcars_full)
[1] 269346
> Nrow(afcars16)
[1] 132347
> Nrow(afcars17)
[1] 85450
> Nrow(afcars18)
[1] 51549

[Frank Edwards]
Now we have a single object which is you know a bit silly because you already had that as a csv but this is formally what I would do when I receive the data from NDACAN. I would take each year of the data, I would perform any operations I wish to perform over each year of the data and then I would bind them to each other to create a long table, right? So now we have AFCARS_full. We can look at the head and tail to confirm that I have 2016 on top and 2018 on bottom. 

[ONSCREEN]
>head(afcars_full)
# A tibble: 6x20
fy    id_num    stnum    sex_f raceth_f phyabuse_f sexabuse_f neglect_f prtsjail_f
<dbl>    «dbl>    «dbl>    <dbl>    <dbl> <chr>    <chr>    <chr>    <chr>
1 2016    8422    1    2    7 0    0    1    0
2 2016    8423    1    1    7 0    0    1    0
3 2016    8424    1    1    7 0    0    0    0
4 2016    8425    1    1    1 1    0    1    0
5 2016    8426    1    1    7 0    0    1    0
6 2016    8427    1    2    7 0    0    1    0
# I 11 more variables: aaparent_f <chr>, daparent_f <chr>, curplset_f <dbl>,
# ctkfamst_f <dbl>, fosfamst_f <dbl>, agefirstrem_f <dbl>, ageatlatrem_f <dbl>,
# inatend_f <dbl>, inatstart_f <dbl>, entered_f <dbl>, exitedf <dbl>
> tail(afcars_full)
# A tibble: 6x20
fy id_num    stnum sex_f raceth_f phyabuse_f sexabuse_f neglect_f prtsjail_f
<dbl> <dbl>    <dbl> <dbl>    <dbl> <chr>    <chr>    <chr>    <chr>
1 2018 354528    6 2    2 1    0    0    0
2 2018 354529    6    1    2 0    0    0    0
3 2018 354540    6    1    2 0    0    1    0
4 2018 354549    6    1    1 0    1    0    0
5 2018 354550    6    2    1 0    0    0    1
6 2018 7924    6 1    7 0    0    0    0
# I 11 more variables: aaparent_f <chr>, daparent_f <chr>, curplset_f <dbl>,
# ctkfamst_f <dbl>, fosfamst_f <dbl>, agefirstrem_f <dbl>, ageatlatrem_f <dbl>, # inatend_f <dbl>, inatstart_f <dbl>, entered_f <dbl>, exited_f <dbl>

[Frank Edwards]
Any questions so far? Not seeing anything in the chat. This is all pretty straightforward stuff thus far. We're going to do something weird next. All right. So what if we wanted to know about the proportion of entries that involved neglect allegations? And we wanted to look at that over time. First thing I'm going to do is I'm going to take AFCARS full and I'm going to subset to only entries. I'm going to use filter to subset exclusively to entries this is what that would look like, right?

[ONSCREEN]
> afcarsfull %>%
+ filter(entered_f == 1)
# A tibble: 57,711 x 20
fy id_num    stnum    sex_f    raceth_f phyabuse_f    sexabuse_f nealect    f prtsjail_f
<dbl> <dhl>    <dbl>    <dbl>    <dbl> <chr>    <chr>    <chr>    <chr>
1 2016 8424    1    1    7 0    0    0    0
2 2016 8425    1    1    1 1    0    1    0
3 2016 8428    1    1    1 0    0    1    0
4 2016 8435    1    1    1 0    0    1    0
5 2016 8440    1    2    7 0    1    0    0
6 2016 8442    1    1    6 0    0    0    0
7 2016 8443    1    2    7 0    0    1    0
8 2016 8464    1    2    2 0    0    1    0
9 2016 8470    1    1    7 0    0    1    0
10 2016 8475 # 1 57,701 More    1 rows    1    7 0    0    0    0
#i 11 more variables: aaparent_f <chr>, daparent_f <chr>, curplset_f <dbl>,
# ctkfamst_f <dbl>, fosfamst_f <dbl>, agefirstrem_f <dbl>, ageatlatrem_f <dbl>
# inatend_f <dbl>, inatstart_f <dbl>, entered_f <dbl>, exited_f <dbl>
#i Use print(n - ...) To see more rows


[Frank Edwards]
We end up with a much smaller object 57,000 or so rows compared to the 200,000 that we had and I'm going to group by fiscal year. Group_by we covered this in our Tidyverse sessions but effectively it'll allow us to perform summary operations over discrete values of our grouping variable in this case FY will be treated as a grouping variable. And I'm going to compute the total number of entries using the n function that'll just compute how many observations are within each fiscal year for entries. And then I'll also compute the number of neglect-related entries. So we've already subset to only entries and I'm going to ask R to compute how many cases did we observe where neglect was equal to one which indicates that there was neglect involved in the case and let's look at what this object looks like before we proceed.

[ONSCREEN]
> afcars-full %>%
 filter(entered_f == 1) %%
 group_by(fy) %>%
 summarize(total_entries = n,
 neglect_entries = sum(neglect_f == "1"))
# A tibble: 3x3
fy total_entries neglect_entries
<dbl>    <int>    <int>
1 2016    50274    30994
2 2017    3925    2160
3 2018    3512    2043

[Frank Edwards]
So now what I have is an annual count fiscal year total entries and neglect entries now I can do a simple mutate to compute a new column entries proportion neglect which is just going to be neglect entries divided by total entries and this will tell me what proportion of entries involved a neglect allegation.

[ONSCREEN]
> afcarsfull %>%
 filter(entered_f == 1) %%
 group_by(fy) %%
 summarize(total_entries = n(,
 neglect_entries = sum(neglect_f == "1")) %>%
 mutate(entries_prop_neglect = neglect_entries / totalentries) # A tibble: 3 x A
fy total_entries neglect_entries entries_prop_neglect
<dbl>    <int>    <int>    <dbl>
1 2016    50274    30994    0.617
2 2017    3925    2160    0.550
3 2018    3512    2043    0.582

[Frank Edwards]
Right? So in 2016 62% of our data involved a neglect allegation in 17, 55% and in 18 58% and then I'd like to send this over to ggplot to visualize. And what I'm going to do is I'm going to put FY on my x-axis as time, this does not no so we could add good catch Brett we could do a filter distinct id num if we wish to remove any duplicates we could do it that way. Good catch. Okay actually I like to do not duplicated here not duplicated yeah that'll work. Yep boom and this is what we'll get when we visualize it so let's see now that we've caught that problem that brett flagged Brett mentioned this doesn't account for individual level duplicates, right? So once we've done that we end up with very similar numbers 62% in 16, 58% in 17, 59% in 18. And then once we send that over to ggplot we get this line plot right that shows us this set of points. Maybe geom line isn't the most useful here it's kind of implying that we have more data than we do so let's just use geom point to show these three data points.

[ONSCREEN]
The following code produces a graph with the Y-axis labeled entries_prop_neglect and the X-axis labeled fy. There are three points displayed at these coordinates point1(2016, 0.617), point2(2017, 0.550), point3(2018, 0.582).

## Lets evaluate the proportion of entries involving
## neglect allegations
afcarsfull %>%
filter(entered_f == 1) %>%
filter(!duplicated(id_num)) %>%
group_by(fy) %>%
summarize(total_entries • n(,
neglect_entries = sum(neglect_f == "1")) %>%
mutate(entries_prop_neglect = neglect_entries / totalentries) %>% ggplotaesx = fy, y = entries_prop_neglect)) +
geom_point()

[Frank Edwards]
Yeah okay. So that's one way we can kind of think about computing a simple index to compare over time. And there's of course no reason we couldn't do this by state as well if we wanted to do that. So I could add group_by FY and stnum. The only thing I might wish to add here now we'll go back to our line is I might wish to add a facet_wrap by stnum so that I get separate plots for each state. And I have six states in my fake data now we can separately look at the trends across those six states, right? It looks like in states one and four we have higher levels of neglect-related removals. In states six and three and five relatively lower levels.

[ONSCREEN]
The following code produces a graphic showing six plots. Each plot is a line graph with X-axis FY and Y-axis entries_prop_neglect.

Afcarsfull %>%
filter(entered_f == 1) %>%
filter(!duplicated(id_num)) %>%
group_by(fy, stnum) %>%
summarizetotalentries - n(),
neglect_entries = sum(neglect_f == "1")) %>%
mutateentriespropneglect = neglect_entries / totalentries) %>% ggplotaesx - fy, y - entries_prop_neglect)) +
geom_line() +
facet -wrap(~st num)

[Frank Edwards]
And I will add this updated code to the Box if you want to see the facet wrap and all that as as your demo. So let's do a slightly more complex case let's create wide data by subject right now we have long data AFCARS full is set up in a long format we have each individual has potentially more than one observation right we have so let's just do table duplicated AFCARS full dollar sign idnum. This will tell us that we have 137,000 observations in this data that occur more than once. And what we could imagine doing is converting this into a wide format. That is and for those of you who are pivoting from STATA this might be how you're used to looking at things. Where we could have a an observation for 2016 for a child and then a second observation for 2017 in a separate column rather than a separate row. And then another column for 2018, right? So we could take each of our variables that we're interested in and create new columns for them rather than representing them as long in the rows. 

[ONSCREEN]
> # slightly more complex case: wide data by subject
> # Maybe we want to know about children's placement setting trajectories
> 
> # Start with all unique IDs across years
> # then drop duplicates
> wide_data<-data.frame(id_num = 
                       c(afcars16$id_num,
                         afcars17$id_num,
                         afcars18$id_num)) %>% 
                       distinct()
> head(wide_data) id_num
1 8422
2 8423
3 8424
4 8425
5 8426
6 8427

[Frank Edwards]
So the thing I'm going to do here is I'm first going to take all of my id numbers this will give me all of the id numbers and distinct will subset it to only include the unique id numbers missing. A yeah okay so this is basically the vector that we're going to be sending to R all of our ids. And then the distinct function will remove duplicate values. So we'll only end up with unique ids. And this is is a simple vector at this point. It's well it's a data frame with one vector that includes only id number. Now what I'd like to do is join the placement setting in 2016. So what we're going to set up is we're going to set up for each id number what placement setting were they in in 2016, and then what placement setting were they in in 2017, and then what placement setting were they in in 2018, so with a single row we can see a case trajectory. So I'm going to take my wide data that is just my id number at this point I'm going to set up a left join to do this and what I'm going to join to it is AFCARS 16. I'm going to take only id number and current placement setting right?

[ONSCREEN]
> afcars16 %>%
+    select(id_num, curplset_f)
# A tibble: 132,347 x 2
idnum curplset_f
    <dbl>    <dbl>
1    8422    NA
2    8423    2
3    8424    2
4    8425    3
5    8426    2
6    8427    2
7    8428    2
8    8431    2
9    8435    3
10    8437    2
# 1 132,337 More rows
# I Use print(n - ...) To see more rows

[Frank Edwards]
So that's the start of my object I'll join id number current placement setting and I'm going to rename placement setting so that it's flagged as a 2016 value. So we're going to end up taking this object

[ONSCREEN]
> afcars16 %>%
+    select(id_num, curplset_f) %>%
+    rename(curplset_f_2016 = curplset_f)
* A tibble: 132,347 x 2
id num curplset_f_2016
<dbl>    <abl>
1    8422    NA
2    8423    2
3    8424    2
4    8425    3
5    8426    2
6    8427    2
7    8428    2
8    8431    2
9    8435    3
10    8437    2
# I 132,337 more rows
# I Use ‘print(n -    ...) To see more rows

[Frank Edwards]
And joining it to our wide data.

[ONSCREEN]
> wide_data<-wide_data %>%
+ left_join(afcars16 %>%
+    select(id_num, curplset_f) %>%
+    rename(curplset -42016 - curplset_f))
Joining with by - join_by(id_num)’

[Frank Edwards]
And what we'll end up with is now. 

[ONSCREEN]
> head(wide_data)
id_num curplset_f_2016
1    8422    NA
2    8423    2
3    8424    2
4    8425    3
5    8426    2
6    8427    2

[Frank Edwards]
An id number that includes the placement setting variable for the child in 2016. Then we're just going to repeat that with 2017 and 2018.

[ONSCREEN]
> wide_data< wide_dota %%
+ left_join(afcars17 %%
+    select(id_num, curplset_f) %>%
+    renamecurplset f 2017 = curplsetf))
Joining with ‘by = join_by(id_num)
# and 2018
> wide_data<-widedata %>%
+ left_join(afcars18 %>%
+    select(id_num, curplset_f) %>%
+    renamecurplsetf = curplset_f))
Joining with by = join_by(id_num)’

[Frank Edwards]
And let's see what we get.

[ONSCREEN]
> headwide data)
 id_num curplset_f_2016 curplset_f_2017 curplset_f_2018
1    8422    NA    3
2    8423    2    NA
3    8424    2    NA
4    8425    3    NA
5    8426    2    2
6    8427    2    1

[Frank Edwards]
Now we have for child 8422 we know that they have non-reported placement setting data in 2016, they were in placement setting 3 in 2017, and no reports in 2018. For child 8427 we have a two for 16 a one for 17 and an na for 18. We might also if we wanted to extend this we can look at AFCARS_full again real quick to see that we have a variable that can help us out here. We have the inatend and inatstart variables and those can help us register if the child was not in foster care maybe we want to create a new code for whether that child was in foster care that year or not. I think we can pretend for now that the na's indicate that the child was not in foster care but of course we'd want to investigate that really carefully in real world data analysis. That's all I have for my demo today relatively short. The the topic of joining multiple years of our data together after you've considered the complexities of how time is handled in our data sets is relatively straightforward. We're going to bind them together and work with them like we would any other long data format. Working wide gets a little trickier but not terribly tricky. The big thing to to always keep in mind here again is that for NCANDS the year of inclusion will be determined by report disposition date and only cases that have a report disposition between October and November of that fiscal year of that year will be included in that year's submission. So if we are interested in doing time series analysis we really need to work with multiple years of the data and think carefully about which file particular observations will occur in and not over rely on a single year of data for time series analysis unless we're willing to truncate and only consider the fiscal year itself. Obviously the kind of overlaps are going to overlap with calendar year might make some analysis a little trickier. So as long as you're careful about how time is constructed within the data you should have a pretty easy course. I do have some homework up for you in Box if you want to check it out. It's pretty straightforward. We are asking you to take a look. We have an R script here yeah you can't of course look at R on Box can you? But let's homework four here we're going to ask you to load each file, we're going to ask you to count unique observations, use unique to establish unique children in each year of the data, create subsets for only entries, use unique ids to identify and join exits to entries, and then create a new variable to compute average length of stay is foster care for each child across time. So that is we'll take children who enter in 16 and then we'll use our other data sets to find their exit, we'll join those together, we'll compute the length of time each child stayed in foster care and then we'll compute the mean time that children's data in foster care in this simulated data set. I've also provided a solution set for you if you need need support for each of the questions. So that's all available. Yes pivot wider is great Doug pointed out to me that we can use pivot wider very easily to do some of these I I felt it would be useful and it's pivot wider is is a great function to use once you get your head around how it works. It is a there's it it's handling a lot of the joining business for you. So if you want to check out the pivot wider and pivot longer functions those are incredibly helpful once you get a little more advanced. All right that's all I got for today if there aren't any questions.

[Paige Logan Prater]
Thanks frank let's take a just a minute just make sure that folks don't have any questions about any of the presentation. I think the even just conceptualizing how time is taken into account what these data sets can be a little tricky sometimes so. Let me just check the q and a box. Okay it's looking like.

[Frank Edwards]
No open questions 

[Paige Logan Prater]
No open questions and Frank you already responded to Brett's question about not accounting for individual duplicates right and

[Frank Edwards]
yeah yep and we added that to our code good catch Brett. It's something you know sometimes we care about potentially seeing repeat incidents with children like reports or foster care placements. We might want to know about every single one that comes through or we might be interested in counting unique children. And again that's a researcher question that's an analytic question that you need to answer whether you're interested in distinct children or you might be interested in repeat occurrences.

[Paige Logan Prater]
Thank you so much Frank as always we will now transition to our breakout rooms.

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.

[MUSIC]