[Musical cue then voiceover]
National Data Archive on Child Abuse and Neglect.

[Erin McCauley]
All right everyone we are at time so we're going to kick things off. Welcome to the 2022 NDACAN Summer Training Series. We have Frank here today who's gonna be doing some live coding. And so we'll because this is a really technical presentation we'll be taking questions throughout the presentation but if you have a broad question not specifically about the technical aspects that are occurring then save it till the end. But if you've got a question kind of clarifying question technical question as Frank is going through his presentation feel free to put it into the Q and A we do ask that everyone use the Q and A box instead of the chat box for questions because it allows us to capture that question for when we create the webinar series. Which leads me into the fact that this session is being recorded and that is because we turn the Summer Training Series which is kind of a live workshop series into a webinar series at the end of the summer which will be available on the website. So don't you know panic to take notes if you want to because the whole thing will be available later. We also have webinar series from the last four years so if there's anything you hear here about different data sets or whatnot you can always look at prior Summer Training Series to learn more about the data sets. If you have Zoom issues I think we're all pretty experts at Zoom over the last two years and probably existentially tired but if you have remaining Zoom issues you know there's a lot of support online or Andres who's here as kind of our big organizer of this event shout out to Andres you can also get him by email for any support or assistance. Okay. So as I said this is the NDACAN Summer Training Series. This series is run every summer by the National Data Archive on Child Abuse and Neglect we are co-hosted at Cornell University and Duke University. And the theme of this summer series of workshops is "The Power of Linking Administrative Data" and so today's presentation will be about linking our administrative data out to external products. In this series as I said it's run by NDACAN and we are available by a contract with Children's Bureau. Thank you so here's the overview of this summer we are at session three linking NDACAN data with external products at the end of this presentation we will be a halfway through the Summer Training Series. It's a very exciting time. Clayton who is not here with us today because he is moving to start graduate school exciting helped we planned this series throughout the last year and I want to shout out all of our participants from last year because they're the ones that gave suggestions for the topics for this year's series. You will be receiving a survey from me at the end of the summer asking how we did what you liked what you didn't like and then also getting fodder for ideas for next year so if you want to see something you haven't seen yet keep it in the back of your mind and put it on into that series. Next. So I'm going to pass it over to Frank now who's going to be taking away this presentation thank you to Frank and thanks everyone for being here.

[Frank Edwards]
Hi everybody please yeah great great to see you all here happy to be working with you today on linking the NDACAN administrative data products to external sources of information. And in general when we link NDACAN to or NDACAN data such as today we're going to look at the NCANDS child file and the AFCARS foster care file. The first question we want to ask is what's our key through which we're going to link to an external product. Now obviously we're not able to nor are we allowed to link these data at the individual level. So any efforts to identify individuals obviously is is not allowed under your data use agreement with these but we can link the data at the geographic level and the time level pretty typically. We might also consider other levels that we're interested in linking at. The key question when we're thinking about a data join is to find what I'm going to call our keys that is our variables that have shared information across the tables that we want to conduct our join across right. So we're going to talk about finding those keys, we're going to talk about how to access population data, and then we're going to join it to NDACAN files. And the purposes of our exercise today is to compute some state-level rates of exposure to child welfare investigations and to foster care contact. So we're gonna con we're gonna construct per capita rates. And we can do that using population data. Okay so let's kind of dive in and again at any point feel free to just raise your hand unmute yourself throw it in the chat so that if you have any questions about as we go I am going to be doing some live coding today in R using the Tidyverse packages and Rstudio if you're using R and want to follow along the only software you need all the software is free and open source but you'll need you'll need R, Rstudio, and Tidyverse but the general concepts that we're going to be discussing here apply to any statistical package if you're using SPSS or SAS or Stata the general principles are the same the obviously the the syntax and the the code will be different. Okay so let's find our keys. So to identify key variables what we need to think about are what is our primary unit of analysis right. When we when we think about key variables typically we're thinking about units of analysis that we're interested in that we want to use to bring in additional information beyond what's already available in the NDACAN data products. And the key variables in general are going to be used to join one table to another and when I say table here effectively I mean a flat a flat file so I typically work with csv files but any any you know file you're working with we can think of a data set as a tabular representation of data that is it has columns and rows so we're going to join we're going to use key variables that is columns to join one table to another. And they these key variables usually are are important characteristics such as time, place, sex, age, race, ethnicity, or other characteristics that we may be interested in learning about. So really critically when we're designing a join we want our key variables to have identical types and levels in both tables used in a join. So if it's numeric and we're talking about integers so for example you know time we might typically treat as an integer variable at single increments but we could also treat it as a time variable itself right there are a number of formats we can use to document time in in data analysis packages and we want to make sure that we have matching types across those. One problem that we might run into that routinely occurs is if we're using geography right. So for geography let's say we're using U.S states. U.S states could i've seen them routinely represented in three ways right. The first would just be the full spelled out name, the second would be the two-letter postal abbreviation, and the third might be a two-digit numeric fips code right and those are the federal information processing system codes used by folks like the Census. And so we want to make sure that we have identical types on both tables that we're going to join so that our match can be successful. We'll walk through that a little bit. Okay if you'd like to follow along at home I do have the data available uh at this Github repository it's de-identified so these are based on the 2019 child file that's a limited subset of variables that we're going to use to aggregate up sorry I see a question in the chat I don't know if that's for me yes. So and on this repo which I'm just going to tab over and show you what it looks like this is the repository on Github if you've never seen Github before. [ONSCREEN Github website file list is a folder called data and two .r scripts] we have a little assuming the internet will work for me. You can see this is my NDACAN workshops repository and if you have git installed you can actually just click here [ONSCREEN clicks on "code" button, a sub-window called "clone" appears with a web link and a "copy" button and Frank clicks the "copy" button] copy this link and then we could maybe go to, if you're on mac, I use iterminal but whatever your terminal is. And again if this is above your sort of pay grade that's okay too but I'm going to [ONSCREEN an iterminal window appears and Frank types "git clone" and pastes the git repo address] git clone and I'm just going to copy and paste that and it's going to create it's going to download all of the software and data off of this repository onto my computer. So if you use the Git software and it's free if you just Google Git install and if you're on PC Git install Windows get installed Mac you can download that. I'm downloading the data right now onto my machine. Okay so while that's running I'm going to walk us through some more of the slides. So this is the NCANDS data structure that we're going to be working with. We're going to use the data today so these are effectively child subjects of maltreatment investigations and they could be repeated. We don't we haven't de-identified unique ids here so we don't have the identifiers we just have the age of the child, the states in which the investigation occurred, and the year of the data file. So here we have three variables and does anyone see something that we could think of as using as a key to join to a population file? What do y'all think? What could make a good key that we could join onto a population data file? So I see subyr here. Yes state so we have we have states we have subyr and child age actually we could use all three of these and i'll walk we are going to use all three of these as we go and i'll show you how we do it. But remember we talked earlier about making sure our names and labels are all harmonized and in the case of the NCANDS child file we have some names that are absolutely not going to match what we have in the Census data product we're going to use. So subyr, staterr, chage right these are variable names that are unique to NCANDS so we're going to need to modify those once we get going. But we have year we have state we have the age of the child and theoretically all of those could be useful for joining to population data. And again this is the file that you can access through the Github repository that has three these three columns if you want to follow along. Okay here's the AFCARS data structure. We have four variables here we have FY that is the fiscal year of the reporting. State which is that two digit fips code I mentioned earlier. ST which is the two letter state abbreviation postal abbreviation, and ageatstart which is the age the child was at the start of the reporting period. So does anyone want to think let's think about briefly differences between these file structures right. So we have similar variables FY and subyr right we're going to need to harmonize those names and when I say harmonize I need make I mean make them identical. We have staterr and we have st right we're going to need to harmonize those names to make a join effective. We also have state as a fips code here we can't join state as it is to staterr right. But we're going to be able to use this additional fips code later so we're going to keep it in the data. Ageatstart is equivalent to child age in the NCANDS file so we're getting similar kinds of information on different events. Here this is children who are in foster care at any point during the 2019 reporting period. Okay so let's talk about what we're going to do with these data to bring in the population data and for the first step is obviously to access that data. I really like using the data from NIH SEER. This is a product that the National Institutes of Health through the surveillance epidemiology and end results program at the National Cancer Institute has put together. I have a link for you in the slide here [ONSCREEN https://seer.cancer.gov/popdata/yr1990_2020.singleages/us.1990_2020.singleages.adjusted.txt.gz
] but what this file is is it's a version of the U.S Census small area population estimates. And the reason this data is so helpful and i'll just kind of hop over on a link to it real quick oh I'm downloading a giant file which I also needed to do but. I'll show you how to access the data SEER population data you can see I go here quite often. Sorry I'm seeing in the chat yeah thank you Erin so if You want to click that link it'll download the file now the file is quite large it's I believe a two gigabyte file once we've extracted it it's 258 megs just as a compressed file. But this is the SEER data repository I'm going to put this in the chat. And the file that I'm typically going to work with is to do age-specific estimates. What we need is this is a 19 age groups file so it groups to you know this runs the entire population so groups like zero to four five to five to nine right but we don't want that. When we're looking [ONSCREEN at the seer.cancer.gov website, the speaker scrolls to the County-Level Population Files - Single-year Age Groups] to get race specific as well and ethnicity specific as well then we want this "4 Expanded Races by Origin" this is using the bridged race I'm sorry I'm getting a connection unstable notice so I'm going to move somewhere where my wi-fi will be a little bit better i'll keep talking through it. But we're going to use this "4 Expanded Races by Origin" version and so that's going to include information that we can use to join to the race and ethnicity categories that we have available in the AFCARS and the NCANDS. And someone let me know if you can't hear me in the chat but it seems like it's okay. So yeah we're gonna download that file you can see up here it's taking me quite a while we're not gonna wait an hour for that to download fortunately I already have it downloaded on my computer. If that doesn't come through quickly but it's a large file we want to get that going because it does include data at the county level for each age group for each racial ethnic group over 30 years so you can imagine that's quite a large file. Okay back to Powerpoint. So again I'm using that single year age groups "4 Expanded Races by Origin" file all states combine adjusted. And they're they're very similar to the SAIPE data but we get more granular information on child age and child race ethnicity than we would from the Census products that is without directly accessing the micro data. Okay so SEER's great we should use SEER. Now SEER's format is a little funky. [ONSCREEN code displayed: library(Tidyverse) pop<-read_fwf ("us.1990_2020.singleages.adjusted.txt", fwf_widths(c(4, 2, 2, 3, 2, 1, 1, 1, 2, 8), c("year", "state", "st_fips", "cnty_fips", "reg", "race", "hisp", "sex", "age", "pop")))] it's a fixed width file so I'm going to do some live coding for you today where we work through together how to work with these data but the fixed width files if you haven't seen them before they they basically have it's a it's a single block of text where there's no commas or anything else separating the variables but the the data repository gives us information on the width of the columns in the files. So we know for example that the first column is year and it is four characters wide. So we'll see that I'm giving a vector of widths here fwf_widths when I read in the file first column is four wide and it's called year, second column is two wide it's called state, third column is two white it's called st_fips,  third column is three wide county_fips right and etc for all the variables, and this is what it looks like when we read it in. So do we see keys here? Do we see key files that look similar to the keys that we had in our other files? Well of course right we have year here, we have that'll join easily on to subyr from NCANDS and fy from AFCARS. We have state here that's the two-letter abbreviation we also have st_fips. So we have the possibility of choosing whether we want to use the two-letter abbreviation or the fips code to do the join. We also have age but notice that age right now was read in as a character yep all three of the key bars are here right. And then we also have population but notice that it was read in as an eight-digit character it's not currently being treated as a number it's being treated as a character vector so we're going to need to fix that. But that's okay we can fix it. Okay so to join onto the NDACAN files first I want to give us a kind of high-level overview of what we're going to do and then we're just going to start coding so we're going to take three tables these tabular data files that is the AFCARS child file AFCARS foster care NCANDS child file and the SEER population data three data tables we're going to join them into one file that provides that allows us to compute per capita measures of screened in reports and per capita measures of children in foster care by age and state. Right so first we're going to compute age and state-specific counts of CPS investigations using NCANDS, children in foster care using AFCARS. Then we'll join those counts then we'll join that to the population data, then once we've completed our join of NCANDS to AFCARS and then the join of NCANDS and AFCARS to the population data, we'll compute population adjusted rates. So it's time for data cleaning it's time for some live coding right. And remember our key task here is to make variable names match exactly and variable types match exactly and the levels of those variables match as appropriate. So I'm going to pivot over to R and again feel free to hop on the chats if you have questions as we go. So this is rstudio for those of you who haven't seen it before i'll quit out and we'll just kind of launch it fresh. Oops sorry. Okay let's see my repo is done let's see if we have my data file. We will soon excellent. So I'm going to launch Rstudio I'm on a mac but you could use this obviously on pc or linux or whatever platform you're on chat. Yes the R script will be posted whatever I write here I'm just going to save into the project file and then i'll upload it to that Github repository so you can see it when we're done. So the first thing I like to do when I am working in rstudio is to create a new project. And a project helps me just kind of keep everything organized within a single folder. I'm going to set this in an existing directory and we can find that I have this NDACAN workshops folder that I cloned off of the Github from the terminal right remember when I was over here and I said git clone and then put in this URL for the git which if you want to do on your own if you have git installed all you have to do is click this green code button, click this copy and then type git clone and it will make a folder on your computer that has all of the software in it. Okay so we're going to use this one so we're going to create a project there for me. And what that's going to do is it's going to help me kind of keep everything organized within one folder and you can see that i've already got some of the stuff from this folder now showing up in my files viewer. Now the first thing I want to do is create a new file a new R script we're going to call that file we're going to save it as read_NDACAN_data [URL for read_NDACAN_data.R is https://Github.com/f-edwards/ndacan_workshops/blob/main/read_ndacan_data.R]. Okay and you can see that i've got a new file there. The population data I'm not I showed you the code of how we read it in that's available for you right here in make_pop_data [URL for make_pop_data.r is https://Github.com/f-edwards/ndacan_workshops/blob/main/make_pop_data.r] so if you download the latest version of the I think it's still I don't think 2021 is out yet there's my data. Right so what we're going to do we'll see you can actually I kind of want to show this to you just so you can get a feel for what these fixed width files look like they're kind of funky. But yeah there it is so that's what the file itself actually looks like right it's wild it's just you know like a bazillion rows of data that's like 24 wide and so we need to tell it exactly how long each variable is for it to parse the file correctly. It doesn't read in automatically we have to be really specific with how we read these in. Okay so. We are going to just copy this file over into my data folder. I'm not going to upload this data onto Github because it's massive and it's also you know not mine but I'm and I it's not responsible to put a 2 gigabyte file that I don't distribute on Github but you can download the file on your own from the URL in the slides [full URL: https://seer.cancer.gov/popdata/yr1990_2020.singleages/us.1990_2020.singleages.adjusted.txt.gz] and then just put that in your data folder. Looks like yes exactly I wasn't aware of the particular format but 80 column wide data there you are. So what we can do now that we've got that in there I actually can change this a bit I usually keep this in a folder called projects but it's not there it's just in data right now. We can confirm the directory I'm in by typing gitwd on my console and you can see that when I use a project in Rstudio I'm going to make this text a little bigger please let me know if this text is too small for you to read. But I can use the gitwd command to show me which what my working directory is within R where R is set as its root directory right now and it's set for NDACAN workshops so I can use relative paths to tell it to go look within NDACAN workshops in the data folder for this population file so let's read that. Okay so we're just going to read this file in and see what it looks like it's going to take a second and I'm using the Tidyverse library here if you have not installed that before you're going to need to run the install dot packages quote Tidyverse command to get that in Tidyverse a collection of a number of packages that all share a similar data philosophy that I really like using and I think it makes working in R a lot easier. Okay so we're still processing this ginormous data file but that's okay. We actually aren't gonna this this this can be somewhat time-consuming to read this in because you saw it's a 1.8 gigabyte file and then we're kind of going to do some data processing on it as well there we go. Okay Tidyverse thank you. Yeah and to install Tidyverse if you haven't done it you just need to run this line of code [ONSCREEN code typed: install.packages("tidyverse")]. Right there i'll put that in the chat I'm not going to run that because it takes a second and my version is fine. Okay so now that we've got this in we can look at the file I like to use the head command [ONSCREEN code typed: head(pop)] to look at the structure of the data file or we can use glimpse here [ONSCREEN code typed: glimpse(pop)] because we're sideways. And you can see we the those problems we noted earlier about having things as characters when they're in fact numeric right. Because we didn't specify variable types in our read right we have we have some issues there but we actually only need a subset of these variables for our purposes right. We only need year, state, sex, age, and population. We don't need counties we don't need much else so we're going to just subset this using some Tidyverse functions we're going to first [ONSCREEN code highlighted: filter(year==2019) %>%] subset it to only the year 2019 [ONSCREEN code highlighted: select(year, state, sex, age, pop) %>%] then we're going to select only the columns year, state, sex, age, and pop then [ONSCREEN code highlighted: mutate(age = as.numeric(age), pop = as.numeric(pop))] we're going to convert age into a numeric variable which will turn it into an integer and population also into a numeric variable which will turn it into an integer and we're going to have this more compressed file called pop demo that gives us each county's population by sex, by age in 2019 and we're going to write that file out as pop demo. And we can see then down here in our data that we've got pop demo [ONSCREEN: pop_demo.csv] right and it's 37 megabytes instead of 1.7 gigabytes and that's far more tractable. Okay so let's read in some NDACAN data. First I'm going to load in Tidyverse because that's how I like to read my data. First thing we're going to do let's start with NCANDS so we're going to I'm sorry. So we're going to read in the csv file that i've created in NCANDS [ONSCREEN code displayed from the file called read_ndacan_data.R]. And if you want to see how I made these files from the raw data you can see that here [ONSCREEN code displayed from the file called make_sample_data.r] if you have access to the NCANDS and AFCARS NCANDS child file 2019 and AFCARS foster care file 2019 you could use this script to replicate the files i've made here. But we don't need to worry about that right now. Okay so let's read in the NCANDS data. [ONSCREEN code typed: ncands<-read_csv("./data/ncands_demo.csv")] Had some errors with my path I need to tell it to go to the data folder. There we go. And that's what's in NCANDS right. So we have subyr, staterr, chage okay. The other file we want to read it is AFCARS [ONSCREEN code typed: afcars<-read_csv("./data/afcars_demo.csv")]. Oh sorry might you might hear children screaming in the background. That's my eight-year-old. She is lovely but can be a bit dramatic sometimes. Okay and then we're going to read in the population data. [ONSCREEN code typed: pop<-read_csv("./data/pop_demo.csv")] and then the data is pop demo dot csv. We need to make sure to give it the relative path in the data folder. I like to organize my there's nothing necessary about the way i've organized my repository here my project folder here but I always like to keep a separate folder for my data files just to keep things a little bit tidy. Okay so we can read.

[Erin McCauley]
Hey Frank we have a question that came into the Q and A what did fread do in the previous command.

[Frank Edwards]
F read, yeah so you can see that sorry let's go back so that that that only applied to the make sample data over here and you can see I did fread so I'm using a different function from the data table package. Data table is a package in R that is specifically designed to work with very very large files so when we're in the kind of big data realm, data table can be a very good friend. Tidyverse is great for manipulating smaller tables but for very large tables data table has a clear advantage and f read stands for fast read and it's a function in data table that's able to parse large files like the NCANDS child file very quickly much more quickly than the functions in Tidyverse. And a lot of times when we're working with the NCANDS child file for example we might be working with 10 years of the data, right. So we each of those files may be four gigabytes so we're talking about potentially 40 gigabytes of data to parse and so using fread can really speed up the process of reading those data and then parsing them. So in that case I like to use fread which comes from data table. Data table again if you're going to be working in R with the NDACAN data I do recommend getting familiar with the data table package when if you're going to be working with multiple years of the data in particular because it can make your life a bit easier at the parsing side. Did that we good there? Cool okay so we're gonna read in the demo files and in R I can use the hashtag for comments i'll load libraries and I like to make a little kind of header for my files. This is to this script joins NDACAN tables to SEER pop data right just so I can kind of remind myself what's going on here. Okay so from here we have in our three files right let's just and often times I like to just kind of confirm that this the script works as a whole by sourcing it so I can in if I'm on a mac I can command shift enter or command shift return to source it to run all of the code or I can just click the source button. That will do it as well. And so that will run every line of code in sequence just to make sure the whole thing works. And so now we've got head let's check out our tables. We've got NCANDS we've got AFCARS and we've got pop. Okay cool so did I see a question no. Okay so those are our three files now let's start with NCANDS. So let's let's look at NCANDS and pop. I'm going to use the names function to pull out the names of NCANDS and the names of pop. So the first thing we're going to notice is that subyr while the submission year in NCANDS doesn't actually line up perfectly with the fiscal year reported in AFCARS we're just going to kind of pretend they're the same for our purposes today. If we wanted to be really precise with capturing calendar years we could use like the report date or the entry date or the reporting date for AFCARS and then extract the actual calendar year rather than the reporting year or the submission year but we're just going to pretend that 2019 is 2019 across the board. If you so for example in AFCARS the reporting year is technically October to October right. And so we will end up with some dates in the 2019 file from the end of 2018 and we won't actually end up with all of 2019 in it because November December we'll end up in the 2020 file you know but but we're gonna just kind of pretend that that doesn't matter for now. Of course in real data analysis we would want to take that into consideration but for our purposes today we're just gonna smooth over that. So next we're gonna we're gonna look at okay so we're gonna harmonize the names in NCANDS and pop. Okay so I'd like to when I when I'm thinking about working with data I like to keep names simple and I actually wrote the names for the pop file and you can see that they're year, state, sex, age, pop as a and in the NCANDS file we have subyr, staterr, chage. Right now staterr is descriptive because we have like Washington DC and Puerto Rico in the data and those are not technically states so it's it's kind of more descriptively accurate. We have chage to distinguish from like you know the age of the adults or alleged perpetrators involved in a particular submission but let's just harmonize these names real quick. So the way we're gonna do that is we're going to take NCANDS and we're going to overwrite it we're going to rename. We're going to call year so we're going to we're going to rename subyr to year we're going to rename state to staterr. And we're going to rename age chage so whatever goes on the left is what is going to result whatever's on the right is the target. Okay so now when we look at the names of NCANDS we have year, states, age right. Instead of subyr, states, or chage. And those match let's confirm that the values match now. Okay so our ages we see 0 1 2 3 4 5, in pop we see 1 0 11 4 13 10 and NCANDS that's great. 2019 matches state is our two-letter postal abbreviation that matches. So you're looking good there. Now let's look at the unique values that occur in NCANDS age to see if there's anything strange happening which I think there's one strange thing that happens here that is a particular code for age that we need to adjust for. So unique we'll pull out all the unique values of age. You can see 77 and 99 these are two that we need to work on. 99 means missing right. So we're going to and 77 actually means less than that means prenatal it means the child is is is not born so this would be a report about a pregnant a pregnant person. Yeah I see a hand raised go ahead. Maybe throw a question in the box or unmute yourself.
No. 

[Erin McCauley]
I don't think participants can unmute themselves but if you put it in the chat or in the Q and A I can read it aloud.

[Frank Edwards]
Okay sounds good I'm going to keep working on the recode while we do that. So the first thing I'm going to do is recode so let's i'll put this in here unique NCANDS age [ONSCREEN code typed: unique(ncands$age)] and we're going to note note that 77 and 99 have special meaning. So we're going to recode 77 into zero. So we're going to treat those reports about a pregnant person as involving a infant and we're gonna recode 99 as NA. NA has special meaning in R it means missing. So we'll take NCANDS again we'll overwrite it we're going to mutate and we're going to use a function we're going to make age equal to case when I like case when is a function for doing recodes so we will do when when age is equal to 77 I want you to output a value of zero. When age is equal to 99 I want you to output a value of NA, and when age is less than 77 I just want you to leave it alone as age. Okay so after we run that. Why is that a problem. Yeah maybe we'll just do it with if else I could try and debug this but we could also just do nested if else which will work a little more certainly okay so if else age equals 77 then convert it to 0 then we'll do a nested if else age equals 99, NA and if that's not true then just return age that ought to work. One more needed okay there we go. There we go so now our unique values exclude 77 and 99 we have an NA in there so that's useful this will join successfully because the unique values for age and population are as follows. Right so that's what we have in pop for unique ages it it's top coded at 85 but we actually don't care about that because we're only going up to we're only going to go up to 17. Now what else do we notice about this so far right? So NCANDS the way we have it structured is actually still at the individual child level, right, and pop is at the county level. So we need to convert those into identical levels so we need to convert or collapse NCANDS to state, collapse pop to state. We'll call we'll call our level of analysis state year. Because we're going to have a single year at the state level so how do we do that we'll call this NCANDS_st we'll take NCANDS and we're going to use a function called group_by within Tidyverse this is really similar to anyone who's used sql in the past it's really similar syntax to what you would see in sql. So we're going to group by state and we're going to group by age. So two grouping variables and we're gonna summarize actually you know what we're just gonna group by all three because we wanna preserve year. Sometimes if were working with multiple years of the child file we might want to preserve year as a grouping so that we can get annual accounts right. And it's not strictly necessary here because we only have one year of the data but it doesn't hurt to do it. So we're going to summarize and we're going to call we're going to call this investigations and I'm going to use the function n. And what that's going to do is it's going to count the number of rows by year state and age that occur within the data. So let's look at it now. Now we can see in Alaska in 2019 we had 672 investigations that involved a child who was less than one year old or who or a pregnant person. In 2019 in Alaska we have 1,279 investigations involving a one-year-old subject right and etc.. So we've now collapsed this down to a state-level table by age of the number of investigation the number of child subjects which right we could maybe we'll call it like child investigation pair. What exactly this measures is a little complicated. And we also know that our population data right is at the county level so we want to convert that as well. Pop we're going to take pop or we'll call pop state now we're going to group by year state age. Now here we don't want to group by sex we don't have sex data over in our NCANDS file so there's no point in preserving that distinction in our state population file. The other thing we can go ahead and do is we can go ahead and drop ages that we're not interested in we can drop non-children so I'm going to say filter for age less than 18. So all populations all rows recording population data for ages over greater than or equal to 18 will be dropped. We're going to summarize then pop is equal to the sum of those county population totals. Okay and here we can see the results of that so we have in Alaska 9,800 infants in 2019 10,000 one-year-olds and etc. right. So this matches looks really close to what we have up here from our NCANDS st file right. Which is exactly what we are after so that's good. The next step is to join them. Q and A. Yes it is safe to do less than 18 77 and 99 have been converted. Now strictly speaking 99 if we're our counts here are going to be underestimates because those 99s we haven't dealt with right. Let's look at how many 99s we have. So let's just do a table in the original NCANDS where age is equal to 90 or I'm sorry we've already converted it to NA but we'll do is dot NA NCANDS dollar sign age. Okay so we have 17,000 rows in this file that are missing data on age. Now in you know academic analysis or analysis where we're really interested in getting these counts right we would need to think about appropriate strategies to handle missing data. That's a bit beyond the scope today you know and that's a relatively small proportion of the total size of the file. We're talking about 0.4 of the sample missing data on age. But if we but that may be systematically related to geography in a way that might lead us to biased conclusions if we don't deal with it. I'm not going to worry about it today but for full analysis you may want to think about those missing data a little more carefully than just effectively what we're doing is listwise deletion today which is a strategy I never recommend for serious analysis but this is a workshop so we're gonna we're gonna proceed ahead as if it doesn't matter. Okay so let's call this NCANDS pop and what we're going to do is we're going to use the join function we're going to use a function called left join but there's a lot of different kinds of joins we could use. Left I like to just keep it easy and always use the same type but what we're gonna do is we're gonna take an object on the left and join it to an object on the right. So our left hand object will be NCANDS_st and I'm going to use I keep using you see I keep using this operator here I apologize I haven't explained that [ONSCREEN code highlighted: %%]. In Tidyverse vernacular that's called a pipe and it links it basically allows us to string together multiple commands rather than writing out a new line for each command. And you can generate that on in Rstudio with command shift m if you're on a mac or if you're on a pc ctrl shift m will generate your pipe. And so it allows us to give the object and we're going to pipe and whatever comes next from the pipe is the commands we're going to use to modify that object. So from here we're going to left join. So we're going to left join NCANDS state onto pop state. [ONSCREEN code: ncands_pop<-ncands_st %>% left_join(pop_st). Okay and let's just look at it one more time before we do the join to make sure that it should look right. Right so we should see like 17 rows for each date because we have 17 ages and let's just confirm that we have that. Oh wait I did by age so we have 51 right or 18 we have 0 through 17 so we have 18 ages for each state so that looks right. And then the same should be true in NCANDS. I'd just like to okay we have 20 oh that's right because we have some ages beyond 18 in the NCANDS and that xx you see here is when the state has been de-identified those are usually when there's a child fatality involved so we're not going to have information on those cases. But yeah remember in NCANDS, we go zero through 23 but we're not gonna those are cases we're not gonna consider we're only gonna look at people under 18 years old. Okay let's go ahead and execute this join and see what happens. You can see that it said it joined by year state and age. Now we can manually specify those key columns but if we've done our work in the way I like to do it that is to harmonize the names, harmonize the values, harmonize the levels then R will automatically detect those key columns for us that is columns that have the same name and share values will work as keys automatically. And let's look at what we have now. We should have a file that has all the variables from NCANDS and all the variables from pop and that's exactly what we have right. So we have a table now that has added the column from population onto our column from NCANDS. Super cool. Okay so let's say super cool. Okay now let's do AFCARS. Let's remind ourselves about AFCARS real quick this will go a little faster because we already know what we're doing now right. So we need to first rename these things right. So let's take AFCARS let's modify it by renaming year right. Renaming now here we have two state variables capitals s state right and r is case sensitive so that won't get treated as lowercase state. But the variable we're after for the join is St right so we're going to rename state as St and we're going to rename age as ageatstart. Okay so first we'll get that rename looking good there we have minus ones right that is the child at the start of the reporting period was not had not been born so we're going to recode those as zero again. Let's look at the uniques on age to see what we're working with. [ONSCREEN code typed: unique(afcars$age)] okay we don't have any missings here I don't believe but we do have minus one so we need to deal with minus one. So AFCARS you know what we can do I'm gonna get fancy and show you why I love Tidyverse so much is I can go ahead and just do my recode on the same command so this is all still kind of the single command from line 49 we can just pipe on a new command. So we're going to mutate now age is equal to ifelse age less than zero, zero otherwise leave it alone as age and then we're also going to select and I'm going to drop the capital s state column so I'm going to use the select function and I can use there is 99 in the AFCARS data as well yeah I don't see it here oh there it is thank you. Thank you helpful commenter my eyes missed it. [ONSCREEN code displayed: afcars<-afcars %>% rename(year = FY, state = St, age = ageatstart) %>% mutate(age = ifelse(age<0, 0, age), age = ifelse(age==99, NA, age)) %>% select(-STATE)] so we'll make another if else command to deal with that and that can go in mutate and then we'll select and then you can use the minus on select to drop a column. So we'll select everything but state. All right we've already done this so I need to actually we're going to source the file we've got some problems because i've already modified the AFCARS table in my file but if I resource it and run it from the stop right from the top it'll re-run everything and there we go that's what we're after right. So when you're running into errors in your code I recommend just command shift enter, command shift return re-run the whole script start from scratch. Okay so our table there looks pretty good however it's still at the individual level right. It's not at the state level so we're gonna collapse to state level. We'll call it AFCARS_st it's AFCARS we're going to group by year state age we're going to summarize we'll call it fc is equal to n that is it'll count the number of rows for each level that we've grouped by year state and age. So let's see what after our state looks like. [ONSCREEN code displayed: afcars_st<-afcars %>% group_by(year, state, age) %>% summarize(fc = n())]. Rates that's exactly the data structure we're after now we can join that. Now join to NCANDS pop. Right so we can take we'll call it NCANDS_AFCARS_pop that's a little long I usually don't like to make my object names that long but what are you gonna do. We're going to join that onto our AFCARS. [ONSCREEN code typed: ncands_afcars_pop<-ncands_pop %>% left_join(afcars_st)] now let's see what we got. Now we've got a table that includes the year, the state, the age, the number of investigations in that state, the child population in that state, and the number of children in foster care at some point during the reporting year and AFCARS will have unique children so each child gets a single row we can't have duplicate children in the AFCARS child foster care file year to year across years we can have duplicates but within years we won't. So now let's use this to compute per capita rates. And that's going to be pretty straightforward now that we've got everything done. I like the the the sort of scope of exposure of child welfare tends to lend itself nicely to representation as rates per thousand so we end up with kind of single or double digit numbers so we're going to use rates per thousand as a power we're gonna use a thousand as our denominator a thousand children. We're going to mutate and we'll call it we'll call it inv_pc or we'll we'll be we'll be really descriptive investigation_rate is equal to child investigation divided by pop and we're going to multiply that by one thousand so we convert it into a rate per thousand fc rates will be fc divided by pop times one thousand [ONSCREEN code typed: ncands_afcars_pop<-ncands_afcars_pop %>% mutate(investigation_rate = child_investigation / pop * 1000, fc_rate = fc / pop * 1000)]. Okay now I'm going to use glimpse [ONSCREEN code typed: glimpse(ncands_afcars_pop)] which will show us everything horizontally or vertically rather than horizontally and you can see that we've got now investigation rates, and foster care rates, so these are population adjusted age-specific foster care and investigation rates. [ONSCREEN output Rows: 1,148, Columns: 8, Groups: year, state 53, rows of data for year, state, age, child_investigation, pop, fc, investigation_rate, fc_rate] now do Erin are we at time do I have a second to just do a quick visual with this 

[Erin McCauley]
Yeah you do we have eight minutes left but we've been taking questions throughout so we can get a little closer to the back end 

[Frank Edwards]
Okay cool I'm this is the last thing I'm going to do is just to give you a quick visualization and we've already loaded in ggplot2 as part of Tidyverse so we'll gg plot NCANDS AFCARS pop and we'll do aes so if you haven't seen gg plot before I'm not going to explain it in too much detail given time but I I just want to show you the the kind of thing that we can do with this. [ONSCREEN code typed: ggplot(ncands_afcars_pop, aes(x = age, y = investigation_rate)) + geom_line() + facet_wrap(~state)] all right I'm sorry a x is going to be age here and y is going to be the investigation rate we're going to geom line and then we're going to facet wrap by states and we can get a nice visual that [ONSCREEN output: basic line plots for each state showing the trend of lower investigation rates as child age increases] looks something like that right so then we can look at age specific investigation rates across all of the states in the NCANDS data right. And this could give us a sense that in most states right it's very young children they get referred at higher rates than most but this also gives us a way to compare across states right for example investigation rates in Alaska are substantially higher than they are in Kansas for infants right. So that's where we'll close it for now with R and yeah time for questions.

[Erin McCauley]
If anyone has questions just put them in that Q and A box and we will answer them.

[Frank Edwards]
While I'm waiting for questions to come in i'll go ahead and do foster care too why not? You wana tell me what that question is Erin? 

[Erin McCauley]
Yeah the question says could you please go back to about line 49 where you converted the 99 values to NA I was unclear how using is.na worked when the value was 99. 

[Frank Edwards]
Oh you're correct you caught an error thank you age let's go Holly you caught a you caught an error there well done so yeah no that that was that was an error thank you Holly. 
Clearly I actually do need much more practice now I mean yeah every every time you write code you will make mistakes this is just how the process goes. And we're going to see 99s on this plot my axis is going to be all messed up because well maybe not but it could have theoretically been really messed up oh it's because 99 doesn't occur in the population file so we don't get to join but nonetheless this is what it looks like for foster care. I'm going to show you one other cool thing while I wait for more questions because I think this is a really nice visualization tool if we're doing state stuff. Thank you thank you other Holly. Yeah so this geofacet package is super duper cool. It allows us to create a visual that is similar to what I just made but it actually lays out the states nicely so we could do facet_geo by state here I believe oh right xx and pr but those are getting removed so then let's see so now we can actually lay these out as 50 states and look at foster care rates by state that way. This is kind of nicer than just like an alphabetical representation if we wanted to look at at at the the age-specific rates across states. We could also do things like we see that there's some the axis goes up past 17 so we could do like a plus [ONSCREEN code: library(geofacet)
Ggplot(ncands_afcars_pop, aes(x = age, y = fc_rate)) +   geom_line() + facet_geo(~state)] scale it's going to be x. Continuous whatever we could we could do some things to clean this plot up certainly but we'll we'll kind of leave it alone for for now. Any other any other questions on the kind of basic philosophy of the join? I guess the other thing I can I can do to ensure that you have access to the code that we've made today is go ahead and update my Git repository here so I'm going to Git add the read NCANDS data read NDACAN data file it would help if I nopes. Don't want the r-prime whatever. 

[Erin McCauley]
Hey Frank we have a new question that says how did you get the gg plot graphs to position of states like in the U.S map?

[Frank Edwards]
Yeah yeah so that's what I just showed you it's using the geo facet library so then instead of using facet wrap by state we can use facet geo by state and it understands the state labels in such a way that we can lay them out like this and it's a really quick and easy trick to make much more interesting and legible plots in my opinion. So I'm going to go ahead and push these changes post workshop changes so now in just a moment that will upload online and if you go to my Github repo the code from read_ndacan_data.R will be available for you to you can download this whole script. But yeah that that's I love I love the geofacet library I think it gives us a lot of really cool options for for visualizing geographic data without a tremendous amount of additional coding. 

[Erin McCauley]
We've got one last question from Holly is using your server better than gitkraken?

[Frank Edwards]
I actually don't use gitkraken so I'm not sure I'm just this is just you know a hosted repo on Github so I'm not running this off of a server I just have to this is just for ease of sharing. Yeah and the R script so yeah you can follow up with me Holly if you want to sell me on gitkraken I am not familiar with it. So you can see in just a moment we should have the file available for you all there I don't know why my push is taking a moment but certainly within the next hour if it doesn't go through immediately I will make sure that that file is uploaded on the server. Are we?

[Frank Edwards]
We are at time. So big thank you to Frank for leading this workshop and to everyone for being here today. If you join us next week Sarah Sernaker will be back doing a workshop on using structural equation modeling. We'll be talking about the theory of the method and then an application using NDACAN data so thank you again to Frank and I hope to see you all next week. 
[Frank Edwards]
All right bye everybody.

[Erin McCauley]
Bye.

[voiceover, then musical cue]
The National Data Archive on Child Abuse and Neglect  is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.