All right everyone it is noon so we can get started. First of all I'd like to thank everyone for being here. This is the summer training series it's hosted by myself Erin McCauley and I'm a graduate associate at the National Data Archive on Child Abuse and Neglect which is housed in the Bronfenbrenner Center for Translational Research at Cornell university. And we have speakers from both within our archive and then a various stages of the data collection process. Today we're really lucky to have Frank Edwards here with us. He is a data statistician in the archive and is also an assistant professor at Rutgers. So far we have had an introduction to the data archive, and then gone over our three administrative data sets. This week were talking about strategies for data management which is particularly important when using these large administrative data sets. Next week Michael's going to be coming back to talk to us about how to link these three data sets. And then Frank will come back for concluding session to tell us about a project he has done using some of the data from our administrative cluster. Today were going to be talking about data management because secondary data analysis requires a really high level data management skills such as reshaping and collapsing data. This is particularly true because these data sets are just so large so actually when you collect your own data you might be looking at a kind of narrow area, but the benefit of these data sets is that they're huge and the cover a large group people. The NCANDS data set is particularly large and in that data set running code can be time and computing power intensive and so will have some strategies on how to save time and computing power. And this training should help you increase your data management skills and he's some of those difficulties in conducting the secondary analysis with large administrative data sets. We know many of our users haven't necessarily been trained in how to use large administrative data sets but more on that own data collection side and so we try to help boost the skills of our data users and I know for myself when I learned about data management it was kind of early in my research career and so getting a refresher of those strategies especially as they apply to these data sets can be particularly helpful. So I'm now going to pass control over to Frank was going to take away from here, thanks Frank. Yeah my pleasure and thanks everyone for attending today I'm excited to see where the conversation goes. As Erin mentioned I am a research associate with the data archive of former postdoc with the data archive and currently a professor at Rutgers. And most of my work uses the administrative data sets. With the data archive are primarily focused on using the AFCARS and NCANDS and today's demos I'm going to show you all use a single year of the AFCARS data but the techniques I'm showing you would easily transfer to the NCANDS or NYTD data as well. And the goal would be to think about, not necessarily just using a single year of the AFCARS data like I'm going to show you here, but potentially using multiple years of the data, even using a you know 15 to 16-year panel of the data when we're talking about you know data sets that can get as large as 10 gigabytes or larger. Right so these kinds of strategies are really useful when we're dealing with what we might think of as kind of not necessarily big data in the way it gets thrown around a lot but quite large administrative data sets. Okay so today I'm going to show you how to use the R statistical programming language and commands that are based are kind of derived from the SQL database software to aggregate data a to use R which is open source and free so for those of you who are not experienced in R and you'd like to take that dive. If this is your first exposure to it please feel free to shoot me an email at Frank.Edwards at Rutgers and I'm happy to point you to some introductory resources that could get you started in the language. I'm assuming here some level of familiarity with programming but you know I'm happy to point you in the direction of resources that could be a little bit more of an introductory resource. So we're going to use R and the Tidyverse packages which rely on a language that's similar to the SQL database programming language to aggregate data to geographic or time units of analysis, to reshape data from wide too long, and from long to wide. And if you've never heard those terms before don't worry I'll show you what wide and long data looks like. How to join data using joins, and append variables to or observations to existing data, and how to draw a random samples, and random sampling can be a technique to really improve the efficiency of your coding because as I'm sure many of you are aware, running the operations running a script on a 10 gigabit a 10 gigabyte file can take you know hours and in some cases days. So sampling can be a technique you can use to really improve the efficiency of your your development as you are experimenting with statistical models or various data transformations you can apply those to a subset of of the data first, when you're in development and then only Braun a larger operation single time. So for those of you who have never worked with R or have not installed R, R is free and open source and I think that's the key advantage of using it, it's that you don't have to pay an annual thousand dollars license to use it. And there's also a great interactive development environment, the RStudio IDE, which is also free and open source. So I've given you websites that you can download the software from and Tidyverse packages extend the base R to provide the really nice suite of tools that help with common data management challenges and it provides it's a more legible, more readable code where where the commands you are using are very explicit. And I think increasingly R users are heading in a direction where those are the packages we use to do most of our work. So these techniques are again adapted from principles that for those of you who have ever used SQL or SQL to manage databases, a lot of the syntax you're going to see will be somewhat familiar. And one nice thing about learning to do data management this way is that it's sets you up so that if you do use SQL or interact with large databases directly, rather than working with the kinds of flat files that you are working with from the archive, some of the skills you are building here will be transferable. So again I'm using the AFCARS 2017 child file and have removed identifying variables. If you have access to any of the AFCARS child files, the the foster care files the the code I'm using here today should execute on your machine without problem. So you could theoretically copy and paste the code from the slides and run it on your machine. So to get started you know this and again I'm just going to be demoing code here to kind of couple the code and the concepts together. I'm not going to live-run the code but I'm showing you all the code and output you would need to to execute the kinds of reshapes and transforms I'm going to be talking about. And again I know this is technical so please at any point feel free to jump in the chat and throw question out. And without further ado then I'll dive in. Right in R we load in external packages with the library command so you would need to run install.packages for Tidyverse if you haven't installed that previously and that's a kind of intro R thing again I'm just going to point you to resources if if if useful. The a NDACAN delivers the AFCARS files in the proprietary formats for SAS, SPSS, Stata, but we also deliver a tabular file that is a delimited file. Right where it's a text document that includes either of, or a tab that separates each field so it's effectively a spreadsheet but it's just a text document, right. So we use the ts it's a tsv format a tab-separated format right that you can see "FC2017v2.tab". So in this case it's a tab delimited data file so I'm going to use the "read_tsv" function to pull that data into R. Now some of you probably have run into the problem when working with NDACAN data that you know the AFCARS files are about 500 MB annually, right? But if we are using say 15 years of the data simultaneously, then we're getting to the point where we are working with close to 10 gigs of data. Now that can chew up the RAM on your laptop or desktop relatively quickly so I tend to do nearly all and also for data security reasons, I tend to do nearly all of my work on a remote server hosted by my university. And I I strongly recommend that you find some sort of remote computing environment both for the security of the data and for the efficiency of some of these large operations. If you're working with a single year of the data, you can usually get away with it on, you know, a laptop or desktop but when we're talking about some of these operations the cand quickly chew up, you know, 40 or 50 GB of RAM which is far beyond what most personal computers are able to handle. Okay so once we've loaded the data, it's a really good idea to when you're working in R to just go ahead and pare it down to the things you're going to use. So in this case we're going to focus on a few variables. So you notice I have this weird percent greater than percent sign at the end of "afcars_17". What I've done is so in the first command "read_tsv" we've created an object called "afcars_17" which in this case is a data frame object, which is R's kind of formatted matrix object. Now I'm going to modify "afcars_17" with this %>% here and that's what's called a pipe in the Tidyverse sort of vernacular. And so the pipe tells its to link a command to an object. So in this case I'm going to take this "afcars_17" object and I'm going to select a few columns or variables. I'm going to select this state variable, the sex variable, "AgeAtEnd" which is how old the child was at the end of the reporting year, "InAtEnd" that is our measure for whether the child was in foster care at the end of the reporting year, and "Entered" and so that's whether the child entered foster care during the reporting year. And because there's some missing values for sex in the data I'm going to remove those with a not "is.na" SEX filter command. Right so what the Tidyverse system lets me do is link together multiple data manipulation commands. So first I'm going to take the "afcars_17" object, I'm going to select only a few variables, so in this case five variables, then I'm going to remove all those rows where sex is missing from the data. Right? So we want to have a complete data set with those five columns and with no missing values. Okay so "afcars_17" is now a subset of the full data with only five variables and you know fewer rows than were in the original, those rows that were missing that variable. Okay. Let's look at what that data looks like. So after I've run that command this is what we have. Right? And so this is, often when I'm running code in R I'll type the "head" function, h e a d, just to see what might data object looks like. And this is what my data object looks like right now. This is the first six rows of the data. And so we have data by state, by sex, right now these are individual level observations, write? So we have one child in Alabama with sex equals one I believe sex equals one in AFCARS is male, who is five years old at the end of the reporting period who was in care at the end of the reporting period and who entered during the reporting period, right? Is another individual child, third row is another individual child, right? You know, third row is an infant, fourth row is a four-year-old, fifth row is a one-year-old. Right? All of these children were in care at the end of the reporting period, and all of the them answered care during the reporting period. And we have four boys, two girls. Right? So what are the natural groupings that we might have in this data? Obviously so so what we're going to work on here is thinking about how can we move from this individual level to thinking for example about measures that might help us understand differences across children of different sexes in the foster care system, or children of different ages in the foster care system, or children in different states. Right? And so we can group at each of those levels we might think of these as sort of categorical variables in the data, how can we group according to those categorical variables to gain insights about differences between units, how can we aggregate, right? So let's start and this is something I do really frequently in in my work is I will routinely look at for example, observations at the county and year level. Right? So if we had multiple years of data here we could build out a county by year data set. Right? Where we could look at time series changes in what's going on in foster care populations across counties or across states. So, let's see how we would do that. I want to first just say across the entire data set, let's not group by anything that's not look let's just say what are the mean values for each of the variables that we have in the data. Right? And so here we'll create an object called table1 and again we'll start with the "afcars_17" object that we've created and we're going to pipe this "summarise" command. So what "summarise" does is it does some functional operation across the entire object you've given it and later we'll see how we can couple "summarise" with "group by" to perform those summary operations across subsets of the data. So here I'm going to create a variable called "MeanAge" which is equal to the mean of "AgeAtEnd". PctMale which is going to be equal to a sum where I'm going to sum up all of the times that sex is equal to one. So basically going to count the number of one's and divide it by the number of rows in the data. I'm going to that "na.rm" is superfluous because I've already filtered out the missing values so that's redundant code on my part. "TotalEntries" is a sum of "Entered" and "Caseload" is a sum of "InAtEnd", right? So here we have the "MeanAge" in the 2017 AFCARS is a little over eight years old. 52% of the sample of the that's not sample 52% of AFCARS 17 data the foster care population in AFCARS is 50 is male. In all we had a 270,000 entries recorded in the data and 442,000 children who were in care at the reporting at the end of the reporting year. And the reporting year differs across states sometimes but I think it's usually September 30. Someone correct me if I'm wrong on that from the archive, but that's the fiscal year kind of reporting date, right? So that's the single point in time so we have a you know entries 270,000 nationally, in caseload 442,000 nationally. Okay. So that's a summary across the entire data table, right? Which we could also you know call out the mean for individual variables using other syntax but this is a nice way to create a new summary object. And this will be more interesting when we group by, right? So here let's group by sex. So let's say see if there are differences in the composition of the K you know the number of kids in foster care by sex by age. Maybe we have some hypothesis that girls are likely to enter foster care at a younger age on average or something like that. Let's group by sex. So here we can see that boys have a slightly lower average age in the foster care system then girls do nationally and we have more boys entering more boys on the caseload then we have girls, right? Which we also knew earlier from our percent male variable but this breaks it down a little further. Okay so that's not especially helpful but you know it might be nice for a descriptive table for your paper. Here's what we can get a little more powerful in terms of generating a time series data set or 50 state you know, kind of glance at things. So here we can group by St which is our you know two letter abbreviation states and then calculate the mean age of children in the data by state, the total entries by state, and the caseload by state. So this is getting a little more useful, right? We can also group by state and sex, right? So here now we'll have two rows in the table for each state. One for boys one for girls, right? And it'll show us each of those counts broken out by state by sex. If we had year in the data we could easily do that as well. If we had any other grouping variables that we wanted to pay attention to this no limit to what how many number of variables we can group by in these kinds of summary commands. So we can string together as many different categorical variables as we like to group by. I don't recommend grouping by continuous variables generally with the exception of something like year because for talking about you know age in days, right? Then we might be grouping by you know potentially you know over 1000 you know values for age, right? So the the the more sort of interval the more values there in that are in the variable the more grouping loses its it's utility. So we we might want to think of it in terms of grouping being most useful when we have you categorical or relatively small numbers of values in the continuous measure like years. Okay, otherwise you might want to think about like okay maybe I want to look at children under five and over five or you know some kind of grouping like that rather than using just age on its own. Okay so these are long data, right? And when I say long the default working in this kind of framework and works of philosophy that's called tidy data which you know don't worry too much about right now but you can read plenty about it if you like, that the likes to have its data long. And what that means is that every single variable is in a column, right? And so that means we don't do things like, say, have caseload2012 as one column, caseload2013 as another, caseload2014 as another column. Instead we would have each year having observation in the data. Or instead of saying mean age by sex here we can imagine combining sex and mean age together to have mean age boys, mean age girls as two separate columns. In the long approach were going to have sex as its own column, right? So this is long data and what that means is were going to have fewer columns but more rows, right? So our data is going to have a lot more rows, it's going to be is going to have as many rows as we have members in the S t and sex variables so in this case will have 100 rows in this table because we have a single year data. So will have 50 states by the 26 variables in the data so that's going to give us 50 observations to per each state. But what if we do want the data wide, right? So let's say we want to have only a table that has mean age for boys, mean age for girls by state, right? We can use the spread command to do just that where we can spread a key that's the sort of variable column that we want to spread across and then the value we want to transpose, right? So we can think of this as a bit of a pivot. And so were going to take our categorical key column and our numeric value column and rotated, right? But in the process were dropping a couple columns because we aren't applying that same operation to entries and caseloads. So we could use a series of joins to pull those altogether into a single table that would be especially wide and we could rename it we have column 1 is the mean age of boys in Alaska and mean age of girls in Alaska, Alabama mean age of boys girls etc. , right? So we turned these three columns, right? Dropped 2, and we turned these three columns into three new ones but we removed the sex column entirely. And if we had more than three values in sex this would, or two values in sex this would be now turning three columns into four columns. Okay so how do we do that for the full data set? Were going to do this individually now, right? Where we are going to take table 1 which was our summary table, get a wide age table, a wide caseload table here, and a wide entries table here, right? And the syntax is going to be similar. I'm using this sep = to place text in the name of the variable so I know which variable it is. And you can see I'm going to use the names function so instead of displaying what those all look like, right? Instead for if we ran that code we would get data frames that's the names of wide age would be St, SEXage1, SEXage2 and wide_caseload would be SEXcaseload1, SEXcaseload2. These are not the best entries the best names but you know, the do indicate what variable they are and they do indicate what sex they are. So that's how we could do it but now maybe we want to squish these together, right? Maybe we don't want fees the be all separate objects maybe we just want to have one you know object that will have one state column, but then so the state column is is gonna be identical across all of them but will have 1 2 3 4 5 6 columns in the resulting table, right? So we'll have one row for each state with you know the mean age for boys and girls, the mean caseload for boys and girls, and the mean entries for boys or the the total caseloads and total entries for boys and girls each as separate columns, right? So we can use joins to do that with ST as our key column, right? So when we think about these data frames, right? The state column across each of them is going to be identical but the value columns are all going to differ. So we want to stick these altogether but we want to make sure that Alabama joins to Alabama, and California joins to California. So we can use this state key column to do that. And R is going to understand that it should do that by default by looking for columns that have identical names across individual tables, right? And for those of you who are users of Stata or SPSS this is a key difference in R is that we can have as many tables as we want to and our working environment that we can then manipulate and join together in a way that we are not loading a single table that we are working from. We can create as many tables as we wish, and then manipulate those to get whatever object we want to do our analysis on. Okay so I'm going to use the left_join command. Now this is kind of where for those of you who are this close to SQL this may start to look really familiar. So I'm going to call my new object wide_merge and I'm going to use a left_join just I'll I'll explain what I mean by left in a moment. I'm going to join the objects wide_age and wide_caseload. Now remember wide_age as the names ST as the columns ST, SEXage1, SEXage2, and the column wide_caseload has ST, SEXcaseload1, SEXcaseload2. So when we merge these together we should expect to have an object that has the columns ST, SEXage1, SEXage2, SEXcaseload1, SEXcaseload2 but it's not going to duplicate ST, right? It's going to use ST to join the two. So the left join takes the first object and its calls that the left-hand object and it joins it with the right hand object and it preserves all of the rows in the left-hand object. So that means that it doesn't drop any observations from the first argument you give it the first object you give it. It it retains all of those so you could think of those as like having your master and joining it to some and joining something else onto it, right? There are other kinds of joins there are right joins you could use you could use the the second argument as our sort of master, there's a full join where we could say keep everything, but in this context a left join is never going to have fewer rows than are in the left object. But you could have elements, for example let's say we had an observation from Puerto Rico that was in the wide caseload table that was not in the wide_age variable. If there is no PR in the ST column from wide_age it's going to get dropped in the join because it's a left join so it's going to prioritize keeping the observations from the first argument and dropping those that do not appear in the first argument that may have appeared in the second argument. Now we can use left we can use full joins to keep everything again we can use right joins to keep only those in the right-hand side. For R you can always, at the console, type question and a function and get a full list of the help file in this case I believe there are six different kinds of joins that are built into the package that we're using today so there's left, right, full, anti-, semi-, and inner. But generally speaking left is the one that I use most frequently. You there there are applications for the others but left is the one that I think you'll use most frequently. So here I want you to notice that we're kind of nesting our joins, right? So first were going to create this wide merge object and we're going to join age to caseload, right? So that's going to give us our initial five column table, right? But we also want to have entries on there so to do that we're going to take our merge object which we defined here and add entries to that. So now we have a table that has age, caseload, and entries by sex, right? So it should have six columns plus the state column so we're at seven now. So we have seven columns in this resulting table. Now here's a lot of code. The what if we want to bring in external data, right? So when I look at something like this where I'm looking at state's table of entries and caseloads, huh, I see that Alabama has a lot more entries than Alaska, but of course we know that we should not apples to apples we don't have an apples to apples comparison here because Alabama Alaska have very different underlying populations, right? So we want to adjust these numbers for population size, minimally, right? And we might want to do other adjustments later. But the simplest thing to do is to recalculate these as per capita rates to make sure that we are comparing apples to apples when we look across states. So one way to do that is to use our join commands to append data from other sources. So in this case were going to create a new table called "pop" and I'm going to go to the seer.cancer.gov population data, right? This is the it's it's a great micro data single age by race by sex a source for county level population data in the United States. And it's a I I I'm sorry I'm blanking on the acronym it's the national, it's a it's a cancer research Institute within the CDC that produces these population data. But we can go download that data file and you can obtain it from that URL if you are interested. You could also do this with the American community survey or the census just as easily. This is a fixed width file so it's a little trickier to read because it's a text file where there's no delimiter, so right? So our AFCARS table has tabs between each each column in the data but this is not going to have any delimiters at all. It's just fixed width so it's just going to go, you know, four spaces and then it's a new variable, and then it's a new variable, right? And so we need to give it the widths of all those columns and in this case you know you can just copy and paste this code if you want but I'm not I'm giving it the widths of all the columns and the names of all the variables. And I'm doing some things to clean up the data and making sure I'm only getting children and that I'm only getting year 2017. And that I'm making the column names match but don't worry too much about that. Yes and we will definitely post all of the R code and feel free to sing that question feel free to email me I have this as an R markdown document too. And I can send out the raw code in a document that you can load straight into R. Obviously I can't give you the foster care data you'll have to apply for that through the data archive but yeah if if you'd like to get this in a native R format let me know and I can I can send it your way. Okay. So that's aggregate that data to state, here, sex. So currently the population data you can see by these names is at the year state and you see this st_fips that's the federal information processing system code and cnty_fips so this is at the county, race, Hispanic origin, sex, age level, right? So there's like this is a long data file by by year, by county, by race race, by Hispanic origin, by sex, by age. So this is a very long file. For each county in the data you've got you know a couple hundred rows so this is a big file. But we want to aggregate it to state and we are only looking at 2017. So we want to get state, year, sex. So let's group by state, sex, and age, right? Because we want to get age-specific population numbers so that we can look at the number of foster care entries by age per capita in across states, right? So let's group by state, sex, age so we're going to squish together all the different racial and ethnic groups that are in the data into a single population number, right? And we'll summarize pop equals some pop. So now this is going to give us a table that includes so this is going to give us a state level variable, it's going to have I believe 21 observations for each state and 2 I'm sorry 21�2, right? So we have two sexes in the data and we filtered age less than equal to 21 so it should have 42 observations for each state. And each of those rows will have a population number for you know males age 0. How many how many are in the population? You know females age 0, how many are in the population? Okay so we want to join that all now, right? So let's take our let's do a left join where we take our afcars_17 object again we're going to go back to the original object that we had, this is the big AFCARS file not the summary we created. And we're going to create an entries measure, right? So were going to group by sex, state, and AgeAtEnd, and we are going to count how many children from each sex from each state from each age entered foster care in that year. Were going to rename the variables to match so you noticed that my state variable is lowercase than the other. I like working with all lowercase variable names so I don't have to worry about being case-sensitive. So I'm going to rename them all to be lowercase and oh that's the left join so we're grouping by sex state ageatend, creating a count of entries for AFCARS and noticed this parenthasis closes here, right? So that's our summarize. And we then AFCARS were doing the rename but then the left join this parenthesis closes here so we are joining this AFCARS object to my population state table that we just created here, right? So we are going to smush those two together and were going to remove all people who are over the age of 18 from the data. And you can see that that it joined by sex state and age because those are the common columns across the to tables. So what did this make? Here's what we have, right? So this is what it created for a sex equals one which is boys in Alaska age 0 1 2 3 4 5 this is how many entries there were in the 2017 AFCARS table, right? And this is the population we have reported in the seer population data for those groups for that year. Now from this we can easily calculate per capita numbers and of course as as as we all know, right? This is going to have 18 times 2 times 50 rows, right? So 36�50 is 180 or so well yeah no 1800 were going to have a a data print that has that many rows. So you know we don't sex equals 2 anywhere because were only looking at the first six rows of the data. But that's what that looks like. So let's calculate a per capita rate. Using the tidyverse syntax we can take our "percap" we can create so we called this object this this new table that we are looking at here "percap" so this is "percap". So let's take "percap" and mutate. Mutate is how we create a column, right? So we'll take the "percap" object, we'll pipe the mutate command to it and we'll call our new variable "entries_pc". That's going to be per capita entries and I like to compare entries using a a sort of rate per thousand population because those tend to be a nice whole number when we're working with the foster care data. So here I'm going to divide the number of entries by the population size and multiply that by us thousand. 1e3 is just scientific notation for 1000 in this case. So you could put 1000 and there and that would be fine. And then I'm going to plot it, right? So I'm going to look at if you've never seen the ggplot package and R before, this is a plot command where I'm going to give it the object "percap" and the "aes" is the aesthetics the plots going to use where I'm going to tell it to put age on the x-axis, entries_pc on the y-axis. I'm going to tell it to make different line types by sex and I want to it to do a line but I also want it to wrap by state. So I want it to have an individual plot for each state. So we're going to get a line plot that shows entry rates by age on a XY plot for each state with line types for boys and girls each. And this is what that produces. Apologies that the resolution is a little low on this, but you can see that entry rates across the country are much higher at younger ages than they are at older ages, right? And this is consistent with what we know about the foster care population is that children are at highest risk of entering care in the first years of life and that for some states like West Virginia here, North Dakota here, Vermont here, DC we are seeing upticks in the rates of entry as children get closer to adolescents. We are not seem strong sex differences but perhaps if we zoomed in on this plot a bit more we could see it looks like in Wyoming here we might be seen a sort of change between boys and girls that looks like the rate for girls is a little lower. There would theoretically be dashed line but the trend lines for the two groups is really similar in this case. But this is a really nice way that we could you know use a few lines of code to create a 50 state plot to quickly visualize whether there are differences across places in terms of rates of entry, you know, across states by age. And you know there are some places that look like they might have different patterns going on such as Montana such as West Virginia that that that that might merit us looking such as Oklahoma that might merit us looking a little closer at some things that might be going on with different groups of children. So that's one thing we can quickly do by merging in the population data that we couldn't do if we hadn't merged that data on. And you know ultimately this doesn't take a lot of code to generate these kinds of figures. Okay so the last topic were going to cover is drawing a random sample and then we can open up for Q&A. If you have questions that we that haven't been thrown out yet. So some of you may have heard of a split combine apply approach which I'll I'll talk a little bit about at the end if for for kind of more advanced but for the the simplest thing we can do we often when we're you know running complex models on large administrative data sets like the AFCARS or NCANDS we can get to the point and I've run imputation models on the NCANDS and AFCARS that can take as long as a week to run, right? Because these are huge data sets. Now if we are going to dedicate computing resources and you know our time to run those kinds of models, we want to make sure that it's going to work, right? And we don't necessarily have time to run it for a week come back and check it only to find it you know, wasn't exactly what we wanted, or the code didn't work for some reason. And so we need to do the whole thing over again. That's a waste of our time and it's also a waste of computing power. If you're running this on a server, you know you're you're taking the resources potentially away from some other analysis. So we want to be conservative those resources where we can both time and computing. So an easy way to do that is to sample, right? Is to take a random sample of your administrative data and just debug your coding on it, right? So in this case what I'm suggesting we do is we use again the tidy packages the tidyverse packages and the "sample_frac" command. we'll take a 10% random sample of the entire data set, right? So we can reduce that data set where you know we know that there were there's about 500,000 observations in that data set, right? Between the children who are in care at the end, the children who entered and exited that's that's a lot of observations. So let's take a 10% sample so let's get it down to 50,000, right? Now that makes the statistical model I'm going to estimate below quite a bit easier to estimate. We're going to estimate here a "poisson" model that's a count model to look at the number of placement episodes the child had by their age, by whether they entered, and by their sex, right? So let's say yeah we wanted to build a simple count model of how many placements a child had within a single episode in foster care. We could you know run this model on it. We could you know run it to the point that we were satisfied that the code was working properly on this subset. And then we could run it on the full date. Now theoretically we could take this a step further and we could use a split combine apply approach which is something you'll see a little more when you get into a kind of more formal big data applications where we might subset the data into smaller baskets and estimate the model or run an analysis on those smaller subsets of data and then later combine them. We can also get fancier and do things like parallel computing and a you know it's those a kind of the advanced topics you can push this toward but I do specifically if you're thinking about doing things like multiple imputation for missing data or if you're thinking about running hierarchical or mixed effects multilevel models those kinds of models can be a little more intensive on computing resources, think about developing your your code using a subset of the data. It'll it'll make your life a lot easier and it will also make the administrators of the computing systems that you are using much happier with you when you don't crash the the servers with code that isn't working. But that's that's what I have for now, I know we've kind of blazed through a ton so any questions I'm more than happy to address and again happy to share the code that is used for all of these slides. There is my email address again this a yeah we have all of this in an R markdown file that I'm happy to share with everybody and yeah I look forward to your questions. So thanks for attending. Well thank you very much Frank that was wonderful. So we have our first question could you explain again how the per capita data was used. Yeah yeah and so Lauren please clarified if I don't trust your question directly. So I'm assuming your ask I'll I'll cover of both how I you know calculated the per capita rate and how I used, right? So first we need this raw population data and again I got this from seer.cancer.gov/popdata you could go download that data right now. The other place I like to get data from I'm going to type it didn't chat box is NHGIS which you can get through the ipums I'm sorry ipums.org that's the Minnesota pops center it houses the American community survey the micro data that we can access through NHGIS and you could go download that data right now too and create a national extract for the the groups that you're interested in. But this tier population data is really easy to use and it's got and it's got its got the structure we want to work with the administrative data that were looking at here. So um do that's then how we create the state year sex the state age sex total we want, right? And as we join it were joining it to our AFCARS data so now we have the state by sex by age you know caseload counts, age, you know, so here's age here's sex, here's the number of foster care entries that were in the AFCARS data and then here's our seer population variable, right? And so what we're going to do is we're going to divide this column by this column right so were basically going to create a ratio of entries to population. And that's going to give us what percent of the children in this group, or what proportion of children in this group of these 5600, how many were in foster care? 91 over 5600, right? But I think it's nicer to talk about that in terms of the children per thousand and so I don't have two zeros in front of every number on talking about. So I'm going to again creates a variable "entries_pc" equals entries divided by pop times 1e3, right? Entries divided by pop times 1000. And that's all that's doing it's creating a new column that's the ratio of these two times of thousand and then I'm going to plot that new entries per capita variable in this plot and you can imagine we could make this a national plot. And let me explain these axes a little better the this is not a pretty graph, by any stretch, we could make this a lot prettier with some extra coding. This is the per capita entries on the y-axis and this is age on the x-axis which you can see goes from 0 to 18. So yeah that's that's how I used per capita data but you know if I were gonna construct for example a comparison of caseloads across states, let's say I wanted to look at whether a question that a lot of people are answering right now, does exposure to opioids change caseloads, right? So let's say I had some variable that captured the number of hospital admissions for opioid overdose there were in a state in a year, I could use these per capita foster care entry numbers as a outcome in an OLS regression model, right? Because I've adjusted them for the size of the child population. So you know when you're building out kind of place level models like this, like a state or county model, you must adjust for population exposure, right? You must adjust for the population size. And there's ways to do that in both a kind of per capita measure if you're going to do like a a a typical linear regression model. Or there's ways to do it as a sort of exposure or an offset variable if you're going to do a count-based approach with a Poisson or negative binomial model or something like that. I hope that answers your question, if it doesn't please feel free to follow up. And for those of you who you know wished to see an application of the kind of techniques I've showed here today, I will the the research paper I'll be presenting in a couple weeks as part of our concluding session is a county by year estimate its county by year paper looking at sources of child abuse and neglect reports using the NCANDS data. And so I used exactly these techniques to transform the NCANDS into a county year time series to do some detailed analysis in terms of differences across counties. So if if you're kind of seeing all this and you're wondering why would I do it I can show you in a couple weeks a direct application. Wonderful presentation, thank you very much for your time here today Frank. Next week it's going to be Michael Dineen presenting on linking these three data sets and then as Frank said he'll be back for our final session where he's going to be going through an example that is a published paper. So we have one more question, do you know we can get state-level statistics such as their foster care pop administration administration structure. Yeah um so that's something we can get from the AFCARS. So the best source for that right now is the Kids Count data from Casey so let me just get a URL for you real quick. Yeah so the easiest place to get that data is here. The AFCARS and NCANDS are the source of nearly all of the data that the that Casey Foundation presents on Kids Count so, you know you can go get quick state-level statistics there. The data archive is developing web-based data visualization table maker that we'll have available at some point hopefully within the year, that will allow you to manipulate state and county level administrative data. As far as administration structure variables, those are harder to come by and they are not really a systematic data collection. There is the state file in the NCANDS that you can look at that has some administrative variables. The Child Trends publishes a funding report annually that that provides some insight into funding structures across the states and the National Council of State Legislatures produces reports that provide insight into state policy changes, legislation that's passed or introduced at the state level for most years. You know and then individual reports here and there from organizations like Casey or Children's Bureau will have some comparisons of administrative structures but there's not. And it might depend on what you mean when you say administrative structure but yeah there's there's less comprehensive data there. But the best source for state-level statistics for foster care population, those kinds of things are the administrative data were talking about. Those are really the only place you can get, you know, comparable statistics across all the states. You know the NCANDS, the NYTD, the AFCARS. So that's the real strength of these data systems and and something that I hope you all find useful. But if you just want to know at a glance what's the child welfare population in Maine? The Casey's kids Count is great. State run or county run yeah there's it changes year-to-year so you'd need a time so you know there there are like I think what 12 states that have state versus county administration. I don't know of a list that tracks that over time. So if someone else does, throw it in the comments. It's a good question. There's not like a single place that shows that and the trend has been toward more state administration, less county administration, more centralization so you'd have to make sure that you know even if you know that the state are looking at this state administered in one year they might have been county administered 10 years ago. All right well that wraps it up for today. Thank you again everyone for attending and we'll see you in two weeks Frank. All right thanks Erin, thanks everybody.