[MUSIC]

[VOICEOVER]
National Data archive on Child Abuse and Neglect.

[ONSCREEN slide content 1]
WELCOME TO NDACAN MONTHLY OFFICE HOURS!
National Data Archive on Child Abuse and Neglect
DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO
 The session will begin at 11am EST
 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R)
 11:30 - 12:00pm - Office hours breakout sessions
 Please submit LeaRn questions to the Q&A box
 This session is being recorded.
 See ZOOM Help Center for connection issues:
https://support.zoom.us/hc/en-us

[Paige Logan Prater]
Hello everyone I think I'm going to go ahead and get started while folks continue to join. Welcome good morning good afternoon no it's still morning. Welcome to our archives monthly office hours. This is session number five. And so welcome if you're returning welcome if you're brand new. We will get started in about a minute after I go through some housekeeping. So today our session will be 30 minutes of our learn with NDACAN series and 30 minutes or so of office hour breakout sessions with some folks with the Archive. We're really excited to have you. And just for reference if you have any questions throughout the first 30 minutes of the R training feel free to put them in the q&a. You can also put them in the chat or come off mute and and ask Frank directly. This session is being recorded and it will be available on our NDACAN website as well as all of the other recordings from past sessions. Any homeworks or assignments associated with this training is all up and available on our website. If you have any Zoom issues feel free feel free to reach out to us and we can try to help you get squared away that way. And I think without further ado I will take it over to Frank to get started with the R training. 

[ONSCREEN CONTENT SLIDE 2]
LeaRn with NDACAN
Presented by Frank Edwards

[Frank Edwards]
All right thanks Paige no more ados coming your way we're gonna learn with NDACAN. I feel like I want to put some Bob Ross on and do a painting lesson or something but let's do it. 

[ONSCREEN CONTENT SLIDE 3]
Materials for this Course
Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains
Data (will be released as used in the lessons)
Census state-level data, 2015-2019
AFCARS state-aggregate data, 2015-2019
AFCARS (FAKE) individual-level data, 2016-2019
NYTD (FAKE) individual-level data, 2017 Cohort
Documentation/codebooks for the provided datasets
Slides used in each week’s lesson
Exercises as that correspond to each week’s lesson
An .R file that will have example, usable R code for each lesson – will be updated and appended with code from each lesson

[Frank Edwards]
The course materials as always are up in our Box folder and the data that we'll be using today is the AFCARS state aggregate data 2015 to 2019 a reminder that these are all fake data sets these are all simulated to look like the real data but they are not real data so please do not use them for actual analysis because your findings will be wrong. But please do request the data sets through through our website if you'd like to use them. Again all of the data all of the code is in week five for today and the slides are available in both PowerPoint and PDF on the Box folder. 

[ONSCREEN CONTENT SLIDE 4]
Week 5: Descriptive Statistics
February 21, 2025

[Frank Edwards]
Today we're going to cover descriptive statistics and before we were meeting we were briefly discussing that some folks said you know the pace has been a little fast a little slow sometimes a little advanced a little basic sometimes. This week it's going to be fairly basic. I'll get to some more advanced applications at the end today but we're really going to cover some of the basics in how we get to know our data sets in R.

[ONSCREEN CONTENT SLIDE 5]
Data used in this week’s example code
AFCARS fake aggregated ./Data/afcars_aggreg_suppressed.csv
Simulated aggregate data on children in foster care following the AFCARS structure
Can order full data from NDACAN:
https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm

[Frank Edwards]
And I want to cover first we're going to use the afcars_aggreg_suppressed.csv file and again this is simulated fake data but what it represents is counts of children in foster care by U.S. State. 

[ONSCREEN CONTENT SLIDE 6]
Basic Descriptive Statistics In R

[Frank Edwards]
So basic descriptive statistics in R. 

[ONSCREEN CONTENT SLIDE 7]
Central Tendency
mean() computes the arithmetic mean of a vector
the formula for calculating the arithmetic mean (also known as the average) of a set of numbers is
mean=x1+x2+...+xn/n
Can be directly computed as sum(x) / length(x)
median() returns the value at the 0.5 quantile of the data after arranging
Can also be computed as quantile(x, 0.5)

[Frank Edwards]
When I think of descriptive statistics I tend to think in a couple categories. The first is measures of central tendency right? And the mean is our old trusty favorite and in R to compute a mean of a vector we use the mean function. Again in R we're typically going to organize our data sets into data frames where each column and row can be thought of as a vector. So for example if I had my AFCARS data loaded in as AFCARS and I wanted to compute the mean number of entries in a state or across states over time then I could compute mean AFCARS dollar sign to index my data frame and then entered would be the name of the variable and that would return the average number of entries in a state and year. Sometimes you may want to be fancy and compute the mean yourself and that's just going to be sum of x over length of x. These functions I tend to refer to as summarizing functions that is they take as input a vector, a numeric vector in this case, and return as output a single number. That is they reduce the dimension of it. They take a vector of say 100 values and in return provide one summary statistic. And the mean is going to tell us about the average points in that vector. Sometimes when we have distributions that are not symmetric the mean though is not the best measure of central tendency. Means as we all know are highly sensitive to outliers. And so in a lot of cases what we'll see when we look at these administrative data is we'll see distributions of our variables that might have quite long tails. For example if we're looking at foster care entries we're going to have a substantially long right-hand tail because some states have larger populations than others and have larger numbers of children entering foster care than others. So the mean might not give us the most informative measure of central tendency in that case the median can be more useful. And the median function in R will return the value at the 0.5 quantile of the data. So median of a vector gives us the 0.5 quantile of the data that is it gives us if we arrange the data from its minimum to its maximum the data point that's exactly in the middle. If we want to again if you want to be a little more tedious about it you can compute it as quantile of a vector comma 0.5. The quantile function is more generic than median. With the quantile function we can return any quantile of the data we could you know take the 2.5 2.5th quantile the 97.5 quantile we can go anywhere we want with that. 

[ONSCREEN CONTENT SLIDE 8]
Dispersion
Standard deviation: sd(x)
Variance: var(x)
Interquartile range: quantile(x, c(0.25, 0.75))
Minimum: min(x)
Maximum: max(x)

[Frank Edwards]
The second category I tend to think of is dispersion. So we have central tendency and mean and median are best there. For dispersion we're measuring how spread out the data are. How much variation do we have in the data. The standard deviation is computed simply with sd so if we provide a vector x we can do sd(x). And typically how you're going to use this in R is with again data frame indexing so it'll be sd afcars dollar sign the name of your variable right? And the variance is simply the square of the standard deviation so we don't necessarily need a separate function for it we could just you know type sd of x squared. But we can also compute it directly with bar parentheses x. Sometimes we might want something like the interquartile range. We might want to know where is the central 50% of the data or where is the central 80% of the data? And in that case the quantile function is very useful. Again we can use the quantile function, provide as input a vector I'm just calling it x here but it'll typically be again something like AFCARS dollar sign entered. And the nice thing about quantile is I can actually give it a vector of values between zero and one and it will return to me all of the quantiles for the specified intervals right? So in this case I'm asking for the 25th and 75th quantile that'll give me the middle 50% of the data but there's no reason I couldn't add five more values in there. I could ask for the 0.01 the 0.1 the 0.25 the 0.75 the 0.9 the 0.99 right? I could ask for those all at once and R would happily return to me five or six values. Minimum and maximum are also quite useful and they are quite simply min and max. 

[ONSCREEN CONTENT SLIDE 9]
Crosstabs and grouped summaries
For univariate or bivariate crosstabs in a data.frame: table(df$x, df$y)
For more advanced applications of grouped operations (beyond frequencies), use tidyverse group_by() %>% summarize()

[Frank Edwards]
Now those are what we'll use for continuous measures. When we have categorical measures then crosstabs are incredibly useful right? For bivariate cross tabs in a data frame we'll just use the table function. Table where I provide table with two vectors and it will provide me with a simple crosstab. If I want a univariate crosstab I can provide table a single measure. But table's really useful when I just want to look at frequencies across a categorical variable but when I want to do something more advanced I want to do a multi-dimensional crosstab with more than one grouping with more than one grouping variable I'll often use the group_by summarize syntax that we covered in our introduction to the Tidyverse. And I'll show you how that works a bit when we get to the code demonstration in a moment. So group_by summarize will do something similar to table it will provide us with cross tabulations if we set that up correctly but table is when you want a very simple univariate or bivariate crosstab table is your go-to. 

[ONSCREEN CONTENT SLIDE 10]
Why not all of them?
We can also just use summary() on a data.frame to obtain a good collection of descriptive statistics

[Frank Edwards]
Now R understands that many of us want to just see a lot of descriptive statistics on our data at once. And so the summary function is a very very handy tool. If we provide within summary a data frame what R will do is it will compute the minimum the maximum the interquartile range the mean and the median I believe. So it'll provide you with that kind of slate of descriptive statistics that we all know in love but it will do it for every variable in a data frame. So if you want to just get a quick glance at descriptive statistics across your data, summary on a data frame is a very good way to do that. 

[ONSCREEN CONTENT SLIDE 11]
Over to RStudio

[Frank Edwards]
So without further ado let's hop on to RStudio for the rest of our time together today. 

[VOICEOVER]
The program written in R is included in the downloadable files for the slides and the transcript.

[Frank Edwards]
All right let's get started this is up on the Box folder as week5.R. You'll notice that I put this chunk of comments at the top of my code. 

[ONSCREEN]
#### Week 5: descriptive statistics
#### project: leaRn
#### Author: Frank Edwards
#### Email:  frank.edwards@rutgers.edu
#-------

[Frank Edwards]
I like to do this with all of my code I have a sort of template I like to use for my code documentation. You can actually program macros into Rstudio to do this automatically which is pretty cool but it's a really good idea to document the heck out of your code especially when you start working across lots of projects. This way anytime I open up a script I know exactly what it's for and if I'm collaborating I know who has touched it last. As always we're going to start by pulling Tidyverse in. And I'm going to execute this sequentially but I could click the source button up here to run the entire script and a lot of times that's what we're going to want to do. For teaching purposes today I'm going to run it line by line so I'm going to highlight line seven I'm on a Mac so I'm going to push command-return on a PC it would be control-enter to run Tidyverse and go ahead and get that in.

[ONSCREEN]
> library(tidyverse)
── Attaching core tidyverse packages ──────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors

[Frank Edwards]
and then we're going to run the AFCARS demo read function with the read_csv. 

[ONSCREEN]
>afcars_demo<-read_csv("./data/afcars_aggreg_suppressed.csv")
Rows: 4100 Columns: 10                                            
── Column specification 
Delimiter: ","
chr (1): sex
dbl (9): fy, state, raceethn, numchild, phyabuse, sexabuse, negl...

ℹ Use 'spec()' to retrieve the full column specification for this data.
ℹ Specify the column types or set 'show_col_types = FALSE' to quiet this message.

[Frank Edwards]
You'll notice that I did not specify a full file path here and the reason I don't have to do that is because I'm using an R studio project. We talked a little bit about this in our data management session but I strongly recommend that you learn and use how to work with Rstudio projects because it makes relative paths something that's baked into your programming environment it makes your life a lot easier. So in this case I have a subdirectory in my learn folder call called data that contains all of my data files. So let's show how some of these functions work in Base R. And I like to distinguish between Base R that is the functions that you get with R without loading any external packages and then I'll show you some of the Tidyverse approaches that we can use. So mean we're going to look at exits today. 

[ONSCREEN]
> # base R 
> ------
> ## central tendency
> # mean of exits
> mean(afcars_demo$exited)
[1] NA

[Frank Edwards]
So I ask R to provide me with the mean of the exited variable and oops I get an NA returned. 

[ONSCREEN]
> head(afcars_demo)
# A tibble: 6 × 10
    fy state sex   raceethn numchild phyabuse sexabuse neglect
 <dbl> <dbl> <chr>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
1  2015     1 1            1     2180      352       88     568
2  2015     1 1            2     1245      198       46     331
3  2015     1 1            4       10       NA        0      NA
4  2015     1 1            5       NA        0        0      NA
5  2015     1 1            6      245       30       NA      71
6  2015     1 1            7      204       56       22      60
# ℹ 2 more variables: entered <dbl>, exited <dbl>

[Frank Edwards]
Well part of the problem there is that I have some NA's in my exited column my apologies that doesn't show you everything you want to see so let me use the glimpse function from tidyverse to show all of the variables. 

[ONSCREEN]
> glimpse(afcars_demo)
Rows: 4,100
Columns: 10
$ fy       <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2…
$ state    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2…
$ sex      <chr> "1", "1", "1", "1", "1", "1", "1", "2", "2", "2",…
$ raceethn <dbl> 1, 2, 4, 5, 6, 7, 99, 1, 2, 3, 4, 5, 6, 7, 99, 1,… $ numchild <dbl> 2180, 1245, 10, NA, 245, 204, NA, 2085, 1120, NA,… $ phyabuse <dbl> 352, 198, NA, 0, 30, 56, NA, 305, 217, 0, NA, 0, … $ sexabuse <dbl> 88, 46, 0, 0, NA, 22, 0, 156, 73, 0, 0, 0, 12, 40… $ neglect  <dbl> 568, 331, NA, NA, 71, 60, NA, 572, 251, NA, NA, N… $ entered  <dbl> 1073, 565, NA, NA, 87, 92, NA, 1035, 519, NA, NA,…
$ exited   <dbl> 864, 452, NA, 0, 99, 91, NA, 866, 385, 0, NA, NA,…

[Frank Edwards]
If we look at exited we can see in row three and then again in four five six seven we have NA values. And anytime R tries to compute something a numeric quantity over a vector a vector in this case a column that contains missing values it will return a missing. That is it cannot compute numerically what a the mean of a vector with a missing value and it is. So we need to be explicit that we want R to remove those missing values when it does the calculation. This does not remove them from the data frame it simply removes them from the calculation of the mean. So let's see what we get now. 

[ONSCREEN]
> # wait what?! oh yeah, missing values!
> mean(afcars_demo$exited, na.rm=T)
[1] 365.7624

[Frank Edwards]
All right it's no longer missing it's 365.8 right? So we have on average for these states and I like to think of our data here as being at a state and year unit of analysis. That is each observation represents the total number of cases in each state for each year. We have multiple years represented in this data. So the average state year in this data saw 365.8 foster care exits. How many missing values are there in exits? Table is a great tool to use here and we can nest within our table call the is.na function. What is.na does is it returns us a vector of true and falses after it evaluates the variable exited so that all values that are missing will be will return a true and all values that are non-missing will return a false. Table will tell us how many trues and falses we have in that vector. 

[ONSCREEN]
> # how many missing values in exits?
> table(is.na(afcars_demo$exited))

FALSE  TRUE
3077  1023 

[Frank Edwards]
So in exited we have 1,023 missing values 3077 non-missing values. So about one and four our observations are missing in the spectrum that's a pretty high proportion of missing data. The actual AFCARS data if you're working from it will not have this much missing data in it. It will have missing data on some variables that is that that is this high but not on exited. On some of the variables you will experience very high levels of missingness and we have talked about in some of our previous training sessions and are likely to talk again in the future about how we recommend handling those missing values. Okay for now we're just going to do na.rm because we just want to compute things. 

[ONSCREEN]
 # how about the median?
> median(afcars_demo$exited, na.rm=T)
[1] 106

[Frank Edwards]
So we computed a mean of 365 but again we talked about the fact that the a lot of these distributions are not symmetric and might have outliers might have substantial outliers. In this case the median reveals that right away we saw a mean of 365.8 but a median of 106. Which tells us that we have a lot of mass at low values of exits and a few observations with incredibly high numbers of exits that are pulling that mean up. So in this case the mean and median are telling us something quite different and depending on what our goal is for summarizing central tendency we might want to be very aware of those differences. All right so we know we've got substantial variation in this data already based on those mean and median numbers but let's quantify exactly how much it is. 

[ONSCREEN]
> ## Variation, dispersion
> # OK, mean > median, long right tail on this data # how much variable 
> are exits?
> sd(afcars_demo$exited, na.rm=T)
[1] 686.2612

[Frank Edwards]
Our standard deviation is 686. That is on average the average observation differs 686 from the mean. That's a really high level of variation when our mean is 365 right? The variance is obviously going to be dramatically larger than that because it's the square of the standard deviation might not be as directly informative in this case but it's still useful to know how to calculate. 

[ONSCREEN]
> var(afcars_demo$exited, na.rm=T)
[1] 470954.5

[Frank Edwards]
So let's check out the range of the data we have this variable has tremendous variation in it we can see that already. 

[ONSCREEN]
> # what about the range of the data?
> min(afcars_demo$exited, na.rm=T)
[1] 0

[Frank Edwards]
The minimum value is zero so we have some state years in this data that report no foster care exits

[ONSCREEN]
> max(afcars_demo$exited, na.rm=T)
[1] 7722

[Frank Edwards]
and our max is 7,722 right? So we've got a tremendous range here and we can see now how we would end up with a standard deviation that's that high. What's our interquartile range? 

[ONSCREEN]
> # How about the IQR?
> quantile(afcars_demo$exited, c(0.25, 0.75), na.rm=T)
25% 75%
22 403 

[Frank Edwards]
Let's think some more about quantifying the range of the data. Okay so we have a fair amount of mass that is 25% of the data lies between 0 and 22. We know the minimum is zero right? So the zeroth quantile is at zero and the 25th percentile or 25th quantile is at 22. So 1/4 of the observations fall between zero and 22 and then again another fourth of the observations fall between 403 and 7700 where half of the observations fall between 22 and 403. Maybe I want a broader interval right? That 50% interval tells us a lot but maybe I just want to know something like where is most of the data really where's 90% of the data? Let's drop those outliers and see where most of the data is. 

[ONSCREEN]
> # How about the central 90% of the data quantile(afcars_demo$exited, 
> c(0.05, 0.95), na.rm=T)
   5%    95% 
  0.0 1583.6 

[Frank Edwards]
Okay we still end up with a zero. So those zeros are not that uncommon here. We have you know 5% of the observations at least are at zero and the 95th percentile of the data is 1583. So this gives us a bit more information about where 90% of the data is we're excluding now all but the most extreme observations. Let's see what our state and year coverage looks like with a crosstab. So now I'm just going to run a univariate crosstab on fy which is our fiscal year and this will tell us how much data we've got for each year. 

[ONSCREEN]
> ## crosstabs
> # what does our state and year coverage look like?
> table(afcars_demo$fy)

2015 2016 2017 2018 2019
819  817  817  824  823 

[Frank Edwards]
In this case we have you know about 820 observations for each year but it's not consistent and so this is telling us this  must be reported at the county level and not the state level because we don't have several hundred states whatever it's simulated data. This shouldn't look right and it doesn't because you would immediately think if I'm looking at statee data I should have 50 or maybe 51 or maybe 52 observations if I'm including Washington DC and Puerto Rico which are often reported in the data but I should not have 819. If I saw this and I would immediately start to wonder what was going on and go back to square one and look at my data structure and make sure I processed things appropriately. In this case we'll just let it slide. And then let's check out our state coverage. 

[ONSCREEN]
> table(afcars_demo$state)

1  2  4  5  6  8  9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 
79 78 94 79 89 80 74 46 58 93 80 89 79 83 84 93 73 85 80 76 80 87 
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 44 45 46 47 48 
75 80 80 80 92 78 81 72 77 70 77 80 81 90 71 80 83 74 85 67 80 99 
49 50 51 53 54 55 56 72 
80 65 79 85 76 81 72 51 

[Frank Edwards]
All right I'm seeing 79 observations for state one which again is a little strange given that we have one two three four five years of data I should expect to see fives for all of these category if we had what we would call a balanced data set. When we are often we'll refer to these time and place data sets that is repeated measures of the same units over time as panel data and we often distinguish between balanced panels where we have the same number of observations for every unit and unbalanced panels where we have differing numbers of observations for each unit. In this case we have an unbalanced panel. We do not have the same amount of information for for each unit. So let's check on our missing data by exits over time. Now I'm going to return to my is.na formulation for exited but now I want to know is there more missing data over different years of the data so I'm going to do a bivariant crosstab where effectively I'm going to ask R to compute how many values are missing for each year. 

[ONSCREEN]
> # what about missing data on exits by year?
> table(afcars_demo$fy, is.na(afcars_demo$exited))

      FALSE TRUE
 2015   626  193
 2016   608  209
 2017   624  193
 2018   612  212
 2019   607  216

[Frank Edwards]
And in this case I can see that I have the most missing data in 2019, the least missing data in 2015 and 2017. Okay so what if I want the mean exits for each year? Now we're getting a little advanced let's pivot over to the Tidyverse. The anatomy of the Tidyverse  calls are such that I start with the name of my object it's almost always going to be a data frame or what Tidyverse  may call a Tibble which is just their precious name for a data frame it's it'll work the same way 99.9% of the time. So I'm going to start with my data frame afcars_demo which we have been using the whole way through and then this little operator is called a pipe. It reads if we're reading the code we would read this line 52 as take afcars_demo and then the pipe effectively means and then do the next thing. So I'm saying take afcars_demo and then group it by fiscal year and then summarize a new variable called exited that will take on the value of the mean of exited after removing missings for each fiscal year. So let's see what we get now.

[ONSCREEN]
> # tidyverse ----
> # What if we want the mean exits for each year?
> afcars_demo %>% 
+   group_by(fy) %>% 
+   summarize(exited = mean(exited, na.rm=T))
# A tibble: 5 × 2
    fy exited
 <dbl>  <dbl>
1  2015   347.
2  2016   372.
3  2017   361.
4  2018   376.
5  2019   373.

[Frank Edwards]
Now I can compute the average exits over time and we don't have a whole lot of variation in the mean exits despite having tremendous variation in the untransformed exit counts right? We could also set up a summarize with multiple variables right? So after I've got my group_by fy I can use summarize to define multiple summary outputs. In this case I might want to compute the mean of exits and the median of exits over time and I can return that as a single object. 

[ONSCREEN]
> # and medians!
> afcars_demo %>% 
+   group_by(fy) %>% 
+   summarize(exited_mn = mean(exited, na.rm=T),
+             exited_med = median(exited, na.rm=T))
# A tibble: 5 × 3
    fy exited_mn exited_med
 <dbl>     <dbl>      <dbl>
1  2015      347.        98 
2  2016      372.       109 
3  2017      361.       104 
4  2018      376.       116.
5  2019      373.       107 

[Frank Edwards]
So now I have fy exited mean exited median. Well we've gone that far let's add standard deviations to the mix. 

[ONSCREEN]
> # and standard deviations
> afcars_demo %>% 
+   group_by(fy) %>% 
+   summarize(exited_mn = mean(exited, na.rm=T),
+             exited_med = median(exited, na.rm=T),
+             exited_sd = sd(exited, na.rm=T))
# A tibble: 5 × 4
    fy exited_mn exited_med exited_sd
 <dbl>     <dbl>      <dbl>     <dbl>
1  2015      347.        98       679.
2  2016      372.       109       706.
3  2017      361.       104       678.
4  2018      376.       116.      686.
5  2019      373.       107       683.

[Frank Edwards]
Now I might have a descriptive table that's ready for me to put into a report if someone wants to know something about the mean exits median exits and standard deviation exits over time, I have that and I could use for example the kable the kable function provided by knitter to provide me with a markdown table output that could go straight into a markdown document and make a nice report. But maybe we don't want to go through all that trouble we could just use summary. 

[ONSCREEN]
> # summary will bundle many of these for us
> summary(afcars_demo)
      fy           state           sex               raceethn    
Min.   :2015   Min.   : 1.00   Length:4100        Min.   : 1.00  
1st Qu.:2016   1st Qu.:17.00   Class :character   1st Qu.: 2.00  
Median :2017   Median :29.00   Mode  :character   Median : 4.00  
Mean   :2017   Mean   :29.44                      Mean   :15.93  
3rd Qu.:2018   3rd Qu.:42.00                      3rd Qu.: 7.00  
Max.   :2019   Max.   :72.00                      Max.   :99.00  

   numchild        phyabuse         sexabuse          neglect     
Min.   :   10   Min.   :   0.0   Min.   :   0.00   Min.   :    0  
1st Qu.:   53   1st Qu.:  11.0   1st Qu.:   0.00   1st Qu.:   31  
Median :  279   Median :  40.0   Median :  13.00   Median :  161  
Mean   : 1016   Mean   : 141.6   Mean   :  44.36   Mean   :  664  
3rd Qu.: 1104   3rd Qu.: 147.0   3rd Qu.:  46.00   3rd Qu.:  670  
Max.   :22237   Max.   :3267.0   Max.   :1290.00   Max.   :17204  
NA's   :929     NA's   :1166     NA's   :1340      NA's   :951    
   entered           exited      
Min.   :   0.0   Min.   :   0.0  
1st Qu.:  30.0   1st Qu.:  22.0  
Median : 132.0   Median : 106.0  
Mean   : 426.4   Mean   : 365.8  
3rd Qu.: 459.5   3rd Qu.: 403.0  
Max.   :8603.0   Max.   :7722.0  
NA's   :1053     NA's   :1023

[Frank Edwards]
Summary of afcars_demo will understand whether we have a numeric or a character or a factor variable and again it will provide us with descriptive statistics for every single measure in the data. It'll also count the missing values for us for numeric variables. Sex here is a character variable so we're not getting any information on that. If we converted it to a factor we would get a category count but and you know it doesn't make much sense to take like the mean of state because it's an identifier but numchild right exited we can see we're returning a lot of the output that we just spent a lot of time computing. So summary can be really useful for just taking a quick glance at all of these descriptive statistics. But my typical preferred approach is something like this where I can drill down and really focus on exactly what it is that I want to know for particular sets of measures. That's all I've got for y'all today. Homework is posted if you would like some practice I've given you a few challenges that you can use this data to play around with and see if you've mastered the tools. So yeah I'll open it up for questions before we pivot to the second half of our program.

[Paige Logan Prater]
Yeah I don't see any questions coming on the chat.

[Frank Edwards]
No that's great we covered a lot of the Tidyverse  approaches here earlier and so seeing this again is helpful. These really do map on to I like to think of those the mean the sd all of these as summarizing functions these collapsing functions that reduce dimension. This group_by summarize is is a very very good way to approach those kinds of questions. Sorry Paige I cut you off because I just like talking so much.

[Paige Logan Prater]
No worries yeah no I think we can move to office hours and if folks do have questions y'all can stay and join Frank or Alex's room to ask about anything we just talked about as well as other things too. So we will transition over to our you know typical office hour setup so.

[Erin McCauley]
Wanted to make a quick announcement that the Summer Research Institute applications are open and are due on March 1st. And so for those of you who are not familiar with the Summer Research Institute it's a competitive application based program where folks submit kind of a research idea of something that they want to do. You know we have a lot of grad students, postdocs early career folks but also folks come from across the career spectrum. And if accepted you work closely with staff over about four days to move your idea from familiarity with the data and a strong research question to hopefully having the analyses done. And so if you are are interested in using the data or you have an idea but you might need a little support in executing the idea I highly recommend checking it out and you are welcome to pop into my breakout room to discuss it if that's of interest.

[Paige Logan Prater]
Cool so now we're going to switch over to breakout rooms.

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.

[MUSIC]