Transcript for video titled "LeaRn R with NDACAN, Week 7, 'Introduction to Regression', April 18, 2025".

[MUSIC]

[VOICEOVER]
National Data Archive On Child Abuse And Neglect

[ONSCREEN CONTENT SLIDE 1]
WELCOME TO NDACAN MONTHLY OFFICE HOURS!
National Data Archive on Child Abuse and Neglect
DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO
 The session will begin at 11am EST
 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R)
 11:30 - 12:00pm - Office hours breakout sessions
 Please submit LeaRn questions to the Q&A box
 This session is being recorded.
 See ZOOM Help Center for connection issues:
https://support.zoom.us/hc/en-us
 If issues persist and solutions cannot be found through Zoom, contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
All righty I will kick us off and then hopefully we'll have a couple more people join and we can we can get started. So welcome everybody thank you so much for joining us for our what is this our sixth session our seventh session? Together for our NDACAN monthly office hours NDACAN the National Data Archive On Child Abuse And Neglect. We are hosting this is I don't think this is anyone's first time but for folks who might be tuning in on the recording later on this is a monthly office hours session that we hold. This year's is a little bit special because we're doing 30 minutes of intro to R before we jump into 30 minutes of office hours and breakout sessions. So today we will be talking about regression with Frank. As always there's a couple of housekeeping items so if you have any questions throughout please put them in the q&a or you can put them in the chat but I will read them aloud so we can capture it with the recording. The session is being recorded and everything is available on our website. And if you are having any Zoom issues or things like that please reach out to Andres who is here with us today and he can help with troubleshooting any issues you have. And before I kick it off actually to Frank I did want to plug our Summer Training Webinar Series that's coming up starting in July of this year and I'll put the link in the chat right now. So that is another training offering that we do every year we take fiveish weeks in the summer and the theme for this Summer Training Series is 'The Life Cycle Of An Ndacan Research Project'. So Alex and Noah will walk us through from creating,  identifying a research question all the way through data analysis and presentation of findings. So it'll be a really great series this year. Okay and now with that Frank you can take it away.

[Frank Edwards]
Wonderful thank you Paige and welcome everybody. So and I am more than happy to be interrupted as we go so if you have questions or want clarification as we go don't I don't know if you can unmute yourselves but you can throw questions in the chat as we go and I'll respond to them during the brief lecture. 

[ONSCREEN CONTENT SLIDE 2]
LeaRn with NDACAN, Presented by Frank Edwards

[Frank Edwards]
So without further ado let's learn with NDACAN. So I'm Frank Edwards I'm a research affiliate with NDACAN I'm also an Associate Professor in the School Of Criminal Justice at Rutgers University and I teach statistics for my day job so this is a fun segue for me. 

[ONSCREEN CONTENT SLIDE 3]
MATERIALS FOR THIS COURSE
Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains
Data (will be released as used in the lessons)
Census state - level data, 2015 - 2019
AFCARS state - aggregate data, 2015 - 2019
AFCARS (FAKE) individual - level data, 2016 - 2019
NYTD (FAKE) individual - level data, 2017 Cohort
Documentation/codebooks for the provided datasets
Slides used in each week's lesson
Exercises as that correspond to each week's lesson
An .R file that will have example, usable R code for each lesson will be updated and appended with code from each lesson

[Frank Edwards]
We know by now where the course materials are. All of the data I know I say this every month but all of the data are fake do not use them for analysis but they do mirror the structure that you will find in the archive data. So you can certainly learn about how to work with the data with with these demo data sets but do not use them for real world analysis because they are fake.

[ONSCREEN CONTENT SLIDE 4] 
WEEK 7: Introduction to Regression
April 18, 2025

[Frank Edwards]
All right let's do some regression.

[ONSCREEN CONTENT SLIDE 5]
Data used in this week’s example code
AFCARS fake aggregated data ./data/afcars_aggreg_suppressed.csv
Simulated foster care data following the AFCARS structure
Can order full data from NDACAN: 
https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm 
Census data ./data/
Full data available from NIH / NCI SEER

[Frank Edwards]
So linear regression is obviously a core tool in the statistical analysis toolbox that many of us might have been exposed to previously but we're going to start from the basics today and relearn it. You know the the core of what we're after with regression is typically one of two goals. Goal number one would be description and prediction. We can use linear regression as a tool to summarize a multivariate kind of function to predict some outcome variable y and then think about the multivariate relationships between some set of predictors and some outcome. The other goal of regression is commonly causal inference right? If we have designed our study in such a way that we have randomization or as-if randomization of the primary treatment variable that we're thinking about we can use linear regression to identify the sample average treatment effect. We don't get that automatically though that's design. So we're going to talk about regression primarily as a prediction and descriptive tool today because design is a little bit outside of the scope of what we're going to do but nevertheless our two goals are typically either to describe the data or to develop causal inference and learn about a causal relationship. Today we're going to use the afcars_aggregated suppressed file. Afcars_aggreg_suppressed.csv. We're also going to use the census data which is available in dot/data. The full data of the Census file that we're using is available from the National Institutes Of Health and National Cancer Institute SEER program. And again you can order the AFCARS data from NDACAN if you want to work with the real data.

[ONSCREEN CONTENT SLIDE 6]
Linear Regression 101

[Frank Edwards]
So linear regression 101.

[ONSCREEN CONTENT SLIDE 7]
Lines
We can define a line as: y = mx + b
Where m is the slope and b is the y-intercept.

[Frank Edwards]
The core technology that underlies linear regression is the line right? And so lines we're going to do a little bit of geometry for a second. We define a line as y = m * x + b. Where m is a slope and b is an intercept. So what m defines is the unit change in y when x changes by one unit. How much does y go up when x increases by one unit? A negative slope indicates that y decreases as x increases. A positive slope indicates that y increases as x increases. B here tells us where the line intersects the y axis. 

[ONSCREEN CONTENT SLIDE 8]
Six line graphs. Graph 1: showing the function y = x+2, with a line slope 1, intercept 2
Graph 2: showing the function y = 2x+2, with a line slope 2, intercept 2
Graph 3: showing the function y = -2x+2, with a line slope -2, intercept 2
Graph 4: showing the function y = 0.2x+2, with a line slope 0.2, intercept 2
Graph 5: showing the function y = x+0, with a line slope 1, intercept 0
Graph 6: showing the function y = x-3, with a line slope 1, intercept -3

[Frank Edwards]
So let's look at a few lines just to kind of get a refresher on what we might see when we estimate linear regression models and the kinds of lines that we're talking about. Here I just want you to think about our m x plus b setup and imagine what m is. So for the first chart in the top first graph in the top left corner we have y= x + 2 and so implicitly there m is equal to 1. So we have a slope of one and an intercept, a y intercept of two. That is when x increases by one unit, y also increases by one unit, and the line is shifted upward on the y-axis by two you can see that the line crosses the vertical axis of zero when we're at two when y equals 2. In our second plot here we've doubled our slope so now we have a slope of two. This is steeper right when x increases by one now y increases by two. Here we have a two magnitude slope again but it's negative so when x increases by one when we go to the right with x, y goes down by two right? And we still have our two position y intercept. For a number between 0 and one or 0 and negative 1, we get a positive or negative slope but a bit shallower right so a 0.2 two slope here means that for every one unit increase in x, y goes up by one or we need x to increase by five I'm sorry for y to increase by one so a one unit increase in x is a 0.2 increase in y. I think we get the general idea. But lines are the underlying technology for everything we're doing in regression and we need to be thinking about linear functional forms when we're estimating these models. We need to have an intuition about the kinds of models that we're working with and what the parameters that we're going to estimate mean. 

[ONSCREEN CONTENT SLIDE 9]
The Linear Regression Model
We can describe the relationship between a predictor variable x and an outcome variable y with a line:
E(y) = mx + b
Where the parameter m describes the expected change in y when x changes by one unit. 
The parameter b describes the expected value of y when x = 0

[Frank Edwards]
So the linear regression model can be described between a predictor variable x and an outcome variable y. I like to use the predictor outcome language but you might have heard this referred to for outcome as an independent variable and predictor as a dependent variable. I don't like that language because it implies causal inference when that's not always what we're doing and we need to be very aware of that so I like to say outcome and predictor but you do you. So we're going to describe the linear regression model basically as the expected value of y is equal to some slope m, times the variable x, plus an intercept b, right? And again m describes the expected change in y when x changes by one unit. This expected operator the e indicates an average. So it tells us that the average value of y is equal to some parameter times x + b. That is the average value of y depends on x. The parameter b again describes the expected value of y when x is zero.

[ONSCREEN CONTENT SLIDE 10]
Now with Greek!
We can describe the relationship between a predictor variable x and an outcome variable y for unit i with the linear regression model: 
𝑦_𝑖=𝛽_0+𝛽_1 𝑥_𝑖+ 𝜀_𝑖
𝜀~Normal(0, 𝜎^2)
In plain English, these equations say: The variable y for unit I is equal to beta zero plus beta 1 times the variable x for unit i, plus an error term. That error term follows a Normal distribution with variance sigma squared.

[Frank Edwards]
So let's look at it with Greek because this is typically how we're going to see it. This is the same basic format but we're adding a couple of new elements here. Now instead of saying the expected value of y we're saying the actual value of y right? So what we're saying here is for the outcome variable y for unit i, that is we can imagine a vector y that takes on 10 rows perhaps and each row of that vector now 1 through 10 is our I index so y sub I equals beta 0 which is our intercept term plus beta 1 time the value x at the location i, plus this new guy epsilon. Epsilon is this last greek letter here and epsilon sub I is what we call our error term. A key assumption in linear regression is that the error terms, or the residuals, follow a normal distribution centered at zero with variance according to a parameter sigma squared. So that is they have constant variance centered on the regression line itself. So we end up with a normal distribution of errors centered on the regression line and that error term allows us to use the linear functional form for individual observations y. Without that error term what we get is the line that predicts the average expected or the expected value of y or the average value of y for a particular value of x. The error term lets us turn this into a model that actually generates the observed values. So in plain English these equations say the variable y for unit I is equal to beta 0 plus beta 1, times the variable x for unit i, plus an error term. That error term follows a normal distribution with variance sigma squared.

[ONSCREEN CONTENT SLIDE 11]
Regression Assumptions for Prediction
Correct functional form (linearity in y as a function of x)
Errors follow a Normal distribution centered at zero (identically distributed errors)
Errors are not correlated with each other (independent errors)
Additional assumptions are required for causal inference!!

[Frank Edwards]
Here are the key assumptions that we need to meet for linear regression if we're using regression for prediction or description of our data. We need to assume the correct functional form that is a line describes the relationship between y and x well. So that's the first and perhaps most obvious assumption but something that sometimes gets overlooked. We don't want to have nonlinearities or other kinds of nonlinear relationships. If there are nonlinear relationships between y and x we need to consider perhaps more complex modeling strategies but we need to assume linearity makes sense as an assumption. We need to assume that the errors follow a normal distribution centered at zero. This is also called an identical distribution assumption, so anytime you've heard you might have heard identically distributed errors before we're assuming every error term in the data follows a normal distribution with mean zero and variance sigma squared. We also need to ensure that the errors are not correlated with each other. This is the independence of errors assumption. That is the error term for an observation two cannot be correlated with the error term for observation three. The key way this typically gets violated is through something called clustering. That is if we have units that are similar to each other, maybe we've measured the same individual over time as in a longitudinal design, maybe we've measured in the case of AFCARS children from the same county or state or family, we can assume that those errors are going to be correlated with each other. So we often violate this independent errors assumption and when we do it's okay but we need to be aware that we're doing it and we'll make adjustments to account for that. Now additional assumptions are required for causal inference we're not going to cover those today but if we want to use linear regression to make an inference about a causal relationship there are additional assumptions that we need to meet. 

[ONSCREEN CONTENT SLIDE 12]
Over to RStudio

[Frank Edwards]
So I'm going to pop over to RStudio to show you how this works. All right so here we go this is week7.R and I realize I haven't given you all a homework assignment yet I'll make sure to post one by the end of the day. As always we're going to add some comments at the top of our script to ensure that we know what we're working with. Comment your code extensively I cannot emphasize that enough let's library in the Tidyverse as we always do because we're working in a Tidyverse format. I'm going to read in the afcars_aggregated and census data with read_csv and then we're just going to look oops I will update this code apologies wrong had an error in my code. Let's look at the top of afcars_agg and census. And there we go.

[ONSCREEN]
> library(tidyverse)
── Attaching core tidyverse packages ────────────────────
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ─────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package to force all conflicts to become errors
> ### read data
> afcars_agg<-read_csv("./data/afcars_aggreg_suppressed.csv")
Rows: 4100 Columns: 10                                 
── Column specification ─────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (9): fy, state, raceethn, numchild, phyabuse, sex...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> census<-read_csv("./data/census_2015_2019.csv")
Rows: 6120 Columns: 8                                    
── Column specification ─────────────────────────────────
Delimiter: ","
chr (2): state, st
dbl (6): cy, stfips, sex, race6, hisp, pop

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
> # take a look
> head(afcars_ind)
Error: object 'afcars_ind' not found
> # take a look
> head(afcars_agg)
# A tibble: 6 × 10
     fy state sex   raceethn numchild phyabuse sexabuse
  <dbl> <dbl> <chr>    <dbl>    <dbl>    <dbl>    <dbl>
1  2015     1 1            1     2180      352       88
2  2015     1 1            2     1245      198       46
3  2015     1 1            4       10       NA        0
4  2015     1 1            5       NA        0        0
5  2015     1 1            6      245       30       NA
6  2015     1 1            7      204       56       22
# ℹ 3 more variables: neglect <dbl>, entered <dbl>,
#   exited <dbl>
> head(census)
# A tibble: 6 × 8
     cy stfips state   st      sex race6  hisp    pop
  <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl>  <dbl>
1  2015      1 Alabama AL        1     1     0 331305
2  2015      1 Alabama AL        1     1     1  33105
3  2015      1 Alabama AL        1     2     0 164122
4  2015      1 Alabama AL        1     2     1   2763
5  2015      1 Alabama AL        1     3     0   2540
6  2015      1 Alabama AL        1     3     1   1195

[Frank Edwards]
So this is what our data look like we can see on AFCARS we have a series of variables that describe the numbers of children in foster care and their reasons for placement in foster care while in census we have population counts by sex, race, and ethnicity. Now I'm going to join these tables today and I immediately see that there's a few things I'm going to need to do prior to joining. First I can see that the year variable in AFCARS and census does not have the same name. I'm going to want to harmonize those that is I'm going to want them to have an identical name so that they can be joined with year as a key column. So I want to make sure to rename those. I also can see that state is defined in afcars_agg as a numeric and I know that this is an fips code, a fips code and I have that fips code in the census file but I have another variable named state which is just the full spelled out name state. So I need to rename things a bit. What I'm going to do is I'm going to rename in afcars_agg state to stfips so that it matches and in both files I'm going to rename cy to year and fy to year. And that's what this code block does here.

[ONSCREEN]
> ## harmonize names
> afcars_agg<-afcars_agg %>%
+   rename(year = fy,
+          stfips = state)
> 
> census<-census %>%
+   rename(year = cy)
> # take a look
> head(afcars_agg)
# A tibble: 6 × 10
   year stfips sex   raceethn numchild phyabuse sexabuse neglect
  <dbl>  <dbl> <chr>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
1  2015      1 1            1     2180      352       88     568
2  2015      1 1            2     1245      198       46     331
3  2015      1 1            4       10       NA        0      NA
4  2015      1 1            5       NA        0        0      NA
5  2015      1 1            6      245       30       NA      71
6  2015      1 1            7      204       56       22      60
# ℹ 2 more variables: entered <dbl>, exited <dbl>
> head(census)
# A tibble: 6 × 8
   year stfips state   st      sex race6  hisp    pop
  <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl>  <dbl>
1  2015      1 Alabama AL        1     1     0 331305
2  2015      1 Alabama AL        1     1     1  33105
3  2015      1 Alabama AL        1     2     0 164122
4  2015      1 Alabama AL        1     2     1   2763
5  2015      1 Alabama AL        1     3     0   2540
6  2015      1 Alabama AL        1     3     1   1195

[Frank Edwards]
Now our names should be ready to match and that looks fine to me. Now I also don't want to work at the race by sex level today I want to work at the total population level. I'm going to show you something a little bit advanced here what I'm going to do is I'm going to take my afcars_agg table and I'm going to do a group_by and a summarize we've seen this before where I'm going to do an operation after I group the data by year and stfips that is it's going to perform the operations that come next over the groups of data that are subset by their values on these two grouping variables. So we know we read off in Tidyverse basically AFCARS take afcars_agg and then we have a pipe, which means and then group it by year, and stfips and then here's where it gets a little fancy. I want to basically add up every single variable in afcars_agg. When I'm looking at the table I can see that I have numchild, phyabuse, sexabuse, neglect, entered and exited, and I know these are all numeric counts. But I can also see that I've got some missing data in them. So I'm going to use this extra function called across. What across allows me to do is to specify a range of columns or columns that meet a particular condition. In this case I want to summarize across all of the columns between numchild and exited. And what do I want to do? There well I want to take a sum of those columns because I want to collapse my data, I want to basically get a count of all children in foster care regardless of race regardless of sex. In order to do that I need to sum across 2015 stfips 1111 right? This is all the same state and these are children of sex equals 1 with various race and ethnicities. So I want to sum this column this column this column this column and also those other two columns for all cases where year is equal to 2015 and stips is equal to 1 and then move on to the next year by stips category. So the thing that's a little funky here is this block of code what I've done here is I've defined a special function, a new userdefined function as part of my summarize across what I'm telling R to do is you're going to do some operation by x and then I'm going to define what that operation is in this case it's going to be a sum of x making sure that we're doing an nar.rm equals true because I want those na's to be treated as zeros if I don't do this anytime I add an na to a numeric I will get an na in return so I'll effectively lose a lot of information. Let's go ahead and run this and see what happens. Now oops sorry I'm going to need to rerun things so I'm just going to control option b I kind of broke my sequence of my code and I want to fix it and restart from the top so I'm going to control option b to run all code above that line and that'll get us back to where we were. I just wanted to show you the head of afcars_agg.

[ONSCREEN]
 ## aggregate over sex, race
> afcars_agg<-afcars_agg %>%
+   group_by(year, stfips) %>% 
+   summarize(across(numchild:exited, \(x) sum(x, na.rm=T)))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
> # take a look
> head(afcars_agg)
# A tibble: 6 × 8
# Groups:   year [1]
   year stfips numchild phyabuse sexabuse neglect entered exited
  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
1  2015      1     7509     1218      437    1954    3536   2914
2  2015      2     3588      551      216    2966    1483    922
3  2015      4    27616     2315      722   22601   12553   9880
4  2015      5     7529      887      446    3933    4009   2989
5  2015      6    79740     7479     2012   61328   31258  26832
6  2015      8     9022      860      515    3504    4733   4025

[Frank Edwards]
Now you can see we don't have sex we don't have race ethnicity now we only have one row for each state year so 2015 state 1 state 2 state 4 state 5 and we have the sum of all of the values within those state years. Great perfect. I want to do the same thing on the census. Here I only have the pop column so I don't need to worry about an elaborate summarize across I can just use pop. So I'm going to group by year and stfips again and summarize pop and let's just take a look at the head of census and that looks fine to me.

[ONSCREEN]
> ## aggregate over sex, race
> census<-census %>%
+   group_by(year, stfips) %>% 
+   summarize(pop = sum(pop))
`summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
> head(census)
# A tibble: 6 × 3
# Groups:   year [1]
   year stfips     pop
  <dbl>  <dbl>   <dbl>
1  2015      1 1103159
2  2015      2  184134
3  2015      4 1629765
4  2015      5  706879
5  2015      6 9118819
6  2015      8 1258312

[Frank Edwards]
Now we're going to do a left join we've seen this before I'm going to take afcars_agg which I know has year and stfips and I have census which also has year and stfips. What I'm going to do is I'm going to take afcars_agg and stick census onto it. So now that we've done that take a quick look at our object and we can see that we've got this new variable pop here.

[ONSCREEN]
 ## join
> dat<-afcars_agg %>%
+   left_join(census)
Joining with `by = join_by(year, stfips)`
> head(dat)
# A tibble: 6 × 9
# Groups:   year [1]
   year stfips numchild phyabuse sexabuse neglect entered exited
  <dbl>  <dbl>    <dbl>    <dbl>    <dbl>   <dbl>   <dbl>  <dbl>
1  2015      1     7509     1218      437    1954    3536   2914
2  2015      2     3588      551      216    2966    1483    922
3  2015      4    27616     2315      722   22601   12553   9880
4  2015      5     7529      887      446    3933    4009   2989
5  2015      6    79740     7479     2012   61328   31258  26832
6  2015      8     9022      860      515    3504    4733   4025
# ℹ 1 more variable: pop <dbl>


[Frank Edwards]
I'm going to run a glimpse on this so that you can see all of our variables. Glimpse is in the Tidyverse package.

[ONSCREEN]
> glimpse(dat)
Rows: 260
Columns: 9
Groups: year [5]
$ year     <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 20…
$ stfips   <dbl> 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 1…
$ numchild <dbl> 7509, 3588, 27616, 7529, 79740, 9022, 4616, 898, 1… $ phyabuse <dbl> 1218, 551, 2315, 887, 7479, 860, 455, 38, 219, 507… $ sexabuse <dbl> 437, 216, 722, 446, 2012, 515, 150, 0, 19, 1332, 5… $ neglect  <dbl> 1954, 2966, 22601, 3933, 61328, 3504, 3092, 796, 7… $ entered  <dbl> 3536, 1483, 12553, 4009, 31258, 4733, 1727, 375, 4…
$ exited   <dbl> 2914, 922, 9880, 2989, 26832, 4025, 1279, 227, 315…
$ pop      <dbl> 1103159, 184134, 1629765, 706879, 9118819, 1258312…

[Frank Edwards]
And now you can see that our pop column joined on so I know that state one is Alabama for 2015. We now have the population of the state of Alabama including all of our foster care measures so that's great. So what I want to think about today is the relationship between neglect and physical abuse cases. That is I'm going to ask whether places with more neglect cases have more physical abuse cases. Do we learn anything about the number of physical abuse cases there are in a state by having information about how many neglect cases there are in the state? And anytime I'm going to begin getting set for regression analysis I always want to do data visualization first. In this case a scatter plot is the perfect visual to think about the bariate relationship between neglect and physical abuse. So I'm going to use ggplot where I'll set my x which is going to be my focal predictor as neglect and y which will be my outcome as physical abuse. So let's take a look at it.

[ONSCREEN]
> # first, visualize the bivariate relationship ggplot(dat,
+        aes(x = neglect,
+            y = phyabuse)) + 
+   geom_point()

Image 01 Description
A plot of points on x-axis neglect and y-axis physabuse showing semi-linear results and outliers in the relationship between the two variables.

[Frank Edwards]
This is the relationship. And when I'm looking at this plot I want to think about the linearity assumption that I need to make for regression analysis. Now it's pretty clear that there are some nonlinear things happening here. We can see obvious curvature here, we have these patterns of outliers over here but the linearity assumption wouldn't be too terrible in this context. Linearity looks okayish right it doesn't look great but it looks okayish. The thing often when I'm thinking about data in a numeric format where I have some very large numbers is that log transformations are often a really good idea because what a logarithmic transformation will do is take any multiplicative processes or exponential processes that occur in the data and smooth those out as linear. So let's take a quick look at what a log transform here would look like. I'm going to rather than transform the variables themselves I'm just going to do it in my ggplot so I can add this scale x log 10 and scale y log 10 functions to change my plot. 

[ONSCREEN]
> ggplot(dat,
+        aes(x = neglect,
+            y = phyabuse)) + 
+   geom_point() + 
+   scale_x_log10() + 
+   scale_y_log10()

Image 02 Description
A plot of points on a log base 10 x-axis neglect, and log base 10 y-axis for physabuse, showing more clear linearity in the relationship between the two variables.

[Frank Edwards]
And now when we look at it we can see when we have a log log of physical abuse and neglect that is we have the log base 10 of physical abuse on the y-axis and the log base 10 of neglect on the x-axis there is very clear linearity in this relationship. So this is telling me when I get to regression modeling I'm going to want to consider a log transform. It should dramatically improve the fit of my models compared to a linear fit. That is I should get better performance in my model I should have a much better fit for my model to the data when I use a log transform compared to the linear transform or rather lack of transform. So let's estimate a baseline linear model without the log transform just to see what we've got going on. So going to use the lm function and we have something fancy going on here. When we use the lm function we have a new kind of syntax that R uses this is called a formula. And the formula reads like this we're saying estimate a linear model with physical abuse as the outcome predicted by neglect. So we think about this as the left hand side the y, and the right hand side the x. We don't need to specify the parameters R will do it for us and by parameters I basically mean anything with a Greek letter R will handle it for us we just need to tell it which x variables and which relationship between the x variables are we considering. So the recipe for an lm call goes lm outcome tilda predictor then we need to tell R,  where are these variables? So we give it data equals dat that's the name of the joined data frame that we put together. Once we have a model the most common thing we're going to do is we're going to use summary to look at the parameter estimates. So in this case summary is going to give me a whole lot of information about this model fit that was fit by ordinary least squares regression.

[ONSCREEN]
> m1<-lm(phyabuse ~ neglect,
+        data = dat)
> summary(m1)

Call:
lm(formula = phyabuse ~ neglect, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-4039.6  -315.7  -115.9   209.4  7670.2 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 2.268e+02  9.889e+01   2.294   0.0226 *  
neglect     1.705e-01  7.465e-03  22.844   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1267 on 258 degrees of freedom
Multiple R-squared:  0.6692,	Adjusted R-squared:  0.6679 
F-statistic: 521.9 on 1 and 258 DF,  p-value: < 2.2e-16


[Frank Edwards]
In this case I get two pieces of information that I want to look at right away I get my estimate for the intercept which if we thought think back to our equation this is beta 0. This is the expected number of physical abuse cases when there are zero neglect cases. That is our model says that we should expect to see 22 point no I'm sorry that's eo2 so 226.8 or 227 or so physical abuse cases when there are zero neglect cases. That sounds a little funky but that's what the intercept means. I strongly encourage you not to over interpret your intercepts especially cuz sometimes it's nonsense right? We can't imagine there being a case where there were 220 you know 27 physical abuse cases and zero neglect cases. That's something we would never actually observe in the data so what beta 0 is really here for is it's an artifact to help make our linear model fit right? It's it's helping that slope which is typically what we're more interested in. It's helping to ensure that that slope can fit the data appropriately. So let's turn to our slope in this case data 1 for neglect. This is the term that we multiply by neglect to see our expected change in physical abuse cases. And so we have scientific notation here. In the intercept we had e plus o2 in the slope we have e minus o1. So that just means times 10 to the power of 2 in the case of the intercept. And in the case of the slope it means times 10 the power of -1 which means shift this decimal point one place over. So this tells us that our slope is 0.17. What that means is that when we see neglect cases, when we see one more neglect case, on average we will see 0.17 more physical abuse cases. I'll say that again because this is the kind of core of what we're typically interpreting right? When there is one more neglect case we expect to see on average 0.17 more physical abuse cases. We could multiply this by say 100 right? And we could say when there are 100 more neglect cases on average we expect to see 17 more physical abuse cases at the state and year level. The other things we have going on here are our standard error. This is a measure of how much uncertainty we have in the location of this regression parameter. And we typically want to think about the standard error on the scale of the estimate itself and that's what the t value tells us. It tells us how many standard errors go into this estimate. How far away is this estimate from zero? That's what this t value tells you it's just the ratio of the estimate to the standard error. So we can read the t value is how many standard errors is our estimate away from zero. In this case we have a 1.7 and we have a standard error of 0.007 so that's a pretty small standard error. That's 22 standard errors away from zero. Now the reason why we care about how far away it is from zero is because we're setting up a null hypothesis test and that's the information right here. We are setting up a hypothesis test, implicitly R is just doing it for us, where we are asking R to evaluate the probability that this value this value of the parameter beta 1 in this case, could have come from a distribution centered at zero with a standard error of 0.0074. And R is telling us that that's incredibly unlikely. We can reject the null hypothesis of a zero-centered distribution with a high level of confidence. I'm going to push on to the log log fit. So we saw when we visualize the data that the logarithmic transformation really did make the linearity assumption more viable so let's look at what that model means. 

[ONSCREEN]
> # now try the log-log fit
> m2<-lm(log(phyabuse) ~ log(neglect),
+        data = dat)
> 
> summary(m2)

Call:
lm(formula = log(phyabuse) ~ log(neglect), data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.49030 -0.37036  0.04292  0.32774  1.12613 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.22343    0.24605  -4.972 1.21e-06 ***
log(neglect)  0.95085    0.02885  32.959  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4654 on 258 degrees of freedom
Multiple R-squared:  0.8081,	Adjusted R-squared:  0.8073 
F-statistic:  1086 on 1 and 258 DF,  p-value: < 2.2e-16

[Frank Edwards]
Now we have very different set of estimates because our scale is dramatically different. now we're on the log physical abuse scale. that is our unit is now log physical abuse cases. so what our model is saying here is that when the logarithm of neglect is equal to I'm sorry when neglect is equal to zero we have an intercept of minus 1.2. so we have minus 1.2 log physical abuse cases. the interpretation is going to get a little more complicated and for a one unit increase in the logarithm of neglect we get a 0.95 unit increase in the logarithm of physical abuse. we still see these estimates showing as statistically significant. the last thing I want to show you is including an extra predictor. that is now we're doing multiple regression we could think maybe this relationship between physical abuse and neglect is just a function of population size. that is, we expect to see more of both when populations are large so let's check that out. let's estimate model 3 including our log log relationship but with a pop predictor and let's take a look.

[ONSCREEN]
> m3<-lm(log(phyabuse) ~ log(neglect) +
+          pop,
+        data = dat)
> summary(m3)

Call:
lm(formula = log(phyabuse) ~ log(neglect) + pop, data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.54065 -0.33934  0.04166  0.31900  1.11279 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -7.035e-01  3.543e-01  -1.986   0.0482 *  
log(neglect)  8.787e-01  4.499e-02  19.533   <2e-16 ***
pop           6.047e-08  2.697e-08   2.242   0.0258 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4646 on 252 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.8115,	Adjusted R-squared:   0.81 
F-statistic: 542.6 on 2 and 252 DF,  p-value: < 2.2e-16

[Frank Edwards]
So population is clearly associated with the number of physical abuse cases but we still see a pretty clear positive association between neglect rates and physical abuse rates. And let's use the Beyesian information criteria sometimes when we fit two models like model 3 and model 2 we can look at a few things to learn whether this improves our estimation. The first thing we can look at is our residual standard error right and think about whether we've explained any extra variation that was left over in our error terms across models. In model 2 we had a residual standard error of 0.465 here we have a residual standard error of 0.65 effectively so that doesn't really tell us a whole lot of anything. I like to use what are called information criteria to help me make comparisons when I don't have strong theory telling me which model to prefer. BIC is going to give us a quick estimate that tells us which model fits the data better and BIC is nice because it penalizes for model complexity. 

[ONSCREEN]
> ## compare models to see if we've improved our fit BIC(m3, m2)
   df      BIC
m3  4 351.8810
m2  3 354.7905
Warning message:
In BIC.default(m3, m2) :
  models are not all fitted to the same number of observations

[Frank Edwards]
I'm also getting this warning message that is important it's because we have missing data and and we have some zeros we can't take a log of zero, so we end up with some cases that are dropped in R. No we must be missing population data somewhere I'd need to diagnose this a little more carefully and I want to think carefully about that. But anyway a lower score indicates an improved model fit and in this case in model 3 we have a slightly lower BIC score than model 2 which tells me that it fits the data a little bit better. Now lastly I want to show you how to visualize our results from these models. Once we're in a world where we're doing things like log transforms and especially once we have interactions and other kinds of things going on on our model it's really important to step back from just interpreting our coefficients directly and to think about our outcome itself. So what I like to do typically is to use prediction, use the regression equation itself to generate predicted values to help me understand the functional relationship that I'm describing with my regression model. So what I'm going to do is I'm going to take my dat frame I'm going to ungroup it because it still has that group bystate near thing going on and I'm going to use the predict function to generate an expected value of y for each value of x that we observe in the data. Let's take a quick glimpse at that.

[ONSCREEN]
> glimpse(dat_plot)
Rows: 260
Columns: 10
$ year     <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
$ stfips   <dbl> 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 2…
$ numchild <dbl> 7509, 3588, 27616, 7529, 79740, 9022, 4616, 898, 1131, 358… $ phyabuse <dbl> 1218, 551, 2315, 887, 7479, 860, 455, 38, 219, 5079, 1717,… $ sexabuse <dbl> 437, 216, 722, 446, 2012, 515, 150, 0, 19, 1332, 530, 63, … $ neglect  <dbl> 1954, 2966, 22601, 3933, 61328, 3504, 3092, 796, 792, 1731… $ entered  <dbl> 3536, 1483, 12553, 4009, 31258, 4733, 1727, 375, 412, 1751…
$ exited   <dbl> 2914, 922, 9880, 2989, 26832, 4025, 1279, 227, 315, 13531,…
$ pop      <dbl> 1103159, 184134, 1629765, 706879, 9118819, 1258312, 762174…
$ yhat1    <dbl> 560.0843, 732.6728, 4081.2653, 897.5869, 10685.8461, 824.4…

[Frank Edwards]
We can see that we have this new column yhat1 which tells me the expected number of physical abuse cases when neglect cases are equal to 1954, 2966, 22601, etc.. So this is what model one tells me is my expected number of physical abuse cases in Alabama in 2015 based on this number of neglect cases. The difference between the observed values 1218 and 560 is the residual right? So that's our epsilon term that's our error. So we have a fair amount of error for observation one, observation two 551 732 less error there right? And we could keep on going to think about our residuals but what I really want to do is I want to plot this. So I'm going to take that plot I'm going to set neglect as my x parameter and physical abuse as my y parameter so this is just the scatter plot we had earlier right? But now I want to add an extra geom, an extra, layer which is going to be my regression line. I'm going to set the y values for the regression line as being equal to yhat1 and the x values can stay as neglect. 

[ONSCREEN]
> ggplot(dat_plot,
+        aes(x = neglect,
+            y = phyabuse)) + 
+   geom_point()
> ggplot(dat_plot,
+        aes(x = neglect,
+            y = phyabuse)) + 
+   geom_point() + 
+   geom_line(aes(y = yhat1))

Image 03 Description
A plot of points on x-axis neglect and y-axis physabuse, with a regression line overlaid on the results at a 30-degree angle, showing the functional relationship that has been estimated. There is a cluster of points in the lower left that is close to the line but more scattered points away from the line as the x-axis value increases.

[Frank Edwards]
This shows me the functional relationship that I've estimated right? This helps me see clearly what my regression line is saying. Now normal errors I think not this has a clear case of heteroscadasticity which is a really silly way of saying that the variance on our errors is not constant. We can see really clearly that for these low count places our errors are pretty small. That is, the distance between the regression line and the observed data point right? Up here we have tremendously large errors. So this is telling me that we definitely have violated the constant error variance assumption right away. There are other ways to look at that but that's okay we'll stop there for now. Next I want to look at model two which was our log log this is going to improve a little bit on that heteroscadasticity problem. What I'm going to do I'm going to use predict a little differently this time. I'm going to use the predict again but I'm going to ask R to predict on model 2 and this time I'm going to ask it to return a standard error for the expected value so se.fit equals true tells R to return the standard error as well. And this is what that looks like.

[ONSCREEN]
> ## model 2, with uncertainty
> m2_preds<-predict(m2, se.fit=T)
> glimpse(m2_preds)
List of 4
 $ fit           : Named num [1:260] 5.98 6.38 8.31 6.65 9.26 ...
  ..- attr(*, "names")= chr [1:260] "1" "2" "3" "4" ...
 $ se.fit        : num [1:260] 0.0387 0.0319 0.0534 0.0294 0.0791 ...
 $ df            : int 258
 $ residual.scale: num 0.465

[Frank Edwards]
Now we have a fit column that is our yhat our expected value of y and we also have a standard error for that value of yhat. Now that standard error is going to help us understand the uncertainty we have in our model. I'm going to take those two features of this m2_preds object I'm going to put them into that and I'm going to make a new column called yhat2 that's equal to m2_predsfit which is this value right here and we know that's the log log relationship right? That's our expected log physical abuse cases and this is our uncertainty in that estimate of the mean, the expected right? Phys physical abuse cases. So this tells us a little bit about how certain we are about the relationship of that regression line. So let's just run this and take a quick look at it again.

[ONSCREEN]
> dat_plot<-dat %>%
+   ungroup() %>% 
+   mutate(yhat2 = m2_preds$fit,
+          se2 = m2_preds$se.fit)
> glimpse(dat_plot)
Rows: 260
Columns: 11
$ year     <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015…
$ stfips   <dbl> 1, 2, 4, 5, 6, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19, 2…
$ numchild <dbl> 7509, 3588, 27616, 7529, 79740, 9022, 4616, 898, 1131, 358… $ phyabuse <dbl> 1218, 551, 2315, 887, 7479, 860, 455, 38, 219, 5079, 1717,… $ sexabuse <dbl> 437, 216, 722, 446, 2012, 515, 150, 0, 19, 1332, 530, 63, … $ neglect  <dbl> 1954, 2966, 22601, 3933, 61328, 3504, 3092, 796, 792, 1731… $ entered  <dbl> 3536, 1483, 12553, 4009, 31258, 4733, 1727, 375, 412, 1751…
$ exited   <dbl> 2914, 922, 9880, 2989, 26832, 4025, 1279, 227, 315, 13531,…
$ pop      <dbl> 1103159, 184134, 1629765, 706879, 9118819, 1258312, 762174…
$ yhat2    <dbl> 5.981780, 6.378604, 8.309575, 6.646924, 9.258756, 6.537103…
$ se2      <dbl> 0.03867443, 0.03195000, 0.05336355, 0.02939348, 0.07913451…

[Frank Edwards]
There is our yhat2 and there is our standard error for that yhat. Now what do I do with this? I'm going to go back to ggplot do our scatter plot again but this time I'm going to go ahead and log transform neglect and physical abuse inside my ggplot call to give me a scatter plot that looks very similar to our log base 10 scatter plot just with different axes.

[ONSCREEN]
> ggplot(dat_plot,
+        aes(x = log(neglect),
+            y = log(phyabuse))) + 
+   geom_point()

Image 04 Description
A plot of points on a natural log base e x-axis neglect, and natural log base e y-axis for physabuse, showing more clear linearity in the relationship between the two variables.

[Frank Edwards]
This is now a natural logarithm an ln right so it's a base e logarithm. And I'm going to add a fancy geom down here I'm going to add a geom ribbon. And what I'm doing here is I'm going to draw my regression line yhat2 that's just our basic line, but I want to add some uncertainty intervals to it. I'm going to tell R to set y maximum equal to yhat2 + 1.96 times the standard error of the prediction. 1.96 is a magic number because that represents the upper and lower bound of a symmetric 95% confidence interval for a normally distributed variable. So 1.96 we could think of as a critical value for the upper bound and lower bound of a confidence interval. And what this is going to look like is that.

[ONSCREEN]
> ggplot(dat_plot,
+        aes(x = log(neglect),
+            y = log(phyabuse))) + 
+   geom_point() +
+   geom_ribbon(aes(y = yhat2,
+                      ymax = yhat2 + 1.96 * se2,
+                      ymin = yhat2 - 1.96 * se2),
+               alpha = 0.5)

Image 05 Description
A plot of points on a natural log base e x-axis neglect, and natural log base e y-axis for physabuse, showing more clear linearity in the relationship between the two variables. This is a bar overlaid on the results at a 30-degree angle, with almost all points within a certain distance of the bar.

[Frank Edwards]
Now we can see that we really don't have the same problem with non-constant error variance are regardless of our position on the line the observed values are pretty similar in distance the cloud of values doesn't really stray further it doesn't spread out as we go up the x-axis which is something we don't want to see. And we also can see that we don't have a whole lot of uncertainty about the location of the regression line right? That bar indicates our 95% confidence interval for the location of the regression line and we can see here that we should feel pretty confident that the regression line we've estimated fits relatively well and other values other more extreme lines are not very likely to be observed. So that's it for the session today.

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families.

[MUSIC]