Transcript for 2025 Summer Training Series, Session 5, Visualization and Finalizing the Analysis
Presenter: Noah Won, M.S., NDACAN
National Data Archive on Child Abuse and Neglect (NDACAN)

[MUSIC]
[VOICEOVER]
National Data Archive on Child Abuse and Neglect.

[ONSCREEN CONTENT SLIDE 1]
Welcome to the 2025 NDACAN Summer training series!
National Data Archive on Child Abuse and Neglect.
Duke University, Cornell University, UC San Francisco, & Mathematica.

[Paige Logan Prater]
Hello, everybody. Good morning, good afternoon. My name is Paige Logan Prater. I am the Graduate Research Associate here at NDACAN. I'm just going to kick us right off. I see people are coming in, so we'll just go ahead and get started. Yeah, so yeah, my name is Paige Logan Prater. This is the NDACAN Summer Training Series. NDACAN is the National Data Archive on Child Abuse And Neglect. We're housed across these universities and institutions, listed on the slide. And we are funded through the National Children's Bureau. 

[ONSCREEN CONTENT SLIDE 2]
NDACAN Summer Training series schedule.
July 2nd, 2025
Developing a research question & exploring the data.
July 9th, 2025
Data management.
July 16th, 2025
Linking data.
July 23rd, 2025
Exploratory Analysis.
July 30th, 2025
Visualization and finalizing the analysis.

[Paige Logan Prater]
So, this is our last session of our Summer Training Series for this summer. If this is your first time joining us, don't worry. You will have access to recordings of all of our previous sessions, as well as any materials that we used. All of those will be posted on our website. I will put the link in the chat once I kick it over to Noah, and those recordings and materials will be posted in the next few weeks. Definitely in August though. So, look out for those. And like I said, this is our last session. The theme of our series is the life cycle of an NDACAN research project. Today, we'll be talking about visualization and finalizing analyses with Noah. And just a quick plug that we are currently working on our Monthly Office hours series, which we'll kick off in. I believe September and we'll run through the academic year. So, if you're not already, if you don't already have that on your radar, definitely keep an eye out. We'll be doing similar types of learning opportunities with the Monthly Office Hours series. And yeah, all of our information about all of our events, the Summer Training Series, the Monthly Office Hours, can also be found on our website. The next slide, please. 

[ONSCREEN CONTENT SLIDE 3]
Life Cycle of an NDACAN research project.
This session is being recorded.
Please submit questions to the Q&A box.
See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us.
If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu.

[Paige Logan Prater]
And just a really quick, sorry, go back one, Noah. Cool. Really quick housekeeping items. All of our sessions will be recorded and they will be available for accessing later on. If you want to refer back to them, or if you aren't able to join live for any of our sessions. And if you have questions throughout the presentation, please put them in the Q&A chat or the Q&A box, which is along the bottom of your Zoom screen. There's a little comment bubble with a question mark. It says Q&A there. As your questions come up throughout the presentation, go ahead and type your questions in the Q&A box and we will address them at the end. We'll save about five or so minutes and we'll get those answered to the best of our ability. If you do have other questions or you feel like you have more things you want to talk about regarding the archive or any of our presentations you can always reach out to us as well. I think that is it. I will kick it over to Noah to talk about visualization and finalizing our data analysis. 

[Noah Won]
Thank you, Paige. As Paige said, you know, welcome everyone. Thank you for joining our last session of the Summer Training Series. Again, my name is Noah. I'm a data analyst here at Duke and I'll be presenting a data visualization and finalizing the analysis. 

[ONSCREEN CONTENT SLIDE 4]
SESSION AGENDA
STS Review
Regression Review
Data Visualization
Univariate Plots
Bivariate Plots
Regression
Variable Types
Assumption Assessment


[Noah Won]
So, overview of our agenda, we're going to review the last summer training series session for which it's exploring analysis and just kind of a refreshment memory on some of the learning we did then with Alex. We'll move into data visualization, namely, univariate and bivariate plots the sorts of plots and the uses for those plots that we will show in the R code example later. Then we're going to go over a couple different types of other types of regression, besides simple linear regression and multiple linear regression that you may come across in your research experience. As well as some assumptions, key assumptions that come with most forms of regression, these are key assumptions that we make that form the backbone of basic understanding of the regression model that we do.

[ONSCREEN CONTENT SLIDE 5]
STS REVIEW

[ONSCREEN CONTENT SLIDE 6]
REGRESSION
Regression analysis is a statistical method for estimating the relationship between two (or more) random variables
Linear Regression Equation: Y sub I is equal to Beta naught plus Beta sub one times X sub I plus Epsilon sub i. 
Confounders are variables that affect both the independent and dependent variable
Expansions
Stratification
Fixed-State Effect Models
Description of the image of the Equation: Dependent variable is Y sub i. Population Y intercept is Beta naught. Population Slope Coefficient is Beta sub one. Independent variable is X sub i. Random Error Term is Epsilon sub i. The Linear component is Beta naught plus Beta sub one times X sub i. The Random Error component is Epsilon sub i.

[Noah Won]
Okay, so as you may remember last time, we built a simple linear regression model and a multiple linear regression model. So, as a refresher regression analysis is a statistical method we use to show the relationship between an outcome or a dependent variable and a predictor or an independent variable. Here below is the equation for simple linear equation. As a review, the y sub I is our outcome variable and it's a continuous outcome variable. We followed by an intercept, which is kind of used to adjust the line of best fit as Alex talked about last last lecture. Our b1 and x1, which our b1 is our slope coefficient, which shows the relationship, whether negative or the linear relationship, whether positive or negative of our predictor and our outcome variable. So the sign of this slope coefficient, this is a number, will dictate whether there's a positive or negative relationship. And of course, our independent variable, Xi and a random error term, which exists as our best estimate as it residuals from the line of best fit. It's essentially the distance between the data points and the line of best fit. So methods that we covered last lecture to address confounders are including them in our model. So we extended it to multiple linear regression, adding another B1, Xi term, a B2, Xi, B3, Xi, adding these terms in our model allows for the model to kind of accommodate for potential confounders. And also stratification models. Kind of stratifying on a categorical variable, creating different models based off of, you know, for example, maybe male female, two different linear regression models run stratified based off of sex can provide  also insight and control for confounders such as sex or other categorical variables. 

[ONSCREEN CONTENT SLIDE 7]
DATA VISUALIZATION

[Noah Won]
So moving forward, we will be going to data visualization. 

[ONSCREEN CONTENT SLIDE 8]
USES FOR DATA VISUALIZATION
Holistic Overview
Provides a quick, concise, visual summary of data
Association at a Glance
Reveal trends or pattern in data
Identify General Nature of Relationship
e.g. Linear, Quadratic, Cubic Splines
Assumption Evaluation
Aids in the validation of assumption testing
[ONSCREEN IMAGES ALT-TEXT] Three sample graph images shown only for illustrative purposes:  a scatter plot with dots on a plane, a bar graph labeled histogram, and a graph labeled "Normal Q-Q Plot showing dots on a plane.

[Noah Won]
So a very important tool in any statistician or researchers repertoire. It has a number of uses. One is it provides a great holistic overview of relationship between variables. In this example, we have a histogram, QQ plot, and a scatter plot. And just add a glance. It's a very quick, concise visual summary of the relationship between these two data. And oftentimes in research, we have to make educated decisions on how we will proceed with our model. So for example, if you see the scatter plot, you know, it could give some insight, you know, without any formal hypothesis testing, it could give some insight into, okay, you know, maybe there's a negative relationship between there. And negative linear relationship, it seems that this relationship follows the straight line. Perhaps like linear regression would be an apt to model to fit to this sort of relationship between these two data points. And likewise, this histogram over here of this of height and show the distribution of a single variable. So you're roughly normally distributed around what seems to be about 175 centimeters. So we know that this distribution seems to be normally distributed, which is important assumption for some models. The QQ plot, which also aids in the normality assessment, it just plots quantiles to the actual sample quantiles. And without going too much into it, it's supposed to be a 45 degree line. It's supposed to follow a 45 degree line. So this QQ plot seems relatively, it seems to display a variable that's shows normality. Or a model that shows normality. And as I said before, we can fit these relationships can be linear, quadratic, even more complex than that. We could fit what's known as cubic splines, which I'll go into detail in the coding example and later in the slides. But essentially, if it's cubic curves locally to a data. So I'm sure you've seen it in data like a lowest curve, but you see like a line of best fit. That's not necessarily linear, but kind of, kind of squiggles about and like follows the data. But those are known as cubic splines. And as I said before, it's assumption evaluation, normality, linearity. These are kind of key assumptions that we can kind of assess holistically before proceeding to more formal hypothesis testing.

[ONSCREEN CONTENT SLIDE 9]
UNIVARIATE PLOTS
Helpful for showing distribution of a single variable
Types
Histograms – Discrete
Geom_histogram (in R)
Dot Plots - Discrete
Geom_dotplot (in R)
Box and Whisker – Continuous 
Geom_boxplot (in R)
[ONSCREEN SLIDE 9 IMAGE 1 ALT-TEXT] 
A sample histogram: X-axis is Class interval, Y-axis is Frequency
A histogram is a type of statistical chart used to visually represent the distribution of a dataset, particularly when the data is continuous or numeric. The chart is comprised of adjacent bars (rectangles), each representing a range or "bin" of values. The height of each bar indicates the frequency or count of data points within that specific range or bin.
[ONSCREEN SLIDE 9 IMAGE 2 ALT-TEXT] 
A Sample Box and Whisker Plot Graphic.
A box and whisker plot (or simply box plot) is a graphical representation used to display the distribution of a dataset through its five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box shows where the central 50% of the data lie, marked by Q1 and Q3, with a line inside the box marking the median value. The "whiskers" extend from the box to the minimum and maximum values, or sometimes to boundaries for potential outliers, which may be marked individually.

[Noah Won]
Okay, so kind of giving it away a little bit, but we're going to go into univariate plots. So these are plots that show a single variable. And, you know, addressing like finding the distribution of a single variable as I said before is important, you know, normality assumptions, seeing if there's a bi-modal distribution, perhaps there are two, you know, humps. So it really gives you a better view and like, okay, where's everyone, where's the density of values in this variable? And how can I use that to, or how can I use that information and how can I model this variable? So most likely, the most common type of univariate plot you'll see is a histogram. So histograms, as a review, are they count frequency. And they're used for discrete variables. And this example over here, you see the histogram. And histograms can have various bin sizes. So you can see, every bin here has a 0.02 bin width. So that is anything that falls within 9.05, 9.07, so you see they're separated by a 0.02 bin width, falls within that bins, and we'll count them and then plot them on this Y access as frequency. So increasing bin width can be better for larger spreads of data, you know, making a smaller bin width also can give more exact estimates. It really depends on what you're using it for. But generally, this histogram right here shows a normal distribution roughly, so it's good for assessing that sort of thing at a eyes glance. Similarly, I don't have a dot plot on this presentation, but they're very similar to histograms, except instead of counts like these bars, they show dots, like stacked dots. So it's just another way of kind of visualizing the discrete information and like more of a, I guess, like lens to more of it's discrete nature, you know, each dot represents a count. So in that way, it's different. Some people, I think most people prefer histograms, but it also has time in place. And box and whisker plots as well. So box and whisker plots show, not counts, but quartiles. So a quick review of box and whisker plots, and you can see this plot. We have the median in the center, which is the 50th percentile. We have Q1, which is the 25th percentile, and Q3, which is the 75th percentile. And then at the whiskers, are the Q1 times 1.5 times the inter-quantile range, which is Q3 - Q1. And then of course, we have outliers that may exist outside of this range. So like box and whisker plots are great at showing, you know, maybe the distribution of a data set in a different way that a histogram might not be able to convey. Like if the whiskers were unevenly spaced, you could tell that, you know, perhaps. Or if there were more outliers existing outside of the whisker range, then it would likely show that maybe it's, you know, weighted heavily on the upper scale of this variable, or lower scale of this variable. Or if this inter-quartile range is huge, what does that say? Is it more spread out? Is it closer together? So all these things are very helpful when determining distribution, especially in relation with relation with each other. And we'll be going into that in the R coding example as well.

[ONSCREEN CONTENT SLIDE 10]
BIVARIATE PLOTS
Helpful for showing relationship between two variables
Linear, Quadratic?
Types
Scatterplot – Two Continuous Variables
Geom_point
Cubic Splines – Two Continuous Variables
Geom_smooth
[ONSCREEN SLIDE 10 IMAGE 1 ALT-TEXT] 
A cubic spline plot is a smooth curve that passes through a set of given data points, created by piecing together a series of cubic polynomials between each pair of points. Each segment is defined so that the entire curve is smooth The example shows a line on plane with x-axis labeled "y1" with values going from 0 to 25, and y-axis labeled "vcr" with values going from 0 TO 2500. There is a shaded band which covers the line from above and below: this represents the standard error band width.

[Noah Won]
Okay, so, you know, we also want to be able to plot plots, you know, with two variables, and kind of explore a relationship between those two. So, you know, linear, quadratic, you know, these, like kind of scatter plots and cubic splines, help us to determine at a visual glance, you know, what is roughly the relationship between these two variables. If I had these two variables right here, I was looking at it, you could clearly see it's not linear. There's a some sort of curve here, a fitted by a cubic spline, and I would say that, you know, the linearity assumption, you know, in linear regression would not be met here. And some remediate efforts would be needed to kind of be able to use a linear regression. But the two main types that we're going to cover in this lecture are scatter plots. Essentially, just this graph without the line, the best fit, it's just dots, you know, the x values, independent values, and the outcome values on the y variable. And you just kind of dot the relationship between them. And then cubic, we're going to fit cubic splines to on top of our scatter plots in the R coding example to kind of show, you know, give a, give a chance to show, you know, the relationship, the nonlinear relationship, between these two variables. 

[ONSCREEN CONTENT SLIDE 11]
REGRESSION REVIEW

[Noah Won]
Okay, so we're going to go back into regression.

[ONSCREEN CONTENT SLIDE 12]
WHAT IS REGRESSION
Simple Linear Regression models LINEAR relationship between a dependent and independent variable
Composition
Right Side
Intercept: b0
Slope Intercept: b1
Independent Variable: X1
Multiple Linear Regression
[ONSCREEN SLIDE 12 IMAGE 1 ALT-TEXT] 
Simple Linear Regression
Y equals b naught plus b sub one times X sub i.
Dependent variable is Y. Y intercept (constant) is b naught. Slope Coefficient is b sub one. Independent variable is X sub 1. 
[ONSCREEN SLIDE 12 IMAGE 2 ALT-TEXT] 
Graph of Simple Linear Regression Model. On an unlabeled x-y plane, a series of dots are clustered around a line that slopes at an approximate 40 degree angle starting from close to the origin.

[Noah Won]
Just a quick review, and I know I covered it earlier in the last review, but I think it's very important to hammer down regression before we go into other types, right? So it's a similar equation, just doesn't have the errors, but the errors do exist. This is just kind of a simplified version of the equation. As I said before, B0 is intercept, so you see in this graph right here, this linear regression graph starts at non-zero, so it kind of gives it some gives the regression line a better fit through these variables if it can determine like a best place to start to intercept at. And a slope coefficient, so this is going to be the slope of this line, so this is a positive correlation. So this this number B1, if it were fit to this simple linear regression would be a positive number. And then the magnitude I'm not entirely sure as there's no numbers on it, but with this would definitely a positive number. And it plots the x1 values right here and then the y values here. So the line predicts the y values are this prediction and right here. And of course multiple linear regression will be adding more independent variables, so like plus B2, x2, plus B3, et cetera. 

[ONSCREEN CONTENT SLIDE 13]
VARIABLE TYPES
Quantitative
Discrete- Poisson Regression
Continuous - Simple or Multiple Linear Regression
Qualitative
Binary - Logistic Regression
Nominal – Multinomial Logistic Regression
Ordinal – Ordinal Logistic Regression

[ONSCREEN SLIDE 13 IMAGE 1 ALT-TEXT] 
Table outlining qualitative data types, namely, nominal, ordinal, and binary, and quantitative data types, namely, discrete and continuous. 
Qualitative nominal: Variables with no inherent order or ranking sequence. E.g. Gender, Race etc..
Qualitative ordinal: Variables with an ordered series E.g. Blood Group, Performance etc.
Qualitative binary: Variables with only two option. Eg. Pass/Fail, Yes/No etc
Quantitative Discrete: aka Attribute data.
Discrete data is information that can be categorized into a classification.
Discrete data is based on counts.
Finite number of values is possible and the values cannot be subdivided meaningfully. E.g. - No of Parts damaged in shipment.
Quantitative Continuous: Continuous data is information that can be measured on a continuum or scale.
Continuous data can have almost any numeric value and can be meaningfully subdivided into finer and finer increments . E.g. - Length, Size, width.

[Noah Won]
Okay, so we know that simple linear regression and multiple linear regression are useful for when we have a continuous outcome. That is a number an outcome that like blood pressure or age, something that has a continuous values is not separated into discrete increments. But what if we had a variable that we wanted to model that wasn't continuous? We can see kind of here just the definition discrete, binary, ordinal variables. What if our variables ordinal that is like an ordered series, like a categorical variable with an inherent ordered series. So, perhaps like grades on, you know, like grades like ABCDEF. There's an inherent order in the grading scale in the American grading scale that we can utilize and extract more information out of than if we just had a nominal variable, something would know inherent order, like gender and race. Something that doesn't, you know, that we can't extract any information based on the order in which they come. Binary variables, probably one of the most helpful one of the most sought-after variables to model. That's like a zero, one, pass, fail, Yes, no. Like these variables are, how do we model something that has one of two  values, which we'll go into logistic regression and how we kind of solve the problem of, okay, how we're going to model. If we can't model like a continuous variable, it's one or zero, then how do we model it, right? And of course, some other forms of regression that's lie outside the scope of this lecture, but are very helpful and, many forms of research. 

[ONSCREEN CONTENT SLIDE 14]
ASSUMPTION ASSESSMENT
Core Assumptions
Homoscedasticity – Variance of residuals is constant for all levels of all independent variables
Use plots of residuals.
Independence -  Each observation is independent of others
Verify with study design
Linearity
Inspect scatter plots of independent and dependent variables
No Multicollinearity
Independent variables are not correlated with each other
Check correlation tables

[Noah Won]
Okay, so, we went over the types of regression. Now these are a couple of the core assumptions that exist within these different types of regression. So, there's four, well, technically, yeah, there's, I'd say there's four core assumptions. We'll go over a homoscedasticity, just kind of a big word. But, essentially, it's ensuring that the variance of residuals is constant for all levels of independent variables. So, let me just go back right here. 

[ONSCREEN SLIDE 12 IMAGE 2 ALT-TEXT] 
Graph of Simple Linear Regression Model. On an unlabeled x-y plane, a series of dots are clustered around a line that slopes at an approximate 40 degree angle starting from close to the origin.

[Noah Won]
So, residuals are kind of the distance between all these points and the line, right? So, this point right here, and this line is going to have a positive number as a residual because it's above the line. It's over as the line is underestimating this point. And this point right here is going to have a negative residual since, under it's the line is overestimating it. So, we kind of collect all these values, all these residual values, and we want to make sure that any given X that this residual that the variance of the residuals is roughly equal. So, what does that mean, exactly? So, looking at this graph, you can kind of see that I would say more or less it is equal. If you cut spin slices of this separated by, you know, X value, roughly X values, you can see that the spread, like this lowest curve that I have here, this lowest curve, you see the spread of these residuals, is roughly the same. You could say increases here, you could see that spread between here increases, especially around here. But a general assumption is that you want this to be equal with, like an equal, equal standard error around this line, right? And I guess in this example, we can holistically determine that roughly, it is the same. This is kind of a large jump right here, compared to this one. But let's say roughly, roughly, we have, we met the assumption of homoscasticity. Independence, so this comes down with study design. We, one of the assumptions is each observation is independent of each other. So if, you know, in our data collection, we see that, perhaps, you know, one person's collection of data impacts another person's, or say that. Maybe an example could be that, say there's a doctor and a tall person, they're collecting height. And, you know, perhaps there's a bunch of NBA players, and, you know, he's measuring, you know, tall people. He's measuring tall people and say he's just kind of like, keeps on measuring and he doesn't really pay attention. So when someone's shorter comes in, maybe he over estimates, or, you know, it doesn't necessarily have to be high. It could be like a test that the doctor performs, like a holistic test. And, you know, he sees a bunch of people that are performing highly, and then so he's more likely to give someone who doesn't perform as high a higher, test value because he's used to seeing other people. It's kind of like seeing other people impacts the values of other people down the line. But this is something we have to verify with study design. It's not necessarily something we can change or address in our data. There are a few remediate actions, but it comes down to study design. I spoke briefly before linearity, just a relationship between the independent variable, like the X values, and then the outcome, the Y values, shares a linear relationship because, basically, linear regression, all these models, more or less assume linear relationships so that we can have one beta value. And, you know, line of best fits are linear. There are other types of models that we can use, and there are other, like, transformations we could use to address linearity issues, but, at its core, this is the key assumption, linear regression models. And then multicollinearity. So, there can't be any correlation between the X values themselves. One of the assumptions is that each covariate that we're adding provides independent unique information. So, for instance, if you had BMI in your model, and you also had height, those are independent, because height is a parameter in calculating BMI. So, you would want to, to address that you would want to only include either BMI or height and not both, because that would be redundant information and a violation of multicollinearity. 

[ONSCREEN CONTENT SLIDE 15]
COEFFICIENTS OF DETERMINATION
How do we determine how well our model performs?
Coefficients of Determination
R squared equation: R squared equals one minus the quotient of the variance of the residual sum of squares and the total variance.
Adjusted R Squared equation: R squared equals one minus the quotient of the variance of the residual sum of squares and the total variance where the residual variance is equal to the sum of squares divided by n and total variance is equal to the total sum of squares divided by n.

[Noah Won]
And the last concept that we're going over is coefficient of determination. So, after we fit our model, we kind of want to see how well it performs. So, coefficient of determination also known as R-squared or another form-adjusted R-squared can kind of give us information on how well our model adheres is like, like, adheres to the data points we're given. So, the equation is R-squared equals one minus the variance of the other residuals. I think my mic may have died out for a second. 

[Paige Logan Prater]
Yeah Noah, you cut out, but I think we can hear you now. 

[Noah Won]
Okay, great. Thank you, Paige. So, quick, you know, a formula for R-squared is just R-squared is equal to one minus the variance of the residuals. So, that is the kind of spread I was talking about from that line of best fit earlier from that graphic. And, versus the divided by the variance of the total. So, the variance of the residual this ratio kind of gives us a measure of how close our line is to the residual points. So, we want our model to be able to predict the points as close as possible. We want to minimize residuals. So, this R squared kind of gives us information on how our model performs. But one drawback from R squared is that it doesn't have the variables you add to your model, like the in multiple linear regression, the greater your R squared increases. So, there's not a punishing variable where you could just add as many variables as you want and adding variables to your model doesn't ever decrease R squared value, just by its inherent nature. So, adjusted R squared kind of accounts for this by accounting for this n here, right? N is the number of variables independent variables we have in our model. And it kind of accounts for this by punishing by punishing your R squared adjusted R squared value by adding more variables to your model. So, adjusted R squared, I'd say, is more of a standard than R squared just because of this kind of a foresight, sorry, yeah, of a lack of foresight in this R squared equation. Just because researchers could add in a bunch of independent variables, and seemingly have an accurate R squared when really it's taking advantage of R squared's inherent nature of not punishing additional variables. 

[ONSCREEN CONTENT SLIDE 16]
HELPFUL RESOURCE
GGplot2 Cheat Sheet
https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf 
GGplot2 Documentation
https://ggplot2.tidyverse.org/reference/index.html 
[Noah Won]
Okay, and these are very helpful research sheets that are, you know, that are available publicly. I can click on one and kind of show you, but as we go into visualization, we're going to use the GGPlot2 package. And this is, I can show you real quick. It's kind of a small cheat sheet to aid you, and if you, you know, perhaps want to make a histogram or make a density plot, it kind of has like a, you know, one sheet that combines all the information that you can use. It's been helpful for me. And another helpful tidbit of information is the GGplot2 documentation. One of the benefits of R is that the documentation is excellent. So if you have a question about, you know, what variables go in what, you know, outside of this R-coding demonstration, the R, the official GGplot2 website has very good documentation that you can use as a reference. 

[ONSCREEN CONTENT SLIDE 17]
Questions?
Alex Roehrkasse
aroehrkasse@butler.edu 
Noah Won
noah.won@duke.edu 
Paige Logan Prater
paige.loganprater@ucsf.edu

[Noah Won]
And Alex was unable to make this presentation, but Paige and I, and Alex are, you know, have our emails here for any questions, you know, if they may come up after the lecture. But without further ado, I think I will be moving to the coding portion of our demonstration and yeah. So like last, like last week's, we'll be kind of following a similar kind of template. 

[ONSCREEN]
# NOTES #
# This program file demonstrates strategies discussed in
# session 5 of the 2025 NDACAN Summer Training Series 
# "Data Visualization." 
# For questions, contact the presenter 
# Noah Won (noah.won@duke.edu).
# Note that because of the process used to anonymize data, 
# all unique observations include partially fabricated data
# that prevent the identification of respondents.
# As a result, all descriptive and model-based results are fabricated.
# Results from this and all NDACAN presentations are for training purposes only 
# and should never be understood or cited as analysis of NDACAN data. 
# TABLE OF CONTENTS #
# 0. SETUP
# 1. Univariate Plots
# 2. Bivariate Plots
# 3. Logistic Regression


[Noah Won]
We'll be going over data visualization session five. If you have any questions, as I said before, please don't hesitate to reach out to my email noah.won@duke.edu. And we will again be using anonymized data and any findings from this data should be aren't indicative of any true trends in childhood maltreatment data. Okay, so we are going to do a similar setup as we did last time. 

[ONSCREEN]
# 0. SETUP #
# Clear environment
rm(list=ls())
# Installs packages if necessary, loads packages 
if (!requireNamespace("pacman", quietly = TRUE)){
  install.packages("pacman")
}
Pacman::p_load(data.table, tidyverse, mice) 
# Defines filepaths working directory
project <- "C:/Users/nhwn1/Downloads/STS5/data"
data <- "C:/Users/nhwn1/Downloads/STS5/data"
# Set working directory
setwd(project)
# Set seed 
set.seed(1013) 

[Noah Won]
I'm going to clear our environment as per good coding practice. We're going to install some packages that we're going to need. So, Tidyverse, Mice, Data.Table, Packman, but the Tidyverse will likely include everything we'll need. We are going to run our project pathways. This is just where I set the project directory where I have my data and I'll just reference this data line later and when I'm reading the data. I'm just going to set up working directory, set a seed for any randomization number randomization that we may need. OK so, we're going to be going over our Univariate Plots first. So if you remember that's those histograms, is dot plots, there's the Box and Whisker Plots. We're going to begin by reading in data, our AFCARS and anonymized data.

[ONSCREEN]
> afcars <- fread(paste0(data,'/afcars_clean_anonymized_linear.csv'))
> head(afcars, 20)
            StFCID STATE St     RecNumbr        DOB    SEX RaceEthn CLINDIS
 1: AL000001456616     1 AL 000001456616 2003-02-15 Female    White      No
 2: AL000001524474     1 AL 000001524474 2003-01-15   Male    White      No
 3: AL000001528009     1 AL 000001528009 2003-01-15   Male    White     Yes
 4: AL000001597400     1 AL 000001597400 2003-01-15   Male    White      No
 5: AL000001612758     1 AL 000001612758 2003-01-15 Female    White      No
 6: AL000001634843     1 AL 000001634843 2002-09-15 Female Hispanic      No
 7: AL000001699782     1 AL 000001699782 2003-02-15   Male    White      No
 8: AL000001699789     1 AL 000001699789 2003-02-15 Female    White      No
 9: AL000001714510     1 AL 000001714510 2002-09-15   Male    White      No
10: AL000001718423     1 AL 000001718423 2002-12-15 Female    White      No
11: AL000001718927     1 AL 000001718927 2003-03-15   Male    White      No
12: AL000001917909     1 AL 000001917909 2002-11-15   Male    White      No
13: AL000002036685     1 AL 000002036685 2003-08-15 Female    Black     Yes
14: AL000002041179     1 AL 000002041179 2003-04-15   Male    White      No
15: AL000002061766     1 AL 000002061766 2003-03-15 Female     AIAN      No
16: AL000002239265     1 AL 000002239265 2003-02-15   Male    White      No
17: AL000002270506     1 AL 000002270506 2003-01-15   Male Hispanic      No
18: AL000002276813     1 AL 000002276813 2003-07-15 Female    White      No
19: AL000002501625     1 AL 000002501625 2003-07-15   Male    Black      No
20: AL000002788485     1 AL 000002788485 2003-04-15   Male    Black     Yes
    TOTALREM FCMntPay AgeAtStart
 1:        2     5425         16
 2:        2     4371         16
 3:        3      732         16
 4:        1      732         16
 5:        4     3102         16
 6:        5      732         16
 7:        2      732         16
 8:        2      732         16
 9:        2     8525         16
10:        1      732         16
11:        2      732         16
12:        2      732         16
13:        2        0         16
14:        2        0         16
15:        2        0         16
16:        2     2961         16
17:        3        0         16
18:        1        0         16
19:        2        0         16
20:       NA      938         16

[Noah Won]
And then we're just going to check to see make sure everything ran well. Okay, great. So we have STFCID, states, a character state variable, recnumber, data of birth, sex, race and ethnicity, CLINDIS, total removal, foster care monthly, pay, age at start. Okay, looks like everything ran well. We're going to just look at a couple variables, a couple of frequency tables of our variables, which are variables of interest that we're going to be looking at. 

[ONSCREEN]
> table(afcars$SEX)

       Female   Male 
    88 323567 307598 
> Table(afcars$RaceEthn)

                   AIAN       Asian       Black    Hispanic Multiracial 
      10255       14846        3393      138569      134412       48829 
       NHPI       White 
       1698      279251 
> Table(afcars$FCMntPay)

     0      1      2      3      5      6      7      8      9     10     11 
265467    858      3      7      4      5     23     10      8     15      8 
    12     13     14     15     16     17     18     19     20     21     22 
     3     16     32     28     35     15     32     89     67     40     20 
    23     24     25     26     27     28     29     30     31     32     33 
    40     27     67     84    109     64     34     69     31     20     83 
    34     35     36     37     38     39     40     41     42     43     44 
    85     19     14     33     42     73     42     51     47     17     31 
    45     46     47     48     49     50     51     52     53     54     55 
    45     23     27     48     71    227     92    142    122     84     41 
    56     57     58     59     60     61     62     63     64     65     66 
    90     23    142     68     51     49     15     32     34     39     22 
    67     68     69     70     71     72     73     74     75     76     77 
    56     25     40     29     47    134     19     24     52    111     21 
    78     79     80     81     82     83     84     85     86     87     88 
    89     64     35     60     36     28     57     44     41     61     57 
    89     90     91     92     93     94     95     96     97     98     99 
    45    354     23     59    782     18     44    866     29     56    162 
   100    101    102    103    104    105    106    107    108    109    110 
   141     31     72     17     70    279    342   1056     72     34     37 
   111    112    113    114    115    116    117    118    119    120    121 
  1006     48     30     49     18     41     31    799     26    242     11 
   122    123    124    125    126    127    128    129    130    131    132 
   723     17     49     18    807     79     15     15    753     52     26 
   133    134    135    136    137    138    139    140    141    142    143 
    20     41     69     44     91     36     42     25     25     16     32 
   144    145    146    147    148    149    150    151    152    153    154 
    89     35     28     59     22    160     73     44     58    383    103 
   155    156    157    158    159    160    161    162    163    164    165 
    36     45     72    366     13     26     32     62     21     29     91 
   166    167    168    169    170    171    172    173    174    175    176 
    32     33     44     40     14     69     15     33     25     83     26 
   177    178    179    180    181    182    183    184    185    186    187 
   582     66     23    808     12     51     44     76     28    255     18 
   188    189    190    191    192    193    194    195    196    197    198 
    16     58     13     38     11      5     19     55     19     39    124 
   199    200    201    202    203    204    205    206    207    208    209 
    10    101     32     28     67     34     15     18     20     49     15 
   210    211    212    213    214    215    216    217    218    219    220 
    64     35     21     16     23     27     29     46     35     14     44 
   221    222    223    224    225    226    227    228    229    230    231 
    12     18     17     30     25     21     28     74     13     40     33 
   232    233    234    235    236    237    238    239    240    241    242 
    27     17     86     28     80     19     24     14    147      3      5 
   243    244    245    246    247    248    249    250    251    252    253 
    75    116     42     10     16    124     24     79     35     54     31 
   254    255    256    257    258    259    260    261    262    263    264 
  1924     29     51     21     15     27     43     26     89     67     23 
   265    266    267    268    269    270    271    272    273    274    275 
    17    186     27     40     23     63     48     18     17     10     63 
   276    277    278    279    280    281    282    283    284    285    286 
    39     20     13     49     24     23     17     20     12     45     40 
   287    288    289    290    291    292    293    294    295    296    297 
    10    134      6     15     35     28     12     15     93     16    154 
   298    299    300    301    302    303    304    305    306    307    308 
    42     14    142     20     31     10     11     72     38     11     18 
   309    310    311    312    313    314    315    316    317    318    319 
    22    155     11   8129     12     16     67     15     22     26     41 
   320    321    322    323    324    325    326    327    328    329    330 
    26     17     22     30     39     79     16     37     12      9   2043 
   331    332    333    334    335    336    337    338    339    340    341 
    41      8     13     16     75     10     18     45      6     19    330 
   342    343    344    345    346    347    348    349    350    351    352 
    17     33      4    125     15    166      7     28     27     45    109 
   353    354    355    356    357    358    359    360    361    362    363 
    16     16    113     38     15     15      9     75     64     15     11 
   364    365    366    367    368    369    370    371    372    373    374 
    42     16     21     88     21     35     17     16     23     23     43 
   375    376    377    378    379    380    381    382    383    384    385 
    19     14      9     42     48     20     92     24     50     18     37 
   386    387    388    389    390    391    392    393    394    395    396 
     3     14     22     30     81     24     26   2233     51    757    168 
   397    398    399    400    401    402    403    404    405    406    407 
    34     13     48     82     34     48     12     23     14   1123    826 
   408    409    410    411    412    413    414    415    416    417    418 
    37     33   1662     18     31     43     29     77     92     19    171 
   419    420    421    422    423    424    425    426    427    428    429 
    30    140    280    165     15     16     22    233     16     18     41 
   430    431    432    433    434    435    436    437    438    439    440 
    13     60     30     40     45     18    100     32     60    774    810 
   441    442    443    444    445    446    447    448    449    450    451 
   400     49     39     22     89    165    377     22    672     51    549 
   452    453    454    455    456    457    458    459    460    461    462 
    56    480     18     23   1368     59     17     52     98   2335     23 
   463    464    465    466    467    468    469    470    471    472    473 
  1695    290     33     26    752    118     37    304     30     48     36 
   474    475    476    477    478    479    480    481    482    483    484 
    10   1736     62     16     28     23     66     31   1453    158     64 
   485    486    487    488    489    490    491    492    493    494    495 
    23   1463     64     22     19     54     12     56     32     41    218 
   496    497    498    499    500    501    502    503    504    505    506 
   134     12     27     41   2332    330     18   1329     32     10     52 
   507    508    509    510    511    512    513    514    515    516    517 
    20    108     33    107    773     43    399     60     61     36   1455 
   518    519    520    521    522    523    524    525    526    527    528 
   327     15    249     10     23   1955     99     82     53   1315     55 
   529    530    531    532    533    534    535    536    537    538    539 
    17    245   3207     73    129    439     43     48    203     20     22 
   540    541    542    543    544    545    546    547    548    549    550 
   269     86     43    930     22    139    183     10     56     27    161 
   551    552    553    554    555    556    557    558    559    560    561 
    63    109     80    910     88   1271     34    829     14    107    331 
   562    563    564    565    566    567    568    569    570    571    572 
   448     42     98     12     41     16   1146     17    119    690     86 
   573    574    575    576    577    578    579    580    581    582    583 
   206     24    129     61    396     17    304    115   1434     38     23 
   584    585    586    587    588    589    590    591    592    593    594 
   402    110     35   1498     85   1457    558    600    156     44    158 
   595    596    597    598    599    600    601    602    603    604    605 
    40    301     37    107     24    889     50     25    133     51     56 
   606    607    608    609    610    611    612    613    614    615    616 
   156     87     80    916     71    166   1817     98     29     36    117 
   617    618    619    620    621    622    623    624    625    626    627 
   739     24     15    175    166     23     59   1478     30    246     41 
   628    629    630    631    632    633    634    635    636    637    638 
    27    492    353     43     91     81   1963     96     80     93    705 
   639    640    641    642    643    644    645    646    647    648    649 
    32     74     24     21     47    811    719     41    333     61    735 
   650    651    652    653    654    655    656    657    658    659    660 
   109    157     97     16     56    514     39     25     16   1245    277 
   661    662    663    664    665    666    667    668    669    670    671 
    46     53     74    297     67     43    520     94    566     69     22 
   672    673    674    675    676    677    678    679    680    681    682 
  1778     51   1057     61     30    305    716     24    355    225    121 
   683    684    685    686    687    688    689    690    691    692    693 
   480     48    144     94     91    153    192    249    272     16   1824 
   694    695    696    697    698    699    700    701    702    703    704 
   230     20    383     17     77     35    409    109    339    299    158 
   705    706    707    708    709    710    711    712    713    714    715 
  1188     36     51     91     81    108     91     61    130     15    162 
   716    717    718    719    720    721    722    723    724    725    726 
   709    139     31     17   2559   1056     15   1381     76     30     29 
   727    728    729    730    731    732    733    734    735    736    737 
    41     88     50     27     69    341   1076    524     28   1956    156 
   738    739    740    741    742    743    744    745    746    747    748 
    29     57     15    279    446    157    592     30     29    274     35 
   749    750    751    752    753    754    755    756    757    758    759 
    53   2640     20     54     82   1097     11     63     39   1085     14 
   760    761    762    763    764    765    766    767    768    769    770 
   383     52     75   2372     29    119     67     49    174    221    117 
   771    772    773    774    775    776    777    778    779    780    781 
    42     43    141     53    342    541     75     23     31   1700    130 
   782    783    784    785    786    787    788    789    790    791    792 
    35    502     65     99    294    303    115    691   2343    270   1043 
   793    794    795    796    797    798    799    800    801    802    803 
   279     17    790    869    147     39    433    383     79     81    225 
   804    805    806    807    808    809    810    811    812    813    814 
    30     82    467    460     51     24   1345     82    754    184     55 
   815    816    817    818    819    820    821    822    823    824    825 
   325     52    726    262     19     24     16     83     87     99     93 
   826    827    828    829    830    831    832    833    834    835    836 
   238     47    158     20     31    193     50     14    133     30     35 
   837    838    839    840    841    842    843    844    845    846    847 
   329     18    903    338     30    265     40    120   1016     43     18 
   848    849    850    851    852    853    854    855    856    857    858 
   863     25     84   1497     22     13     47    188    477     45     73 
   859    860    861    862    863    864    865    866    867    868    869 
    98     19    547     13    245    163     16    635     16    143     31 
   870    871    872    873    874    875    876    877    878    879    880 
   109    150    601     37     47    905     64     44     34    118     43 
   881    882    883    884    885    886    887    888    889    890    891 
    29     31     58    160     10     39     30    135   1063    508     36 
   892    893    894    895    896    897    898    899    900    901    902 
   268     13     79    746     70     52     21     53   1654    148     60 
   903    904    905    906    907    908    909    910    911    912    913 
    14    274     66      9    627     21     66     56     32    203    196 
   914    915    916    917    918    919    920    921    922    923    924 
    51     20     28    252     69    240     81    321     21    274    215 
   925    926    927    928    929    930    931    932    933    934    935 
   224    602     21     43     19    436    107     34    244     85     55 
   936    937    938    939    940    941    942    943    944    945    946 
    88     20     16     11    127    173     72    212     78    220     22 
   947    948    949    950    951    952    953    954    955    956    957 
   128     33    139     34    252     22     80     28     48    671     36 
   958    959    960    961    962    963    964    965    966    967    968 
   132     44   1899     67    100     15     49     40     34     99     21 
   969    970    971    972    973    974    975    976    977    978    979 
    34     46      5    106    232     55    319    110     82     11     38 
   980    981    982    983    984    985    986    987    988    989    990 
    93     28     31     90     30     26     18     20     51     12    302 
   991    992    993    994    995    996    997    998    999   1000 
   200    115     17     30    152    104     29     27     54  12120 
 [ Reached getOption("max.print") -- omitted 8076 entries ]

[Noah Won]
It seems like we have a lot. It looks good. You have our sex. We have about 30,000, seven-- Sorry, 307,000 males and 323,000 females. Good spread of our race ethnicity variable. And pay variable, obviously, it's just going to have a lot of values, because this is a continuous variable. Okay, we are going to create a couple of dummy variables, right? So, like, for example, in our sex, we have male female, which is, you know, easy to read when we're putting into tables. But for computers to read, we really want to set it to the ones in zeros. Like I said, that binary variable that we're going to use. So, this piece of code just essentially creates another variable, sex underscore d or dummy variable, sex underscore dummy, and creates it when it creates it to be one, when it's male, and zero, when it's female. Same thing with CLINDIS, we're going to do, create another variable called CLINDIS underscore d. Yes, it's going to be a one, no, it's going to be a zero. Hispanic, one, if it's Hispanic, zero, if it's anything else. And then we're going to filter by age. So we're just going to find the most applicable age groups. So just people who have, we're just more concerned with, like, a more recent age group. So we're going to run this code. 

[ONSCREEN]
> afcars2 <- afcars %>% 
+            mutate(SEX_d = case_when(
+               SEX == "Male" ~ 1,
+               SEX == "Female" ~ 0),
+            CLINDIS_d = case_when(
+                 CLINDIS == "Yes" ~ 1,
+                 CLINDIS == "No" ~ 0),
+            Hispanic = case_when(
+               RaceEthn == "Hispanic" ~ 1,
+               TRUE ~ 0),
+            age = as.numeric(difftime(Sys.Date(), DOB, units = "days")) / 365.25
+            ) %>%
+            Filter(age <= 30)

[Noah Won]
I'm going to test some of these variables. Make sure they do right right at the well, right? We've got ones in zero. 

[ONSCREEN]
> table(afcars2$SEX_d)

     0      1 
323559 307580 
> Table(afcars2$CLINDIS_d)

     0      1 
391018 148137 
> Table(afcars2$Hispanic)

     0      1 
496820 134407 
> Table(afcars2$age)

  4.290212183436 4.53661875427789 4.62149212867899 4.70362765229295 
               1                1                1                1 
4.78850102669405  4.9555099247091  5.0403832991102 5.12251882272416 
            1287             2465             2239             2479 
5.20739219712526 5.28952772073922 5.37440109514031 5.45379876796715 
            2349             2705             2529             3326 
5.53867214236824 5.62354551676934  5.7056810403833 5.79055441478439 
            2853             3395             3440             3884 
5.87268993839836 5.95756331279945 6.04243668720055 6.12457221081451 
            3773             4179             4258             3826 
6.20944558521561 6.29158110882957 6.37645448323066 6.45311430527036 
            3706             3935             3687             4479 
6.53798767967146 6.62286105407255 6.70499657768652 6.78986995208761 
            3816             4306             4440             4522 
6.87200547570157 6.95687885010267 7.04175222450376 7.12388774811773 
            4416             4503             4755             4171 
7.20876112251882 7.29089664613279 7.37577002053388 7.45242984257358 
            4214             4004             3866             4328 
7.53730321697468 7.62217659137577 7.70431211498973 7.78918548939083 
            3751             4113             4209             4152 
7.87132101300479 7.95619438740589 8.04106776180698 8.12320328542094 
            4165             4149             4337             3544 
8.20807665982204   8.290212183436  8.3750855578371  8.4517453798768 
            3824             3549             3244             3607 
8.53661875427789 8.62149212867899 8.70362765229295 8.78850102669404 
            3284             3450             3730             3643 
8.87063655030801  8.9555099247091  9.0403832991102 9.12251882272416 
            3667             3655             3739             3088 
9.20739219712526 9.28952772073922 9.37440109514031 9.45379876796714 
            3292             3116             2947             3235 
9.53867214236824 9.62354551676934  9.7056810403833 9.79055441478439 
            3016             3184             3350             3219 
9.87268993839836 9.95756331279945 10.0424366872005 10.1245722108145 
            3304             3257             3316             2831 
10.2094455852156 10.2915811088296 10.3764544832307 10.4531143052704 
            2975             2935             2766             2966 
10.5379876796715 10.6228610540726 10.7049965776865 10.7898699520876 
            2749             2891             3030             2867 
10.8720054757016 10.9568788501027 11.0417522245038 11.1238877481177 
            3020             3083             3111             2657 
11.2087611225188 11.2908966461328 11.3757700205339 11.4524298425736 
            2627             2524             2539             2716 
11.5373032169747 11.6221765913758 11.7043121149897 11.7891854893908 
            2492             2578             2930             2745 
11.8713210130048 11.9561943874059  12.041067761807 12.1232032854209 
            2762             2692             2885             2533 
 12.208076659822  12.290212183436 12.3750855578371 12.4517453798768 
            2684             2475             2308             2578 
12.5366187542779  12.621492128679 12.7036276522929  12.788501026694 
            2170             2503             2613             2640 
 12.870636550308 12.9555099247091 13.0403832991102 13.1225188227242 
            2462             2602             2648             2368 
13.2073921971253 13.2895277207392 13.3744010951403 13.4537987679671 
            2406             2338             2149             2349 
13.5386721423682 13.6235455167693 13.7056810403833 13.7905544147844 
            2170             2347             2399             2447 
13.8726899383984 13.9575633127995 14.0424366872005 14.1245722108145 
            2278             2463             2437             2188 
14.2094455852156 14.2915811088296 14.3764544832307 14.4531143052704 
            2376             2141             1997             2301 
14.5379876796715 14.6228610540726 14.7049965776865 14.7898699520876 
            1950             2207             2335             2256 
14.8720054757016 14.9568788501027 15.0417522245038 15.1238877481177 
            2181             2241             2296             2103 
15.2087611225188 15.2908966461328 15.3757700205339 15.4524298425736 
            2149             2118             2075             2208 
15.5373032169747 15.6221765913758 15.7043121149897 15.7891854893908 
            1933             2136             2266             2312 
15.8713210130048 15.9561943874059  16.041067761807 16.1232032854209 
            2208             2395             2399             2141 
 16.208076659822  16.290212183436 16.3750855578371 16.4517453798768 
            2070             2134             2022             2165 
16.5366187542779  16.621492128679  16.703627652293  16.788501026694 
            1971             1951             2203             2264 
 16.870636550308 16.9555099247091 17.0403832991102 17.1225188227242 
            2171             2404             2363             2095 
17.2073921971253 17.2895277207392 17.3744010951403 17.4537987679671 
            2052             2141             2021             2221 
17.5386721423682 17.6235455167693 17.7056810403833 17.7905544147844 
            2061             2137             2157             2219 
17.8726899383984 17.9575633127995 18.0424366872005 18.1245722108145 
            2239             2315             2276             2182 
18.2094455852156 18.2915811088296 18.3764544832307 18.4531143052704 
            2131             2171             1907             2287 
18.5379876796715 18.6228610540726 18.7049965776865 18.7898699520876 
            1988             2217             2229             2319 
18.8720054757016 18.9568788501027 19.0417522245038 19.1238877481177 
            2383             2316             2478             2167 
19.2087611225188 19.2908966461328 19.3757700205339 19.4524298425736 
            2204             2263             2041             2297 
19.5373032169747 19.6221765913758 19.7043121149897 19.7891854893908 
            2118             2204             2320             2319 
19.8713210130048 19.9561943874059  20.041067761807 20.1232032854209 
            2276             2383             2437             2232 
 20.208076659822  20.290212183436 20.3750855578371 20.4517453798768 
            2330             2344             2258             2461 
20.5366187542779  20.621492128679  20.703627652293  20.788501026694 
            2172             2408             2534             2494 
 20.870636550308 20.9555099247091 21.0403832991102 21.1225188227242 
            2411             2586             2615             2415 
21.2073921971253 21.2895277207392 21.3744010951403 21.4537987679671 
            2487             2455             2304             2618 
21.5386721423682 21.6235455167693 21.7056810403833 21.7905544147844 
            2350             2436             2575             2692 
21.8726899383984 21.9575633127995 22.0424366872005 22.1245722108145 
            2632             2772             2737             2463 
22.2094455852156 22.2915811088296 22.3764544832307 22.4531143052704 
            2515             2502             2439             2632 
22.5379876796715 22.6228610540726 22.7049965776865 22.7898699520876 
            2426             2478             2636             2408 
22.8720054757016 22.9568788501027 23.0417522245038 23.1238877481177 
            2558             2477             2591             2152 
23.2087611225188 23.2908966461328 23.3757700205339 23.4524298425736 
            2130             2028             2022             2055 
23.5373032169747 23.6221765913758 23.7043121149897 23.7891854893908 
            1901             1894             2023              639 
23.8713210130048 23.9561943874059  24.041067761807 24.1232032854209 
            1852              598              661              526 
 24.208076659822  24.290212183436 24.3750855578371 24.4517453798768 
             527              479              446              494 
24.5366187542779  24.621492128679  24.703627652293  24.788501026694 
             428              441              490              378 
 24.870636550308 24.9555099247091 25.0403832991102 25.1225188227242 
             428              366              385              357 
25.2073921971253 25.2895277207392 25.3744010951403 25.4537987679671 
             403              369              339              368 
25.5386721423682 25.6235455167693 25.7056810403833 25.7905544147844 
             365              318              302              316 
25.8726899383984 25.9575633127995 26.0424366872005 26.1245722108145 
             319              291              325              263 
26.2094455852156 26.2915811088296 26.3764544832307 26.4531143052704 
             261              248              202              270 
26.5379876796715 26.6228610540726 26.7049965776865 26.7898699520876 
             258              219              229                2 
26.8720054757016 27.0417522245038 27.2908966461328 27.3757700205339 
             184                1                2                3 
27.4524298425736 27.5373032169747 27.6221765913758 27.7043121149897 
               2                1                1                1 
 28.041067761807 28.1232032854209  28.621492128679 29.0403832991102 
               1                2                1                1 
29.1225188227242 29.4537987679671 29.9575633127995 
               1                1                1 

[Noah Won]
So it seems like to align with what we had seen the sex variable. Similarly with CLINDIS, okay, looks good. Hispanic and age, and nice. Okay, oh, and I forgot to mention age is just derived as from date of birth, I just took the date of birth and days and then divided it by 365.25 to get the age. So it looks all good. Variables are derived well. There are no crazy values. Let's start graphing. So we're going to start with the GGplot package. We are going to create a very simple histogram. So syntax in GGplot2 with this code is we're going to assign this plot, this is the plot. We're going to assign it to this variable hist1. This is some arbitrary name I came up with. But essentially, so we're assigning this. This arrow means we're assigning the variable to hist1. The syntax begins with the GGplot function. And it takes in these parameters, afcars2, which is what I named this data set, which I assigned this data set. So we're going to be using this data. Whatever is in the first parameter position we're going to be using that as data set name. And then we're going to use the AES function within it. It's a static function. And essentially it gives information about the plot that we would want to see. So we're going to do AES. And then in the function x equals age. So our x variable is our age variable. Continuous variable. We're just going to see. Yeah, we're just going to kind of see like where this histogram kind of shows the distribution of age. And then we're going to this point right here is adding a geom histogram. So essentially just telling them, okay, build a histogram from this information that we have here. So let's see what this makes. 

[ONSCREEN]
> hist1 <- ggplot(afcars2, aes(x = age)) + geom_histogram()

[ONSCREEN R CODE IMAGE 01]
A histogram bar graph with age as the x axis, and count as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5.

[Noah Won]
Okay, it seems like a relatively simple histogram. Looks good, but we don't have any titles. We don't have any labels. It says count, lowercase and age. And like these bins seem kind of large. It kind of seems like kind of murky. The colors a little bit off too. I think we can do better this. So let's make a improved version. Let's add some titles to this. So we're going to assign this the histogram 2. A different histogram that we can call later. And we're, same thing, we're going to do ggplot2, afcars calling the afcars data set using age as the x variable using geom_histogram. But this time, we're going to add a lab function, which is essentially kind of gives information about titles, subtitles that we can add to our graph. So let's go ahead and run this. We're adding a histogram of age title just at the very top. At the x axis, we're adding age or just capitalizing age because by default, it's going to use it's going to use the variable name. So I guess in this case age is not necessarily a very bad name, but if you have like FCMntPay, it's going to look kind of dirty. And then we're going to change count to frequency. So let's see how this looks. 

[ONSCREEN]
> hist2 <- ggplot(afcars2, aes(x = age)) +
+   geom_histogram() +
+   labs(title = "Histogram of Age",
+        x = "Age",
+        y = "Frequency")
> hist2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

[ONSCREEN R CODE IMAGE 02]
A histogram bar graph with title "Histogram of Age", "Age" as the x axis, and "Frequency" as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5.

[Noah Won]
Okay, looks good, looks better. We have a title, we have a frequency, we have a label, we have x, y, axis labels, but it seems kind of dull still. It's kind of hard to determine where what starts, what ends, it seems like we could use more number of bins to it doesn't seem like there's some relationships here that we're missing. So let's try and make this a little bit better. Okay, so we're going to run it back to histogram 3. Or we're going to make histogram 3 and add a couple more static variables. So again, same thing, GGplot2. We have the static variable X equals age, you know, same thing as before. And in the geom histogram function, we can actually add more information about what we want to add to this histogram. So in this example, we're going to be adding more bins, right? So bins is this like this length of age that we're trying to capture. So we're going to we're going to add more bins because we want adding more bins, and adding more bins makes the size of the bin with smaller. So you can either decrease bin with or add bins, so which is going to add bins. We're going to fill it with a different color steel blue, and then we're going to make it white as well. So and of course adding the the title sorry the labels that we added earlier. Let's take a look at this. 

[ONSCREEN]
> hist3 <- ggplot(afcars2, aes(x = age)) +
+   geom_histogram(bins = 50, fill = "steelblue", color = "white") +
+   labs(title = "Histogram of Age",
+        x = "Age",
+        y = "Frequency")
> hist3

[ONSCREEN R CODE IMAGE 03]
A histogram bar graph with title "Histogram of Age", "Age" as the x axis, and "Frequency" as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5. Blue bars now appear discretely separated from each other.

[Noah Won]
Okay, looks good. Looks much better. See the color seems to show the difference between the bars. So we can actually see the difference to outline and white and the fill is steel blue. It's added definitely a number more of bins here. So we can kind of see more of that relationship between these variables. It's more exact, right? And I think this histogram is a lot better than what we started with. So, you know, I think I'm pretty happy with this. I think we're going to probably move on. We can kind of see just from the histogram of age that, yes, we see a huge spike, like a modal spike right here, seemingly between zero and 10, and then it goes down, and then we see another slight spike here at 20. It kind of gives us information that this data might not be normally distributed. But more likely, seemingly, bimodally distributed, small hump here, and that's a larger hump here. Okay, but what kind of information can we get from a box and whisker plot? So, I'm kind of kind of for the sake of time. I'm going to kind of skip over most of these, because it's kind of very similar to what we did before. 

[ONSCREEN]
> box1 <- ggplot(afcars2, aes(x = SEX, y = age)) +
+   geom_boxplot(fill = "skyblue", color = "darkblue") +
+   labs(title = "Box and Whisker Plot of Age by Sex",
+        y = "Age", x = "Sex") +
+   theme_minimal()
> box1

[Noah Won]
It's changing the color of the box plot, changing the titles, but I will go over this difference in syntax. We are actually, box and whisker plots are univariate, but we're going to actually split it by age, because box and whisker plots are more often useful when we're comparing them against other box and whisker plots, right? So, we're going to use age. We're just going to separate, we're going to create three box and whisker plots based off of different values of age, male, female, and missing. So, that's kind of why we added this aesthetic value. This kind of x equals SEX, y equals age. We kind of added another variable in there. We're going to use geombox_plot this time and we're going to add a theme. There are like a number of themes. It just kind of makes your box plots or your graphs look prettier. So, like I said, in the documentation, you can see all the themes that GGplot offers. I'll just be using minimal, so let's see. 

[ONSCREEN R CODE IMAGE 04]
A box and whisker plot with x-axis labeled "Sex" and y-axis labeled "Age". There are three boxes. The first box is unlabeled and rises from about age 5 to age 15, with lines that stretch down to 0 and up to 25. The second box is labeled "Female" and rises from about age 7 to about age 18, with lines that stretch down to 0 and up to 29. The third box is labeled "Male" and rises from about age 7 to about age 17, with lines that stretch down to 0 and up to 30.

[Noah Won]
Okay, nice. So, we can see through this box and whisker plot that, you know, it seems like male and female are very similar in distribution. Perhaps males have a slightly higher age, very slightly. And perhaps have a slightly longer whisker over here, showing that maybe they have a heavier distribution over here. But yeah, when we see like these missing values are also are distributed lower. So perhaps we could see that maybe if there's some kind of relation between missing the sex variable and younger, like having a younger age, which could be it's likely a data, like, related to the data, but it's hard to tell for certain. But yeah, so like we can see the distribution and compare them between various levels of this sex variable that we created. Okay, so that's fine and dandy. If you want to measure one variable, but what if we want to measure the relationships between two continuous variables? Since that we're going to go over a scatter plots, which are very helpful in kind of determining the relationship between two variables. Two continuous variables we're going to be using to compare our age and FCMntPay. So essentially what's the foster care monthly pay relationship to the age of the person. So we are going to assign this GGplot to scatter1. It's just the first scatter plot that we're making. And we are using the AFCARS2 data as I said before. We are comparing age and FCMntPay. So age is our independent variable, right? Age we want age to be on our x-axis, our horizontal axis, and the pay to be on our y-axis. We are using geom_point. So this is kind of the key word here for scatter plots, geom_point. And filling in the color with steel blue. A new parameter we have here is alpha point 6. It's kind of just the static parameter. It's just the value between zero and one, and it shows like the opacity of the dots that we're plotting. So if you have like a ton of dots, maybe you'd want it to be less opaque, so that you can kind of see the dots through each other. And then the size of the dots, it's just, we can assign as well, just a value. You know, if there are a lot of dots, we can make them smaller, bigger. See them better, it's just kind of you can play with these parameters to get a visually appealing plot. Again we're going to use the labs, labs, a function, to assign titles. And we're going to be applying a minimal theme. So let's see how this runs. It might take a little bit, just because there's a lot of information in this plot. So as, like you saw, there's like 300,000 males. 

[ONSCREEN]
> scatter1 <- ggplot(afcars2, aes(x = age, y = FCMntPay)) +
+   geom_point(color = "steelblue", alpha = 0.6, size = 2) +
+   labs(title = "Scatterplot of Foster Care Monthly Payment vs Age",
+        x = "Age",
+        y = "Foster Care Monthly Payment (FCMntPay)") +
+   theme_minimal()
> scatter1
Warning message:
Removed 1366 rows containing missing values (`geom_point()`). 

[Noah Won]
And so it takes a lot of time for R program to kind of run it. We'll say, we will wait for a little bit. I'm sure, it will be well worth it. Well, like I said, there are, you know, many different options. I can kind of reference this cheat sheet right now while we're waiting. 

[ONSCREEN]
A view of the "Data visualization with ggplot2 CHEAT SHEET" PDF available at the following link: posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf.

[Noah Won]
So like we have one variable continuous, right? So you can do like geom_histogram like I did before. Or you could do like, let's see, two continuous. We did geom_point, right? And then in the point, in the point we could do, like, in the, sorry, in this, that in the point parameter, we could do X, Y. You could also put the X and Y values in this, in this geom_point thing at function. So we did alpha, we've used color, we used fill, we haven't used shape, we used size. So these are just kind of the parameters that you can put in here. And yeah, there's lots of information, useful information that we use not see it. Like using, we'll be using geom_smooth in a second to kind of show like cubic splines. I think we'll use geom_box plots, one discrete, one continuou. See if this is coming along. Well it should be almost done I don't remember it taking so long. But I guess, I suppose in the meantime, I can cover this other plot that I was going to show. 

[ONSCREEN]
#There seems to be a positive relationship between age and Monthly Foster Care Payment but it is hard to tell
#Let's fit a cubic spline in the data to see
> scatter2 <- ggplot(afcars2, aes(x = age, y = FCMntPay)) +
+   geom_point(color = "steelblue", alpha = 0.6, size = 2) +
+   geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color = "darkred", se = FALSE) +
+   labs(title = "Scatterplot of FCMntPay vs Age with Cubic Spline Fit",
+        x = "Age",
+        y = "Foster Care Monthly Payment (FCMntPay)") +
+   theme_minimal()

[Noah Won]
So we're going to kind of add cubic splines to the graph that we made up here. So we're going to follow the same method. We're going to use GGplot's same variables, because that doesn't change. We're not trying to change X and Y variables. We're going to use the same geom_point as we did. So this kind of builds the scatter plot. And then we're going to use geom_smooth, which builds the lowest line or the cubic splines on top of the scatter plot. So that's why we're kind of adding it in addition to geom_point. This makes scatter plot. This makes the curves. We're going to use, we're going to specify which model. Apologies. We're going to specify which model, or which method we're going to use to create the cubic splines. We're going to use the general additive model, which is gam for short. We're going to specify the formula of this. So it's just our outcome, Y, FCMntPay is equal to this is a smooth function to X. And then the method is going to be cubic splines, which is CS for short. We're going to set it to dark red. And this SE option stands for standard error. So the standard error, so the standard. Oh, okay, as it comes. We have it right here actually. 

[ONSCREEN R CODE IMAGE 05]
A graph of blue dots. X-axis is labeled "Age", y-axis is labeled "Foster Care Monthly Payment (
FCMntPay)". Most of the dots are between ages 3 and 27, and below FCMntPay value 20,000.

[Noah Won]
So we see all these dots and plot it against age. It's kind of hard to see this really high density of odds here, like I said, there's more than 500,000 observations. So it's going to be a high density. It's kind of hard to see the relationship, but it seems there's a relatively positive relationship here. But we're going to fit this cubic spline to kind of see exactly how, we'll kind of see more holistically, how, you know, what the relationship is between the points and the cubic splines. As I said, the standard error kind of shows that little bandwidth the width that I have here right here.

[ONSCREEN SLIDE 10 IMAGE 1 ALT-TEXT]
A cubic spline plot is a smooth curve that passes through a set of given data points, created by piecing together a series of cubic polynomials between each pair of points. Each segment is defined so that the entire curve is smooth The example shows a line on plane with x-axis labeled "y1" with values going from 0 to 25, and y-axis labeled "vcr" with values going from 0 TO 2500. There is a shaded band which covers the line from above and below: this represents the standard error band width.

[Noah Won]
See this with this standard error bandwidth, you can kind of see. So I'll display it for us, but since this distribution is still heavily populated, I'm going to set it to false because you can't even see it if I set it to true. Either way, it's kind of hard to see because just the sheer density of variables. We're going to apply our titles. Scattered plot of FCMntPay versus age with a cupbic spline fit. The titles, and then we're going to apply a similar theme. 

[ONSCREEN]
> scatter2
Warning messages:
1: Removed 1366 rows containing non-finite values (`stat_smooth()`). 
2: Removed 1366 rows containing missing values (`geom_point()`). 

[Noah Won]
So hopefully this one moves a little bit quicker. I want to leave time for questions. But it should be able to fit a line. So between of that scatter plot that we had just made. Let's say while we're waiting for this because it's just going to be a cubic spline I will start going over the logistic regression model. So it's fairly quick and simple to run. But oh, that was a little bit quicker than expected. 

[ONSCREEN R CODE IMAGE 06]
A graph of blue dots. X-axis is labeled "Age", y-axis is labeled "Foster Care Monthly Payment (
FCMntPay)". Most of the dots are between ages 3 and 27, and below FCMntPay value 20,000. A line is overlaid on the dots which runs along low values of FCMntPay from age 0 to 30, with a slight hump in the line at age 23.

[Noah Won]
So you could see here the slight positive, a very slight positive association with a age and foster care monthly payment. And this line is so close to the bottom likely because there's a high density of variables here that we can't see. But cubic splines, you know, they fit like local trends very well so you could see that it kind of humps up here and then humps back down. But yeah, that's just how to fit a cubic spline essentially. But I'm sorry, I'm kind of jumping around but I thought this would take longer. But our logistic regression model if you recall, we use logistic regression to model binary outcome variables. So our binary outcome variables take the value of zero or one. And we are going to be modeling a clinical disability this CLINDIS variable. Essentially, if the indicator of one equals yes, if they have been diagnosed, clinically diagnosed with a disability or zero, if they have not. And we want to see that kind of relationship as compared to it with age, the relationship between being diagnosed with disability and the longer you live. 

[ONSCREEN IMAGE]
Regression Equation: Dependent variable is Y sub i. Population Y intercept is Beta naught. Population Slope Coefficient is Beta sub one. Independent variable is X sub i. Random Error Term is Epsilon sub i. The Linear component is Beta naught plus Beta sub one times X sub i. The Random Error component is Epsilon sub i.

[Noah Won]
So to go over the model again, it's very similar to the continuous linear regression model. Except this dependent variable is not just the continuous, but we want to make it continuous. Because we want to be able to kind of model it the same way as we do linear regression. So how do we turn a variable that's from zero to one into a continuous variable? Well, instead of, as the made instead of using the zero to one binary variables, what's something we can predict the probability instead of the variable as when given a certain age. So okay, so we don't exactly use probabilities however, because the probability is restricted between zero and one. So what's something that's analogous to probability that we can use that has that's continuous, you know, negative infinity, you know, infinity, you can be negative positive, it could be anything. And that's odds. So odds and probability are very closely related. And I'm sure if you heard the word odds before, but what is an odd? An odd? It odds are the ratio of outcomes that we can use to model chance. 

[ONSCREEN]
# 4. Logistic Regression
#What if we want to model and outcome that is NOT continuous but binary (i.e. Has two different values, yes/no, male/female, etc.)
#We can use a logistic regression model which converts the outcome variable to log odds
#Odds is a ratio of outcomes that we can use to model chance
#What is the probability we don't roll a 6 on a fair die? 5/6. What are the odds? 5 To 1.

[Noah Won]
Probability and odds are related by equations. Very so it's one over one plus p, I believe, as equal to odds. But an example that I find helpful to determine the difference between probability and odds is, what is the probability we don't roll a six on a fair die? There's five options where we don't roll a six and then six total options. So probability is five six. Odds work slightly differently. What are the odds? We take the ratio of potential outcomes that don't, sorry. Potential outcomes where we don't roll a six and then take that ratio and put it against the outcomes where we do roll a six. So there are five outcomes where we don't roll a six and there's one outcome where we do roll a six. So the odds are actually five to one or five. Which this is very helpful in our modeling because now we have a way to kind of model data from negative infinity to infinity or zero to infinity technically. And in order to get it to negative infinity, we're actually going to be modeling log odds. That's kind of where logistic comes from. We're modeling logistic regression. We're modeling the log transformation of odds in order to get a continuous variable. That was a mouthful, but getting into the model. We're going to be using GLM as the function. We're going to be using CLINDIS as the outcome as we had in previous outcomes. This is the Y variable. It's zero one. And we're going to be using age to model it. Then we're going to use data AFCARS2 as we had created above, we going to use family equals binomial, which is just basically saying, do logistic regression. We're going to run this model. 

[ONSCREEN]
> logmodel <- glm(CLINDIS_d ~ age, data = afcars2, family = binomial)
> summary(logmodel)
Call:
glm(formula = CLINDIS_d ~ age, family = binomial, data = afcars2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3933  -0.8209  -0.6331   1.2456   2.0124  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.3358391  0.0088388  -264.3   <2e-16 ***
age          0.0944796  0.0005468   172.8   <2e-16 ***
---
Signif. Codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 633975  on 539154  degrees of freedom
Residual deviance: 602576  on 539153  degrees of freedom
  (92072 observations deleted due to missingness)
AIC: 602580

Number of Fisher Scoring iterations: 4

[Noah Won]
Okay, we have some deviants, residuals, and we have our coefficients here. So of course, we have our intercept and we have our age estimate coefficients or coefficient. So first we check the p-value. Is it less than 0.05 or whatever alpha we set our p-value to be. And yes, it's very much below that. And we have our estimate here. So we're essentially modeling linearly age to log odds. So this will be this kind of unit is in log odd increments, right? So which is not necessarily helpful for interpretation. So we're going to transform it back to odds. Or we're going to transform it back to odds in order to, in order to be able to interpret it. So we see here we have our coefficient here, our estimate points 0.944796. And in order to turn it back, we're just going to exponentiate it as you know, like Ln and E cancel each other out. So we're going to exponentiate this estimate. And we see that our odds estimate is 1.099. So we can, and the way to interpret this is, since our p values are significant, we can say that the odds of being clinically diagnosed with a disability with a disability increased 1.099 times per year of age. So let's say that's being a 26 year old would be, would have 1.099 times the odds of being clinically diagnosed with a disability than a 25 year old in this data. So we can see there's positive relationship between, being diagnosed clinically with a disability and age. Okay, I think I'm going to leave some time, hopefully. I'm going to leave some time for questions, but thank you very much, but thank you very much for listening. Thank you for attending the Summer Training Series. And, you know, I hope you'll reach out if you have any questions that can answer. 

[ONSCREEN CONTENT SLIDE 17]
Questions?
Alex Roehrkasse
aroehrkasse@butler.edu 
Noah Won
noah.won@duke.edu 
Paige Logan Prater
paige.loganprater@ucsf.edu

[Paige Logan Prater]
Thanks so much, Noah. We only have one question in the Q&A box, but while we're doing that, maybe we'll have a couple more come through. And this one is a little bit more logistical, so someone is asking if the link to those infographics that you were sharing will be included in the course materials. I think actually, no it might be helpful if you could you copy those links from the presentation and put them in the chat for folks to have. And when those when the slides are posted, those links should be clickable because we included them in the presentation. But again, okay great yeah, Noah  just put those links in the chat. They should be clickable when the materials are posted on our website. But if you have any issues, please let us know. Noah  seems to be pretty good at navigating what that what the infographics offer. So if you have any questions about like how to use that tool, I think definitely reach out as well. Okay here is a question Noah. When is it best to visualize continuous variables as categorical? This may also refer to the previous recording regarding data exploration. 

[Noah Won]
So like you mean like when you went to use a histogram I presume. So there is an alternative to a histogram called density plots. And it's essentially like a more linear, it kind of follows like a line. It's like if you made like a bunch of little small bins, if you like if you chop it up into infinitely small bins, it kind of like creates a curve that is like in the shape of a histogram. So to answer your question, I would say generally there's like been some like research, so researchers will use their discretion when determining bin lengths because the density graphs can get very particular in like all the nooks and crannies and kind of can distract from the overall distribution. So a lot of researchers prefer histograms or like histograms because they provide like some leeway. They provide a more like blended smooth curve, right because of those bin widths and bin counts in comparison to like density graphs. Does that answer your question? 

[Paige Logan Prater]
Hopefully it does and if not, Antonio please put your put another question in the Q&A box. But also from a less advanced quantitative and statistical researcher, I think a lot of times it's there's all these choices that we can make right about how we're approaching data analysis and approaching data visualization. And I think the best practice is to understand why we're choosing to do one thing over the other. And having a good justification or a good rationale for that decision. And kind of making that explicit in any presentations you do or papers or things like that. Noah does that sound accurate too. 

[Noah Won]
Yeah, yeah. 

[Paige Logan Prater]
I don't think there's ever like a absolute best only way to do something. 

[Noah Won]
Maybe. Yeah, he says it's referring to the previous recording and exploration so perhaps if that didn't answer your question, maybe. 

[Paige Logan Prater]
No, I think it did answer the question. I was just offering another kind of angle of how to approach a question like that. Just from my experiences too. I mean, there's usually, at least in some of the projects I've worked on, there's no clear cut we must do this. That emerge it's a mix of factors. It's like what the data can tell us? Who is our audience and what do we want them to know? Like, there's a bunch of different considerations and just making sure that we are clear in why we're making certain decisions I think in addition to some of the more technical options for data visualization, I think those are two things to consider. 

[Noah Won]
Yeah, I completely agree with Paige. But Antonio does that answer your question, my email is always open. It's a Noah.Won@duke.edu So feel free to send me an email and I'd be happy to answer any questions. 

[Paige Logan Prater]
Cool. And they followed up saying both answers help. So that's amazing. Okay. We only have about one more minute. So if anyone has a very last quick burning question that they want to type in, please do. We'll stick around for the next minute or so. But while before we close out, thank you all so much for sticking with us through this whole Series. Like we mentioned at the top of the hour, everything will be posted. We're working on getting, all of that available, within the next few weeks. So please thank you for your patience and it will definitely be up there. Look out for other learning opportunities. We host our office hours, monthly office hours every academic year, and our next summer training series will be a similar, not the theme will be different, but the format will be the same every Wednesday, for about an hour. So thank you all so much. And I hope you have a great week. 

[VOICEOVER]
The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an office of the Administration for Children and Families.

[MUSIC]