Transcript for 2025 Summer Training Series, Session 5, Visualization and Finalizing the Analysis Presenter: Noah Won, M.S., NDACAN National Data Archive on Child Abuse and Neglect (NDACAN) [MUSIC] [VOICEOVER] National Data Archive on Child Abuse and Neglect. [ONSCREEN CONTENT SLIDE 1] Welcome to the 2025 NDACAN Summer training series! National Data Archive on Child Abuse and Neglect. Duke University, Cornell University, UC San Francisco, & Mathematica. [Paige Logan Prater] Hello, everybody. Good morning, good afternoon. My name is Paige Logan Prater. I am the Graduate Research Associate here at NDACAN. I'm just going to kick us right off. I see people are coming in, so we'll just go ahead and get started. Yeah, so yeah, my name is Paige Logan Prater. This is the NDACAN Summer Training Series. NDACAN is the National Data Archive on Child Abuse And Neglect. We're housed across these universities and institutions, listed on the slide. And we are funded through the National Children's Bureau. [ONSCREEN CONTENT SLIDE 2] NDACAN Summer Training series schedule. July 2nd, 2025 Developing a research question & exploring the data. July 9th, 2025 Data management. July 16th, 2025 Linking data. July 23rd, 2025 Exploratory Analysis. July 30th, 2025 Visualization and finalizing the analysis. [Paige Logan Prater] So, this is our last session of our Summer Training Series for this summer. If this is your first time joining us, don't worry. You will have access to recordings of all of our previous sessions, as well as any materials that we used. All of those will be posted on our website. I will put the link in the chat once I kick it over to Noah, and those recordings and materials will be posted in the next few weeks. Definitely in August though. So, look out for those. And like I said, this is our last session. The theme of our series is the life cycle of an NDACAN research project. Today, we'll be talking about visualization and finalizing analyses with Noah. And just a quick plug that we are currently working on our Monthly Office hours series, which we'll kick off in. I believe September and we'll run through the academic year. So, if you're not already, if you don't already have that on your radar, definitely keep an eye out. We'll be doing similar types of learning opportunities with the Monthly Office Hours series. And yeah, all of our information about all of our events, the Summer Training Series, the Monthly Office Hours, can also be found on our website. The next slide, please. [ONSCREEN CONTENT SLIDE 3] Life Cycle of an NDACAN research project. This session is being recorded. Please submit questions to the Q&A box. See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us. If issues persist and solutions cannot be found through Zoom, please contact Andres Arroyo at aa17@cornell.edu. [Paige Logan Prater] And just a really quick, sorry, go back one, Noah. Cool. Really quick housekeeping items. All of our sessions will be recorded and they will be available for accessing later on. If you want to refer back to them, or if you aren't able to join live for any of our sessions. And if you have questions throughout the presentation, please put them in the Q&A chat or the Q&A box, which is along the bottom of your Zoom screen. There's a little comment bubble with a question mark. It says Q&A there. As your questions come up throughout the presentation, go ahead and type your questions in the Q&A box and we will address them at the end. We'll save about five or so minutes and we'll get those answered to the best of our ability. If you do have other questions or you feel like you have more things you want to talk about regarding the archive or any of our presentations you can always reach out to us as well. I think that is it. I will kick it over to Noah to talk about visualization and finalizing our data analysis. [Noah Won] Thank you, Paige. As Paige said, you know, welcome everyone. Thank you for joining our last session of the Summer Training Series. Again, my name is Noah. I'm a data analyst here at Duke and I'll be presenting a data visualization and finalizing the analysis. [ONSCREEN CONTENT SLIDE 4] SESSION AGENDA STS Review Regression Review Data Visualization Univariate Plots Bivariate Plots Regression Variable Types Assumption Assessment [Noah Won] So, overview of our agenda, we're going to review the last summer training series session for which it's exploring analysis and just kind of a refreshment memory on some of the learning we did then with Alex. We'll move into data visualization, namely, univariate and bivariate plots the sorts of plots and the uses for those plots that we will show in the R code example later. Then we're going to go over a couple different types of other types of regression, besides simple linear regression and multiple linear regression that you may come across in your research experience. As well as some assumptions, key assumptions that come with most forms of regression, these are key assumptions that we make that form the backbone of basic understanding of the regression model that we do. [ONSCREEN CONTENT SLIDE 5] STS REVIEW [ONSCREEN CONTENT SLIDE 6] REGRESSION Regression analysis is a statistical method for estimating the relationship between two (or more) random variables Linear Regression Equation: Y sub I is equal to Beta naught plus Beta sub one times X sub I plus Epsilon sub i. Confounders are variables that affect both the independent and dependent variable Expansions Stratification Fixed-State Effect Models Description of the image of the Equation: Dependent variable is Y sub i. Population Y intercept is Beta naught. Population Slope Coefficient is Beta sub one. Independent variable is X sub i. Random Error Term is Epsilon sub i. The Linear component is Beta naught plus Beta sub one times X sub i. The Random Error component is Epsilon sub i. [Noah Won] Okay, so as you may remember last time, we built a simple linear regression model and a multiple linear regression model. So, as a refresher regression analysis is a statistical method we use to show the relationship between an outcome or a dependent variable and a predictor or an independent variable. Here below is the equation for simple linear equation. As a review, the y sub I is our outcome variable and it's a continuous outcome variable. We followed by an intercept, which is kind of used to adjust the line of best fit as Alex talked about last last lecture. Our b1 and x1, which our b1 is our slope coefficient, which shows the relationship, whether negative or the linear relationship, whether positive or negative of our predictor and our outcome variable. So the sign of this slope coefficient, this is a number, will dictate whether there's a positive or negative relationship. And of course, our independent variable, Xi and a random error term, which exists as our best estimate as it residuals from the line of best fit. It's essentially the distance between the data points and the line of best fit. So methods that we covered last lecture to address confounders are including them in our model. So we extended it to multiple linear regression, adding another B1, Xi term, a B2, Xi, B3, Xi, adding these terms in our model allows for the model to kind of accommodate for potential confounders. And also stratification models. Kind of stratifying on a categorical variable, creating different models based off of, you know, for example, maybe male female, two different linear regression models run stratified based off of sex can provide also insight and control for confounders such as sex or other categorical variables. [ONSCREEN CONTENT SLIDE 7] DATA VISUALIZATION [Noah Won] So moving forward, we will be going to data visualization. [ONSCREEN CONTENT SLIDE 8] USES FOR DATA VISUALIZATION Holistic Overview Provides a quick, concise, visual summary of data Association at a Glance Reveal trends or pattern in data Identify General Nature of Relationship e.g. Linear, Quadratic, Cubic Splines Assumption Evaluation Aids in the validation of assumption testing [ONSCREEN IMAGES ALT-TEXT] Three sample graph images shown only for illustrative purposes: a scatter plot with dots on a plane, a bar graph labeled histogram, and a graph labeled "Normal Q-Q Plot showing dots on a plane. [Noah Won] So a very important tool in any statistician or researchers repertoire. It has a number of uses. One is it provides a great holistic overview of relationship between variables. In this example, we have a histogram, QQ plot, and a scatter plot. And just add a glance. It's a very quick, concise visual summary of the relationship between these two data. And oftentimes in research, we have to make educated decisions on how we will proceed with our model. So for example, if you see the scatter plot, you know, it could give some insight, you know, without any formal hypothesis testing, it could give some insight into, okay, you know, maybe there's a negative relationship between there. And negative linear relationship, it seems that this relationship follows the straight line. Perhaps like linear regression would be an apt to model to fit to this sort of relationship between these two data points. And likewise, this histogram over here of this of height and show the distribution of a single variable. So you're roughly normally distributed around what seems to be about 175 centimeters. So we know that this distribution seems to be normally distributed, which is important assumption for some models. The QQ plot, which also aids in the normality assessment, it just plots quantiles to the actual sample quantiles. And without going too much into it, it's supposed to be a 45 degree line. It's supposed to follow a 45 degree line. So this QQ plot seems relatively, it seems to display a variable that's shows normality. Or a model that shows normality. And as I said before, we can fit these relationships can be linear, quadratic, even more complex than that. We could fit what's known as cubic splines, which I'll go into detail in the coding example and later in the slides. But essentially, if it's cubic curves locally to a data. So I'm sure you've seen it in data like a lowest curve, but you see like a line of best fit. That's not necessarily linear, but kind of, kind of squiggles about and like follows the data. But those are known as cubic splines. And as I said before, it's assumption evaluation, normality, linearity. These are kind of key assumptions that we can kind of assess holistically before proceeding to more formal hypothesis testing. [ONSCREEN CONTENT SLIDE 9] UNIVARIATE PLOTS Helpful for showing distribution of a single variable Types Histograms – Discrete Geom_histogram (in R) Dot Plots - Discrete Geom_dotplot (in R) Box and Whisker – Continuous Geom_boxplot (in R) [ONSCREEN SLIDE 9 IMAGE 1 ALT-TEXT] A sample histogram: X-axis is Class interval, Y-axis is Frequency A histogram is a type of statistical chart used to visually represent the distribution of a dataset, particularly when the data is continuous or numeric. The chart is comprised of adjacent bars (rectangles), each representing a range or "bin" of values. The height of each bar indicates the frequency or count of data points within that specific range or bin. [ONSCREEN SLIDE 9 IMAGE 2 ALT-TEXT] A Sample Box and Whisker Plot Graphic. A box and whisker plot (or simply box plot) is a graphical representation used to display the distribution of a dataset through its five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box shows where the central 50% of the data lie, marked by Q1 and Q3, with a line inside the box marking the median value. The "whiskers" extend from the box to the minimum and maximum values, or sometimes to boundaries for potential outliers, which may be marked individually. [Noah Won] Okay, so kind of giving it away a little bit, but we're going to go into univariate plots. So these are plots that show a single variable. And, you know, addressing like finding the distribution of a single variable as I said before is important, you know, normality assumptions, seeing if there's a bi-modal distribution, perhaps there are two, you know, humps. So it really gives you a better view and like, okay, where's everyone, where's the density of values in this variable? And how can I use that to, or how can I use that information and how can I model this variable? So most likely, the most common type of univariate plot you'll see is a histogram. So histograms, as a review, are they count frequency. And they're used for discrete variables. And this example over here, you see the histogram. And histograms can have various bin sizes. So you can see, every bin here has a 0.02 bin width. So that is anything that falls within 9.05, 9.07, so you see they're separated by a 0.02 bin width, falls within that bins, and we'll count them and then plot them on this Y access as frequency. So increasing bin width can be better for larger spreads of data, you know, making a smaller bin width also can give more exact estimates. It really depends on what you're using it for. But generally, this histogram right here shows a normal distribution roughly, so it's good for assessing that sort of thing at a eyes glance. Similarly, I don't have a dot plot on this presentation, but they're very similar to histograms, except instead of counts like these bars, they show dots, like stacked dots. So it's just another way of kind of visualizing the discrete information and like more of a, I guess, like lens to more of it's discrete nature, you know, each dot represents a count. So in that way, it's different. Some people, I think most people prefer histograms, but it also has time in place. And box and whisker plots as well. So box and whisker plots show, not counts, but quartiles. So a quick review of box and whisker plots, and you can see this plot. We have the median in the center, which is the 50th percentile. We have Q1, which is the 25th percentile, and Q3, which is the 75th percentile. And then at the whiskers, are the Q1 times 1.5 times the inter-quantile range, which is Q3 - Q1. And then of course, we have outliers that may exist outside of this range. So like box and whisker plots are great at showing, you know, maybe the distribution of a data set in a different way that a histogram might not be able to convey. Like if the whiskers were unevenly spaced, you could tell that, you know, perhaps. Or if there were more outliers existing outside of the whisker range, then it would likely show that maybe it's, you know, weighted heavily on the upper scale of this variable, or lower scale of this variable. Or if this inter-quartile range is huge, what does that say? Is it more spread out? Is it closer together? So all these things are very helpful when determining distribution, especially in relation with relation with each other. And we'll be going into that in the R coding example as well. [ONSCREEN CONTENT SLIDE 10] BIVARIATE PLOTS Helpful for showing relationship between two variables Linear, Quadratic? Types Scatterplot – Two Continuous Variables Geom_point Cubic Splines – Two Continuous Variables Geom_smooth [ONSCREEN SLIDE 10 IMAGE 1 ALT-TEXT] A cubic spline plot is a smooth curve that passes through a set of given data points, created by piecing together a series of cubic polynomials between each pair of points. Each segment is defined so that the entire curve is smooth The example shows a line on plane with x-axis labeled "y1" with values going from 0 to 25, and y-axis labeled "vcr" with values going from 0 TO 2500. There is a shaded band which covers the line from above and below: this represents the standard error band width. [Noah Won] Okay, so, you know, we also want to be able to plot plots, you know, with two variables, and kind of explore a relationship between those two. So, you know, linear, quadratic, you know, these, like kind of scatter plots and cubic splines, help us to determine at a visual glance, you know, what is roughly the relationship between these two variables. If I had these two variables right here, I was looking at it, you could clearly see it's not linear. There's a some sort of curve here, a fitted by a cubic spline, and I would say that, you know, the linearity assumption, you know, in linear regression would not be met here. And some remediate efforts would be needed to kind of be able to use a linear regression. But the two main types that we're going to cover in this lecture are scatter plots. Essentially, just this graph without the line, the best fit, it's just dots, you know, the x values, independent values, and the outcome values on the y variable. And you just kind of dot the relationship between them. And then cubic, we're going to fit cubic splines to on top of our scatter plots in the R coding example to kind of show, you know, give a, give a chance to show, you know, the relationship, the nonlinear relationship, between these two variables. [ONSCREEN CONTENT SLIDE 11] REGRESSION REVIEW [Noah Won] Okay, so we're going to go back into regression. [ONSCREEN CONTENT SLIDE 12] WHAT IS REGRESSION Simple Linear Regression models LINEAR relationship between a dependent and independent variable Composition Right Side Intercept: b0 Slope Intercept: b1 Independent Variable: X1 Multiple Linear Regression [ONSCREEN SLIDE 12 IMAGE 1 ALT-TEXT] Simple Linear Regression Y equals b naught plus b sub one times X sub i. Dependent variable is Y. Y intercept (constant) is b naught. Slope Coefficient is b sub one. Independent variable is X sub 1. [ONSCREEN SLIDE 12 IMAGE 2 ALT-TEXT] Graph of Simple Linear Regression Model. On an unlabeled x-y plane, a series of dots are clustered around a line that slopes at an approximate 40 degree angle starting from close to the origin. [Noah Won] Just a quick review, and I know I covered it earlier in the last review, but I think it's very important to hammer down regression before we go into other types, right? So it's a similar equation, just doesn't have the errors, but the errors do exist. This is just kind of a simplified version of the equation. As I said before, B0 is intercept, so you see in this graph right here, this linear regression graph starts at non-zero, so it kind of gives it some gives the regression line a better fit through these variables if it can determine like a best place to start to intercept at. And a slope coefficient, so this is going to be the slope of this line, so this is a positive correlation. So this this number B1, if it were fit to this simple linear regression would be a positive number. And then the magnitude I'm not entirely sure as there's no numbers on it, but with this would definitely a positive number. And it plots the x1 values right here and then the y values here. So the line predicts the y values are this prediction and right here. And of course multiple linear regression will be adding more independent variables, so like plus B2, x2, plus B3, et cetera. [ONSCREEN CONTENT SLIDE 13] VARIABLE TYPES Quantitative Discrete- Poisson Regression Continuous - Simple or Multiple Linear Regression Qualitative Binary - Logistic Regression Nominal – Multinomial Logistic Regression Ordinal – Ordinal Logistic Regression [ONSCREEN SLIDE 13 IMAGE 1 ALT-TEXT] Table outlining qualitative data types, namely, nominal, ordinal, and binary, and quantitative data types, namely, discrete and continuous. Qualitative nominal: Variables with no inherent order or ranking sequence. E.g. Gender, Race etc.. Qualitative ordinal: Variables with an ordered series E.g. Blood Group, Performance etc. Qualitative binary: Variables with only two option. Eg. Pass/Fail, Yes/No etc Quantitative Discrete: aka Attribute data. Discrete data is information that can be categorized into a classification. Discrete data is based on counts. Finite number of values is possible and the values cannot be subdivided meaningfully. E.g. - No of Parts damaged in shipment. Quantitative Continuous: Continuous data is information that can be measured on a continuum or scale. Continuous data can have almost any numeric value and can be meaningfully subdivided into finer and finer increments . E.g. - Length, Size, width. [Noah Won] Okay, so we know that simple linear regression and multiple linear regression are useful for when we have a continuous outcome. That is a number an outcome that like blood pressure or age, something that has a continuous values is not separated into discrete increments. But what if we had a variable that we wanted to model that wasn't continuous? We can see kind of here just the definition discrete, binary, ordinal variables. What if our variables ordinal that is like an ordered series, like a categorical variable with an inherent ordered series. So, perhaps like grades on, you know, like grades like ABCDEF. There's an inherent order in the grading scale in the American grading scale that we can utilize and extract more information out of than if we just had a nominal variable, something would know inherent order, like gender and race. Something that doesn't, you know, that we can't extract any information based on the order in which they come. Binary variables, probably one of the most helpful one of the most sought-after variables to model. That's like a zero, one, pass, fail, Yes, no. Like these variables are, how do we model something that has one of two values, which we'll go into logistic regression and how we kind of solve the problem of, okay, how we're going to model. If we can't model like a continuous variable, it's one or zero, then how do we model it, right? And of course, some other forms of regression that's lie outside the scope of this lecture, but are very helpful and, many forms of research. [ONSCREEN CONTENT SLIDE 14] ASSUMPTION ASSESSMENT Core Assumptions Homoscedasticity – Variance of residuals is constant for all levels of all independent variables Use plots of residuals. Independence - Each observation is independent of others Verify with study design Linearity Inspect scatter plots of independent and dependent variables No Multicollinearity Independent variables are not correlated with each other Check correlation tables [Noah Won] Okay, so, we went over the types of regression. Now these are a couple of the core assumptions that exist within these different types of regression. So, there's four, well, technically, yeah, there's, I'd say there's four core assumptions. We'll go over a homoscedasticity, just kind of a big word. But, essentially, it's ensuring that the variance of residuals is constant for all levels of independent variables. So, let me just go back right here. [ONSCREEN SLIDE 12 IMAGE 2 ALT-TEXT] Graph of Simple Linear Regression Model. On an unlabeled x-y plane, a series of dots are clustered around a line that slopes at an approximate 40 degree angle starting from close to the origin. [Noah Won] So, residuals are kind of the distance between all these points and the line, right? So, this point right here, and this line is going to have a positive number as a residual because it's above the line. It's over as the line is underestimating this point. And this point right here is going to have a negative residual since, under it's the line is overestimating it. So, we kind of collect all these values, all these residual values, and we want to make sure that any given X that this residual that the variance of the residuals is roughly equal. So, what does that mean, exactly? So, looking at this graph, you can kind of see that I would say more or less it is equal. If you cut spin slices of this separated by, you know, X value, roughly X values, you can see that the spread, like this lowest curve that I have here, this lowest curve, you see the spread of these residuals, is roughly the same. You could say increases here, you could see that spread between here increases, especially around here. But a general assumption is that you want this to be equal with, like an equal, equal standard error around this line, right? And I guess in this example, we can holistically determine that roughly, it is the same. This is kind of a large jump right here, compared to this one. But let's say roughly, roughly, we have, we met the assumption of homoscasticity. Independence, so this comes down with study design. We, one of the assumptions is each observation is independent of each other. So if, you know, in our data collection, we see that, perhaps, you know, one person's collection of data impacts another person's, or say that. Maybe an example could be that, say there's a doctor and a tall person, they're collecting height. And, you know, perhaps there's a bunch of NBA players, and, you know, he's measuring, you know, tall people. He's measuring tall people and say he's just kind of like, keeps on measuring and he doesn't really pay attention. So when someone's shorter comes in, maybe he over estimates, or, you know, it doesn't necessarily have to be high. It could be like a test that the doctor performs, like a holistic test. And, you know, he sees a bunch of people that are performing highly, and then so he's more likely to give someone who doesn't perform as high a higher, test value because he's used to seeing other people. It's kind of like seeing other people impacts the values of other people down the line. But this is something we have to verify with study design. It's not necessarily something we can change or address in our data. There are a few remediate actions, but it comes down to study design. I spoke briefly before linearity, just a relationship between the independent variable, like the X values, and then the outcome, the Y values, shares a linear relationship because, basically, linear regression, all these models, more or less assume linear relationships so that we can have one beta value. And, you know, line of best fits are linear. There are other types of models that we can use, and there are other, like, transformations we could use to address linearity issues, but, at its core, this is the key assumption, linear regression models. And then multicollinearity. So, there can't be any correlation between the X values themselves. One of the assumptions is that each covariate that we're adding provides independent unique information. So, for instance, if you had BMI in your model, and you also had height, those are independent, because height is a parameter in calculating BMI. So, you would want to, to address that you would want to only include either BMI or height and not both, because that would be redundant information and a violation of multicollinearity. [ONSCREEN CONTENT SLIDE 15] COEFFICIENTS OF DETERMINATION How do we determine how well our model performs? Coefficients of Determination R squared equation: R squared equals one minus the quotient of the variance of the residual sum of squares and the total variance. Adjusted R Squared equation: R squared equals one minus the quotient of the variance of the residual sum of squares and the total variance where the residual variance is equal to the sum of squares divided by n and total variance is equal to the total sum of squares divided by n. [Noah Won] And the last concept that we're going over is coefficient of determination. So, after we fit our model, we kind of want to see how well it performs. So, coefficient of determination also known as R-squared or another form-adjusted R-squared can kind of give us information on how well our model adheres is like, like, adheres to the data points we're given. So, the equation is R-squared equals one minus the variance of the other residuals. I think my mic may have died out for a second. [Paige Logan Prater] Yeah Noah, you cut out, but I think we can hear you now. [Noah Won] Okay, great. Thank you, Paige. So, quick, you know, a formula for R-squared is just R-squared is equal to one minus the variance of the residuals. So, that is the kind of spread I was talking about from that line of best fit earlier from that graphic. And, versus the divided by the variance of the total. So, the variance of the residual this ratio kind of gives us a measure of how close our line is to the residual points. So, we want our model to be able to predict the points as close as possible. We want to minimize residuals. So, this R squared kind of gives us information on how our model performs. But one drawback from R squared is that it doesn't have the variables you add to your model, like the in multiple linear regression, the greater your R squared increases. So, there's not a punishing variable where you could just add as many variables as you want and adding variables to your model doesn't ever decrease R squared value, just by its inherent nature. So, adjusted R squared kind of accounts for this by accounting for this n here, right? N is the number of variables independent variables we have in our model. And it kind of accounts for this by punishing by punishing your R squared adjusted R squared value by adding more variables to your model. So, adjusted R squared, I'd say, is more of a standard than R squared just because of this kind of a foresight, sorry, yeah, of a lack of foresight in this R squared equation. Just because researchers could add in a bunch of independent variables, and seemingly have an accurate R squared when really it's taking advantage of R squared's inherent nature of not punishing additional variables. [ONSCREEN CONTENT SLIDE 16] HELPFUL RESOURCE GGplot2 Cheat Sheet https://posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf GGplot2 Documentation https://ggplot2.tidyverse.org/reference/index.html [Noah Won] Okay, and these are very helpful research sheets that are, you know, that are available publicly. I can click on one and kind of show you, but as we go into visualization, we're going to use the GGPlot2 package. And this is, I can show you real quick. It's kind of a small cheat sheet to aid you, and if you, you know, perhaps want to make a histogram or make a density plot, it kind of has like a, you know, one sheet that combines all the information that you can use. It's been helpful for me. And another helpful tidbit of information is the GGplot2 documentation. One of the benefits of R is that the documentation is excellent. So if you have a question about, you know, what variables go in what, you know, outside of this R-coding demonstration, the R, the official GGplot2 website has very good documentation that you can use as a reference. [ONSCREEN CONTENT SLIDE 17] Questions? Alex Roehrkasse aroehrkasse@butler.edu Noah Won noah.won@duke.edu Paige Logan Prater paige.loganprater@ucsf.edu [Noah Won] And Alex was unable to make this presentation, but Paige and I, and Alex are, you know, have our emails here for any questions, you know, if they may come up after the lecture. But without further ado, I think I will be moving to the coding portion of our demonstration and yeah. So like last, like last week's, we'll be kind of following a similar kind of template. [ONSCREEN] # NOTES # # This program file demonstrates strategies discussed in # session 5 of the 2025 NDACAN Summer Training Series # "Data Visualization." # For questions, contact the presenter # Noah Won (noah.won@duke.edu). # Note that because of the process used to anonymize data, # all unique observations include partially fabricated data # that prevent the identification of respondents. # As a result, all descriptive and model-based results are fabricated. # Results from this and all NDACAN presentations are for training purposes only # and should never be understood or cited as analysis of NDACAN data. # TABLE OF CONTENTS # # 0. SETUP # 1. Univariate Plots # 2. Bivariate Plots # 3. Logistic Regression [Noah Won] We'll be going over data visualization session five. If you have any questions, as I said before, please don't hesitate to reach out to my email noah.won@duke.edu. And we will again be using anonymized data and any findings from this data should be aren't indicative of any true trends in childhood maltreatment data. Okay, so we are going to do a similar setup as we did last time. [ONSCREEN] # 0. SETUP # # Clear environment rm(list=ls()) # Installs packages if necessary, loads packages if (!requireNamespace("pacman", quietly = TRUE)){ install.packages("pacman") } Pacman::p_load(data.table, tidyverse, mice) # Defines filepaths working directory project <- "C:/Users/nhwn1/Downloads/STS5/data" data <- "C:/Users/nhwn1/Downloads/STS5/data" # Set working directory setwd(project) # Set seed set.seed(1013) [Noah Won] I'm going to clear our environment as per good coding practice. We're going to install some packages that we're going to need. So, Tidyverse, Mice, Data.Table, Packman, but the Tidyverse will likely include everything we'll need. We are going to run our project pathways. This is just where I set the project directory where I have my data and I'll just reference this data line later and when I'm reading the data. I'm just going to set up working directory, set a seed for any randomization number randomization that we may need. OK so, we're going to be going over our Univariate Plots first. So if you remember that's those histograms, is dot plots, there's the Box and Whisker Plots. We're going to begin by reading in data, our AFCARS and anonymized data. [ONSCREEN] > afcars <- fread(paste0(data,'/afcars_clean_anonymized_linear.csv')) > head(afcars, 20) StFCID STATE St RecNumbr DOB SEX RaceEthn CLINDIS 1: AL000001456616 1 AL 000001456616 2003-02-15 Female White No 2: AL000001524474 1 AL 000001524474 2003-01-15 Male White No 3: AL000001528009 1 AL 000001528009 2003-01-15 Male White Yes 4: AL000001597400 1 AL 000001597400 2003-01-15 Male White No 5: AL000001612758 1 AL 000001612758 2003-01-15 Female White No 6: AL000001634843 1 AL 000001634843 2002-09-15 Female Hispanic No 7: AL000001699782 1 AL 000001699782 2003-02-15 Male White No 8: AL000001699789 1 AL 000001699789 2003-02-15 Female White No 9: AL000001714510 1 AL 000001714510 2002-09-15 Male White No 10: AL000001718423 1 AL 000001718423 2002-12-15 Female White No 11: AL000001718927 1 AL 000001718927 2003-03-15 Male White No 12: AL000001917909 1 AL 000001917909 2002-11-15 Male White No 13: AL000002036685 1 AL 000002036685 2003-08-15 Female Black Yes 14: AL000002041179 1 AL 000002041179 2003-04-15 Male White No 15: AL000002061766 1 AL 000002061766 2003-03-15 Female AIAN No 16: AL000002239265 1 AL 000002239265 2003-02-15 Male White No 17: AL000002270506 1 AL 000002270506 2003-01-15 Male Hispanic No 18: AL000002276813 1 AL 000002276813 2003-07-15 Female White No 19: AL000002501625 1 AL 000002501625 2003-07-15 Male Black No 20: AL000002788485 1 AL 000002788485 2003-04-15 Male Black Yes TOTALREM FCMntPay AgeAtStart 1: 2 5425 16 2: 2 4371 16 3: 3 732 16 4: 1 732 16 5: 4 3102 16 6: 5 732 16 7: 2 732 16 8: 2 732 16 9: 2 8525 16 10: 1 732 16 11: 2 732 16 12: 2 732 16 13: 2 0 16 14: 2 0 16 15: 2 0 16 16: 2 2961 16 17: 3 0 16 18: 1 0 16 19: 2 0 16 20: NA 938 16 [Noah Won] And then we're just going to check to see make sure everything ran well. Okay, great. So we have STFCID, states, a character state variable, recnumber, data of birth, sex, race and ethnicity, CLINDIS, total removal, foster care monthly, pay, age at start. Okay, looks like everything ran well. We're going to just look at a couple variables, a couple of frequency tables of our variables, which are variables of interest that we're going to be looking at. [ONSCREEN] > table(afcars$SEX) Female Male 88 323567 307598 > Table(afcars$RaceEthn) AIAN Asian Black Hispanic Multiracial 10255 14846 3393 138569 134412 48829 NHPI White 1698 279251 > Table(afcars$FCMntPay) 0 1 2 3 5 6 7 8 9 10 11 265467 858 3 7 4 5 23 10 8 15 8 12 13 14 15 16 17 18 19 20 21 22 3 16 32 28 35 15 32 89 67 40 20 23 24 25 26 27 28 29 30 31 32 33 40 27 67 84 109 64 34 69 31 20 83 34 35 36 37 38 39 40 41 42 43 44 85 19 14 33 42 73 42 51 47 17 31 45 46 47 48 49 50 51 52 53 54 55 45 23 27 48 71 227 92 142 122 84 41 56 57 58 59 60 61 62 63 64 65 66 90 23 142 68 51 49 15 32 34 39 22 67 68 69 70 71 72 73 74 75 76 77 56 25 40 29 47 134 19 24 52 111 21 78 79 80 81 82 83 84 85 86 87 88 89 64 35 60 36 28 57 44 41 61 57 89 90 91 92 93 94 95 96 97 98 99 45 354 23 59 782 18 44 866 29 56 162 100 101 102 103 104 105 106 107 108 109 110 141 31 72 17 70 279 342 1056 72 34 37 111 112 113 114 115 116 117 118 119 120 121 1006 48 30 49 18 41 31 799 26 242 11 122 123 124 125 126 127 128 129 130 131 132 723 17 49 18 807 79 15 15 753 52 26 133 134 135 136 137 138 139 140 141 142 143 20 41 69 44 91 36 42 25 25 16 32 144 145 146 147 148 149 150 151 152 153 154 89 35 28 59 22 160 73 44 58 383 103 155 156 157 158 159 160 161 162 163 164 165 36 45 72 366 13 26 32 62 21 29 91 166 167 168 169 170 171 172 173 174 175 176 32 33 44 40 14 69 15 33 25 83 26 177 178 179 180 181 182 183 184 185 186 187 582 66 23 808 12 51 44 76 28 255 18 188 189 190 191 192 193 194 195 196 197 198 16 58 13 38 11 5 19 55 19 39 124 199 200 201 202 203 204 205 206 207 208 209 10 101 32 28 67 34 15 18 20 49 15 210 211 212 213 214 215 216 217 218 219 220 64 35 21 16 23 27 29 46 35 14 44 221 222 223 224 225 226 227 228 229 230 231 12 18 17 30 25 21 28 74 13 40 33 232 233 234 235 236 237 238 239 240 241 242 27 17 86 28 80 19 24 14 147 3 5 243 244 245 246 247 248 249 250 251 252 253 75 116 42 10 16 124 24 79 35 54 31 254 255 256 257 258 259 260 261 262 263 264 1924 29 51 21 15 27 43 26 89 67 23 265 266 267 268 269 270 271 272 273 274 275 17 186 27 40 23 63 48 18 17 10 63 276 277 278 279 280 281 282 283 284 285 286 39 20 13 49 24 23 17 20 12 45 40 287 288 289 290 291 292 293 294 295 296 297 10 134 6 15 35 28 12 15 93 16 154 298 299 300 301 302 303 304 305 306 307 308 42 14 142 20 31 10 11 72 38 11 18 309 310 311 312 313 314 315 316 317 318 319 22 155 11 8129 12 16 67 15 22 26 41 320 321 322 323 324 325 326 327 328 329 330 26 17 22 30 39 79 16 37 12 9 2043 331 332 333 334 335 336 337 338 339 340 341 41 8 13 16 75 10 18 45 6 19 330 342 343 344 345 346 347 348 349 350 351 352 17 33 4 125 15 166 7 28 27 45 109 353 354 355 356 357 358 359 360 361 362 363 16 16 113 38 15 15 9 75 64 15 11 364 365 366 367 368 369 370 371 372 373 374 42 16 21 88 21 35 17 16 23 23 43 375 376 377 378 379 380 381 382 383 384 385 19 14 9 42 48 20 92 24 50 18 37 386 387 388 389 390 391 392 393 394 395 396 3 14 22 30 81 24 26 2233 51 757 168 397 398 399 400 401 402 403 404 405 406 407 34 13 48 82 34 48 12 23 14 1123 826 408 409 410 411 412 413 414 415 416 417 418 37 33 1662 18 31 43 29 77 92 19 171 419 420 421 422 423 424 425 426 427 428 429 30 140 280 165 15 16 22 233 16 18 41 430 431 432 433 434 435 436 437 438 439 440 13 60 30 40 45 18 100 32 60 774 810 441 442 443 444 445 446 447 448 449 450 451 400 49 39 22 89 165 377 22 672 51 549 452 453 454 455 456 457 458 459 460 461 462 56 480 18 23 1368 59 17 52 98 2335 23 463 464 465 466 467 468 469 470 471 472 473 1695 290 33 26 752 118 37 304 30 48 36 474 475 476 477 478 479 480 481 482 483 484 10 1736 62 16 28 23 66 31 1453 158 64 485 486 487 488 489 490 491 492 493 494 495 23 1463 64 22 19 54 12 56 32 41 218 496 497 498 499 500 501 502 503 504 505 506 134 12 27 41 2332 330 18 1329 32 10 52 507 508 509 510 511 512 513 514 515 516 517 20 108 33 107 773 43 399 60 61 36 1455 518 519 520 521 522 523 524 525 526 527 528 327 15 249 10 23 1955 99 82 53 1315 55 529 530 531 532 533 534 535 536 537 538 539 17 245 3207 73 129 439 43 48 203 20 22 540 541 542 543 544 545 546 547 548 549 550 269 86 43 930 22 139 183 10 56 27 161 551 552 553 554 555 556 557 558 559 560 561 63 109 80 910 88 1271 34 829 14 107 331 562 563 564 565 566 567 568 569 570 571 572 448 42 98 12 41 16 1146 17 119 690 86 573 574 575 576 577 578 579 580 581 582 583 206 24 129 61 396 17 304 115 1434 38 23 584 585 586 587 588 589 590 591 592 593 594 402 110 35 1498 85 1457 558 600 156 44 158 595 596 597 598 599 600 601 602 603 604 605 40 301 37 107 24 889 50 25 133 51 56 606 607 608 609 610 611 612 613 614 615 616 156 87 80 916 71 166 1817 98 29 36 117 617 618 619 620 621 622 623 624 625 626 627 739 24 15 175 166 23 59 1478 30 246 41 628 629 630 631 632 633 634 635 636 637 638 27 492 353 43 91 81 1963 96 80 93 705 639 640 641 642 643 644 645 646 647 648 649 32 74 24 21 47 811 719 41 333 61 735 650 651 652 653 654 655 656 657 658 659 660 109 157 97 16 56 514 39 25 16 1245 277 661 662 663 664 665 666 667 668 669 670 671 46 53 74 297 67 43 520 94 566 69 22 672 673 674 675 676 677 678 679 680 681 682 1778 51 1057 61 30 305 716 24 355 225 121 683 684 685 686 687 688 689 690 691 692 693 480 48 144 94 91 153 192 249 272 16 1824 694 695 696 697 698 699 700 701 702 703 704 230 20 383 17 77 35 409 109 339 299 158 705 706 707 708 709 710 711 712 713 714 715 1188 36 51 91 81 108 91 61 130 15 162 716 717 718 719 720 721 722 723 724 725 726 709 139 31 17 2559 1056 15 1381 76 30 29 727 728 729 730 731 732 733 734 735 736 737 41 88 50 27 69 341 1076 524 28 1956 156 738 739 740 741 742 743 744 745 746 747 748 29 57 15 279 446 157 592 30 29 274 35 749 750 751 752 753 754 755 756 757 758 759 53 2640 20 54 82 1097 11 63 39 1085 14 760 761 762 763 764 765 766 767 768 769 770 383 52 75 2372 29 119 67 49 174 221 117 771 772 773 774 775 776 777 778 779 780 781 42 43 141 53 342 541 75 23 31 1700 130 782 783 784 785 786 787 788 789 790 791 792 35 502 65 99 294 303 115 691 2343 270 1043 793 794 795 796 797 798 799 800 801 802 803 279 17 790 869 147 39 433 383 79 81 225 804 805 806 807 808 809 810 811 812 813 814 30 82 467 460 51 24 1345 82 754 184 55 815 816 817 818 819 820 821 822 823 824 825 325 52 726 262 19 24 16 83 87 99 93 826 827 828 829 830 831 832 833 834 835 836 238 47 158 20 31 193 50 14 133 30 35 837 838 839 840 841 842 843 844 845 846 847 329 18 903 338 30 265 40 120 1016 43 18 848 849 850 851 852 853 854 855 856 857 858 863 25 84 1497 22 13 47 188 477 45 73 859 860 861 862 863 864 865 866 867 868 869 98 19 547 13 245 163 16 635 16 143 31 870 871 872 873 874 875 876 877 878 879 880 109 150 601 37 47 905 64 44 34 118 43 881 882 883 884 885 886 887 888 889 890 891 29 31 58 160 10 39 30 135 1063 508 36 892 893 894 895 896 897 898 899 900 901 902 268 13 79 746 70 52 21 53 1654 148 60 903 904 905 906 907 908 909 910 911 912 913 14 274 66 9 627 21 66 56 32 203 196 914 915 916 917 918 919 920 921 922 923 924 51 20 28 252 69 240 81 321 21 274 215 925 926 927 928 929 930 931 932 933 934 935 224 602 21 43 19 436 107 34 244 85 55 936 937 938 939 940 941 942 943 944 945 946 88 20 16 11 127 173 72 212 78 220 22 947 948 949 950 951 952 953 954 955 956 957 128 33 139 34 252 22 80 28 48 671 36 958 959 960 961 962 963 964 965 966 967 968 132 44 1899 67 100 15 49 40 34 99 21 969 970 971 972 973 974 975 976 977 978 979 34 46 5 106 232 55 319 110 82 11 38 980 981 982 983 984 985 986 987 988 989 990 93 28 31 90 30 26 18 20 51 12 302 991 992 993 994 995 996 997 998 999 1000 200 115 17 30 152 104 29 27 54 12120 [ Reached getOption("max.print") -- omitted 8076 entries ] [Noah Won] It seems like we have a lot. It looks good. You have our sex. We have about 30,000, seven-- Sorry, 307,000 males and 323,000 females. Good spread of our race ethnicity variable. And pay variable, obviously, it's just going to have a lot of values, because this is a continuous variable. Okay, we are going to create a couple of dummy variables, right? So, like, for example, in our sex, we have male female, which is, you know, easy to read when we're putting into tables. But for computers to read, we really want to set it to the ones in zeros. Like I said, that binary variable that we're going to use. So, this piece of code just essentially creates another variable, sex underscore d or dummy variable, sex underscore dummy, and creates it when it creates it to be one, when it's male, and zero, when it's female. Same thing with CLINDIS, we're going to do, create another variable called CLINDIS underscore d. Yes, it's going to be a one, no, it's going to be a zero. Hispanic, one, if it's Hispanic, zero, if it's anything else. And then we're going to filter by age. So we're just going to find the most applicable age groups. So just people who have, we're just more concerned with, like, a more recent age group. So we're going to run this code. [ONSCREEN] > afcars2 <- afcars %>% + mutate(SEX_d = case_when( + SEX == "Male" ~ 1, + SEX == "Female" ~ 0), + CLINDIS_d = case_when( + CLINDIS == "Yes" ~ 1, + CLINDIS == "No" ~ 0), + Hispanic = case_when( + RaceEthn == "Hispanic" ~ 1, + TRUE ~ 0), + age = as.numeric(difftime(Sys.Date(), DOB, units = "days")) / 365.25 + ) %>% + Filter(age <= 30) [Noah Won] I'm going to test some of these variables. Make sure they do right right at the well, right? We've got ones in zero. [ONSCREEN] > table(afcars2$SEX_d) 0 1 323559 307580 > Table(afcars2$CLINDIS_d) 0 1 391018 148137 > Table(afcars2$Hispanic) 0 1 496820 134407 > Table(afcars2$age) 4.290212183436 4.53661875427789 4.62149212867899 4.70362765229295 1 1 1 1 4.78850102669405 4.9555099247091 5.0403832991102 5.12251882272416 1287 2465 2239 2479 5.20739219712526 5.28952772073922 5.37440109514031 5.45379876796715 2349 2705 2529 3326 5.53867214236824 5.62354551676934 5.7056810403833 5.79055441478439 2853 3395 3440 3884 5.87268993839836 5.95756331279945 6.04243668720055 6.12457221081451 3773 4179 4258 3826 6.20944558521561 6.29158110882957 6.37645448323066 6.45311430527036 3706 3935 3687 4479 6.53798767967146 6.62286105407255 6.70499657768652 6.78986995208761 3816 4306 4440 4522 6.87200547570157 6.95687885010267 7.04175222450376 7.12388774811773 4416 4503 4755 4171 7.20876112251882 7.29089664613279 7.37577002053388 7.45242984257358 4214 4004 3866 4328 7.53730321697468 7.62217659137577 7.70431211498973 7.78918548939083 3751 4113 4209 4152 7.87132101300479 7.95619438740589 8.04106776180698 8.12320328542094 4165 4149 4337 3544 8.20807665982204 8.290212183436 8.3750855578371 8.4517453798768 3824 3549 3244 3607 8.53661875427789 8.62149212867899 8.70362765229295 8.78850102669404 3284 3450 3730 3643 8.87063655030801 8.9555099247091 9.0403832991102 9.12251882272416 3667 3655 3739 3088 9.20739219712526 9.28952772073922 9.37440109514031 9.45379876796714 3292 3116 2947 3235 9.53867214236824 9.62354551676934 9.7056810403833 9.79055441478439 3016 3184 3350 3219 9.87268993839836 9.95756331279945 10.0424366872005 10.1245722108145 3304 3257 3316 2831 10.2094455852156 10.2915811088296 10.3764544832307 10.4531143052704 2975 2935 2766 2966 10.5379876796715 10.6228610540726 10.7049965776865 10.7898699520876 2749 2891 3030 2867 10.8720054757016 10.9568788501027 11.0417522245038 11.1238877481177 3020 3083 3111 2657 11.2087611225188 11.2908966461328 11.3757700205339 11.4524298425736 2627 2524 2539 2716 11.5373032169747 11.6221765913758 11.7043121149897 11.7891854893908 2492 2578 2930 2745 11.8713210130048 11.9561943874059 12.041067761807 12.1232032854209 2762 2692 2885 2533 12.208076659822 12.290212183436 12.3750855578371 12.4517453798768 2684 2475 2308 2578 12.5366187542779 12.621492128679 12.7036276522929 12.788501026694 2170 2503 2613 2640 12.870636550308 12.9555099247091 13.0403832991102 13.1225188227242 2462 2602 2648 2368 13.2073921971253 13.2895277207392 13.3744010951403 13.4537987679671 2406 2338 2149 2349 13.5386721423682 13.6235455167693 13.7056810403833 13.7905544147844 2170 2347 2399 2447 13.8726899383984 13.9575633127995 14.0424366872005 14.1245722108145 2278 2463 2437 2188 14.2094455852156 14.2915811088296 14.3764544832307 14.4531143052704 2376 2141 1997 2301 14.5379876796715 14.6228610540726 14.7049965776865 14.7898699520876 1950 2207 2335 2256 14.8720054757016 14.9568788501027 15.0417522245038 15.1238877481177 2181 2241 2296 2103 15.2087611225188 15.2908966461328 15.3757700205339 15.4524298425736 2149 2118 2075 2208 15.5373032169747 15.6221765913758 15.7043121149897 15.7891854893908 1933 2136 2266 2312 15.8713210130048 15.9561943874059 16.041067761807 16.1232032854209 2208 2395 2399 2141 16.208076659822 16.290212183436 16.3750855578371 16.4517453798768 2070 2134 2022 2165 16.5366187542779 16.621492128679 16.703627652293 16.788501026694 1971 1951 2203 2264 16.870636550308 16.9555099247091 17.0403832991102 17.1225188227242 2171 2404 2363 2095 17.2073921971253 17.2895277207392 17.3744010951403 17.4537987679671 2052 2141 2021 2221 17.5386721423682 17.6235455167693 17.7056810403833 17.7905544147844 2061 2137 2157 2219 17.8726899383984 17.9575633127995 18.0424366872005 18.1245722108145 2239 2315 2276 2182 18.2094455852156 18.2915811088296 18.3764544832307 18.4531143052704 2131 2171 1907 2287 18.5379876796715 18.6228610540726 18.7049965776865 18.7898699520876 1988 2217 2229 2319 18.8720054757016 18.9568788501027 19.0417522245038 19.1238877481177 2383 2316 2478 2167 19.2087611225188 19.2908966461328 19.3757700205339 19.4524298425736 2204 2263 2041 2297 19.5373032169747 19.6221765913758 19.7043121149897 19.7891854893908 2118 2204 2320 2319 19.8713210130048 19.9561943874059 20.041067761807 20.1232032854209 2276 2383 2437 2232 20.208076659822 20.290212183436 20.3750855578371 20.4517453798768 2330 2344 2258 2461 20.5366187542779 20.621492128679 20.703627652293 20.788501026694 2172 2408 2534 2494 20.870636550308 20.9555099247091 21.0403832991102 21.1225188227242 2411 2586 2615 2415 21.2073921971253 21.2895277207392 21.3744010951403 21.4537987679671 2487 2455 2304 2618 21.5386721423682 21.6235455167693 21.7056810403833 21.7905544147844 2350 2436 2575 2692 21.8726899383984 21.9575633127995 22.0424366872005 22.1245722108145 2632 2772 2737 2463 22.2094455852156 22.2915811088296 22.3764544832307 22.4531143052704 2515 2502 2439 2632 22.5379876796715 22.6228610540726 22.7049965776865 22.7898699520876 2426 2478 2636 2408 22.8720054757016 22.9568788501027 23.0417522245038 23.1238877481177 2558 2477 2591 2152 23.2087611225188 23.2908966461328 23.3757700205339 23.4524298425736 2130 2028 2022 2055 23.5373032169747 23.6221765913758 23.7043121149897 23.7891854893908 1901 1894 2023 639 23.8713210130048 23.9561943874059 24.041067761807 24.1232032854209 1852 598 661 526 24.208076659822 24.290212183436 24.3750855578371 24.4517453798768 527 479 446 494 24.5366187542779 24.621492128679 24.703627652293 24.788501026694 428 441 490 378 24.870636550308 24.9555099247091 25.0403832991102 25.1225188227242 428 366 385 357 25.2073921971253 25.2895277207392 25.3744010951403 25.4537987679671 403 369 339 368 25.5386721423682 25.6235455167693 25.7056810403833 25.7905544147844 365 318 302 316 25.8726899383984 25.9575633127995 26.0424366872005 26.1245722108145 319 291 325 263 26.2094455852156 26.2915811088296 26.3764544832307 26.4531143052704 261 248 202 270 26.5379876796715 26.6228610540726 26.7049965776865 26.7898699520876 258 219 229 2 26.8720054757016 27.0417522245038 27.2908966461328 27.3757700205339 184 1 2 3 27.4524298425736 27.5373032169747 27.6221765913758 27.7043121149897 2 1 1 1 28.041067761807 28.1232032854209 28.621492128679 29.0403832991102 1 2 1 1 29.1225188227242 29.4537987679671 29.9575633127995 1 1 1 [Noah Won] So it seems like to align with what we had seen the sex variable. Similarly with CLINDIS, okay, looks good. Hispanic and age, and nice. Okay, oh, and I forgot to mention age is just derived as from date of birth, I just took the date of birth and days and then divided it by 365.25 to get the age. So it looks all good. Variables are derived well. There are no crazy values. Let's start graphing. So we're going to start with the GGplot package. We are going to create a very simple histogram. So syntax in GGplot2 with this code is we're going to assign this plot, this is the plot. We're going to assign it to this variable hist1. This is some arbitrary name I came up with. But essentially, so we're assigning this. This arrow means we're assigning the variable to hist1. The syntax begins with the GGplot function. And it takes in these parameters, afcars2, which is what I named this data set, which I assigned this data set. So we're going to be using this data. Whatever is in the first parameter position we're going to be using that as data set name. And then we're going to use the AES function within it. It's a static function. And essentially it gives information about the plot that we would want to see. So we're going to do AES. And then in the function x equals age. So our x variable is our age variable. Continuous variable. We're just going to see. Yeah, we're just going to kind of see like where this histogram kind of shows the distribution of age. And then we're going to this point right here is adding a geom histogram. So essentially just telling them, okay, build a histogram from this information that we have here. So let's see what this makes. [ONSCREEN] > hist1 <- ggplot(afcars2, aes(x = age)) + geom_histogram() [ONSCREEN R CODE IMAGE 01] A histogram bar graph with age as the x axis, and count as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5. [Noah Won] Okay, it seems like a relatively simple histogram. Looks good, but we don't have any titles. We don't have any labels. It says count, lowercase and age. And like these bins seem kind of large. It kind of seems like kind of murky. The colors a little bit off too. I think we can do better this. So let's make a improved version. Let's add some titles to this. So we're going to assign this the histogram 2. A different histogram that we can call later. And we're, same thing, we're going to do ggplot2, afcars calling the afcars data set using age as the x variable using geom_histogram. But this time, we're going to add a lab function, which is essentially kind of gives information about titles, subtitles that we can add to our graph. So let's go ahead and run this. We're adding a histogram of age title just at the very top. At the x axis, we're adding age or just capitalizing age because by default, it's going to use it's going to use the variable name. So I guess in this case age is not necessarily a very bad name, but if you have like FCMntPay, it's going to look kind of dirty. And then we're going to change count to frequency. So let's see how this looks. [ONSCREEN] > hist2 <- ggplot(afcars2, aes(x = age)) + + geom_histogram() + + labs(title = "Histogram of Age", + x = "Age", + y = "Frequency") > hist2 `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. [ONSCREEN R CODE IMAGE 02] A histogram bar graph with title "Histogram of Age", "Age" as the x axis, and "Frequency" as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5. [Noah Won] Okay, looks good, looks better. We have a title, we have a frequency, we have a label, we have x, y, axis labels, but it seems kind of dull still. It's kind of hard to determine where what starts, what ends, it seems like we could use more number of bins to it doesn't seem like there's some relationships here that we're missing. So let's try and make this a little bit better. Okay, so we're going to run it back to histogram 3. Or we're going to make histogram 3 and add a couple more static variables. So again, same thing, GGplot2. We have the static variable X equals age, you know, same thing as before. And in the geom histogram function, we can actually add more information about what we want to add to this histogram. So in this example, we're going to be adding more bins, right? So bins is this like this length of age that we're trying to capture. So we're going to we're going to add more bins because we want adding more bins, and adding more bins makes the size of the bin with smaller. So you can either decrease bin with or add bins, so which is going to add bins. We're going to fill it with a different color steel blue, and then we're going to make it white as well. So and of course adding the the title sorry the labels that we added earlier. Let's take a look at this. [ONSCREEN] > hist3 <- ggplot(afcars2, aes(x = age)) + + geom_histogram(bins = 50, fill = "steelblue", color = "white") + + labs(title = "Histogram of Age", + x = "Age", + y = "Frequency") > hist3 [ONSCREEN R CODE IMAGE 03] A histogram bar graph with title "Histogram of Age", "Age" as the x axis, and "Frequency" as the y axis. Age runs from approximately 1 to 27 and the maximum count is approximately 43,000, occuring around age 5. Blue bars now appear discretely separated from each other. [Noah Won] Okay, looks good. Looks much better. See the color seems to show the difference between the bars. So we can actually see the difference to outline and white and the fill is steel blue. It's added definitely a number more of bins here. So we can kind of see more of that relationship between these variables. It's more exact, right? And I think this histogram is a lot better than what we started with. So, you know, I think I'm pretty happy with this. I think we're going to probably move on. We can kind of see just from the histogram of age that, yes, we see a huge spike, like a modal spike right here, seemingly between zero and 10, and then it goes down, and then we see another slight spike here at 20. It kind of gives us information that this data might not be normally distributed. But more likely, seemingly, bimodally distributed, small hump here, and that's a larger hump here. Okay, but what kind of information can we get from a box and whisker plot? So, I'm kind of kind of for the sake of time. I'm going to kind of skip over most of these, because it's kind of very similar to what we did before. [ONSCREEN] > box1 <- ggplot(afcars2, aes(x = SEX, y = age)) + + geom_boxplot(fill = "skyblue", color = "darkblue") + + labs(title = "Box and Whisker Plot of Age by Sex", + y = "Age", x = "Sex") + + theme_minimal() > box1 [Noah Won] It's changing the color of the box plot, changing the titles, but I will go over this difference in syntax. We are actually, box and whisker plots are univariate, but we're going to actually split it by age, because box and whisker plots are more often useful when we're comparing them against other box and whisker plots, right? So, we're going to use age. We're just going to separate, we're going to create three box and whisker plots based off of different values of age, male, female, and missing. So, that's kind of why we added this aesthetic value. This kind of x equals SEX, y equals age. We kind of added another variable in there. We're going to use geombox_plot this time and we're going to add a theme. There are like a number of themes. It just kind of makes your box plots or your graphs look prettier. So, like I said, in the documentation, you can see all the themes that GGplot offers. I'll just be using minimal, so let's see. [ONSCREEN R CODE IMAGE 04] A box and whisker plot with x-axis labeled "Sex" and y-axis labeled "Age". There are three boxes. The first box is unlabeled and rises from about age 5 to age 15, with lines that stretch down to 0 and up to 25. The second box is labeled "Female" and rises from about age 7 to about age 18, with lines that stretch down to 0 and up to 29. The third box is labeled "Male" and rises from about age 7 to about age 17, with lines that stretch down to 0 and up to 30. [Noah Won] Okay, nice. So, we can see through this box and whisker plot that, you know, it seems like male and female are very similar in distribution. Perhaps males have a slightly higher age, very slightly. And perhaps have a slightly longer whisker over here, showing that maybe they have a heavier distribution over here. But yeah, when we see like these missing values are also are distributed lower. So perhaps we could see that maybe if there's some kind of relation between missing the sex variable and younger, like having a younger age, which could be it's likely a data, like, related to the data, but it's hard to tell for certain. But yeah, so like we can see the distribution and compare them between various levels of this sex variable that we created. Okay, so that's fine and dandy. If you want to measure one variable, but what if we want to measure the relationships between two continuous variables? Since that we're going to go over a scatter plots, which are very helpful in kind of determining the relationship between two variables. Two continuous variables we're going to be using to compare our age and FCMntPay. So essentially what's the foster care monthly pay relationship to the age of the person. So we are going to assign this GGplot to scatter1. It's just the first scatter plot that we're making. And we are using the AFCARS2 data as I said before. We are comparing age and FCMntPay. So age is our independent variable, right? Age we want age to be on our x-axis, our horizontal axis, and the pay to be on our y-axis. We are using geom_point. So this is kind of the key word here for scatter plots, geom_point. And filling in the color with steel blue. A new parameter we have here is alpha point 6. It's kind of just the static parameter. It's just the value between zero and one, and it shows like the opacity of the dots that we're plotting. So if you have like a ton of dots, maybe you'd want it to be less opaque, so that you can kind of see the dots through each other. And then the size of the dots, it's just, we can assign as well, just a value. You know, if there are a lot of dots, we can make them smaller, bigger. See them better, it's just kind of you can play with these parameters to get a visually appealing plot. Again we're going to use the labs, labs, a function, to assign titles. And we're going to be applying a minimal theme. So let's see how this runs. It might take a little bit, just because there's a lot of information in this plot. So as, like you saw, there's like 300,000 males. [ONSCREEN] > scatter1 <- ggplot(afcars2, aes(x = age, y = FCMntPay)) + + geom_point(color = "steelblue", alpha = 0.6, size = 2) + + labs(title = "Scatterplot of Foster Care Monthly Payment vs Age", + x = "Age", + y = "Foster Care Monthly Payment (FCMntPay)") + + theme_minimal() > scatter1 Warning message: Removed 1366 rows containing missing values (`geom_point()`). [Noah Won] And so it takes a lot of time for R program to kind of run it. We'll say, we will wait for a little bit. I'm sure, it will be well worth it. Well, like I said, there are, you know, many different options. I can kind of reference this cheat sheet right now while we're waiting. [ONSCREEN] A view of the "Data visualization with ggplot2 CHEAT SHEET" PDF available at the following link: posit.co/wp-content/uploads/2022/10/data-visualization-1.pdf. [Noah Won] So like we have one variable continuous, right? So you can do like geom_histogram like I did before. Or you could do like, let's see, two continuous. We did geom_point, right? And then in the point, in the point we could do, like, in the, sorry, in this, that in the point parameter, we could do X, Y. You could also put the X and Y values in this, in this geom_point thing at function. So we did alpha, we've used color, we used fill, we haven't used shape, we used size. So these are just kind of the parameters that you can put in here. And yeah, there's lots of information, useful information that we use not see it. Like using, we'll be using geom_smooth in a second to kind of show like cubic splines. I think we'll use geom_box plots, one discrete, one continuou. See if this is coming along. Well it should be almost done I don't remember it taking so long. But I guess, I suppose in the meantime, I can cover this other plot that I was going to show. [ONSCREEN] #There seems to be a positive relationship between age and Monthly Foster Care Payment but it is hard to tell #Let's fit a cubic spline in the data to see > scatter2 <- ggplot(afcars2, aes(x = age, y = FCMntPay)) + + geom_point(color = "steelblue", alpha = 0.6, size = 2) + + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), color = "darkred", se = FALSE) + + labs(title = "Scatterplot of FCMntPay vs Age with Cubic Spline Fit", + x = "Age", + y = "Foster Care Monthly Payment (FCMntPay)") + + theme_minimal() [Noah Won] So we're going to kind of add cubic splines to the graph that we made up here. So we're going to follow the same method. We're going to use GGplot's same variables, because that doesn't change. We're not trying to change X and Y variables. We're going to use the same geom_point as we did. So this kind of builds the scatter plot. And then we're going to use geom_smooth, which builds the lowest line or the cubic splines on top of the scatter plot. So that's why we're kind of adding it in addition to geom_point. This makes scatter plot. This makes the curves. We're going to use, we're going to specify which model. Apologies. We're going to specify which model, or which method we're going to use to create the cubic splines. We're going to use the general additive model, which is gam for short. We're going to specify the formula of this. So it's just our outcome, Y, FCMntPay is equal to this is a smooth function to X. And then the method is going to be cubic splines, which is CS for short. We're going to set it to dark red. And this SE option stands for standard error. So the standard error, so the standard. Oh, okay, as it comes. We have it right here actually. [ONSCREEN R CODE IMAGE 05] A graph of blue dots. X-axis is labeled "Age", y-axis is labeled "Foster Care Monthly Payment ( FCMntPay)". Most of the dots are between ages 3 and 27, and below FCMntPay value 20,000. [Noah Won] So we see all these dots and plot it against age. It's kind of hard to see this really high density of odds here, like I said, there's more than 500,000 observations. So it's going to be a high density. It's kind of hard to see the relationship, but it seems there's a relatively positive relationship here. But we're going to fit this cubic spline to kind of see exactly how, we'll kind of see more holistically, how, you know, what the relationship is between the points and the cubic splines. As I said, the standard error kind of shows that little bandwidth the width that I have here right here. [ONSCREEN SLIDE 10 IMAGE 1 ALT-TEXT] A cubic spline plot is a smooth curve that passes through a set of given data points, created by piecing together a series of cubic polynomials between each pair of points. Each segment is defined so that the entire curve is smooth The example shows a line on plane with x-axis labeled "y1" with values going from 0 to 25, and y-axis labeled "vcr" with values going from 0 TO 2500. There is a shaded band which covers the line from above and below: this represents the standard error band width. [Noah Won] See this with this standard error bandwidth, you can kind of see. So I'll display it for us, but since this distribution is still heavily populated, I'm going to set it to false because you can't even see it if I set it to true. Either way, it's kind of hard to see because just the sheer density of variables. We're going to apply our titles. Scattered plot of FCMntPay versus age with a cupbic spline fit. The titles, and then we're going to apply a similar theme. [ONSCREEN] > scatter2 Warning messages: 1: Removed 1366 rows containing non-finite values (`stat_smooth()`). 2: Removed 1366 rows containing missing values (`geom_point()`). [Noah Won] So hopefully this one moves a little bit quicker. I want to leave time for questions. But it should be able to fit a line. So between of that scatter plot that we had just made. Let's say while we're waiting for this because it's just going to be a cubic spline I will start going over the logistic regression model. So it's fairly quick and simple to run. But oh, that was a little bit quicker than expected. [ONSCREEN R CODE IMAGE 06] A graph of blue dots. X-axis is labeled "Age", y-axis is labeled "Foster Care Monthly Payment ( FCMntPay)". Most of the dots are between ages 3 and 27, and below FCMntPay value 20,000. A line is overlaid on the dots which runs along low values of FCMntPay from age 0 to 30, with a slight hump in the line at age 23. [Noah Won] So you could see here the slight positive, a very slight positive association with a age and foster care monthly payment. And this line is so close to the bottom likely because there's a high density of variables here that we can't see. But cubic splines, you know, they fit like local trends very well so you could see that it kind of humps up here and then humps back down. But yeah, that's just how to fit a cubic spline essentially. But I'm sorry, I'm kind of jumping around but I thought this would take longer. But our logistic regression model if you recall, we use logistic regression to model binary outcome variables. So our binary outcome variables take the value of zero or one. And we are going to be modeling a clinical disability this CLINDIS variable. Essentially, if the indicator of one equals yes, if they have been diagnosed, clinically diagnosed with a disability or zero, if they have not. And we want to see that kind of relationship as compared to it with age, the relationship between being diagnosed with disability and the longer you live. [ONSCREEN IMAGE] Regression Equation: Dependent variable is Y sub i. Population Y intercept is Beta naught. Population Slope Coefficient is Beta sub one. Independent variable is X sub i. Random Error Term is Epsilon sub i. The Linear component is Beta naught plus Beta sub one times X sub i. The Random Error component is Epsilon sub i. [Noah Won] So to go over the model again, it's very similar to the continuous linear regression model. Except this dependent variable is not just the continuous, but we want to make it continuous. Because we want to be able to kind of model it the same way as we do linear regression. So how do we turn a variable that's from zero to one into a continuous variable? Well, instead of, as the made instead of using the zero to one binary variables, what's something we can predict the probability instead of the variable as when given a certain age. So okay, so we don't exactly use probabilities however, because the probability is restricted between zero and one. So what's something that's analogous to probability that we can use that has that's continuous, you know, negative infinity, you know, infinity, you can be negative positive, it could be anything. And that's odds. So odds and probability are very closely related. And I'm sure if you heard the word odds before, but what is an odd? An odd? It odds are the ratio of outcomes that we can use to model chance. [ONSCREEN] # 4. Logistic Regression #What if we want to model and outcome that is NOT continuous but binary (i.e. Has two different values, yes/no, male/female, etc.) #We can use a logistic regression model which converts the outcome variable to log odds #Odds is a ratio of outcomes that we can use to model chance #What is the probability we don't roll a 6 on a fair die? 5/6. What are the odds? 5 To 1. [Noah Won] Probability and odds are related by equations. Very so it's one over one plus p, I believe, as equal to odds. But an example that I find helpful to determine the difference between probability and odds is, what is the probability we don't roll a six on a fair die? There's five options where we don't roll a six and then six total options. So probability is five six. Odds work slightly differently. What are the odds? We take the ratio of potential outcomes that don't, sorry. Potential outcomes where we don't roll a six and then take that ratio and put it against the outcomes where we do roll a six. So there are five outcomes where we don't roll a six and there's one outcome where we do roll a six. So the odds are actually five to one or five. Which this is very helpful in our modeling because now we have a way to kind of model data from negative infinity to infinity or zero to infinity technically. And in order to get it to negative infinity, we're actually going to be modeling log odds. That's kind of where logistic comes from. We're modeling logistic regression. We're modeling the log transformation of odds in order to get a continuous variable. That was a mouthful, but getting into the model. We're going to be using GLM as the function. We're going to be using CLINDIS as the outcome as we had in previous outcomes. This is the Y variable. It's zero one. And we're going to be using age to model it. Then we're going to use data AFCARS2 as we had created above, we going to use family equals binomial, which is just basically saying, do logistic regression. We're going to run this model. [ONSCREEN] > logmodel <- glm(CLINDIS_d ~ age, data = afcars2, family = binomial) > summary(logmodel) Call: glm(formula = CLINDIS_d ~ age, family = binomial, data = afcars2) Deviance Residuals: Min 1Q Median 3Q Max -1.3933 -0.8209 -0.6331 1.2456 2.0124 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.3358391 0.0088388 -264.3 <2e-16 *** age 0.0944796 0.0005468 172.8 <2e-16 *** --- Signif. Codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 633975 on 539154 degrees of freedom Residual deviance: 602576 on 539153 degrees of freedom (92072 observations deleted due to missingness) AIC: 602580 Number of Fisher Scoring iterations: 4 [Noah Won] Okay, we have some deviants, residuals, and we have our coefficients here. So of course, we have our intercept and we have our age estimate coefficients or coefficient. So first we check the p-value. Is it less than 0.05 or whatever alpha we set our p-value to be. And yes, it's very much below that. And we have our estimate here. So we're essentially modeling linearly age to log odds. So this will be this kind of unit is in log odd increments, right? So which is not necessarily helpful for interpretation. So we're going to transform it back to odds. Or we're going to transform it back to odds in order to, in order to be able to interpret it. So we see here we have our coefficient here, our estimate points 0.944796. And in order to turn it back, we're just going to exponentiate it as you know, like Ln and E cancel each other out. So we're going to exponentiate this estimate. And we see that our odds estimate is 1.099. So we can, and the way to interpret this is, since our p values are significant, we can say that the odds of being clinically diagnosed with a disability with a disability increased 1.099 times per year of age. So let's say that's being a 26 year old would be, would have 1.099 times the odds of being clinically diagnosed with a disability than a 25 year old in this data. So we can see there's positive relationship between, being diagnosed clinically with a disability and age. Okay, I think I'm going to leave some time, hopefully. I'm going to leave some time for questions, but thank you very much, but thank you very much for listening. Thank you for attending the Summer Training Series. And, you know, I hope you'll reach out if you have any questions that can answer. [ONSCREEN CONTENT SLIDE 17] Questions? Alex Roehrkasse aroehrkasse@butler.edu Noah Won noah.won@duke.edu Paige Logan Prater paige.loganprater@ucsf.edu [Paige Logan Prater] Thanks so much, Noah. We only have one question in the Q&A box, but while we're doing that, maybe we'll have a couple more come through. And this one is a little bit more logistical, so someone is asking if the link to those infographics that you were sharing will be included in the course materials. I think actually, no it might be helpful if you could you copy those links from the presentation and put them in the chat for folks to have. And when those when the slides are posted, those links should be clickable because we included them in the presentation. But again, okay great yeah, Noah just put those links in the chat. They should be clickable when the materials are posted on our website. But if you have any issues, please let us know. Noah seems to be pretty good at navigating what that what the infographics offer. So if you have any questions about like how to use that tool, I think definitely reach out as well. Okay here is a question Noah. When is it best to visualize continuous variables as categorical? This may also refer to the previous recording regarding data exploration. [Noah Won] So like you mean like when you went to use a histogram I presume. So there is an alternative to a histogram called density plots. And it's essentially like a more linear, it kind of follows like a line. It's like if you made like a bunch of little small bins, if you like if you chop it up into infinitely small bins, it kind of like creates a curve that is like in the shape of a histogram. So to answer your question, I would say generally there's like been some like research, so researchers will use their discretion when determining bin lengths because the density graphs can get very particular in like all the nooks and crannies and kind of can distract from the overall distribution. So a lot of researchers prefer histograms or like histograms because they provide like some leeway. They provide a more like blended smooth curve, right because of those bin widths and bin counts in comparison to like density graphs. Does that answer your question? [Paige Logan Prater] Hopefully it does and if not, Antonio please put your put another question in the Q&A box. But also from a less advanced quantitative and statistical researcher, I think a lot of times it's there's all these choices that we can make right about how we're approaching data analysis and approaching data visualization. And I think the best practice is to understand why we're choosing to do one thing over the other. And having a good justification or a good rationale for that decision. And kind of making that explicit in any presentations you do or papers or things like that. Noah does that sound accurate too. [Noah Won] Yeah, yeah. [Paige Logan Prater] I don't think there's ever like a absolute best only way to do something. [Noah Won] Maybe. Yeah, he says it's referring to the previous recording and exploration so perhaps if that didn't answer your question, maybe. [Paige Logan Prater] No, I think it did answer the question. I was just offering another kind of angle of how to approach a question like that. Just from my experiences too. I mean, there's usually, at least in some of the projects I've worked on, there's no clear cut we must do this. That emerge it's a mix of factors. It's like what the data can tell us? Who is our audience and what do we want them to know? Like, there's a bunch of different considerations and just making sure that we are clear in why we're making certain decisions I think in addition to some of the more technical options for data visualization, I think those are two things to consider. [Noah Won] Yeah, I completely agree with Paige. But Antonio does that answer your question, my email is always open. It's a Noah.Won@duke.edu So feel free to send me an email and I'd be happy to answer any questions. [Paige Logan Prater] Cool. And they followed up saying both answers help. So that's amazing. Okay. We only have about one more minute. So if anyone has a very last quick burning question that they want to type in, please do. We'll stick around for the next minute or so. But while before we close out, thank you all so much for sticking with us through this whole Series. Like we mentioned at the top of the hour, everything will be posted. We're working on getting, all of that available, within the next few weeks. So please thank you for your patience and it will definitely be up there. Look out for other learning opportunities. We host our office hours, monthly office hours every academic year, and our next summer training series will be a similar, not the theme will be different, but the format will be the same every Wednesday, for about an hour. So thank you all so much. And I hope you have a great week. [VOICEOVER] The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an office of the Administration for Children and Families. [MUSIC]