Transcript for video titled "LeaRn R with NDACAN, Week 6, 'Data Visualization', March 21, 2025". [MUSIC] [VOICEOVER] National Data archive on Child Abuse and Neglect. [Frank Edwards] All right, Noah, do you want to kick us off [Noah Won] Sure, Frank. Thank you. Hello, everyone. My name is Noah Won, and I'm a data analyst here at NDACAN. Would like to welcome you to our first Monthly Office Hours Series. If you're new to our Monthly Office Hour Series, it's a series we hold every year this year, with including the addition of our training portion. For the first 30 min Frank Edwards will be leading the R training portion, followed by 30 minutes of our typical office hours. From there we'll be splitting into break rooms, guided by 3 NDACAN members based on admin or survey data sets or career development. If you have any questions, please put them in the Q and A box, or you can come off mute and ask the question. But please do not type your question in the chat for recording purposes. The session will be recorded, and all materials will be available on the NDACAN website without further delay I'll kick it over to Frank to start the R Portion of our monthly Office Hours series. [ONSCREEN CONTENT SLIDE 1] WELCOME TO NDACAN MONTHLY OFFICE HOURS! National Data Archive on Child Abuse and Neglect DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO The session will begin at 11am EST 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R) 11:30 - 12:00pm - Office hours breakout sessions Please submit LeaRn questions to the Q&A box This session is being recorded. See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us If issues persist and solutions cannot be found through Zoom, contact Andres Arroyo at aa17@cornell.edu. [Frank Edwards] Thanks, Noah, and good morning, everybody. I am Frank Edwards. I'm a research affiliate with NDACAN. And I'm faculty at Rutgers [ONSCREEN CONTENT SLIDE 2] LeaRn with NDACAN, Presented by Frank Edwards [Frank Edwards] So today, we're going to cover one of my favorite topics. And that's data visualization. [ONSCREEN CONTENT SLIDE 3] MATERIALS FOR THIS COURSE Course Box folder (https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains Data (will be released as used in the lessons) Census state - level data, 2015 - 2019 AFCARS state - aggregate data, 2015 - 2019 AFCARS (FAKE) individual - level data, 2016 - 2019 NYTD (FAKE) individual - level data, 2017 Cohort Documentation/codebooks for the provided datasets Slides used in each week's lesson Exercises as that correspond to each week's lesson An .R file that will have example, usable R code for each lesson will be updated and appended with code from each lesson [Frank Edwards] As always, the materials are available to you in the box folder. We are on Week 6 today and we've created fake data sets that mimic the structure of the AFCARS and NYTD data. again, don't use these for research. These are just for practice, but they do mimic the structure, but don't contain any real data. And we also up there have the script that I'll be using the R script that I'll be using today for your reference as well as a homework assignment that you can work on to practice some of the skills we're going to work on today. Copy of the slides and a Pdf of the slides is also available in the Box folder. [ONSCREEN CONTENT SLIDE 4] WEEK 6: DATA VISUALIZATION March 21, 2025 [Frank Edwards] So today is data visualization. This, again, it's a topic near and dear to my heart, and it should be the core of your toolkit when you're working with data in R. R in particular one of the big upsides for using R is that it's data visualization tools are incredibly well-developed And almost certainly I would argue best in class. Compared to other statistical software suites, they certainly outpace Stata and others and outpace python, I think, generally as well, and the main reason for that is the development of ggplot 2, which we're gonna cover How to use today. [ONSCREEN CONTENT SLIDE 5] DATA USED IN THIS WEEK’S EXAMPLE CODE AFCARS fake individual level data ./Data/afcars_2018_indv_fake.csv Simulated foster care data following the AFCARS structure Can order full data from NDACAN: https://www.ndacan.acf.hhs.gov/datasets/request-dataset.cfm [Frank Edwards] The data from for today that I'm going to use for demo is the afcars_2018_indv_fake.csv. This is simulated Foster care data that follows the AFCARS structure, and it's at the individual level rather than the aggregated level. for the homework you'll work with The aggregated data as well as the individual data. And just to check is my audio Yeah, okay, I can Now the mic light is lighting up It didn't seem like it was coming through. But okay, so and if you have questions at any point, feel free to just come off mute and ask your question. I don't mind being interrupted. [ONSCREEN CONTENT SLIDE 6] BASIC ANATOMY OF A PLOT IN R [Frank Edwards] So first, st let's talk about the basic anatomy of setting up a plot in R. [ONSCREEN CONTENT SLIDE 7] GGPLOT2 ggplot2 uses the following basic ingredients for a plot 1) Data, 2) aesthetic mappings, 3) graphics to draw This takes the form in syntax of ggplot(DATA, aes(x = VARIABLE)) + geom_histogram() [Frank Edwards] Our basic syntax for writing, for developing visuals in R using ggplot, which we've been using Tidyverse the whole way through anytime. You library in the Tidyverse suite of packages you're actually loading, I believe, 10 packages when you pull Tidyverse in and ggplot2 is one of the packages that you pull in when you pull in Tidyverse with a library command. Now, in order to create a plot in ggplot, we need 3 ingredients. We need a data object that's going to take the form of a data frame or a table. We need aesthetic mappings. And we need graphics to draw. So these are the 3 ingredients. First, we need a source table, then we need to tell our how to map variables in that table onto particular elements of a data visual, And then we need to tell our what kind of visual to draw. So this will take the form in syntax of a command like the command at the bottom of the slide Here. we'll use the ggplot function, and the first argument we give to the ggplot function is the name of our data object. We then put a comma, and we use the Aes function inside the ggplot function to specify the aesthetic parameters that we're going to use. So Aes here stands for Aesthetic, and we're going to tell our how to map our variables in the data frame onto particular features of our visual. So in this case, I'm going to specify one single parameter. That's X and X is going to handle the horizontal position of the element on the visual, and we'll call it variable name, for now. generally, though, the ggplot function itself will take 2 main features, the name of the data object comma Aes to specify our aesthetic parameters, and then within the parentheses of the aesthetic call, we'll name out all of our aesthetic parameters, after I've made my initial call to Ggplot I'm going to connect it. We've looked at piping before in using the Tidyverse syntax. ggplot has a similar setup where, instead of using the pipe operator, we use a plus to string together Different function calls. So after we make our base call to Gg plot data comma aesthetics with our variables we use a plus sign, and then specify with a different function the kind of shape we would like ggplot to draw. and these are called geoms using ggplots vernacular. In this case I'm calling for a histogram geom histogram. So we have ggplot plus geom right? This is going to be this, the basic structure that all of our calls to ggplot will take. [ONSCREEN CONTENT SLIDE 8] COMMON AESTHETIC PARAMETERS For univariate visuals, we will generally only use aes(x = VARIABLENAME) For bivaraites continuous: use aes(x = VAR1, y = VAR2) Continuous + categorical: aes(x = VAR1, color = VAR2) Can also use shape, size, color (for lines), fill (for solid fills) For continuous ranges, try xmin, xmax, ymin, ymax. Group is also useful: aes(x = VAR1, y = VAR2, group = VAR3) [Frank Edwards] common aesthetic parameters that we might use Depends on whether we're looking at Bivariates, univariates, multivariates right? for univariates. We usually only want one variable, and we'll specify it, as X. X is variable name. for bivariates and apologies for the typo We most commonly might use a X and Y right? For example, if we're making a scatterplot, we'd need a horizontal position and a vertical position, right? And we can just use commas to separate the aesthetic parameters. If we have a continuous and a categorical, we could imagine setting up aes x equals var1 color equals var2 So we wouldn't necessarily need to use y for Var2, we could use color. We could use shape, size, color, fill right? So we could say, shape equals var2. Size equals var2, right? for continuous ranges. If we, for example, are looking at producing a plot with error bars, or something like that We can use xmin, xmax, Ymin or Ymax to set our upper and lower bounds for vertical and horizontal plotting. and group can also be useful when we want to plot out, for example, separate lines for different grouping variables within a data set. So, for example, imagine we wanted to create a line plot over time. We could imagine X equals Time. Y equals the variable that we're looking at. Maybe it's caloric intake per day and group could be equal to exposure to treatment or control group in an experimental setting, and then we could have separate lines for each group. so that can be incredibly helpful. [ONSCREEN CONTENT SLIDE 9] COMMON GEOMS Here are my most commonly used geoms Histogram: geom_histogram() Density: geom_density() Scatterplot: geom_point() Line plot: geom_line() Bar plot: geom_col() or geom_bar() Maps: geom_sf() [Frank Edwards] Common geoms that we'll see are geom_histogram. Obviously, we've already looked at that geom_density to create a density plot. geom_point creates scatter plots. geom_line will create line plots. for bar plots We'll usually use geom_bar, or geom_col Which stands for column. We'll use geom_bar when we have the data at the variable level But when we, when we've aggregated it up to a count, that is, we have a count of incidences within a category that maybe we're working from something that already looks like a crosstab Then we'll use geom_col instead of geom_bar. and we can also use ggplot2 for GIS and for mapping. So if we have, for example, shapefiles, maybe shapefiles, we've downloaded from the US Census or somewhere else We can easily attach those to our data objects and use geom_sf To draw maps directly in ggplot. So the sky is really the limit. In ggplot If you can imagine a data visual, you can almost certainly create it using the tools that we've got available to us. [ONSCREEN CONTENT SLIDE 10] OVER TO RSTUDIO [Frank Edwards] So instead of doing a lot of slides. Today, we're going to go right over into Rstudio to start practicing with drawing some plots. And I'm going to walk you through some of the most commonly used visuals that you will see in ggplot that I think will be useful for work with our admin data. All right. This again is week 6 R This script is up on the box folder. If you would like to pull it in and follow along. Now, in our studio, as always we have our script over here on my left side. We have my console up here, which is where we're going to see the script executed. And then here today, we're gonna have our plots viewer That is gonna show us the output that we are creating. I'm gonna start off by just running the first 10 lines of code here and nice tip for working with our studio. If I put my cursor on line 10 and I'm on a Mac So if I push command option B, it will run all lines of code above line 10. So in this case it'll run this command library Tidyverse to pull in the Tidyverse packages, and there's ggplot2 that we're working with today. [ONSCREEN] ### leaRn week 6 ### data visualization with ggplot2 library(tidyverse) ### read data afcars_ind<-read_csv("./data/afcars_2018_indv_fake.csv") # take a look head(afcars_ind) [Frank Edwards] It'll then read in my AFCARS's individual level, and anytime I read in a file I like to take a look at it, and I used head to do that here just to see the variables that I've pulled in and make sure that everything loaded in appropriately. Let's start with some univariate visuals. I'm on lines 11 to 15 Right now. again, what we want to do We need those 3 ingredients We need the data, We need the aesthetic mappings, And we need the geom. So the data we're working with is called AFCARS_ind and then for my aesthetic mappings here, I'd like to look at the age at which a child was first removed from their home and placed into foster care. and we don't see it over here I'm gonna make my text a little smaller for a second, so we can actually just see that variable for a second gotta go very small. Or I could make my screen a bit larger. Let's do that. [ONSCREEN] # A tibble: 6 x 20 fy id_num stnum sex_f raceth_f phyabuse_f sexabuse_f neglect_f prtsjail_f aaparent_f daparent_f curplset_f ctkfamst_f fosfamst_f agefirstrem_f ageatlatrem_f inatend_f 1 2018 8435 1 1 1 0 0 1 0 0 1 3 2 0 8 8 0 2 2018 8458 1 1 2 0 0 1 0 0 0 3 2 0 10 10 0 3 2018 2 1 1 7 0 0 0 0 0 0 2 2 2 13 13 0 4 2018 8468 1 1 1 0 0 1 1 1 0 2 2 2 10 10 0 5 2018 3 1 2 7 0 0 1 0 0 0 3 2 0 13 13 0 6 2018 8480 1 2 1 0 0 1 0 0 0 3 2 0 13 13 1 [Frank Edwards] Okay, yeah, here's the agefirstrem_f variable over here, and we can see that this is an integer that takes on you know, values that represent the numeric age of the child when they were first removed into foster care. So we're gonna look at the distribution of that variable across cases. anytime We have a continuous variable that yes So if we have a continuous numeric variable that's observed across a lot of cases A histogram is a really great way to get a sense of the distribution of that variable. So here's our histogram. [ONSCREEN] ## Univariate visuals # histogram of age at first entry ggplot(afcars_ind, aes(x = agefirstrem_f)) + geom_histogram() Output Image 1 Bar chart with x-axis "agefirstrem_f", y-axis "count". [Frank Edwards] This is a great way to get a quick sense of what's going on in the data. So on the y-axis. Here we have count on the X-axis. We have the numeric age, where the height of the bar represents the number of cases that fall into each age integer. so we can see that the most common value for age in the data is 0, and 1. 1 is where the spike is. We also have a lot of zeros. after age One we see decreasing numbers going on up to age 17. Now, maybe we want to look at this, not as a discrete plot. A histogram will show us the counts on the Y-axis, but it doesn't show us the proportion of the data. Maybe we want to know the relative share of the data. We're not so interested in counts. And maybe we have in this case, we do have a kind of discrete structure to the data, because age is represented as an integer. [ONSCREEN] # density of age at first entry ggplot(afcars_ind, aes(x = agefirstrem_f)) + geom_density() Output Image 2 Line graph with x-axis "agefirstrem_f", y-axis "density". Shows continuous line with spike at age 1 and declining to age 17. [Frank Edwards] But perhaps we had a more truly continuous, variable. a geom_density, a density plot will provide similar information as a histogram. Right? It'll give us a quick glimpse of the overall structure of the data. We can still see those 2 features that 0 And 1 is really where we see most of the children entering Foster care for the first time. But now we have a sort of different feature of the data. A density plot can be read similar to a probability distribution in that the area under the curve of this density must equal one. So we can read this as if we took the, for example, drawing a line at age equals one or 2 We could then kind of imagine what the area under this zone of the curve is to get a sense of what proportion of the data falls in that range. So histograms and densities show us really similar sets of information depending on how you want to look at it. Maybe we want to look at the distribution of a categorical variable in the data. [ONSCREEN] # distribution of race/ethnicity ggplot(afcars_ind, aes(x = raceth_f)) + geom_bar() Output Image 3 Bar chart with x-axis "raceth_f", y-axis "count". Most raceth_f values are less than ~10 and the counts [Frank Edwards] Here, let's look at the race ethnicity. Okay, something weird is going on here. Let's go back to head and think about what might be happening. Race ethnicity, we can see, is represented here as a numeric. And so when I asked R to give me a geom_bar for race ethnicity here, it did that, but it treated it as a numeric variable when what we might be more interested in thinking about is race and ethnicity as a categorical variable rather than as a truly numeric variable. In order to force R to handle this as a categorical variable, I'm just gonna put a factor wrapper around race ethnicity. I don't need to transform this prior to the plot. I can do it within the call to ggplot itself and that'll give me this visual. [ONSCREEN] # weird, oh because it is numeric ggplot(afcars_ind, aes(x = factor(raceth_f))) + geom_bar() Output Image 4 Bar chart with x-axis "factor(raceth_f)" displays eight values i.e. integers 1-7 and 99. y-axis "count" and ranges from near zero for x=5 to greater than 15000 for x=1 and x=7. [Frank Edwards] This is much more useful. Now I can at a glance if I have my code book in hand, tell what the distribution of race ethnicity is across the data set I'm looking at. Maybe I'm interested in thinking about the 2 variables together. So maybe I want to think about how the distribution of age at first removal differs across racial and ethnic groups. To do that, I can start to add color and other elements to the plot. We're going to start with a simpler one, we're going to start with sex, which takes on 3 values in this data set, it takes on 1, 2, and 99 for missing. the way we can specify This is we can Before we were just using X, we're gonna add color now. So I'm gonna set X again at age, at first removal. And I'm going to set color at factor of sex. again Sex like race here is coded numerically, and we need to tell R that we want to handle it categorically rather than numerically. Let's take a look at what we get. [ONSCREEN] ## Bivariate continuous / categorical # age at first entry by child sex ggplot(afcars_ind, aes(x = agefirstrem_f, color = factor(sex_f))) + geom_density() Output Image 5 Line graph with x-axis "agefirstrem_f", y-axis "density". Shows three colored lines with a legend mapping the color of the factor(sex_f) lines to 1, 2, and NA. [Frank Edwards] Okay. So now we have 3 density plots. We have the salmon colored line for factor Sex is one. the blue line for factor, sexes 2. And then this gray line for the missing values which we don't have a lot of and There are not very many missing cases in this data, but again, a density plot will show it to us as the proportion of the data, so it may not be as informative as a histogram might be in this case. But let's take a look at this a little differently. We have some trouble here teasing out the difference between the salmon line and the blue line, the pink and the blue line. They really do overlap quite a lot, and so we can't really see them separately. If I am in a situation where I'd like to draw multiple lines but I don't necessarily want them to all be within the same pane of the plot. I can use this additional function called facet grid, and then I can tell R to facet the plot by sex. The Tilde tells it which variable to use for faceting faceting is going to produce a series of what are called small multiple plots. It'll produce lots of different versions of the same plot for each value of the categorical that I provide. [ONSCREEN] ## thats a bit difficult because of overlap. # let's try small multiples with facet_ # use facet_grid when you want to fix the number of rows or columns # facet_wrap is more generic ggplot(afcars_ind, aes(x = agefirstrem_f)) + geom_density() + facet_grid(~sex_f) Output Image 6 Three separate line graphs for variables 1, 2, and NA. Each has x-axis "agefirstrem_f" and y-axis "density". [Frank Edwards] In this case I will now get one density plot for each value of sex, and this allows us to see side by side each of those values if we don't want them overlaid on top of each other. Now, maybe we want to think about pivoting over to race ethnicity Again. [ONSCREEN] # density of age at first entry by child race/ethnicity ggplot(afcars_ind, aes(x = agefirstrem_f, color = factor(raceth_f))) + geom_density() Output Image 7 Line graph with x-axis "agefirstrem_f", y-axis "density". Shows eight colored lines with a legend mapping the colors of the factor(sex_f) lines to eight values i.e. integers 1-7 and 99. [Frank Edwards] this is the density plot. When we add race ethnicity as a color aesthetic. And here we can compare these quite nicely. We can see that there are clear differences across groups in terms of the age distribution for the race ethnicity values. But we might want to go back and keep in mind what we saw here on our plot. I'm sorry this plot that we have very different amounts of data for groups 1, 2, and 7, we have the most data. So those might be the ones we want to pay most attention to as we look at the plot. so we have pink, orange, and purple. and the pink, orange, purple Those are the ones up top Here. we can see that those distributions look like they do have some slight differences in the proportion of one-year-olds that are the proportion of children whose first entry was one-year-old. might be something for us to look at a bit more closely. Maybe we want to look at both race ethnicity and sex. So here we treated race ethnicity as our color parameter. There's no reason we can't combine both the color parameter with a facet wrap. [ONSCREEN] # And let's also look at sex ggplot(afcars_ind, aes(x = agefirstrem_f, color = factor(raceth_f))) + geom_density() + facet_wrap(~sex_f) Output Image 8 Three separate line graphs for variables 1, 2, and NA. Each has x-axis "agefirstrem_f" and y-axis "density", and each has eight different colored lines for factor(raceth_f) values 1-7 and 99. [Frank Edwards] so we can look now at 1, 2, 3 variables simultaneously. We didn't have enough data in the NA to represent. You can see here this non missing arguments. So we, only we didn't have enough information I think we only had 2 missing sex cases so those just get dropped from the plotting. But now we have a separate density plot for each racial and ethnic group by sex. We could also flip this to make sex color and race ethnicity the facet that would be just as simple as changing around the position of these variable names. Any questions so far? All right, then, let's pivot over to scatterplots. Let's look at the joint distribution of age at first and last removal. So here we're going to have on the X-axis the age when the child was first removed into foster care, and on the Y-axis, the age they were at their last removal into foster care. And, of course, for a lot of children these will be the same if they only have one entry. Right? Then X and Y will be equal to each other. Let's take a look at what this looks like, though. [ONSCREEN] ### Two continuous measures # let's look at the joint distribution of age at first and last rem ggplot(afcars_ind, aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point() Output Image 9 Graph with x-axis "agefirstrem_f" and y-axis "ageatlatrem_f". Shows multiple vertical bars along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25, except for two dots that are ageatlatrem_f of 99. [Frank Edwards] Okay, that's kind of funky. We actually have a problem here. Right? These are our codes for missing data. And I believe here they take on values of 98, or 99. And so we want to go ahead and pull those out, those that really throws off our ability to look at this visual. Here I know that no one over the age of 25 should be in the data. They certainly are not entering Foster care for the first time at age 25, and their last removal can't be that age, either. So I'm just going to remove all observations that are over 25, and I'm going to use filter to tell R to only retain those values where the age at first and last removal are less than 25. And let's take a look at the plot now. [ONSCREEN] #### ok those 99s are missing, let's remove them ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point() Output Image 10 Graph with x-axis "agefirstrem_f" and y-axis "ageatlatrem_f". Shows multiple vertical series of single dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. [Frank Edwards] Okay, so this is starting to look a little better. But there's still a few things that are difficult about this plot. We don't know how much data is at each of these points. Right? So for age 0, we, we have observations where the first and last entry were both at 0. But we also have observations at 1, 2, 3, all the way up to 17 right, that is, we have observations at all possible ages. But we don't know how many observations from this visual, because all those points are overlapping with each other. So the first thing we want to do is add a little bit of random noise onto each of those points. We can do that in our geom point function by adding an argument for position, and we're going to tell R to use jitter to add a small amount of random noise to the position of each plot, so that the points are no longer exactly overlapping. Let's see what that looks like. [ONSCREEN] ### this doesn't do a good job of showing the density of data ## at each point because age is an integer # Let's add some random noise to each observation with a jitter ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point(position = position_jitter()) Output Image 11 Graph with x-axis "agefirstrem_f" and y-axis "ageatlatrem_f". Shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. The graph now shows how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column. [Frank Edwards] Here we go. Okay. So now, the points have all been slightly spread out and effectively, the areas of the plot that we see that are darker now have more data, and the areas that are a little more sparse where we can see some gray shining through, those have a little less data. But again, we still have a ton of overlapping data here. So maybe I want to go a step further. I also might want to adjust the transparency of the points. So to do that I'm going to use the alpha parameter. Alpha takes on values between 0 and 1. And here I'm going to use a 0.25. Zero means fully transparent. One means fully opaque. 0.25 will give me a fairly transparent point. [ONSCREEN] # better! let's make the points a little transparent # alpha does the trick here, 0 is transparent, 1 is opaque ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point(position = position_jitter(), alpha = 0.25) Output Image 12 Graph with x-axis "agefirstrem_f" and y-axis "ageatlatrem_f". Shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. The graph now shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. [Frank Edwards] Now, we can really start to see the density of the data a little more clearly. Most of the data is falling along the Y equals X line. Right? That's the most, not most of the data but that's our modal kind of value right? is the Y equals X line, that is, you know, age at first removal is equal to age at last removal where we can see a kind of decrease in the proportion of cases that are very far away from y equals X. But there are still some of course. this gives us a better sense of the exact location. So maybe let's clean it up a little. This is starting to look a little better, and if I wanted to present this in a paper I would think that I'm getting close, but my axis labels are not particularly useful. To get those ready for print. I'm going to use the labs argument. labs allows me to set the labels for the different features of the plot or the labs function. So I'll just plus labs. here's my X-axis title. Here's my y-axis title. Let's run that and see what we get. [ONSCREEN] # not bad! Let's provide useful axis labels ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point(position = position_jitter(), alpha = 0.25) + labs(x = "Age at first removal", y = "Age at last removal") Output Image 13 Same as previous graph with x-axis labeled "Age at first removal" and y-axis labeled "Age at last removal". Shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. The graph shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. [Frank Edwards] Now we have much nicer axis labels. Maybe I decide I don't like that gray background. Ggplot has a number of themes that we can switch. Theme minimal is one I often like to use and we can just add that as a function call after our labs to remove that gray plotting background. [ONSCREEN] # and what's with that grey background, I don't like it # we can swap to a different theme easily ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point(position = position_jitter(), alpha = 0.25) + labs(x = "Age at first removal", y = "Age at last removal") + theme_minimal() Output Image 14 Same as previous graph without a gray background. Shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. The graph shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. [Frank Edwards] Now, I have a much more minimal plot still has the grid lines. Maybe I want to remove those too. I could continue down that road if I like. Okay, so we've done a pretty good job producing this. once we've got this plot, we might want to think about whether this differs across racial and ethnic groups, whether this basic pattern is the same across groups. we can facet, wrap by race to produce one version of this plot for each group if we like. [ONSCREEN] ## Ok cool! Is this pattern the same for all groups? ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f)) + geom_point(position = position_jitter(), alpha = 0.25) + facet_wrap(~raceth_f) + labs(x = "Age at first removal", y = "Age at last removal") Output Image 15 Eight different graphs, one each for integers 1-7 and 99. Each graph shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. Each graph shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. There are large areas of sparse data in the graphs for 3, 4, 5, and 99. [Frank Edwards] Maybe we want to look at child sex. There's no reason we can't take our basic plot here, copy, paste it and then add a color argument. [ONSCREEN] # and are there differences by child sex? ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f, color = sex_f)) + geom_point(position = position_jitter(), alpha = 0.25) + facet_wrap(~raceth_f) + labs(x = "Age at first removal", y = "Age at last removal") Output Image 16 Eight different graphs, one each for integers 1-7 and 99. Each graph shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. Each graph shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. There are large areas of sparse data in the graphs for 3, 4, 5, and 99. The legend is labeled sex_f and shows a range of continuous values from 1.00 to 2.00. [Frank Edwards] Now we have color. Okay. But again, we didn't factor it first So we have this treated as continuous, which is a little funky. We're going to switch it over to factor. [ONSCREEN] ### oops sex is binary, force it to a factor ggplot(afcars_ind %>% filter(agefirstrem_f<25, ageatlatrem_f<25), aes(x = agefirstrem_f, y = ageatlatrem_f, color = factor(sex_f))) + geom_point(position = position_jitter(), alpha = 0.25) + facet_wrap(~raceth_f) + labs(x = "Age at first removal", y = "Age at last removal") Output Image 17 Eight different graphs, one each for integers 1-7 and 99. Each graph shows multiple vertical series of dots along the agefirstrem_f values from 0 to 17, and their ageatlatrem_f values do not exceed 25. Each graph shows with color transparency how the data density decreases as ageatlatrem_f increases for each agefirstrem_f column by progressively graying out the areas with less data. The graphs contain the colors pink, blue, and gray. The legend is labeled "factor(sex_f)" and shows three values 1 is pink, 2 is blue, and NA is gray. [Frank Edwards] Now we have the pinks and the blues. and let's clean up that legend. I can adjust that with labs, color equals sex. I have an error in my code somewhere. Think it's right there? Hmm, okay, well, I need to debug this my apologies. that works. Okay. we can also look at placement setting by sex and by age. [ONSCREEN] # one more, placement setting by sex and age ggplot(afcars_ind, aes(x = factor(curplset_f))) + geom_bar() Output Image 18 Bar graph with x-axis "factor(curplset_f)" and y-axis "count". Shows five bars for x-axis "factor(curplset_f)" integor values 1 through 4 and value NA. The counts are ~5000 for values 3 and 4, ~10,000 for value 1, ~30,000 for value 2, and just above 0 for value NA. [Frank Edwards] Right here we're going to look at X as a factor for placement setting. We can make sex color and keep placement setting as X. [ONSCREEN] # let's make sex color and keep placement setting as x ggplot(afcars_ind, aes(x = factor(curplset_f), color = factor(sex_f))) + geom_bar() Output Image 19 The bars on the previous graph are now segmented by outline colors pink, blue, and gray. There is now a legend labeled "factor(sex_f)" with pink outline for 1, a blue outline for 2, and a gray outline for NA. Shows five bars for x-axis "factor(curplset_f)" integor values 1 through 4 and value NA. The counts are ~5000 for values 3 and 4, ~10,000 for value 1, ~30,000 for value 2, and just above 0 for value NA. [Frank Edwards] In this case, though I might really want to use fill rather than color. color is going to give us the outline, fill is going to color the interior of the bar. [ONSCREEN] ## oops we want fill ggplot(afcars_ind, aes(x = factor(curplset_f), fill = factor(sex_f))) + geom_bar() Output Image 20 The bars on the previous graph are now more clearly segmented and stacked by fill colors pink, blue, and gray. There is now a legend labeled "factor(sex_f)" with pink fill for 1, a blue fill for 2, and a gray fill for NA. Shows five stacked bars for x-axis "factor(curplset_f)" integor values 1 through 4 and value NA. The counts are ~5000 total for values 3 and 4 and equally distributed blue and pink, ~10,000 for value 1 and equally distributed blue and pink, ~30,000 for value 2 and almost equally distributed with pink visibly slightly greater that blue, and just above 0 for value NA. [Frank Edwards] And I don't want a stacked bar plot here. I really want them side by side, and just like we did position jitter. I can do position, dodge, to force the bars to be next to each other. [ONSCREEN] ## and I want to see the bars side by side, not stacked ggplot(afcars_ind, aes(x = factor(curplset_f), fill = factor(sex_f))) + geom_bar(position = position_dodge()) Output Image 21 A bar graph for x-axis "factor(curplset_f)" integor values 1 through 4 and value NA and y-axis "count". Each x-axis "factor(curplset_f)" value shows a pink and a blue bar. The counts show ~2500 for pink and blue at values 3 and 4, ~5000 for pink and blue at value 1, about 16000 for pink and at 15000 for blue at value 2, and just above 0 for value NA. [Frank Edwards] And maybe I want to add age at last removal, to think about whether differences in placement setting by the age at which the child was removed. Now this plot's getting a bit big. I'm going to need to adjust my window to see it. [ONSCREEN] # ok now let's add age at last removal ggplot(afcars_ind, aes(x = factor(curplset_f), fill = factor(sex_f))) + geom_bar(position = position_dodge()) + facet_wrap(~ageatlatrem_f) Output Image 22 Twenty one separate bar graphs, one each for integers 0-18, 98, and NA. Each graph has an x-axis "factor(curplset_f)" and a y-axis of "count". Graphs 15-18, 98, and NA show very low or no counts in comparison to the other graphs because the y-axis scales all go to the same large number. [Frank Edwards] And here we'll notice that the Y-axis, because we know we have more 0 and one year olds in the data. The Y-axis is scaled to be constant across groups. Maybe I don't want that. Maybe I'd like to have each group have its own y-axis. To do that I can add the scales equals free y argument. [ONSCREEN] # the y axis makes this tough - many more 1 year olds than 15 year olds # we can let the y axis vary for each facet ggplot(afcars_ind, aes(x = factor(curplset_f), fill = factor(sex_f))) + geom_bar(position = position_dodge()) + facet_wrap(~ageatlatrem_f, scales = "free_y") Output Image 23 Twenty one separate bar graphs, one each for integers 0-18, 98, and NA. Each graph has an x-axis "factor(curplset_f)" and a y-axis of "count". Each graph has a different y-axis scale which now accommodates and displays all of its data. [Frank Edwards] And now I'll end up with a separate y-axis for each group. Just make sure that your readers are aware that you've done that. and let's get it ready for presentation by cleaning up my labels. YX fill. [ONSCREEN] # and get it ready for presentation ggplot(afcars_ind, aes(x = factor(curplset_f), fill = factor(sex_f))) + geom_bar(position = position_dodge()) + facet_wrap(~ageatlatrem_f, scales = "free_y") + labs(y = "Number of children", x = "Placement setting", fill = "Child sex", title = "Foster care placement settings for 2018", subtitle = "by child age (panels) and sex (color)") + theme_bw() Output Image 24 The previous image of twenty one separate bar graphs now has the title "Foster care placement settings for 2018", the subtitle "by child age (panels) and sex (color)", the coler legend labeled "Child sex", the x-axis labeled "Placement setting", and the y-axis labeled "Number of children". [Frank Edwards] I can also use title and subtitle to add additional annotations. So this is a pretty close to print, ready visual. That's all I've got for us today. Homework will have you work through. How doing a few of these exercises with more of the AFCARS, individual and AFCARS aggregated data. Does anybody have any questions? Okay? Well, then, I'm gonna hand it back over to Noah. Thank you all for joining us today. [VOICEOVER] The National Data Archive on Child Abuse and Neglect is a joint project of Duke University, Cornell University, University of California San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, an Office of the Administration for Children and Families. [MUSIC]