[Voiceover]
National Data Archive on Child Abuse and Neglect.

[Clayton Covington]
All right again welcome everyone to the final session of the 2023 Summer Training Series here at the National Data Archive on Child Abuse and Neglect. So as a reminder this is the conclusion of the Summer Training Series which is hosted by the National Data Archive on Child Abuse and Neglect at Cornell university and Duke University.  We are also an organization that is funded by the Children's Bureau which is an Office of the Administration of Children and Families housed within the U.S. Department of Health and Human Services.
Just to give you all a little preview of where we've been so far. We started off at the beginning of July with an introduction to NDACAN and the administrative data cluster that we hold. We then introduced a new data acquisition the CCOULD data set which is linking child welfare data to Medicaid data. We also did a couple of workshops including one on causal inference using administrative data, evaluating and dealing with missing data in R, and then most recently a time series analysis in stata. In today's presentation will be on data visualization using R.
So I will hand it off to our presenter which is Sarah Sernaker the NDACAN statistician. Here you go Sarah.

[Sarah Sernaker]
Thank you Clayton thanks for the introduction and the intro to the summer training series. Just to reiterate some of the points Clayton mentioned this is the last session of our series. We do this every year. Previous and previous years presentations are posted on our website and available to watch and the slides and the transcript of the presentations are there and Andres has been doing such a good job that some of this summer's presentations are also already up there so you could watch the first few presentations if you miss them which I highly recommend doing. But so today we're going to be talking about data visualization which I really love almost too much to the point where I had to really edit when I put in the slides because there's just so much to talk about in terms of data visualization. It's just fun stuff. So today just generally I want to introduce some topics and data visualization some examples of visuals and then I want to make our way to R pretty quickly so if I talk really fast like please bear with me and feel free to ask questions that Clayton and Alex will field and yeah so let's get started. So topics in data visualization why visualize data. Basically pictures are more fun than words and tables but visualizing data helps us really uncover patterns, gain a better understanding, such as trends or outliers. You can look at a visualization and understand and take away a lot more than if you're just staring at a table of data and numbers in your face. So it's just a great way to organize and understand your data. And you can display results from modeling so if you fit a model and you're looking at your model fit or you can display point estimates or descriptive statistics. So you can visualize all types of things. And in summary I mean they're just more palatable, memorable, and easier to compare trends than tables or looking at the data itself. I'm sure if I asked you to remember one of the best data visualizations you could do it but if I told you to remember the best data table you've ever seen probably nothing comes to mind or maybe like a mess of a data table. So what makes an effective visual so there's plenty of visuals out there plenty of different types of figures and plots and this and that, but really at the end of the day when you're making a visualization for your goals you want to ask yourself these questions about who is the intended audience? Because if you're making a visual that's going to end up in a journal article that's going to look way different than if you need to do a poster presentation and the visual needs to stand completely on its own. What are you trying to convey? Asking yourself what is the story that I'm trying to say with my visual? Not just adding everything you can but if someone were to look at the visual I've made what is the main story I want them to take away if only one thing? What do you want to highlight in your visit visualizations? You might be plotting a time series and have a few different points within each year maybe be sex or by race or things like that. What of these point estimates do you really want to highlight and how can you highlight them? How can you make your point most effectively? And some general guidelines that I try to live by is the simpler the better. If you can convey the same message and story in simpler terms then there's no need to overcomplicate or make a figure too busy. Even if you think you by adding more and more and more labels and information you're helping a person understand it it could really just overwhelm someone and they might just not even want it might just prohibit their brain like mine does from understanding. If I see a really busy figure my brain just goes oof like it's just too much. So sometimes finding a nice balance of including all the information you need and just that. Not anymore to make it busy or overcomplicated. Other you know natural advice add informative titles labels axes legends variables so the figure can stand on its own, if it needs to. Again these are all with the caveats of who is the intended audience and what are you trying to convey. For example in literature articles or journal articles sometimes you submit papers and they don't want any titles like you add the title within the paper. So this is all you know still within the realm of focusing on the goal for the figure you're making. And again not overcomplicating making concise and clear legends, not just you know acronyms that make sense to you and no one else, not just abbreviations that could mean multiple things. So keeping all of this in mind. And then finally the most you know innate thing but to use the appropriate figure for your data. So try to avoid you know using a bar chart for continuous data or using a pie chart for I don't even know just you should probably just avoid using pie charts in general. But my point is keep in mind the type of data that you have: is a categorical? Is it ordered? Is it continuous? Is it scaled? Is it like a is it like a group scale like a range, is it a range of data is the term I'm looking for. And using the most appropriate figure for that whether that's a scatter plot, density plot, bar chart, etc.. Some considerations that you might not have thought about when you're putting together a data visualization and I have to be honest I didn't until the most until some recent years you know considerations that you don't think about if it doesn't affect you such as colorblind accessibility. I gave a presentation one summer I was an intern and one of the guys raised his hand at the end and told me he was colorblind and couldn't see any of my beautifully colored points. And so just being aware that you know people with different accessibility out there it does affect the realm of data science and statistics and you know it's just something to keep in mind and R and ggplot which I'll get into more has specific themes to help you incorporate accessibility. So they have colorblind palettes. So it's not all like black and gray and white but it is colors that you know a colorblind or color someone who can't see the full range of colors could still differentiate. So that's like a huge one obviously because people can't avoid that. Something other to keep in mind is colors may look different on different computer screens. I have two monitors in front of me and if I move my screen from one monitor to the other the color slightly vary like one's monitored the colors are huge they're bright and beautiful and the other looks very dull. So my point is you know don't spend so much time focusing on these precise things that you know ultimately the final product may look slightly different.

Similarly aesthetics is important but subjective. So the best example I could think about is maybe you're into like neon colors and that's your whole vibe and lifestyle and you want to include that in your in your figures. But you know that's maybe not the best for everyone and it's not the best way to demonstrate your data even though it's cool and funky and you really like it. You know again trying to ask yourself who's my intended audience and how can I best convey these things? So keeping things simple, like simple and classic without detracting from your true message. And again reiterating you know what type of graph is appropriate for your data? I think it's something we kind of take advantage of because inherently you know I as a statistician we grow so used to using the same types you may not think about it but just you know taking a second to consider what's best. So really quickly I just have some examples here of different graphs for different types of data. This is not like the full exhaustive list of all possible graphs or data types but for example if you have continuous data you could use a density plot or a scatter plot. If you had continuous time series data you'd usually want to use a scatter plot. If you had ordinal data that'd be using a bar plot or sometimes stacked bar plots. If you have ordinal and continuous and you're trying to incorporate three dimensions think about heat maps or something like that. So again totally not exhaustive but trying to get you to think about what I mean as far as different graphs for different data. Some cautions to keep in mind as you're putting together your figure just things to be aware of are axes scales, so making sure that you're not swamping results. So if you have data let's say a time series and the values range from 50 to 100. You don't make the axes zero to one million because then that 50 to 100 range is going to look like nothing compared to a million. You're going to swamp totally swap the picture. And I'll show you this in practice but on a similar topic, aspect ratio: so even if you get your axes right let's say your y-axis is aligned on the scale that you want your x-axis look good, when you go to save a figure or you go to like place it on your poster place it in your article, you want to be sure that you're not smushing it to use a technical term. So like what I mean is your axes could be right, but if you save the figure such that it's you know smushed too narrow or drawn out too wide or just too tall it could really screw again screw with the way that people interpret your data and again you could have a swapping effect or you could have an effect where you overexaggerate results.
Other stuff trying not to be misleading, being careful with what estimate is shown. For example using rates versus counts and I'll show you in the context of our data this is really important. So making sure that the points that you're plotting on a graph are exactly comparable not that you when a person looks at a graph they want to say point a how does that compare to point b but if point a is apples and point b is oranges they shouldn't be on the same graph to begin with because now you're misleading people into thinking they could be compared. So is it misleading? Is it appropriate? And is it most effective at telling your story? That comes when you're doing a research question and have data and are making figures. I'm sure there are plenty of statistics you could draw means, standard deviations, proportions, and this is where I mean really asking yourself which statistic let's say you can only make one figure for your paper, what statistic is really going to tell the most effective story for your research question? So thinking about things like that, like when you have an option of statistics or results or values what is really going to be most effective? And then also just general caution don't present misleading information or be misleading in the presentation of your data and I have a few examples that I'll quickly go through a bad, bad visualizations. But I want to really quickly go through examples of figures and then jump into r. So here I'm going to quickly go through this is if you have data. So it's an example of a scatter plot for the data and these slides will be posted which is why I'm going to fly through this because you can recreate this yourself. But a really helpful visualization tool is this function called pairs. And if you have data and you're and you just don't even know where to start or what relationships there are, this pairs function will plot every variable against itself so you can look at all of the different types of relationships. I've just highlighted these two here that which are really clear.

[ONSCREEN 72 scatter plots are displayed in a 9 by 9 grid for the following nine variables which appear listed within the grid in their own boxes diagonally from top left to lower: rm, age, dis, rad, tax, ptratio, black, lstat, medv. The speaker has highlighted the two plots. The first plot is at the intersection of the row for variable rm and the column for variable medv. This plot shows data points as a cluster which forms a rough line that slopes up at a 30 degree angle. The second plot is at the intersection of the row for variable lstat and the column for variable medv. This plot shows the data points as a cluster which forms a curve sloping down from high to low. The R code which generates the scatter plots is displayed: # load data from
# mass package
Library(mass)
Data("boston")
# see data codebook
Help(boston)
# make figure
Pairs(boston[,-(1:5)])
]

[Sarah Sernaker]
I've highlighted them also because in the next slide I went ahead and created this linear model to show you that also you can make residuals and diagnostic plots. 

[ONSCREEN Four plots are displayed on screen entitled as follows: residuals versus fitted, normal q–q, scale-location, residuals versus leverage. The residuals versus fitted plot displays fitted values on the x-axis and residuals on the y-axis. The residuals versus fitted values plot shows data points distributed across fitted values zero through 40 but primarily clustered around fitted values 15 to 32 and residuals values date of 10 through 10 on the residuals axis. The normal q–q plot shows the x-axis of theoretical quantities versus the y-axis of standardized residuals. The plot displays a series of points closely matching the straight line ascending from the point at theoretical quantities -2.2 and standardized residuals value -2, to the point at theoretical quantities 1 and standardized residuals 0. The plot of points rises from there at a steeper angle to theoretical quantities 2 and standardized residuals 4. The scale-location plot displays fitted values on the x axis and standardized residuals on the y-axis. The data points form a discernable cloud between fitted values 15 and 33 and standardized residuals values 0.252 and 1.25. The residuals versus leverage plot displays leverage on the x-axis and standardized residuals on the y-axis. Data points cluster heavily between leverage values 0.001 and 0.01 and standardized residuals values -2 and 2. The R code which generates the scatter plots is displayed: 
# fit linear model 
# medv is the response
# lstat and rm are
# significant covariates
M1 = lm(medv ~ lstat + rm, data = boston)
# make 2x2 figure
Par(mfrow = c(2,2))
# plot model results
Plot(m1)
]

[Sarah Sernaker]
So not just plotting results but using visuals to understand your model fit. Again this will be posted online this is this is reproducible on your own computer. And you just use this function plot after you fit a model and the important thing to see here is we have a residuals and this is not a good fit. And this would not be clear because if you look at the data these are significant variables if you look at the p-values here but you can see this fit is not good because you have a clear pattern this curve here. So this is just kind of an aside of visuals are not just for your end result and results in general but you should use visuals to help understand and inform your data decision and modeling processes.

[ONSCREEN two images are displayed on screen. The first is a tableentitled "estimates of relative survival rates, by cancer site" presents the estimates in 5, 10, 15, and 20-year intervals for cancer site locations such as prostate, thyroid, testes, melanomas, breast, hopkins to seize, etc. The cancer locations with the longest percent survival rates are prostate, thyroid, testes, and melanomas. The cancer locations with the lowest percent survival rates are long and bronchus, the savviest, liver; bile duct, pancreas. For more information, see: https://www.edwardtufte.com/bboard/images/0003nk-10291.gif. The table is dense with numbers which do not easily visually connect to the cancer location word. The second image is a slope graph. This graphic displays the cancer location word in the left column, but the estimates of their percent survival rates across four columns of time periods (5-year, 10 year, 15 year, 30 year) are listed as numbers connected by lines. For example, the word prostate is listed with four numbers connected with lines and visually descending as the percent survival rate decreases from 99% ship at 5 years to 81% 30 years.]

[Sarah Sernaker]
So just really quickly an example of going from this big boring table a bunch of numbers overwhelming. I don't even know where to start. To slope graphs so this is the same information now everything's in order you can clearly see trends over time this is just an example and this is the sources down here from Edward Tufte who does these great books on data visualizations. So just will getting to the point of you're conveying the same information but this is so much more palatable on the right and people will see this and take away so much more than if you just put in tables like what you see on the left.

[ONSCREEN a graph which intends to depict changes in the global distribution of major diseases. It is an arrangement of different size rectangles which each contain acronyms or words. They are clustered by colors: blue, red, and green. No discernible pattern or information can readily be understood from this overwhelming data visualization due to its use of excessive text, colors, and shapes. Source: https://vizhub.healthdata.org/gbd-compare/]

[Sarah Sernaker]
This is an example of a confusing and overwhelming visual and I wanted to include it here because someone had emailed me this once and they were like Sarah look at this super cool visual like isn't this great? And I opened it and I just was like this is one of the worst things I've ever seen. But I just find this so confusing and I can't help but laugh because this is a real data set I'm sure there's really great and useful and important data under this but this visualization like gives me a headache. And I just can't even get past that.

[ONSCREEN a line graph entitled "tech stock throwback?" is used in this presentation as an an example of a misleading data visualization due to its use of different scales on the left and right axes. For more information, visit:
Https://fm-static.cnbc.com/awsmedia/chart/2019/5/12/nflx%20tsla.1560348172590.png?w=929&h=523&vtcrop=y]

[Sarah Sernaker]
Example of confusing and misleading visual. This is from CNBC and this is to show you that you can put numbers and lines anywhere but it doesn't need to make sense. But this is they news organizations will make figures like these to seem impactful, but this is me telling you to take a second and ask yourself, not just you know as you're creating visuals but as you're interpreting visuals, what does this mean? Basically they've just overlaid a different stock from a different time period over another stock over a completely different time period and the scales are totally different. So this is what I mean by misleading I mean you can do you can draw whatever picture you want but what should be going out to the public should probably be a little more curated than this. And that's CNBC. 

[ONSCREEN a graph with the title "change in ticket price v distance 2014, selected routes" is used in this presentation as an example of a busy and confusing data visualization do to its use of confusing axes, hard-to-read text, and poor color contrast. The graphic itself appears as a shaded area of a semi-circular arc with different colored lines starting at the arc's origin and stretching toward the edges of the arc to indicate distance travelled. For more information, visit: https://www.economist.com/cdn-cgi/image/width=1424,quality=80,format=auto/sites/default/files/images/2018/12/articles/main/20181208_woc932.png]

[Sarah Sernaker]
This was The Economist. Just really quickly an example of busy and confusing. It took me a minute to try to understand what's happening here. It's not the worst I've seen but it's just definitely like a lot going on. 

[ONSCREEN a bar graph titled "top tax rate," which is used in this presentation as an example of a misleading graph due to its misleading axis. The tax rate of 39.4% appears as a bar labeled "January 1, 2013" and is four times taller than the 35% tax rate bar labeled "now". For more information visit https://flowingdata.com/wp-content/uploads/2012/08/bush-cuts-620x458.png]

[Sarah Sernaker]
And then an example of misleading axes so this is what I was talking about before making sure your x or y axes are on a scale that makes sense. If I just looked at this you'd say oh my gosh wow look that's a huge increase from January 1st 2013 to now or decrease rather sorry I guess this is now. Anyway but you can see from the scale actually they zoomed in so much that it grossly misrepresents the figures. So the only difference 35% versus 39%. And I chose various news sources to really highlight that this is not inherent to any one news source and just to be careful and be aware that you're not creating figures of this level of confusion or misleading. But that there are figures out there and you should always stay vigilant in understanding what people put out there. 
So okay let's jump to the fun practical stuff making visualizations in r. So the to make visualizations in R you can make them really beautiful. R I think of all the programming languages I used which is quite a few R makes the most beautiful figures. And they have so much customization possible for better or worse. Sometimes there's almost too much where you there's a lot of options but it creates it gives you the ability to create really awesome figures and sometimes interactive. R I think is really just the highest standard. And all of it lives within this package ggplot2. And ggplot2 lives within the tidyverse universe. That could be a whole aside but just know tidyverse is a bigger package of a lot of you know really nice code stuff and ggplot2 is basically a package within there. And the way ggplot works is you create layers of your figure. So you start by first specifying the data you specify your x variable, y variable, in a basic situation however you need to add layers to tell it to plot points. So for example the the previous slide would not have plotted anything because I did not have the points. And so I specify the data then I add points then I can add lines and then I can add customization to worlds beyond. So my point is ggpot is very like sequential of building up this figure that you have in mind. And I'm going to jump into R in a second I promise I'm trying to just get it set up here. So the example I've set up today it uses NCANDS data and it's the number of substantiated and unsubstantiated reports by year. If you're not familiar with NCANDS I'm going to refer you back to our website and our slides because we have so much information about it. Basically it's just child abuse reports every year. And we're going to link the NCANDS data with census population data to get respective populations because basically we want to create rates. We want to go from counts to rates and to get rates we need some denominator. So the census population is going to give us those denominators. And to follow this example I cannot provide the data that I'll be using exactly but you can get analogous data online. I'm not going to go through each of these points because again these slides will be posted but you can get analogous data from what I'm using either if you want to directly order our NCANDS data you do need an irb there are some things that go along with it or publicly accessible you can get data from the child maltreatment report and it's analogous to what I have here. It's not exactly the same but it should be you should be able to use that and tweak my code to hopefully get that. And again the second bullet is publicly available. This I just threw on here again so when you're browsing the slides these are just super helpful graphing functions I use a lot font grid, facet grid, if you're graphing by state this facet geo checking monofit so that's I just threw that there again that'll be for when we go back when you go back and look at these slides. There's also references and resources but again I want to get to the fun stuff so let's go into R. 

[Voiceover]
The program written in R is included in the downloadable files for the slides in the transcript.

[Sarah Sernaker]
Let me get set up. And I think this code could be accessible. I'm just trying to zoom in here. There's no reason I see that I can't share this code with users but again you would need to tweak the data that you're reading in because you wouldn't have the data. Okay so this is R I wish I had time to go through the basics of R but that's not we just don't have the time right now. So always feel free to email me I might fly through this or add a question to the chat. I'm just going to load this and then take a sip of my water let's see.
So what I've done so far I've just loaded some libraries that we're going to use. Notice I loaded tidyverse I loaded ggplot I loaded these things that I like scales and then geofacet is for making graphs by state. I've set the directory to where my data are so this is something you would need to change if you adapt this code. So I'm loading in the data that I have and this is very curated NCANDS data you can see that it's only a few variables we have with each within each submission year for each state for each race and sex, so this is the combination of year state race sex how many reports of maltreatment were unsubstantiated and how many reports of maltreatment were substantiated? So this says that 634 white male children in Alaska in 2010 had unsubstantiated reports of maltreatment. And I said male and white because I know what these codes mean but you might not because what does race equal one mean? You would need a codebook which is on our website and you should always have your codebook handy as you use our data but I am going to help show you how to clean that up.
So I've loaded in the data I always look at the head just to make sure I've loaded what I want it looks like how I want any tweaking that I want to do. And I'm going to do a little bit of cleaning right here first I'm going to take out Puerto Rico and that is solely because the census data I have access to doesn't include Puerto Rico for whatever reason so it just doesn't link so I'm just going to drop it from the get-go. I'm also filtering any cases any combination where there are less than 10 cases and that is because I don't want to display that for data protection reasons. And so I've just saved that as NCANDS2. And I'll just I'll just know I I'll go through NCANDS two three four five six I really try to avoid overwriting. Notice I didn't just overwrite this as NCANDS equals NCANDS because sometimes you want to go back or sometimes you know you screwed up at NCANDS3 but NCANDS2 is fine and so unless computationally that is prohibitive which is very real possibility I just like to create a bunch of versions. And you can always go back and edit it but that's my spiel. So I'm loading census data now I'll show you what that looks like. We have similarly the year, we have a state fips code so that's just a code that goes along with every state every state has one. We have the full state name, we have this thing called st that's our state abbreviation we also have sex and race ethnicity. And I am here to tell you I've cleaned this so that the race ethnicity and the sex values match with NCANDS. I don't think that's necessarily true so what I mean is that I've made sure that in my census data raceEthn equals one is also means white non-Hispanic and that is also true of NCANDS. So I'm going to be linking these two but I've done I've done I think a lot of heavy lifting behind the scenes and I just want to stop and point out that when you link two data sets you want to make sure a few things. First of all you need to make sure that the variables that you're linking on are named the same. So I'm going to scroll up here here was our NCANDS data here's our census data. We as human beings I look with my eyes I can see these are both years that's 2010 that's 2010. Staterr that's ak these are in different orders but ak is in there so the state abbreviation staterr and the state abbreviations here is st. I as a human know that that's the same but to R we need to tell R that these are the exact same things and we need to link on those. And so this is where tidyverse comes in or no I mean tidyverse came in up here but so just to take a quick step back. So I'm going to rename those variables to be the same between NCANDS and census because ultimately I want to link them so we need to tell R these are the same variables, the same content within the data sets and we're just going to name them consistently to link them. And to do that so I'm taking NCANDS2 that was like our sort of cleaned up NCANDS I'm using this piping operator this is what this means this is called a pipe and that is that like defines the tidyverse universe these piping operators. And the piping works in a similar way to what I just described as you kind of create layers of code. So I'm taking NCANDS I'm renaming subyr within NCANDS to just be year I'm renaming staterr within NCANDS to just be st. And I'm renaming chsex in NCANDS to just say sex and that is because year state and sex are the variable names in census and that's why I'm saying those are the variables we want to match on. We'll also be matching on raceEthn but I did not need to change that because they are named exactly the same. So we will be matching on them we are linking on them but I did not need to rename them because they are named exactly the same and as I said before I ensured that these same levels of NCANDS were equivalent to the same levels of raceEthn in census that is something you should very much take the time make sure the data you're linking is consistent and coherent. So I've done whole talks on linking and they you can find them on our resources on our website and I encourage you to watch them but ultimately linking boils down to this single line of code right here. All that linking is is you've prepped the data you've made sure the linking variables are there and that they're consistent and then you just tell R to link them and that's what I'm going to do. And here I'm using a left join. So again go back to my slides I talk about right joins, full joins, inner joins, outer joins, left joins. And what a left join means is that I'm holding on to NCANDS so hold NCANDS in your mind that's your main data set and then I'm bringing in census. I'm linking in census to join onto NCANDS and by joining onto NCANDS I'm saying match what you can from the census with data from NCANDS but if you can find something in the census data that can't be matched with NCANDS drop it. I want to draw I want to match only I want to link and match only on what is available in NCANDS I don't want to keep anything else. If you did a full join anything that's not linked from census would still remain there you just have missingness for the NCANDS side. And vice versa if you did right join onto census then you're saying I want to keep everything in census match what you can from NCANDS and drop what you can. So that's the short of my linkage talk. So you prep the data all you say is left join so I can show you that here. The nice thing with the pipes the piping operators is you can test code before you run everything. So right now I've just run this NCANDS up to the join statement so you can see it's joined by year, state, raceEthn, sex and we have our unsubstantiated from NCANDS our substantiated and then our population from census. So the next piece of code I have is just to reorder and then arrange everything. This is just so it looks pretty and is the order that I want. And so when you look at it then you have like all your demographics all your categories and then all your values on the right side. And that's all I've done here reorder. There's this nice function everything that basically I'm saying okay I know I want these in order and then just throw everything else at the end and this makes sure that you don't lose any variables it's just going to throw everything at the end at the end of like the last few columns I mean.
So data cleaning continues on as it always does in any research project. We haven't even gotten to the fun stuff yet. So I've linked the data to census data but now what I want to do is make things informative. So thinking about our visuals if I had put raceEthn equals one two three four five six and put that in a journal article people would be like what is raceEthn one what is sex one sex two. You could probably infer it but my point is this is where we're going to add informative labels to race and sex. Again using piping operators using our mutate functions so we're saying create this new variable raceEthn2. Again I try to avoid overwriting unless I'm feeling bold or if I've got computational problems. Again you can always go back and edit it but like your first run through I try not to overwrite anything. And I like this case_when function this goes you can go step by step to say okay I'm making this new variable when raceEthn equals one that's actually stands for white non-Hispanic, when two black non-Hispanic three Native American. We usually lump in raceEthn four and five those are Asian is raceEthn four and five is native Hawaiian. So we generally lump them together as aapi so I've done that here with this function %in% this is kind of instead of saying equals equals 4 or equals equals five I can use this in operator to check for both or either. And then the same thing for sex. So I'm just going to run this we're just doing more data cleaning. Notice almost at every stage I'm looking at the head of the data don't anytime you run stuff look at your data make sure it did what you want you have what you want. Ask yourself is what I'm looking at what I want it to happen and does it make sense? Okay don't just assume because your code ran that it was correct.
Okay so we have this stuff at the state level right now state by race and sex. But it might be the case that you want to summarize this to a national level or you want to summarize it just over race you don't care about distinction by sex or vice versa. So here I've done some summarization and then I swear we're getting to plots we're so close but just bear with me. So this whole chunk is making different various aggregated data. Again we have by state but we have by year, state, race and sex. And so here we are creating totals within each year grouped by race and sex. So this is national level totals by year, race, sex. So we're collapsing over the states. So we use this group_by statement to say okay take within each year, race and sex add up all of these unsubstantiated, add up all the substantiated, and add up all the population. So that's going to take all of these values for common race and sex in here and collapse them. And so I should just notice they did not have my head here but let's add it. So to show you what I mean and I'm using our new label so everything is much more coherent and easy to understand. All we've done again is collapse to the national level notice we still have year race sex but we do not have state. So I've collapsed our sums to a national level. I'm going to do the same thing again, but I'm not going to include race and sex, so this is purely national level totals.
No. So like I said this ignores race and sex this is just over the national united states in 2010 there were this many unsubstantiated cases of maltreatment and in 2010 there were this many substantiated cases of maltreatment. I should note I should have noted this before this population and for the census this is all for children 17 or younger. I do make that distinct filter you have to be careful with using census data and linking with our data because we generally this is all data mostly about children. And so you don't want to be comparing populations of children to populations of the entire united states that includes adults and all that. So just to quick aside. Okay onwards we move. This piece right here this last piece is pretty important when you're making figures. And so we're looking at our national level data we have by year the number of unsubstantiated the number of substantiated we have the relative population for the year. But when you're making figures this is what's called wide version we have columns and we have values within each columns. When you make figures it is so much easier to work with things and add colors, add themes, add this and that when your data are in long format when you have one single column that has all your values and all of the information is just kind of condensed into other columns. So what I mean is we want to kind of unwrap the values of our unsubstantiated and substantiated cases and just kind of unfold it. I don't know how else to describe it. I've run this let me show you what it looks like and then oops what did I do and then we're gonna get to making figures. So just really quickly notice the difference so this is in our wide format we have different columns holding all of the values but here now we have one column that has our unsubstantiated reports and our substantiated reports. Notice I did not do the same thing with population because population is our denominator and we basically want to be dividing reports by population and it will be the same for both because in the year 2010 the population stays the same. So this is this is not trivial but I don't have the time to go into it. This is okay that these are repeating and are the same. And again this uses this function pivot_longer. There's pivot_wider to go from long to wide. These are really important functions in the tidyverse and I use them a lot. They're also really complicated. I've been coding for like 10 years now and I always have to look at the help file for this. So pivot longer like try it play around with it sometimes I do trial and error and I'm like oh hey there's what I want after a few tries. Like some of this stuff is not is not obvious and I find pivot_longer to be one of those functions. 
Okay but enough with the blabbering let's get to plotting. So figures we've gotten to ggplot we're in a place where we can make some pretty figures. So the first thing you want to tell ggplot all of this first thing is your data. I'm going to be taking the national level long data that I just made but I just want substantial initiated report so that's why I have this filter. And you could do this above and save it as a new data set or you could leave it like this just as specifying the filter here. Whatever makes sense to you. So this is my data I'll just run that really quickly. This is our substantiated reports x is year so this is basically a time series and y is reports. So I am doing that and I'm running my plot and we have a figure we've made a figure and it's it's pretty gross. I mean what it put year with decimals that doesn't make sense and there's no commas and it starts at 590,000 but it's a figure and if you've never used ggplot before hey this is a start. This is we're starting okay. So this is this is how you get started in the ggplot notice I've got my data layer then I've added points. Okay I'm gonna do something similar notice in this p2 so this was p this is just p I don't know why I do p someone someone used this a while ago I've adopted this p and p2 I think plot and plot two. Anyway so for plot two I'm using the national level data but notice I'm not filtering out substantiated or not I want to include both types, substantiated and unsubstantiated, and the way that I'm going to tell R to differentiate them is with color. So notice up here just really quickly in our natdat_tot_long so that's our long version we have this variable that tells R whether it's unsubstantiated or substantiated. And this is why this is so helpful to put this in long version because this is how you can use these options really easily. So this is all the same I've just gotten rid of the filter and I've added color. And I've also added a line. Notice again ggpod doesn't and assume anything you need to build it up. I only put geom_point here but you should I added geom_line and let's see what we get. Aha this looks pretty good it's looking all right. Notice we have substantiated and then unsubstantiated and it's different colors and there's a a whole legend that's automatically generated because I've specified this color option so that automatically happens. And sometimes like this is fine if you just are looking you know on a personal level you're looking through the data this is pretty good this tells a story you see trends but one thing to keep in mind coming back to some of the content from my slides is this substantiated line is totally getting swamped by the unsubstantiated. There are so many more unsubstantiated and because the figure needs to show both include both of these it's it's accounting for the range of unsubstantiated and substantially just kind of gets swamped. So if we were to zoom in on this it would not look so straight like notice there's little bumps and curves and they don't they look really minor from this point of view but the this range is like 500,000 I don't even know because these labels 250,000. So this is what I mean by swamping keeping axes in mind, thinking about what you want to display. I probably wouldn't display these on the same figure just because they're like I said one is getting super swamped by the other. But okay so working from that p2 notice I saved it as p2 and that can be helpful because then if you want to edit or make customizations you can just add on. So you can just do this p2 plus and then any additional themes or options or anything else so I've added a lot of stuff here this is what I call the beautification process. So we have this like kind of basic figure and then I'm going to fix the axes so that it doesn't have decimals because that doesn't make sense when we're talking about years really. I'm going to change the y-axis to go up to 3 million because right now it's getting cut off. I'm also going to add commas to this the millions scale and that's where that package scales comes in it's really handy. I'm going to add notice these are all just separated by these plus I'm just adding on layer by layer so then I'm going to update my x label to say year with a capital y. I'm going to update my y label to say number of children number of children. I'm going to add a title. I'm going to remove the color legend title. I do this a lot just because usually legends if you've labeled them well enough they don't need a legend title and it just declutters it in some way. I'm going to re-label the values of unsubstantiated and substantiated so they don't just say subst and unsubst. And the key with this is the scale color manual and these are again like one of the trickier functions that I find that I always have to look up but basically as long as you're consistent so you specify values, red, blue, your breaks which are the values and the data we have substantiated and unsubstantiated, and then what I really want to label them as. So substantiated so everything in the first like line here corresponds to substantiated and everything this blue corresponds to this unsubstantiated. So that's what I mean as long as you're consistent and you know have everything in the correct order that's the main thing to keep in mind with the scale color manual. And then I really like just putting my legend on the bottom there's this theme option that has endless customization options and within theme you specify legend position equals bottom. So all of that is just to make things look nice.

[ONSCREEN a line graph is displayed which is titled "number of children on reports of maltreatment, substantiated or unsubstantiated". The x-axis is labeled "year" and goes from 2010 to 2020. The y-axis is labeled "number of children" and goes from 0 to 3,000,000. There are two lines: one red, one blue. The red line shows substantiated reports which appear to stay at approximately 600,000 every year across the entire time range. The blue line shows unsubstantiated reports with the number at ~2,125,000 in 2010, rising to a peak of almost 2,500,000 in 2018, and falling to 2,166,000 in 2020.]

[Sarah Sernaker]
And that was a lot of code and I kind of breezed through it knowing that you guys will have access to it. But again this was all just to make this more palatable, easier for a user to understand. The data is there it's not hard to get the data on a graph. The work is really making sure that this graph is palatable, understandable not confusing, easy to read, and I think I've made it a little bit better but I still have this problem with the substantiated being swamped. And that ultimately comes down to having these two lines in the same figure. There's really no way around it because this one's just on a different scale and so that's a question I would come back to of how could I maybe display these separately or different. But onwards we go so just looking at substantiated cases. So this was all really just setting up like a nice pretty basic figure I mean there's nothing fancy here there's like some nice fancy formatting and it's pretty and whatever but there's not really much else going on beyond that. So okay I got a mindful of times I got five more minutes and way too much code which is always the case. But this like I said will be available online. So after looking at that I wanted to take a step back and just look at substantiated cases because like I kept saying the substantiated cases are getting swamped here and it looks flat. If I were to show this in a presentation or a business meeting they'd say okay this looks pretty flat no changes. But if you're just looking at substantiated cases this is what it looks like on a smaller scale, for better or worse. Not to say that this is a more correct scale it's just a smaller scale. 

[ONSCREEN a line graph titled "number of children on reports of substantiated maltreatment". The x-axis is year from 2010 to 2020. The y-axis is "number substantiated" and goes from 590,000 to 640,000. The line appears to change more sharply up and down across the years because the scale is compressed.]

[Sarah Sernaker]
I think it's in a better direction than the original one but you can see what I mean as far as this is not a flat line it's just varying on a smaller degree than the unsubstantiated cases were. And so that I took from p that was our first plot remember this was all p2 and then I made p2 really pretty but p was just our basic substantiated cases. So that was what I'm building on here so I kind of just did the same quote unquote beautification process of our plot p which is just the substantiated cases. This is something also I don't know how comfortable I feel with starting an axis at 590,000. That's just something again to keep in mind you probably don't want to do that that's not really great practice. If you did keep it like that maybe add a note to your figure so people don't just assume that starting at zero. But okay let me just quickly go through this so we have national level data. I wanted to go back and show faceting by race so I went back to national level data collapse by race so this is very analogous code to what I had before but now it's just national level data by race. So unsubstantiated substantiated population.
And I think I guess I just kept it this way for now because I'm just gonna plot substantiated cases. So I did not make this long for the time being. I made color equals raceEthn so you'll see we have different colors for different races and then I've added kind of all the beautification onto the same bit of code here.

[ONSCREEN a line graph titled "number of children on reports of substantiated maltreatment" with six lines in different colors, each representing a different race ethnicity. The counts are explained by the presenter.]

[Sarah Sernaker]
So we get this nice figure with all these lines for race and it automatically did that because again we have the variable in the data that's directly corresponding so R and ggplot can just go in and pull out you know your aapi, non-Hispanics for across the years and similarly for black and Hispanic and multiracial and we get this really nice figure. But this is where I said that you should be careful about what figure like what statistic you're conveying. Whether in this case this is counts this is number of substantiated cases. But I wanted to link the census data remember to make rates. Because notice this is showing whites if I were to show you this you'd say wow whites have the most substantiated reports of maltreament and American no Asian American and pacific islanders have the lowest, Native Americans also really low number of substantiated maltreatment reports. Like and then we have Hispanic and black in the middle. Like if you were to send this to a policy maker and they were just looking at this they'd say wow we need to focus resources on white, maybe Hispanic, and black right? But that's why you should be careful with what statistic you're demonstrating. So I always favor rates because counts are not necessarily comparable. Sometimes you're comparing apples to oranges in a sense and when you make a rate you're making everything, you're kind of standardize everything to make a more comparable comparison. You're putting everyone at the same playing field. And so I've just gone back through my natdat_race so the data I just made and now I'm using my population as a denominator so I'm taking the number of substantiated cases divided by the respective population of the race within that year and I'm multiplying it so it's a rate per 100,000 kids. And so now I have a rate and I'm going to show you how it changes things.

[ONSCREEN a line graph titled "rate of children on reports of substantiated maltreatment" with six lines in different colors, each representing a different race ethnicity. The rates are explained by the presenter.]

[Sarah Sernaker]
Notice Native American one of the highest rates of substantiated reports of maltreatment black native or black non-Hispanic also one of the highest substantiated reports about treatment rates and that is because what we saw before was just a result of the population we saw white as the highest line in counts and that's just inherently because there are more the population of white children is higher than that of Native American and higher than that of black non-Hispanics. And so that this is like this is where I'm going to leave it because I'm hoping this is an impactful takeaway to help you know when you're putting together figures making sure that you're really conveying an appropriate, the correct message, and one that's not going to mislead people. Which hopefully I've made the case for here. And I have a lot more code that I wrote and like I said I was so excited about and I knew I would never get to but like I said this is available. I have more facets. I can just I'll quickly go through this I won't have time to explain it but. Uh oh what happened there that was. So we can facet data so separating by male and female with colors for race. We could fast it by race instead with colors for male or female. So again thinking about what comparisons do you want to make. Do you want to compare within each race male to female, or do you want to compare within male and female the races? So this is the same data just just organized differently and this really gets to asking yourself what is the exact comparison and the story that I'm trying to tell people? And all of your decisions in data visualization matter and it really people will look at a figure and the first thing they see and you know take away even if you've got you know lengths of notes and whatever what they first see really is impactful on people and psychology and their psyche and this and that but. I'll just stop there now I'm just running through this so this is what's all in the code. I really really like these state box figures this is great if you have like time series for each state.

[ONSCREEN a display titled "rate of children on reports of maltreatment, substantiated or unsubstantiated". 50 small boxes are arranged to match roughly a united states map. Each box contains a state abbreviation and two lines: one blue and one red, representing unsubstantiated and substantiated respectively.]

[Sarah Sernaker]
I really love these state figures and that's in this facet geo. I'll stop there I know I said that like four times right now but it's.

[Clayton Covington]
Okay then.

[Sarah Sernaker]
Yes thank you. 

[Clayton Covington]
Well thank you so much Sarah I feel like you gave us a really comprehensive approach to data visualization and reminded us of some very important checks of both our reading of visualizations but also how we make sure we're not creating misinterpreted ones. So I'm going to read try to go through the questions we have so far pretty quickly. So the first question asks people with color blindness have different types and levels of color blindness. Is the cb palette from ggplot readable for all people with color blindness?

[Sarah Sernaker]
That's a great question and something I wish I was more informed about but that their documentation probably has an alternative to working with colors altogether is you can use geom shape which changes instead of colors you could use like triangles, let's say as substantiated and squares is unsubstantiated. So that's something also in thinking about accessibility you can totally avoid using colors if you're unsure or just want to make sure you could use colors and shapes. So I'm not sure to the extent I would refer you to the documentation I'm sure they go into lots of detail there.

[Clayton Covington]
Great the next question says: I work with the state government agency in that role I've noticed that policy makers like maps for understanding the results of studies and analyses. Reporting a single outcome overlaid on a map is nice but sometimes one more measure helps contextualize the results. For example maybe one analysis finds that it would be really important to include both the rate of foster care placements and poverty rate by county. Any tips for how to do this well?

[Sarah Sernaker]
Oh man okay hold on I know it's probably not like that we're putting a really normal sometimes should help contextualize yourself. So in that just really quickly because we're running out of time my first thoughts there are that sometimes it's better to take a step back and ask yourself is if using a map is the best way to do it because sometimes people get really fixated and like okay everything's a map just what's the statistic we're throwing on it. But maybe it would make sense to you know use something like I have here I think you can still see my screen where you still have a geographic representation but notice I've got two different measures here. So I would say maybe rethinking and I know government is immovable sometimes but maybe rethinking the method attack strategy you're starting with. But also yeah I know that's a tricky one. 

[Alexandra Gibbons]
I can jump in with this one.

[Sarah Sernaker]
Yeah this is the new ArcGIS. Yeah.

[Alexandra Gibbons]
Yeah I do some mapping and the two suggestions that I think would be best for this scenario: one you can do a bivariate chloropleth map where kind of two different color scales combine on an axis. So you can represent two different variables like that. And then the other thing that people do pretty frequently is do a chloropleth map with with slightly transparent proportional symbols overlaid on top of that. So like you could have a dot proportional to like the poverty rate for instance and then you'd be able to see through that dot to also look at the chloropleth map underneath it. I hope that helps.

[Sarah Sernaker]
Our guest Alex coming in with our ArcGIS knowledge. 

[Clayton Covington]
Thank you Alex okay so Sarah we're at time but we do have two small questions do you have time to answer these last two?

[Sarah Sernaker]
See them yeah the code should be shared unless Andres or Clayton unless you know otherwise this code will be shared. It's just the data that won't be. When exporting images to create reports the image resolution tends to suffer. Do you have any recommendations? I would just say to that just trying to make images small as they can while still maintaining all of the elements. The larger you make images and then have to resize that's when things get screwy. Or just talking to like the journal article or wherever you're sending this and ask them for their recommendations because it definitely will vary depending on you know where your image ends up and they probably have guidelines themselves.

[Clayton Covington]
All right well thank you so much again Sarah for giving us a deep dive into data visualization and I want to thank all of our attendees who are both here today and over the course of the summer training series is you know one of the highlights of our work at that archive and we're really excited that you all were able to join us this year. Make sure to keep an eye out on the website for all the recordings of this year's sessions as well as previous year sessions and we look forward to seeing you next summer.

[Sarah Sernaker]
That's the last slide. Yay.

[Clayton Covington]
Yeah thanks everyone.

[Sarah Sernaker]
Thank you.

[voiceover]
The National Data Archive on Child Abuse and Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau an office of the administration for children and families.

The following is the complete R program code for the presentation:

library(data.table)    # package for reading the data  
library(tidyverse)     
library(ggplot2)       
library(scales)        # package for label formats
library(geofacet)      # package for state graphs


# set directory to folder where data are
setwd("C:/Users/ss1216/Box/NDACAN/Presentations/Summer Series 2023/S6 - Data Viz")


#### LOAD DATA ####
# load NCANDS data of number of substantiated/unsubstantiated reports
ncands = fread("CF_summerseries.csv") 
head(ncands)

ncands2 = ncands  %>% # filter out PR (b/c not in census data) and counts less than 10 (for data protection)
                      filter(staterr != "PR",
                            unsubst > 10,
                            subst > 10)

# load census data
census = fread("census_pop.csv") 
head(census)


## join census and ncands
# left join because some states not reported in ncands from 2010-2012
dat  = ncands2 %>% # rename variables to link with census
                      rename(year = subyr,
                            st = staterr,
                            sex = chsex) %>% 
                  # left join because ncands may not have all states all years like census
                  left_join(census) %>%
                  # reorder variables
                  dplyr::select(year, st, state, stfips, everything()) %>%
                  # sort by year, state, and race
                  arrange(year, stfips, raceEthn)
head(dat)


## Data cleaning
dat2 = dat %>% # add informative labels to race and sex
                mutate(raceEthn2 = case_when(raceEthn == 1 ~ "White NH",
                                             raceEthn == 2 ~ "Black NH",
                                             raceEthn == 3 ~ "Native Am NH",
                                             raceEthn %in% 4:5 ~ "AAPI NH",
                                             raceEthn == 6 ~ "Multiracial NH",
                                             raceEthn == 7 ~ "Hispanic"),
                               sex2 = ifelse(sex == 1, "Male", "Female"))
head(dat2)


#### Summarize data to national level
# totals in each year - grouped by race and sex
natdat = dat2 %>% group_by(year, raceEthn2, sex2) %>%
                  summarise(unsubst = sum(unsubst,na.rm = TRUE),
                            subst = sum(subst, na.rm = TRUE),
                            pop = sum(pop,na.rm = TRUE))
head(natdat)

# total in each year - total over everyone
natdat_tot = dat2 %>% group_by(year) %>%
                    summarise(unsubst = sum(unsubst,na.rm = TRUE),
                              subst = sum(subst, na.rm = TRUE),
                              pop = sum(pop,na.rm = TRUE)) 
head(natdat_tot)




# put natdat_tot data in long format
natdat_tot_long = natdat_tot %>% pivot_longer(cols = c(unsubst, subst),
                                              names_to = "rptoutcome",
                                              values_to = "rpts")

head(natdat_tot_long)


#### FIGURES ######
# basic scatter plot of substantiated reports, at national level
p = ggplot(natdat_tot_long %>% filter(rptoutcome == "subst"), 
           aes(x = year, y = rpts)) +
           geom_point()
p


# add unsubstantiated data, at national level
# make the lines different color based on substantiation/outcome
p2 = ggplot(natdat_tot_long, 
            aes(x = year, y = rpts, color = rptoutcome)) +
  geom_point() + 
  geom_line()
p2



# take previous figure but fix labels and some reformatting
p2 + 
  # change x axes lines to 2010-2020, incremented by 1 yr
  scale_x_continuous(breaks = 2010:2020) +
  
  # change y axes to start at 0 and go to 3,000,000, incremented by 500,000 - formatted with commas
  scale_y_continuous(limits = c(0,3e6),
                   breaks = seq(0,3e6, by = 5e5),
                   label = scales::comma) +  
  
  # relabel x and y axes, and title
  xlab("Year") + 
  ylab("Number children") +
  ggtitle("Number of children on reports of maltreatment, substantiated or unsubstantiated") +
  
  # remove the color legend title name 
  labs(color = "") +
  
  # relabel the values of "subst" and "unsubst" respectively, need to specify 'values'/colors for each one too
  scale_color_manual(values = c("red","blue"),
                     breaks =  c("subst", "unsubst"),
                     labels = c("Substantiated", "Unsubstantiated")) +
  
  # put the legend horizontally on the bottom
  theme(legend.position = "bottom")





### just look at substantiated cases
p + 
  geom_line() +
  scale_x_continuous(breaks = 2010:2020) +
  scale_y_continuous(label = scales::comma,
                     #limits = c(0,650000)
                     ) +
  xlab("Year") + 
  ylab("Number substantiated") +
  ggtitle("Number of children on reports of substantiated maltreatment") 









## look at national trends of race

# make national level data - totals by race/ethnicity
natdat_race = natdat %>% group_by(year, raceEthn2) %>% 
                         summarise(unsubst = sum(unsubst, na.rm = TRUE),
                                   subst = sum(subst, na.rm = TRUE),
                                   pop = sum(pop, na.rm = TRUE))


# Plot number substantiated by race
ggplot(natdat_race, aes(x = year, y = subst, color = raceEthn2)) + 
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = 2010:2020) +
  scale_y_continuous(label = scales::comma,
                     breaks = seq(0,300000, by = 50000)) +
  guides(color = guide_legend("Race"))+ 
  xlab("Year") + 
  ylab("Number of substantiated cases") +
  ggtitle("Number of children on reports of substantiated reports of maltreatment")






## Create rates to standardize comparison
# national level rates of substantiated reports per 100k children - by race
natdat_race3 = natdat_race %>% mutate(subst_rate = 100000*subst/pop)


# plot substantiated rate
ggplot(natdat_race3, 
       aes(x = year, y = subst_rate, color = raceEthn2)) + 
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = 2010:2020) +
  scale_y_continuous(label = scales::comma,
                     limits = c(0,1700),
                     breaks = seq(0,1600, by = 400)) +
  guides(color = guide_legend("Race")) + 
  xlab("Year") + 
  ylab("Rate of substantiated cases (per 100k children)") +
  ggtitle("Rate of substantiated reports of maltreatment (per 100,000 children)") +
  theme(legend.position = "bottom")







### look at national trends of race and sex #####
# grouping by sex too now
natdat_race_sex = natdat %>% group_by(year, raceEthn2, sex2) %>% 
                            summarise(unsubst = sum(unsubst, na.rm = TRUE),
                                      subst = sum(subst, na.rm = TRUE),
                                      pop = sum(pop, na.rm = TRUE)) %>% 
                            mutate(subst_rate = 100000*subst/pop)



# facet by sex
ggplot(natdat_race_sex, aes(x = year, y = subst_rate, color = raceEthn2)) + 
  geom_point() +
  geom_line() +
  facet_grid(~sex2) +
  scale_x_continuous(breaks = 2010:2020) +
  scale_y_continuous(label = scales::comma,
                     limits = c(0,1700),
                     breaks = seq(0,1600, by = 400)) +
  guides(color = guide_legend("Race"))+ 
  xlab("Year") + 
  ylab("Rate of substantiated cases (per 100k children)") +
  ggtitle("Rate of substantiated reports of maltreatment (per 100,000 children)") +
  theme(legend.position = "bottom")


# facet by race instead
ggplot(natdat_race_sex, aes(x = year, y = subst_rate, color = sex2)) + 
  geom_point() +
  geom_line() +
  # using facet_wrap now, can easily specify 2 rows and free scales between figures
  facet_wrap(~raceEthn2,nrow = 2, scales = "free") +
  scale_x_continuous(breaks = 2010:2020) +
  scale_y_continuous(label = scales::comma) +
  guides(color = guide_legend(""))+ 
  xlab("Year") + 
  ylab("Rate of substantiated cases (per 100k children)") +
  ggtitle("Rate of substantiated reports of maltreatment (per 100,000 children)") +
  theme(legend.position = "bottom")


#### make figures by state #######
# collapse data over state
statedat = dat %>% group_by(year, st, state, stfips) %>% 
                  summarise(unsubst = sum(unsubst, na.rm = TRUE),
                            subst = sum(subst, na.rm = TRUE),
                            pop = sum(pop, na.rm = TRUE)) %>%
                arrange(year,stfips)


# plot substantiated by state
ggplot(statedat, aes(x = year, y = subst)) +
  geom_point() + 
  geom_line() + 
  facet_geo(~st, grid = "us_state_grid1"#, scales = "free_y" 
            ) + 
  scale_x_continuous(breaks = 2010:2020)+
  scale_y_continuous(label = scales::comma) +
  xlab("Year") + 
  ylab("Number children") +
  ggtitle("Number of children on reports of maltreatment, substantiated or unsubstantiated") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


## make rates instead
statedat2 = statedat %>% mutate(subst_rate = 10000*subst/pop,
                                unsubst_rate = 10000*unsubst/pop)


# plot substantiated rate by state
ggplot(statedat2, aes(x = year, y = subst_rate)) +
  geom_point() + 
  facet_geo(~st, grid = "us_state_grid1") + 
  scale_x_continuous(breaks = 2010:2020) +
  xlab("Year") + 
  ylab("Rate per 10k Children ") +
  ggtitle("Rate of children on reports of maltreatment, substantiated or unsubstantiated") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))




## plot subst and unsubt by color - make long first
statedat_long = statedat2 %>% dplyr::select(year,st,state,stfips,ends_with("rate")) %>%
                              pivot_longer(cols = subst_rate:unsubst_rate,
                                           names_to = "rptoutcome",
                                           values_to = "rate")

# plot substantiated and unsubstantiated rate by state
ggplot(statedat_long, aes(x = year, y = rate, color = rptoutcome)) +
  geom_point() +
  geom_line() + 
  facet_geo(~st, grid = "us_state_grid1") + 
  scale_x_continuous(breaks = 2010:2020) +
  xlab("Year") + 
  ylab("Rate per 10k Children ") +
  ggtitle("Rate of children on reports of maltreatment, substantiated or unsubstantiated") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(color = "") +
  scale_color_manual(values = c("red","blue"),
                     breaks =  c("subst_rate", "unsubst_rate"),
                     labels = c("Substantiated", "Unsubstantiated")) +
  theme(legend.position = "bottom",
        # edit axis text to be a little smaller and vertical
        axis.text.x = element_text(angle = 90, vjust = 0, 
                                   size = 8))