Transcript for video titled "LeaRn R with NDACAN, Week 1, Introduction to R, September 20, 2024". The R code presented in the video is included at the end of this transcript and is marked [R Code]. [VOICEOVER] National Data Archive on Child Abuse and Neglect [ONSCREEN slide content 1] WELCOME TO NDACAN MONTHLY OFFICE HOURS! National Data Archive on Child Abuse and Neglect DUKE UNIVERSITY, CORNELL UNIVERSITY,& UNIVERSITY OF CALIFORNIA:SAN FRANCISCO The session will begin at 11am EST 11:00 - 11:30am - LeaRn with NDACAN (Introduction to R) 11:30 - 12:00pm - Office hours breakout sessions Please submit LeaRn questions to the Q&A box This session is being recorded. See ZOOM Help Center for connection issues: https://support.zoom.us/hc/en-us [Erin McCauley] Hello and welcome to the first session of the LeaRn with NDACAN Office Hours Series. This will be a year-long course where people can learn how to code in R. My nome is Erin McCauley and I am the Co-Director of the National Data Archive on Child Abuse and Neglect. And Paige Logan, our new Graduate Research Associate, who is a Ph.D. Student at the University of California San Francisco will be hosting the series. So let's get started! [Paige Logan] We are really really excited to be hosting a very special iteration of the monthly Office Hours that we do annually. This special iteration will include 30 minutes of training on R for the first half of the session and then at around 11:30 Eastern Time we will transition over to kind of the regularly scheduled programming office hours which includes breakout rooms with some of our lovely statisticians and staff at the archive to answer any questions about research design, statistics, working with the data sets things like that. And I know we have a lot to cover so just a few very quick reminders and then I will hand it over to Sarah Sernaker but we ask that you put questions in the Q&A and the Q&A function in zoom throughout the presentation I think the plan is to try to save some time at the end and answer questions as they come in so feel free to do that as you're listening and as questions come up. This session will be recorded and all of the recorded sessions will be available on our website. I will put the link in the chat shortly but if you cannot make it or you want to reference these recordings in the future everything will be housed on our website and will be available to revisit and then the last thing is if you have any Zoom issues please use the link provided on this slide. If you're having a lot of trouble you can reach out to Andres who is on this call to to answer any issues that you may have. We're really excited to get this started our first session is Introduction to R and I am just going to kick it right over to sarah to get started. [ONSCREEN slide content 2] LeaRn with NDACAN, Created by SaRah SeRnakeR. [Sarah Sernaker] Thanks Paige. Hi everyone I'm Sarah Sernaker I'm a statistician with NDACAN. You might have seen my presentations before so just to reiterate quickly what Paige was saying this first half hour we'll be doing instruction and slides and the second half hour can be as general help with data sets or statistical analysis or we can talk more about R and the content we go over. So LeaRn with NDACAN. So this was all put together by me with some help with references that I have links to. So a lot of you here are hopefully trying to learn R and maybe have not used it at all, maybe are familiar with it, or maybe advanced users and hopefully everyone can get something out of this maybe even just a new function or different perspective. [ONSCREEN slide content 3] Why R? Built for statistical computing. Compatible with all computing systems (Windows, Mac, Linux). Open-source, free. State-of-the-art graphics. [Sarah Sernaker] But why do why is R useful? Why not Stata or SPSS why not SAS or SQL or what else is out there? Mplus. R is a really nice entry-level programming language and it's completely built for statistical computing and that's similar to SPSS and Stata very much so. But R is also really nice because it's free, it's open source, it's a big appeal to a lot of people and companies and when we talk about reproducibility R is a free tool and a really advanced tool that can do a lot of advanced methodologies very well and it has a really great state-of-the-art graphics system. So R is if you're doing any sort of statistical analyses or computing, R is a really great language to get into. Really quickly some housekeeping. There are materials to go along with this whole course not just today but for the whole sequence. [ONSCREEN slide content 4] Materials for this Course. Course Box folder ( https://cornell.box.com/v/LeaRn-with-R-NDACAN-2024-2025) contains Data (will be released as used in the lessons) Census state-level data, 2015-2019 AFCARS state-aggregate data, 2015-2019 AFCARS (FAKE) individual-level data, 2016-2019 NYTD (FAKE) individual-level data, 2017 Cohort Documentation/codebooks for the provided datasets Slides used in each week’s lesson Exercises as that correspond to each week’s lesson An .R file that will have example, usable R code for each lesson – will be updated and appended with code from each lesson. [Sarah Sernaker] There's a Box folder that should be publicly accessible with downloadable materials. Data will be released there as we use it in the lessons. So for instance today we'll be using Census level data if I can get to it it's in the example anyway and some exercises. There's documentation for all the code or I'm sorry all the data that's provided. So as is the case in anything regardless of R or not anytime you have data you should always refer to the codebook to see what the variable definitions are and the category levels. These slides will accompany it. Exercises also if you're looking just to test yourself and just yeah test what you've learned here. I put together examples that hope hopefully help you do that. And then there will be an example R file along with every week. Additionally there's a reference that I used sort of lightly putting this together and it's a book called R in Action [ONSCREEN slide content 5] Materials for this course. Using R in Action as a guide and reference to go with slides. Link to R in Action: https://www.cs.uni.edu/~jacobson/4772/week11/R_in_Action.pdf Other useful resources Link to R for Data Science: https://r4ds.had.co.nz/ Link to Intro to R for Social Scientists: https://jaspertjaden.Github.io/course-intro2r/ Link to list of even more useful resources: https://guides.library.brandeis.edu/c.php?g=302090&p=2013481 [Sarah Sernaker] And I actually found it completely online so the link is right there. And then other useful resources are you know R For Data Science, Intro To R, anything you can find on-line every book I find has you know a unique perspective and way of looking at data and yeah I don't feel free to look search the internet freely for more resources. So today week one is an introduction to R. Hopefully today you at least can download and practice running scripts and code. That's the goal. So we're starting very much at the ground level. Data that we'll eventually use in this week's code is Census data. So this is population data I just wanted to put this here for like thoroughness and when we go back and reference things. [ONSCREEN slide content 6] Week 1: Introduction to R, September 30, 2024. [ONSCREEN slide content 7] Data used in this week’s example code Census aggregate data from 2015-2019 (census_2015_2019.csv) 8 columns: cy, stfips, state, st, sex, race6, hisp, pop. 6120 rows: population counts for each state from 2015-2019, over sex X race6 X hisp. Publicly available from CDC Wonder at this link https://wonder.cdc.gov/single-race-population.html [Sarah Sernaker] But this is sort of in the weeds. We're using Census data it's population, it's broken down by sex, race, Hispanicity and that's over a few years. And this is publicly available from CDC Wonder. All of the data let me say now all of the data we provide should not be used for research purposes or publication purposes. Some of the data are masked and like obfuscated so they're fake basically they're very they mimic real data but they're fake because we because of all the risks inherent to the data that NDACAN hold and etc and etc.. So a lot of the data you'll see NCANDS and AFCARS and NYTD that would have to be directly ordered through NDACAN but we are making the full research publication data is available through NDACAN but the data we're using in this class will be in the Box folder. This Census data to my point is data that you can directly get from the CDC and if you are looking to use population data in your publication you should refer to that is my point. [ONSCREEN slide content 8] Programming in R. “R” is a programming language, specifically built for statistical computing and analyses Open-source, fully free and downloadable through The Comprehensive R Archive Network (CRAN) RStudio is the graphical user interface (GUI) that makes writing R code and working with data much easier and more manageable. [Sarah Sernaker] So R programming an R. You've probably I'm sure you've seen R before if you were here and when we talk about R quote unquote R is actually the programming language that you're using when you write R code R is a programming language. That's like what the communication from you to the computer is an R programming language. And it's open fully free downloadable through what's called CRAN The Comprehensive R Archive Network. CRAN is referenced a lot because you download R through CRAN and all of the packages that you would download for R, external and whatnot, live on CRAN. And so you hear the term CRAN thrown around a lot when you use R. There's also RStudio. So RStudio is what, if you are going to program an R, you should open RStudio. You should not open the R that you download as a first step. What you really want to work with is RStudio. And RStudio is what's called a GUI or a graphical user interface and what it does is it just makes writing R code so much easier and manageable and it's just it writing in programming R like this R language is like writing in a computer terminal basically. And RStudio is just you know a luxurious place to write your R code and see all your variables and output and all of your graphics in one place. So like when we talk about writing R code and using R and whenever anyone's talking about using R it's almost synonymously used with RStudio. So how do you get this stuff? [ONSCREEN slide content 9] 1. Download and install R programming language from CRAN link: https://cran.r-project.org/ 2. Download and install RStudio from Posit link: https://posit.co/download/rstudio-desktop/ 3. Open RStudio. 4. Click the “File” button at the top, then “New File”, then “R Script” to open a new R script to work in. R scripts are where we write executable code and programs that we can save and re-run. [Sarah Sernaker] If you do not have R on your computer step one is you should download and install the R programming language from CRAN. So I've linked it here. A few things to note is that you'll have to download for your computer. So like I think you have to be download a Mac for a Mac download the Windows version for Windows. Linux if you have Linux. And I think I think it automatically recognizes like your default download. I don't think you have to worry about about about bits or whatnot. So step one the R programming language. So that's like you're telling your computer this is the R programming language like you're giving it this sort of language dictionary. Then once you have R download and install RStudio. And this is available through what's called Posit and I don't know if anyone's familiar RStudio just used to be available from RStudio that was like the company but they've changed into what's called posit and it's sort of it's a whole separate topic. They're like trying to incorporate Python and expand and etc. Etc. But that's where we download RStudio now from Posit. So download R download RStudio and then once you've gone through the installation process for RStudio go ahead and open it. So now you're ready to write your code. And to open a new script basically a new working document you click the file button at the top left you where it usually is. New file and you want to open an R script. And this is like opening a new word document. Whenever you write a new paper or a new manuscript and you want to close it and then open it later and edit it and this and that that's what a script is. That's where you write code and programs that you save and send people who you're working with that's that's called an R script or script. [ONSCREEN slide content 10] R studio interface. [ONSCREEN slide content 11] A computer screenshot of an RStudio work session, with an R script open and the sections of the graphical user interface labeled with text boxes. [Sarah Sernaker] So what does RStudio look like so just really quickly when you open RStudio the I guess I cropped out the File but the File drop down should be above here and that's where you can do new file script. Also there's a quick version this plus sign with a blank will open a new R script for you. So this is what this window will look like. This R script this is where you would write code and like I said write your document sort of. All of this is green because it's a comment and we'll get into that but a comment is using hashtags or pound signs for those born before 2000 pound signs or hashtags. And so that creates a comment and so that's not executable code that's just to leave comments about your code and details about the document which I cannot emphasize enough how important that is but that's your R script at the top here. Your R console down here is where stuff will run. So if you run your code you'll see it pop up down here as a sort of check of like yep we're running this and a warning warnings and other stuff will pop up here or it will just complete and again we'll get into that. But that's where you see things running that's where warnings will come up that's where errors will come up and you should never ignore those. This is where oops this is where you can run quick functions like a help function like you don't need to include that in your script necessarily if you're just you know checking stuff out really quick you can run quick code here. The R environment is where you'll see any saved variables, any data frames anything that's been executed. So if you open an R script and nothing's in your environment that means you haven't run your code yet. It doesn't matter that you've just opened your R script just looking at the code and you seeing it as a human doesn't do anything until you actually run it. So this is an important button this is not just writing your code but running it to tell your computer go through this code. And so this is where everything will end up in the R environment. And this is a good check you'll see you can quickly check dimensions like how many columns rows and etc.. And then we have the graphical output and help system. So this is where if you're doing plots or whatnot or using your help function this is where it pops up. And you have your help tab a viewer for actually I don't know what viewer is because you have plots for plots. Files you can see the file the whole tree the file folder structure. So I'm going to quickly brief through those because as if you're with my presentations you know I always run over. [ONSCREEN slide content 12] A computer screenshot of an RStudio work session with additional details about the R Script section of the GUI. Shows the location where to write code and programs that can be saved as an R file that can be easily shared with others, and re-run as needed. [Sarah Sernaker] So R script where to write code. R console where snippets of code can be run output will appear here errors and warnings will appear here in red. [ONSCREEN slide content 13] A computer screenshot of an RStudio work session with additional details about the R console section of the GUI. It shows where snippets of code can be run (e.g. help(), install.packages(). Output will appear here, or progress bars (such as from loading packages or data). ERRORS and WARNINGS will appear here in red (meaning something is wrong with your code) – always read and resolve warnings and errors. [Sarah Sernaker] I put it in red to startle you and make you note that you should not ignore red text. [ONSCREEN slide content 14] A computer screenshot of an RStudio work session with additional details about the R environment section of the GUI. Shows an arrow to the area in the top right where the results of the executed code are saved, e.g. variables, data matrices/data frames. [Sarah Sernaker] R environment executed code variables matrices and then our graphical output. [ONSCREEN slide content 15] A computer screenshot of an RStudio work session with additional details about the R GUI. Ann arrow points to the lower right of the interface where figures and visualizations will appear, and where the help()function will output information about how to use functions and packages. This area will also display graphical output and help systems/ details about packages or functions. [ONSCREEN slide content 16] Programming in R. [Sarah Sernaker] Okay so programming in R. And I did this very intro level but I guess I definitely I think there's an assumption of some acquaintance with programming and programming topics and I'm going to touch on what I think are the important things to at least refresh on as you're learning R. But to talk about R functions and packages. [ONSCREEN slide content 17] R functions and packages. There are a lot of built-in functions for basic statistical analyses – called “base R” functions. Anything not already built-in to R must be installed from external packages from CRAN (or GitHub in some cases). Tidyverse syntax and suite (tidyverse), advanced and niche methodologies (survey, mice), state of the art methods (neuralnet), advanced graphics (ggplot2) install.packages(‘PACKAGENAME’) Must load any needed (and already installed) packages at the start of your script/coding library(PACKAGENAME) # note there are no quotes here Can also reference functions within library using double colons LIBRARYNAME::FUNCTION_in_LIBRARYNAME() [Sarah Sernaker] So when you're writing R it has a lot of built-in functions statistical functions like mean, median, linear models it can do a lot of stuff. And these are all called base R functions so if you hear me or anyone refer to base R this is anything that already exists in R when you download R like the R programming language. And it has a lot but it doesn't have everything. I mean you could probably get by with I think you could get by but packages are helpful because packages provide more functions and they're usually more advanced or niche methodologies that you know only a subset of people are using so it's not so it's just not so broad or commonplace that they've included in a base R. Or it could be stuff that expands on what's base R or improves on it it might just make output prettier or it might just have better functionality than base R or it's easier to understand. So there's a whole host of reasons why you'd be installing external packages to accompany base R. And almost every external package you could need exists on CRAN so like I mentioned CRAN. And if it exists on CRAN all you need to do you don't even have to go to the internet or what not you just do this function from within RStudio, install packages, and then you would put the package name. Because CRAN and R are so intertwined if if it exists on CRAN you can directly download it. The only caveat I'll say here is some like very niche or like advanced or I don't know I guess people who just don't go through the process of CRAN put together R packages and put them on their Github pages and sometimes you have to download it through that. If you're having trouble with that reach out to me I didn't put it here. It's like a package I think you have to use dev tools the package and then there's a function long story short I'd be googling it. So it's very low prevalence of packages that live on Github but when they do live on Github it's just an additional step really. And I've just listed some examples of very often used external packages Tidyverse, ggplot2, survey if you're using survey data. Mice is for imputation. State-of-the-art methodology such as neuralnet, yeah so. So yeah so anything you would want to use if you're using an external package you have to install it and then when you use it in your function in your script you would include this library package name. You don't need to use install packages every time you just need to install it once and you as your user usually remember oh I've installed that or R will tell you if you haven't. But all you need to include in your script is this reference to the library. R needs to know okay we're using that library so I know those functions. There's other ways to reference functions within a library like I've done here and I use this notation to show using functions and what library they come from if it's not base R. So you don't necessarily need to use this notation but I definitely use it within slides just to show where external functions are coming from. And that is just to list the library name and then these double colon and function. [ONSCREEN slide content 18] Documentation and help. Package documentation. CRAN website. Function documentation. Use the help(FUNCTIONNAME)function to access Use ??SEARCHTERM to browse functions in downloaded packages related to search term Any supplemental documentation relating to a package published elsewhere (just Google around) For example, MICE has a great published article with lots more context and examples with it: https://www.jstatsoft.org/article/view/v045i03 [Sarah Sernaker] Okay so documentation and help when you're writing programs use help I use Google for my code I think every single day if I'm programming and I've been using R for I don't know like 10 years now or something. So use help there's package documentation that's available from the CRAN website. There's function documentation and that's available within RStudio using the help function and that provides you know what input to use in functions, what options are available with functions, how to basically tailor functions to what you need. Any supplemental documentation just Google not just for help and code debugging but there's also some published articles with more context and examples and the best example I have of this is mice has this additional published article that's super helpful and provides a lot of of details about the package mice. [ONSCREEN] Link to article "Multivariate Imputation by Chained Equations in R" https://www.jstatsoft.org/article/view/v045i03 [ONSCREEN slide content 19] Programming concepts to refresh. Data types. String, characters, numeric, factor, ordered, logical (TRUE/FALSE) Matrix, data frame, vector, lists Missing/invalid values: NA, Null, Inf Variables. Assigning variables: e.g. x <- 3 Using and manipulating stored variables or objects Conditionals or loops ‘if else’ statements ‘for’ loops [Sarah Sernaker] So if you're stuck if you're looking for help use everything I cannot emphasize that enough. Okay so quickly going through the next few slides these are the programming concepts that I'm not going to drill in here but definitely if something's popping out of like oh like I don't know what that is definitely look into it or reach out to me afterwards. Refresh data types what's the difference between strings, characters, numeric values versus factor, ordered, logical values. What's the difference between a matrix and a data frame in R? I'll tell you a matrix has to be all the same type. And usually you don't work with matrix matrices in data analysis usually they come in data frames or tibbles we'll get into another week and that's because there's usually different data types across variables and data frames have nice labels and etc. Etc.. Missingness so NA's, nulls, infinite values, how to assign variables I'll get to that in a second. There's sort of two ways you can assign variables in R. How to use and access stored variables, conditionals and loops especially 'if else' statements. 'If else' statements come in handy all the time for filtering or mutating or subsetting all those things. [ONSCREEN slide content 20] Programming concepts to refresh. Operators <, <=, >, >=, ==, !=, !, a|b, a & b Coding style Using spaces, indents, new lines in a way that makes code easier to read Comments Thoroughly comment code using # with details about what code does and other relevant information – not just helpful for others but for future you! Seeking programming help Google, Stack overflow help(FUNCTIONNAME) ??SEARCHTERM [Sarah Sernaker] More stuff operators so like really just basic programming but we use these all the time and they're just like building blocks so I want to make sure I at least touch on them. Coding style this is to help you understand your code this is to help others understand your code. There's basic nomenclature using indents and spaces so your code doesn't look like a garble of words and nonsense. Comments, adding comments, not just for even if you work alone. I work alone a lot. I add comments for future Sarah because future Sarah needs those comments to understand what the Hell past Sarah has done. It's not just for other people it's for yourself. And then seeking programming help. Google I use all the time and it usually takes you right to stack overflow so those two using a lot for help. Okay quickly quickly. Reading data into R. [ONSCREEN slide content 21] Reading data into R. Link to Documentation for the R package called ‘datasets’ https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html Read from comma or tab separated files (.csv, .tab, .txt) read.table(file = “C:/pathname/DATATABLE.tab”) read.csv(file = “C:/pathname/DATATABLE.csv”) Read excel – need to use external package to read .xlsx files like readxl readxl::read_xlsx(file = “C:/pathname/DATATABLE.xlsx”) Read data from other programming language formats (Stata, SPSS, SAS) – need to use external package like haven haven::read_spss(file = “C:/pathname/DATATABLE.sav”) haven::read_stata(file = “C:/pathname/DATATABLE.dta”) haven::read_sas(file = “C:/pathname/DATATABLE.sas7bdat”) [Sarah Sernaker] This is just examples I'm not going to get into the next few slides too much it's sort of just examples of functions of how to get started when you're reading data. If you have a csv file or an Excel file or if you have data from other programming language such as Stata or SPSS it can read in stuff in sav, dta, or SAS formats. It to do that you need this external package called haven. [ONSCREEN slide content 22] Considerations when writing code. Conceptualize what you want to do first. Sketch out plan and pseudo code, especially for figures and tables. Understand what you can get out of your data and any limitations you may face when using it in R and any R package limitations. Some very niche or highly complex combination of analyses may be lacking from existing R packages). There are many ways to accomplish the same task and approach writing a program, do what makes most sense to you with however intermediate steps. Use informative but concise variable naming conventions and formats, use _ in names, upper and lower cases. [Sarah Sernaker] And then quickly considerations when writing code. This is sort of like the common practices. Conceptualize what you want to do first. So this is not just R but thinking about what you want to do and writing it down can really help you guide your program and reduce errors. And yeah so conceptualizing understanding what you can get out of your data you can be like a dead data set all you want but it might not have the information that you you need to answer your question. And similarly R might not be the best language for some analyses. There are some complex combination of analyses that I've seen that R is not the best and like for instance MPlus comes to mind if we're talking about like what's the big one structural equation modeling always comes back to sem. Or like I like survey analysis in Stata because I like the functionality better. So R has limitations. R is great for so many things but it's good to understand what you can and can't do not what R but your data also. [ONSCREEN slide content 23] Considerations when writing code. Use common programming format standards and guidelines to make code consistent, readable, and maintainable. Comment, comment, comment code. Use indentations and line breaks for readable. Use informative but concise variable naming conventions and formats, use _ in names (e.g. var_yr2010), upper and lower cases (e.g. raceEthn) Try to avoid “hard-coding” values, may cause errors later. For example, rather than calculating the mean of a variable as 2.3 and setting x = 2.3. Instead, define x = mean(VARIABLE)so that if VARIABLE changes at all the mean will update in the code accordingly [Sarah Sernaker] Just moving forward more comment considerations. Commenting, indenting, using informative names, don't hard code. [ONSCREEN slide content 24] Considerations when writing code. Code in R can be split across multiple lines – must be split in such a way that the code would continue on and not just end, lines should not start with operators. For example: Would just end at line 1 and throw error at line 2 (line 1) X = 1 + 2 + 3 (line 2) + 4 Would evaluate full summation (line 1) X = 1 + 2 + (line 2) 3 + 4 [Sarah Sernaker] When writing an R just common you can split code across multiple lines you just need to be careful in the way that you do it. This is just an example this for example this half will not evaluate as you want because it's just going to finish on this line. It's not going to keep going there's nothing to keep going here. Whereas this is what you would want to put code starts 1 plus 2 plus what and then it goes to the next line it's like oh 3 plus 4 okay then it finishes it up. So this is just when you're writing R code specifically. [ONSCREEN slide content 25] Additional random R tips. R is case sensitive. Use <- or = to assign or create variables in R. Vectors are created using c(), must all be same data type, for example: c(“one”, “two”, “three”,”four”)or c(1,2,3,4,5) or c(x,y). Index variables within a data table using dollar signs DATASET$VAR1 or brackets, DATASET[, “VAR1”]. [Sarah Sernaker] Additional random tips R is case sensitive so Cat with a capital c is different than cat with a lowercase c and all the things with case sensitivity. You can use either this arrow use or arrow dash to assign variables. So <- let's say or I really like equal sign. I'm just to me it's just like one character versus two and that's when you can tell I've added to code because a lot of my colleagues use this one and then I'll throw in my code and it has all the equal signs. So that's just to say you might see this arrow notation used in a lot of other places to assign variables are used in all over the code. I like to use this here and they are synonymous. I think there's some very niche issues you can run into but I've never had that issue. Vectors this notation coming up with c() comes up a lot and this is sort of I think it means combine or concatenate more classically. [ONSCREEN slide content 26] Additional random R tips. Check data types, and know how to do type conversions – lots of errors or problems arise because of incompatible or incorrect data types, e.g. categorical variables in a model as numeric Characters can be referenced with single or double quotes – but if you have quotes within quotes, the outer quotes should differ from the inner quotes, ex. “County’s population”, or ‘The “substantiated” cases’ Many coding techniques can be combined into one line (e.g. simultaneously using logical statements, subsetting syntax, assigning new values). [Sarah Sernaker] And this is just to show you kind of accessing vectors and whatnot indexing variables, checking data types, manipulating data, and then the last few slides were just functions a sort of reference of useful functions as you like explore data. I'm just going to quickly throw up my code and I we have four minutes left so I take that as a win. [ONSCREEN slide content 27] Manipulating data in R. Joining data merge(DATA_A, DATA_B, by = ”shared_variable”) cbind(DATA_A,DATA_B) rbind(DATA_A,DATA_B) Subsetting/filtering data subset(DATA, var1 == CONDITION & var2 < 100 ) sample(DATA) Mutating or creating variables, for example DATA$var1_rate1k = DATA$var1 / 1000 DATA$sex = ifelse(DATA$sex == 1, “Male”, “Female”) [ONSCREEN slide content 28] String data in R. Taking substring substr() String length nchar() Replace string str_replace() Make upper or lower case str_to_upper(), str_to_lower() Sort sort() Look for character or substring grep(), grepl() Join strings paste() Split strsplit() [Paige Logan] Sorry sarah I was not being a good timekeeper for you. [Sarah Sernaker] Oh that's okay I saw stuff in the chat and I was like I hope she's not telling me time is running out I can't see. This code is available on the Box folder so when you get RStudio up and running you should open with or to RStudio. So open it with RStudio make sure you're not opening R the R programming language. And you should be able to run through this code. The only thing is this is my working directory so this is the only part you should need to change and everything should work fine. So this is where you would your folder path to wherever you download R R content. And so there's stuff about installing packages, libraries, so once you've install them you need to reference them to be able to use them. Basic example of what I mean of assigning variables, using variables. Here let me just run through this so you can see. So I've opened R I've run through a lot of stuff sorry I'm like very entrenched in R so I overlooked the small things. So what I've done is when you are highlighted on a row I I work on a Mac and I do command return and that will run that line of code. And you can see down here this is what ran. This showed up down here because I did command return. Or you can highlight it and click run notice it does the same thing. And it's fine to run stuff multiple times usually. So this is fine it's just setting it over and over again. So stuff's running down there and if you do the quick keyboard so command return it'll just take you to the next runnable line so that's why it's sort of jumping down. So I've clicked install packages and I already have Tidyverse so it's actually telling me it could be updated I'll do that later. But there's that okay 2 minutes left I know we usually we will usually leave time for questions but I think we'll have time for that also in the second half. Just really quickly so I'm just going through here. Notice I ran this x assigned to one and now we see x up here has a value of one. We can assign y using x. These are just logical statements. So y is greater than x that's true. X is greater than y that's false. You can run stuff that doesn't assign to anything and it'll just show up so it's sometimes helpful just to view stuff quickly I'm assigning a new variable with my equal sign because that's what I like to use. Just function helpful functions that I use. This colon gives you some consecutive numbers. Looking at data types these are all numbers these are all strings or characters strings are multiple characters basically. Or characters are rather like words and stuff and strings are single letters or numbers. Anyway so just going through so you can see numbers strings always have quotes around them. You cannot combine types. The default is to set everything to string. So if you're trying to combine anything with the string, strings will nuke whatever you're trying to combine it. And because what's the other way like how do you convert these strings you could convert to integer and that's sometimes useful when you read in data and it didn't read in how you wanted. But if you try to run this for example on a string like it just doesn't make sense so this is an example of a warning message. It ran but it was like hey like this is stupid rethink what you did there. So yeah this is then we get to the actual subs and stuff so this is using real data the Census. When you're looking at data sets never assume just because your code ran that it did what you want that's like I should repeat that every month because that's like the mantra of programming. Just because it ran does not mean it did what you wanted. So this is I'm reading data to my point and usually when you read data you should get to know it and make sure it looks right so some helpful functions looking at the names of the columns. This view function will pop it up in a new window so this is helpful. I will say if your data are really big the view function is may choke a little bit. Ahead give you the first view summary, give you five number summary strings, accessing stuff, indexing data, tables, subsetting. Doing the same thing so showing you that code can be written in numerous ways to accomplish the same thing. Sometimes I find I go back to my code and I'm like why did I do it that way and I rewrite the code which is a waste of time. But there's many ways to write code. And so let me stop talking. [Paige Logan] Alright we do have one question that came in through the Q&A so maybe I'll just I don't can you see the questions or do you want me to read it. [Erin McCauley] Read it for the transcript. [Paige Logan] Yes all right okay so the question is how do you know if you have the latest version of R or RStudio and when does it matter if you have the latest version? [Sarah Sernaker] That's a great question when you first open RStudio every time you open it doesn't matter how many times you use it this will be the first text that comes up. I kind of breeze past it because again a lot of things take for granted. This comes up when you open it and you'll see the version right here and it'll tell you the date of the version and they're always named silly things so this one's puppy cup. I've seen like way weirder stuff. And so that will tell you it honestly does not usually make a differenc.e I will say package and library versions that makes more difference. You might find you go to rerun old code and you're referencing a library and you run through your code and then you get to a point you see a warning or an error and it's like oh this function has been updated use this one instead or sometimes it breaks down in unfortunate ways and it needs more touch and go but so that's where I'd say usually the version of RStudio does not matter so much. I would update it once or twice a year but like I've gone a long time without updating it. It's more packages and you'll know when the package is running problems. Like if you encounter problems that's that's when you you should look at the version of the package. How do you know when packages so I think that addressed that I'm just looking at the next question how do you know what packages are already installed? Also a great question. Usually you just get to know like what's already there but there is an explicit tab so in your Rstudio in the bottom right quadrant or whatever there's a packages tab and you can see anything that's listed here is installed on your computer. Anything that has a check has been loaded. So you might want a function from this class package and you can see you have it but you need to reference the library. So you need to say okay R like we're using this library for instance I'll just show you library class. And I'm going to run it I don't know if everyone saw but now we have a check mark so now R is like okay we're going to use class and it knows those functions. The problem is if you don't load a fun if you don't load a package and you try to reference a function within that package R is just going to be like I don't know that function. And it you like R will tell you. I will say like a really good thing about R is it's pretty informative in its errors and warnings. It's pretty clear of like I don't know this variable or it'll say this is referencing this version or it'll say variable not found or sometimes type versions so. [Erin McCauley] All right well thank you sarah so much for that first session. Now we're gonna transition into our regular office hour series. [Voiceover] The National Data Archive On Child Abuse and Neglect is a joint project of Duke University, Cornell University, University Of California, San Francisco, and Mathematica. Funding for NDACAN is provided by the Children's Bureau, An Office Of The Administration For Children and Families. [R Code] # # # LeaRn with NDACAN # by SaRah SeRnakeR # # # Week 1: Introduction to R, September 30, 2024 ## This is a comment created using the pound sign, aka hash tag, in front of text # Anything that comes after any pound sign will NOT be run and R will ignore print("hey there") # can put a comment anywhere in line ## Set working directory where R file will be saved. Usually the project folder # Often where the data files are also located # CHANGE TO YOUR WORKING DIRECTORY setwd("/Users/ss1216/Library/CloudStorage/Box-Box/NDACAN/Presentations/Office hours/Office hours 2024-2025/LeaRn with NDACAN") ## install packages (only need to successfully run this once - could then delete) install.packages('tidyverse') # family of functions and special syntax that # can make coding much more efficient and elegant install.packages('ggplot2') # for beautiful figures ## MUST load installed packages to be able to use them library(tidyverse) library(ggplot2) library(class) ## Basic example of assigning and using variables # note how output is displayed in the console # note how variables in memory appear in the environment x <- 1 y <- 3*x y > x x > y x+y z = x+y # quick vector of consecutive numbers 1:10 10:1 # data types a = c(1,2,3) b = c("1","2", "3","4","5") a b c(a,b) as.integer(b) as.integer(c("cow")) ## use "help" function to see the input and various options, # what the function will output, help files and documentation help(table) # LOADING AND USING DATA ## Read census data that are .csv and in the "Data" folder census = read.csv("Data/census_2015_2019.csv") ## get variable names names(census) ## browse full data View(census) ## look at first few rows of data head(census) ## get summary statistics of all variables in data - look for missingness and range of values summary(census) ## get data type of all variables in data str(census) ## indexing the variable in the data census$cy # the whole vector of cy census[,"cy"] # does the same thing as above, warning this method can sometimes cause errors with matrices head(census$cy) # the first few obs of cy census$cy[1:100] # the first 100 obs of cy head(census$st) # first few obs of st variable ## index the data census[1:50,] # first 50 rows of data census[1:50, c("cy", "state")] # first 50 rows only of cy and state ## get counts of variable values (ideal for discrete or factor variables) table(census$cy) table(census$state) table(census$sex, useNA = 'ifany') table(census$race6, useNA = 'ifany') ## subset just Rhode Island census_RI = subset(census, st == "RI") head(census_RI) # accomplishes the same thing as above (only keep rows where statement true) census[census$st == "RI",] census$st == "RI" # examine what this does ## subset RI and cy = 2019 census_RI_2019 = subset(census, st == "RI" & cy == 2019) head(census_RI_2019) ## unique values unique(census$race6) unique(census$st) ## get statistics of count/discrete or continuous variables mean(census$pop) mean(census_RI$pop) mean(census_RI_2019$pop) median(census$pop) sd(census$pop) max(census$pop) ## create new variables census$newVar = census$pop/100000 head(census) # create new race/ethnicity variable as the combination of race and hispanic identity census$raceethn = with(census, ifelse(race6 == 1 & hisp == 0, 1, 0)) census$raceethn = with(census, ifelse(race6 == 2 & hisp == 0, 2, 0)) census$raceethn = with(census, ifelse(race6 == 3 & hisp == 0, 3, 0)) #............... continue for 6 levels of race & hispanic = 1 to get 7 race/ethnicity levels # make labeled factor variable of race/ethnicity census$raceethn_factor = factor(census$raceethn, levels = 0:3, labels = c("Other", "White, Non-Hispanic", "Black, Non-Hispanic", "Native American/AK Native, Non-Hispanic")) # END OF CODE