[musical cue] [Voiceover] National Data Archive on Child Abuse and Neglect. [Clayton Covington] Ok welcome to our attendees to today's session. My name is Clayton Covington, I'm the Graduate Research Associate with the National Data Archive on Child Abuse and Neglect and I'll be facilitating this week's session of the Summer Training Series. If you have any questions please submit them to the Q and A box. Also just so you know the session is being recorded. So again this is the NDACAN Summer Training Series which is hosted by the National Data Archive on Child Abuse and Neglect which is housed at both Cornell University and Duke University. We're also sponsored by the Children's Bureau which is an office of the Administration for Children and Families at the U.S Department of Health and Human Services. So to give you all a little background on what we've done so far in this in this year's series and where we're going moving forward, we started out at the beginning of July with the introduction to NDACAN and the administrative data holdings at the archive. We then highlighted a new a new data acquisition known as the CCOULD data set which is actually combining child welfare data with some Medicare data or sorry Medicaid data. Then we also had a couple of workshops including one on causal inference using administrative data. Last week's session was about evaluating and dealing with missing data in the R package software. And today's presentation is about time series analyses using Stata software. All right so without further ado I'm gonna pass it over to our presenter who is Dr Alexander F. Roehrkasse who is an Assistant Professor at Butler University who's going to cover this week's session. [Alexander F. Roehrkasse] Thanks Clayton. Hi everyone thanks for being here today I'm really excited to give this presentation, to hear about your ideas and questions. We're talking about time series analysis today and I just want to be honest right up front. Time series analysis is pretty difficult. There's a lot of technical aspects to time series analysis, so today's presentation is really going to be about just trying to give you some basic ideas about how you might use time series analysis in the analysis of archive data. What are the reasons why you might want to use these kinds of methods? If you decided that you did want to use them how would you get started like what are the things you need to do to organize your data in a way to get it ready to do some time series analysis? And then we'll do a little bit of time series analysis we'll do some autoregression modeling and even some forecasting. But there's a lot of technical aspects to these models that we're just not going to be able to cover today. So it's going to be a quick crash course and how just to get started doing time of series analysis using child maltreatment data provided by NDACAN. And as Clayton mentioned if you have the quick clarifying questions please feel free to just drop them in the chat as we go and I'll try to pick them up but of course I'll leave plenty of time at the end for Q and A about bigger questions. Okay so what is time series analysis what do we mean when we say the words time series analysis? Well first time series data, the kinds of data that are amenable to time series analysis are always going to be a series of data points that are indexed in time order. So it's always going to be some variable that we track over time that's going to change over time. So the data are always going to be sequenced. It follows then that time series analysis is just going to be those set of methods that we use for extracting statistics and other meaningful information very often visualizations from time ordered data from data that have a sort of time ordering. Time series forecasting is a kind of subset of these sets of methods that entail using a statistical model then to predict future or unobserved data points data points that haven't yet occurred based on the patterns of past observed data. And then sometimes we'll use regression analyses you may be familiar with regression analyses in other contexts to test the relationships between different points of a time series or even multiple time series. So I just want to clarify what we're not going to be talking about today. Panel data is really the sort of more General case of a time series where we have multiple units that have a given time series so we might be talking about say multiple States or multiple counties Each of which have a trend in maltreatment. We'll talk about multiple States today just to visualize some different Trends but panel analysis or panel data are going to be those sets of data those sets of models that are used to analyze time series for multiple units at a time. And that's not what we're going to be talking about today we'll be talking about analyzing one Trend at a time, so you'll see in our demonstration that we'll focus in on one state and just analyze trends of confirmed abuse in California alone. We could of course though analyze one unit say California but multiple different time series in that single unit and so we call this multivariate time series. That very much falls under the umbrella of time series analysis but it'll be a little bit outside of the scope of today's presentation. Okay so now that we know what time series analysis is why would we use it? One of the main reasons you want to think about time series analysis is if your variable of interest is serially correlated. What does that mean? It means that values of your variable at one point in time are correlated with the values of your variable at previous or future points in time. So that is to say that your variable is correlated with itself over time. Many variables that we provide in NDACAN data sets are serially correlated and we can think about different reasons why that might be true so let's think about NCANDS which we'll be talking about today. NCANDS is an administrative data set that comprises screened-in reports of child abuse and neglect it's an administrative data set. Let's say we wanted to count up reports of abuse or neglect or or confirmed abuse or neglect. Now I think it's fair to say that like CPS agencies are probably going to do something this month that looks pretty similar to whatever they did last month. And so whatever they did last month is going to kind of positively predict what they're going to do this month and so we might expect rates of confirmed abuse or neglect this month to be correlated with rates of abuse or neglect last month because CPS agents are kind of doing the same thing, they're sort of using the same practices they're exerting the same amount of effort. So that'd be an example of positive serial correlation. What happened last month is going to help predict what happened this month. That that's like that example is like a one-month lag month to month last month to this month. But of course we know that reports of abuse and neglect also have a seasonality to them. So one of the main sources of maltreatment reporting is from school-based reporters, so teachers counselors. Children are only in school certain months of the year and so we might expect there to be a sort of seasonal pattern in abuse and neglect reporting. That is to say that child abuse reporters are probably going to do the same thing this month that they did the same time last year or this month last year. So August reporting in 2023 is going to probably look something similar to August reporting in 2022 because there's some seasonality to reporting. August is going to look something like August each and every year. There might also be like negative serial correlation so it might for example be the case that if CPS agents in a given County go out and investigate a lot of reports in one month then there might be actually fewer reports the following month as a result of all those investigations that happened in the prior month. So you sort of exhaust the stock of available faces and so there's fewer to report as a result of having investigated more in the prior month. So that'd be an example of negative serial correlation. All of these are good examples for why you might consider using time series analysis to analyze your data. You might of course want to just visualize Trends in your variable of interest but as you'll see in a minute a lot of Trends in maltreatment data are kind of noisy. They go up and down quite a lot. So maybe we want to understand the kind of bigger pattern the broader Trend without losing some of that variation but without having so much noise. So time series analyses can be really helpful for visualizing noisy trends. And then particularly depending on what our research goals are I'm I'm guessing many of you work in Academia but maybe some of you also work in policy or Administration. We very May well want to understand or or develop some expectations about future Trends in child maltreatment. Time series analyses can be particularly helpful for forecasting future values of child abuse and neglect using very limited amounts of information. So we'll do a little bit of forecasting today and we won't use any explanatory variables. We won't draw in information about child poverty or social support. We'll only use prior values of our confirmed abuse variable to predict future values of our confirmed abuse variable. So time series analyses can be a powerful way of predicting the future using only values that we observed in the past of our variable of interest. Here's an example of a time series. So this is not archive data this is Google Trends data so what you see is that I've used Google Trends to try to examine how many people Google the term child maltreatment. And so that line represents the trend or the time series of Google searches for child not treatment. Now if you look at the vertical axis it goes from zero to 100 it's just indexed to the day on which there were the most searches for child maltreatment. So it looks like in kind of late 20 or mid to late 2021 was the sort of day where it was searched the most and so that value gets 100. So we're just looking at a sort of index of interest in child maltreatment over several years. You can see that this this trend is pretty random it goes up and down in kind of weird and wonky ways there's not a lot of clear Trends here. There does seem to be some seasonality to it so if you if you squint real hard you can see that this trend actually does kind of map onto the school year a little bit. You see you see some dips in summer and even briefly over winter break and you see larger volumes of search for child maltreatment during the school year. So you even just Google searches for child maltreatment exhibit some seasonality or some serial correlation that we would want to try to understand better. This trend is kind of noisy it's hard to tell when it's going up or going down and so we might also want to use time series methods to smooth this trend to try to understand really what's going on here. If we looked at searches for child maltreatment hour by hour it'd probably be even noisier and harder to interpret if we looked at child maltreatment searches year over year it'd probably be too smooth there wouldn't be enough variation going on to for it really to be of interest. So we can explore different ways of smoothing this trend to better understand what's really going on. And then we could use time series methods to forecast of future Trends in searches for child maltreatment. It's a little bit hard just staring at this graph to know if we think that searches for child maltreatment are going to go up or go down. We can build a statistical model to forecast them. [ONSCREEN equation showing Abuse subscript t = Alpha subscript 0 + Alpha subscript 1 times abuse subscript t minus 1 + ... + Alpha sub k times abuse subscript t minus k + Epsilon subscript t] Okay just a very little bit of math to kind of help start an initial understanding of how we'll be thinking about the statistics of time series. So we'll be walking through when we do our Stata example actually an autoregression model that also includes a moving average component but I just want to highlight here that what we mean by autoregression what the autoregression autoregressive component of our models is going to be. We say univariate because we're really only analyzing one variable in our whole model and that's abuse. So you see on the left hand side we're trying to predict the number of in our example we'll look at substantiated or indicated reports of abuse at time T so in any given let's say month how many reports of confirmed abuse or neglect will there be. And we'll model that as a function of a constant we'll use sort of alpha zero as a constant, and then we'll have a parameter Alpha One that will be the parameter associated with the value of our outcome variable in the prior period in T minus one. And so I'll be examining how abused in Period T minus 1, the prior period or the prior month, predicts abuse in the current month of use in time T. And then depending on how many lags we include in our model let's say k lags we'll have abuse in in the prior period or the prior period or the prior period or the prior period influence of use in the current period. We won't really get to this in our Stata demonstration today but you may also come across the term Vector Autoregression in looking into time series analyses. [ONSCREEN equation showing left bracket, abuse subscript t over neglect subscript t, right bracket, = a subscript 0 + A subscript 1, left bracket, abuse subscript t minus 1 over neglect subscript t minus 1, right bracket, + … + A subscript k, left bracket, abuse subscript t minus k, over neglect subscript t minus k, right bracket, + left bracket, Epsilon subscript 1 comma t, over Epsilon subscript 2 comma t, right bracket] And just to clarify here we're not talking about vectors of different units so like a time series in California and Arkansas and Wyoming. Instead we're talking about multiple time series so time series of different variables in the same unit. So how might multiple Trends influence one another? So we might look at for example Trends in confirmed abuse and Trends in confirmed neglect in the same state let's say California. And then we would model those Trends as a function not only of their prior values but the prior values of the other Trend, so how is abuse influenced not only by higher levels of abuse but also prior levels of neglect? So that's what we mean when we talk about Vector autoregression. Okay let's say you decide that time series analysis is appropriate for your research. What are you going to need to make that happen? It's worth saying right up front that you really do need a fairly large sample of sequenced observations so you're really not going to be able to do the methods we're talking about today if you're only examining say 12 months in the fiscal year 2017. It's just not a long enough time series to to implement these methods so you're going to need a larger sample of sequenced observations. And the models we'll run today will have something like I don't know something on that order of 100 months so that's you know it's still a pretty small time series but enough to do some of these things. You'll need observations that are measured at regular time intervals. So it can't be the case that one time interval is three days and another time interval is eight days and another time interval is 15 days. The intervals at which you measure observations need to be the same. We're going to use NCANDS in our example today and it's worth noting that the most detailed time variable available in NCANDS data is the half month. So there is a report date variable in NCANDS but it's coded to the half month so for example all reports that occur in the first half of August get coded as August 8th all reports that occur in the second half of August get coded as August 23rd. But of course months are not the same way some months are 28 days some months are 31 days so if we just use the half month we don't have regular time intervals so you'll see in our demonstration that we're actually going to collapse our data to the month level which it's worth noting is still not actually a regular interval but it's good enough for our purposes so we'll just sort of acknowledge that that's a limitation and we'll treat months as as regular enough intervals. And then depending on what you're really trying to do you want to really try to use methods that are dedicated to the analysis of time series so you could probably reverse engineer a lot of what we're doing with more basic Stata commands but Stata has a whole Suite of time series commands and we'll want to use those commands because they incorporate a lot of important considerations when we're thinking about doing time series analysis. Okay what are some examples of time series? But I've already given you a few but let's talk through a few more examples of why you might want to do time series analysis using archive data. Let's say you're interested in like how seasonal screened in reports of maltreatment are? Like how much do they vary over the course of a calendar year? Like is August really a special time? Is December really a special time? You might use a time series filter to filter out state or County trends for that seasonal cyclicality we'll talk a little bit about how to do that how to think about a 12-month lag where year over year Trends in a particular month tend to look the same. You might be interested in how the rate of confirmed maltreatment in a specific time interval might depend on the rate in the previous interval. And here we might use our regression model we'll be demonstrating an ARIMA model to try to understand like how much does maltreatment in the previous month predict maltreatment this month? We might again want to know what future rates of confirmed neglect might be and in this case we would build a forecast model to try and predict future state or County trends. And then again this example of vector autoregression models if we're interested in how different time series are related to one another like physical abuse and neglect we might use a vector autoregression model to try and understand those Trends and how they're related to one another. Okay so now I'm going to move to a demonstration of time series analysis and Stata. I'm going to drop in the chat two links you'll see a link to this Stata do file that I'll be using so you can go ahead and follow that link and download the file. [ONSCREEN link to Stata do file https://drive.google.com/file/d/1kzZl6JmID_gEv8zzdConrgCQJ3X9JRou/view?usp=sharing] You won't be able to actually run the commands I'm going to run because you don't have the underlying data I can't share that with you, but you can actually follow along with the code and annotate it yourself. [ONSCREEN link to slide presentations https://docs.google.com/presentation/d/1b9MPcKcD7_Unfo0IYYIH-rhUoiIP5oSi/edit?usp=sharing&ouid=114322564655947637684&rtpof=true&sd=true] And then the slides for this presentation are also available right there. We'll post them on the archive's site later but for now if you just want to follow along they're there as well. Before we begin with the demonstration I do also just want to note a few helpful resources for time series analysis in stata. [ONSCREEN Link to Stata reference manual on time series https://www.stata.com/manuals13/tstimeseries.pdf. Link to Dr. Torres-Reyna’s slides on time series analysis in Stata https://www.princeton.edu/~otorres/TS101.pdf. Link to Becketti’s Introduction to Time Series Using Stata, Revised Edition https://www.stata-press.com/books/introduction-to-time-series-using-stata/] The Stata reference manual is very good. Dr Torres Reyna at Princeton also has slides on time series analysis using Stata they're a little bit old Stata's commands are a little bit updated by now but I think it's still a helpful beginner's introduction to a lot of important time series Concepts. There is a whole book dedicated to time series analysis in Stata and you'll see in my Stata code that I've also included a few other resources that I find particularly helpful. Great okay so I'm using Stata 16. It's worth noting there actually are some new time series functions in Stata 18 that users might be interested in. And so we'll be using a do file here I always recommend using a do file when you're using Stata you don't want to just type stuff into the command line and I strongly discourage people from using drop down menus. The main reason is so that your analyses can be reproducible you have a sort of document here that outlines everything you've done in your analysis. So again this this file itself is available here at this Google Drive Link so you can go ahead and download this and keep it for your own learning going forward. [ONSCREEN link to Stata tutorial series videos https://www.youtube.com/playlist?list=PLN5IskQdgXWlEVJe6t9urIMoJVHdifFuR] Here's a few other resources that you might explore. Stata has its own tutorial series. It's a pretty beginner level series on how to do time series analysis and they use these drop down menus to do most things but you'll be able to see the command line codes that you can use in those videos. Again a link to stated reference manual and then Juan D'Amico has a really helpful set of more intermediate level time series analysis tutorials for stated that I would highly recommend and that have been helpful for me in preparing this presentation. [ONSCREEN link to Juan D'Amico's tutorial series : https://youtube.com/playlist?list=PLsZ8kVwX52ZEFZsVViYs60lf7idJuKKUO] Okay let's get rolling. Let's first clear any data we have in our memory. We'll set more to off which just means if we're doing anything computationally intensive data won't prompt us to tell it to keep going. I believe we'll be using some sort of random processes in today's presentation it's always helpful to set a seed so that you can reproduce your work. And then we'll go ahead and set a working directory so that's stata knows where to read files from. I'm going to first walk you through how you would set up data for a time series analysis and I'm going to read in some example data these are not real data they've been anonymized but they roughly approximate a one percent sample of several key variables from the 2017 ncands child file. So let's go ahead and read that data in. And let's go ahead and examine the first observation. We have just a few variables here. We have the fips code corresponding to the maltreatment report we see that it came from York county Maine. We have a report date for that report the 2016 November 23rd, and then we have two variables each corresponding to four different maltreatment types. So chmal1 is the kind of maltreatment that was associated with the first instance of maltreatment on that report and then maltreatment one level (mal1lev) is whether or not what the disposition of that report ultimately was so this just this report got an alternative disposition. Most often it's either going to be substantiated or not substantiated. And because there was only one type of abuse or neglect on this report the remaining sort of maltreatment variables are going to be missing. It's worth noting that this fips variable this is actually a value label. If we ask stata to tell us what the actual unlabeled value is we'll see that the fips code is actually a state and county code combined. So 23 is the state fips code for Maine and 31 is the county fips code for York county. You'll see why that's important in just a minute. Okay so let's go ahead and clean our data. We're going to analyze trends at the state level, so let's go ahead and create a state fips code and to do that we're going to want to extract this state fips code from the state county fips code. The easiest way to do that is just to round it to the nearest thousand and if we do that and re-examine our data we can see that we have a new state fips code here that's just 23 that's the state fips code for Maine. Now this is very important for time series analysis. Because we're dealing with time here we need stata to understand that our date variable is a time variable. And just with other software same in R and I believe in Spss as Sas, each set of software has its own way of dealing with date variables. We're going to need to convert this to report date variable into a date variable that's stata understands. So let's first tell stata that our report date variable is a date and the way we'll do that is we'll just generate a date variable using this date function and tell stata that our report date variable is in this year month day format so year month day. And then we'll format it in a way that's like a little easier for humans to read and we'll see what it looks like then. So now we have this date variable that stata uses this specific way to represent dates and now stata understands that this is literally a point in time it's not just a sequence of characters but an actual point in time. But because this report didn't actually probably occur on the 23rd of November it actually just occurred in the second half of November what we're ultimately going to do is try to analyze our data at the year month level. So we're going to end up collapsing our half months into full months and we need to tell stata that this is actually not a specific date but rather a year month variable. So we're going to use this month of date mofd(date) command to convert our date into a monthly date. We'll again reformat it and then see what it looks like. And now we see that we have instead of 23rd November 2016 we have 2016 month 11. In other words November 2016. So now state understands that this report just corresponds to the month of November in general. I'm going to skip over this code for now it's not sort of conceptually important for today's demonstration just suffice it to say that when I run this code we end up with data where we have a state fips code for each report, we have a monthly date, for each report, and then we have an abuse variable that equals one if the report is an instance of confirmed abuse, and it equals zero if the report is an instance of confirmed neglect but not abuse. And I'm just using different maltreatment types and disposition levels to create these variables so abuse is equal to one if it's a case of confirmed abuse and it's equal to zero if it's a case of confirmed neglect but not abuse. Okay. And then what we'll do is collapse our data into counts of report by month. So for example you might have so we're going to basically sort of for each state and month we're going to count up the number of reports corresponding to each sort of category of abuse and so we'll generate a sort of counter variable that's just equal to one for all observations and then we'll count up all of those counter variables by the abuse category, the state fips code, and by the monthly date. We'll then go ahead and order and sort our data just reorganize it a little bit and see what it looks like now. Now for each month for each state and for each abuse type or maltreatment type we have a count of the number of events that occur. So I'm now going to read in some other pre-processed data this is just a sort of simple example of how to get your data formatted in the right way for time series analysis. But I'm going to read in some pre-processed data now that's actually for fiscal years 2012 through 2021 so this is the full excuse me full child files for 2012 to 2021. Note that again I've anonymized these by whenever there's small counts I have arbitrarily inflated them so that we're not at risk of disclosing any personal information so let's go ahead and read in that data file. Now I'm going to merge it to a utility file that just contains both state fips codes and state abbreviations which is going to allow us to like make nice graphs that instead of having state fips codes have like a nice little state abbreviations like the word the letters al for Alabama instead of the state fixed code of 1. Now let's go ahead and use these two letter abbreviations as labels for our state fips code. So we're just going to label the values of state fips using the two letter state code so we'll do that using this label mask function that you'll need to install a user package to do that but it should be pretty straightforward. Now okay so here's the sort of first key step in telling stata not only that we have a time variable but we're ready to do some time series analysis. Some of you may be familiar with the command xtset which is when you tell stata that you have panel data. We're going to similarly tsset our data we're going to tell stata that we have time series data and we're going to tell it what the properties of that time series are. We have to tsset our data in order to run a lot of status custom time series commands. So we'll do that by ts setting our data and because we have data from multiple states we do have a panel variable here so we can list the stfips variable as a panel variable but that's optional we can no we can drop it if we want. Let's leave it in for now and then the second variable which is mandatory is our time series variable. So we're just telling this data we have time series data here and our time series variable our time variable is this monthly date variable. And then we're furthermore going to tell stata that our interval our time interval of interest is the month so m here tells stata that we're dealing with monthly data. Let's go ahead and tsset our data. Whoops. "repeated time values within panel". What's going on here? Well let's look at our data here we have Alabama in let's see October of 2010. Oh but look we have two observations that have the same panel variable and the same time variable. Why? Well because we also have this abuse category variable here so n here counts seven counts of confirmed neglect but not abuse. And 54 here means 54 counts of confirmed abuse. Stata needs it to be the case that we only have one panel variable and one date variable for in our whole data set. So it's not understanding it thinks we have sort of multiple panels here or multiple dates. Multiple panels for any given date or multiple dates for any given panel. So we have to do a little bit of reorganizing our data and what we'll do is instead of counting up with our n variable different counts of abuse and neglect we'll reshape our data wide so that we have separate variables separate columns that count abuse and that count neglect. If you're not familiar with reshaping data I would just refer you to our prior training series on managing data in stata which will walk you through how to reshape data. So we've got to reshape our data and rename the resulting variables neglect and abuse. And what we get now is an individual panel variable, Alabama, that corresponds to a single month, October 2010, and we have separate counts for neglect and abuse in separate columns. And now when we try to tsset our data stata is going to go with the flow. So now we see that our panel variable is stfips it's noting that we have an unbalanced panel some of our time series are longer than others. It understands the date m or monthly date variable is our time variable that ranges from 2010 in October to 2021 in September but there are gaps in our data. And it understands that our time variable our time interval is one month. [ONSCREEN Line graphs for six states showing Counts of Confirmed Abuse Reports over the time period 2012 month 1, to 2022 month 1. Graphs for Alabama and California show spikes at the beginning and sharp dropoffs at the end.] Okay let's say that we want to visualize some trends in our data but we know that these trends are going to be noisy. Let's first just look at those trends so I'm just going to graph some trends in maltreatment. Okay pretty interesting. Here we have a different year months and here we have just counts of confirmed abuse reports. We can see a few things how we get these squiggly lines kind of like we did in our google trends data. Here you can kind of make out some trends this one looks pretty flat, this one's kind of going down, but there's also some noise there too. There's something weird here though doesn't it look like there's this huge spike in abuse reports where they're like almost none in 2010 and then suddenly they shoot up and then it looks like in 2021 they just fall off and the same is true in other states as well. This is this is artificial this is not real and this is the function of the fact that reporting in ncands is sometimes delayed. So whenever you download a child file for a given fiscal year the data that's submitted in that year includes some reports that actually occurred in prior years and it logically follows that some reports that actually occurred in that year won't actually be submitted until future submission years. And so what's going on here is that you're actually sort of capturing some lagged or some delayed submissions to ncands and here this drop-off is the fact that there are some reports of abuse that actually occurred but have yet to be submitted to ncands. And so for this reason it's extremely important to censor your data appropriately and my rule of thumb is that you can really only analyze one fewer fiscal year than any given submission year of data that you have. So we're using the 2012 to 2021 child files. So we're going to have to censor our data to fiscal years 2012 to 2020 because we know this 2021 child file is going to fill in some of these missing reports and then we'll have to lop off our make sure we censor left sensor our data too so we're not getting these arbitrarily low counts of reports. So let's go ahead and do that we'll drop reports that are outside of our range of interest and now when we re-plot our line we get some trends that make a lot more sense. [ONSCREEN Line graphs for six states showing Counts of Confirmed Abuse Reports over the time period 2011 month 7, to 2020 month 7. Each graph shows oscillating lines with mostly flat trends with the exception of California which shows a clearly downward trend.] But they look kind of noisy though. What if we want like a smoother line? Let's say we want to sort of not lose entirely the noisiness of this data but try to develop some sense of like what the maybe cyclicality is or the overall trend is in our data. We can do this with stata's moving average capability. So let's go ahead and try this and see what it looks like. [ONSCREEN Graph for California Counts of Confirmed Abuse Reports showing two lines: one line is the raw reports, and the other is the moving average line which overlaps with lower peaks and troughs. The overall trend slopes downward at about a 20 degree angle.] So we're just looking here at data for California so where this data that fips code is equal to six. And here we've used a smoother a moving average smoother that basically averages the value of the variable one month to the left and one month to the right of any given month. So that's mislabeled it should be a three-month moving average so let's rewrite that. Let's make a new font and so now we see that we have a three month moving average in red plotted against our raw data in blue. Well it's still looking pretty noisy maybe you want to sort of smooth it even more. We can widen our window and we can also sort of weight our moving average and here we can use the weights option on our ts smoother and I won't get into this but this is just kind of how we specify weights you know in our moving average smoothly. And so we can go ahead and calculate those and re-plot them again but now we have an even smoother line where the green line represents a 12-month weighted moving average. [ONSCREEN Graph for California Counts of Confirmed Abuse Reports showing three lines: one line is the raw reports, one line is the 3-month moving average line which overlaps with lower peaks and troughs, and one line is the 12-month moving average line which has hardly any peaks or troughs. The overall trend slopes downward at about a 20 degree angle.] So at any given point in time what we're plotting is the six months prior, the month of data at that point, and the five months following where we've weighted the data so that more near observations count more and more discount observations count less and here we get an even smoother line. You see that there's a general downward trend, there's a little bit of annual cyclicality to our trend, but we've sort of gotten rid of a lot of the noise. Okay I'm gonna very briefly walk through some time series operations that can be very helpful when working in stata we're very often interested in the lead leading our data or lagging our data. Stata has a specific syntax for how to do this. So let's say we want to create a lag for our variable. All we need to do is like take our variable of interest say neglect and we put l and then a number and then a period before that variable. And when we do that stata creates a new variable that lags our variable of interest. [ONSCREEN Table showing Alabama data with the following column headers: "stfips", "datem", "neglect", "L2.neglect".] So notice that we have 249 here in November of 2011. That's just the value two months prior of that same variable. Ditto December and October etc etc. We can do the same for differences, [ONSCREEN Table showing Alabama data with the following column headers: "stfips", "datem", "abuse", "D.abuse".] so we can use the letter d to calculate the difference between a given month and the prior month so 547 minus 31 is 516. So very often we'll be interested in the month over month change. And if we're interested in further intervals we can use this s variable for seasonally adjusting our data because they're dealing with monthly data here we're going to be particularly interested in the 12-month difference. [ONSCREEN Table showing Alabama data with the following column headers: "stfips", "datem", "abuse", "S12.abuse".] So here we're comparing September 2012 to September 2011 and we'll see that the difference between 480 and 516 is 36. So our seasonally adjusted 12-month difference in confirmed abuse is equal to 36. So year over year the monthly sort of count went down by 36. Let's go ahead and visualize this seasonal difference. We'll generate a variable using that seasonal difference and go ahead and plot it. [ONSCREEN Line graphs for six states showing 12-month change in confirmed abuse reports over the time period 2011 month 7, to 2020 month 7.] And now we see our trends in a slightly different light we see sort of on the vertical axis the 12-month change in confirmed abuse reports so how much did that change over the 12-month period. Okay we're getting low on time so let's just say that we wanted then to build a statistical model that captured the properties of our time series just to keep things simple let's limit our data to California alone. The most important thing when we're analyzing time series models is that we have to sort of have a time trend that's what what's called a stationary. And stationarity basically just means that the trend is independent of time. It can't be going up or going down over long periods of time kind of has to be flat over long periods of time. Are trends of confirmed abuse in California stationary? Well when we look at them it doesn't seem that way kind of seems like they're going down and that would mean that it's not a stationary time series. We can run these formal statistical tests that would sort of give us more evidence of whether or not our time series is stationary I won't run those now but I'll just leave them there for to help you in future analysis. Let's say our process isn't stationary. The most common way to make it stationary is instead of analyzing the levels of abuse we'll let we'll analyze that first difference the month over month change in abuse. So what's the difference between say September and august in the number of confirmed reports of abuse? Very often when a given time trend is not stationary the first difference in that time trend is stationary. Let's go ahead and plot that first difference and see if the first difference in abuse looks a little more stationary. [ONSCREEN Line graph of California 1 month change in confirmed abuse reports from 2011 month 7 to 2020 month 7. The line oscillates around a fixed value but is generally horizontal.] Lo and behold there's quite a lot of noise here but it really looks pretty flat that gives us some strong indications that we're dealing with a stationary process and we can use autoregression models some time series models to analyze that process. The most popular time series model here is the ARIMA model the ARIMA model combines both an autoregressive component where we regress the value at a given point in time on the prior values of that of that same variable in sort of successive lags. So the value in the prior month, the value two months prior, the value three months prior. That's that equation I showed you in the powerpoint slide. They also include the moving average process. Moving average processes are a little more complicated to explain and they're a slightly more sort of complicated way that past and even future values can sort of affect your present value. I've included some code here for how to diagnose your data the autoregressive and moving average processes in your data. And these can be helpful for choosing the parameters of your ARIMA model your autoregressive integrated moving average model. This is really more of an art than a science but you should find here some helpful tips on how to specify your ARIMA model. Let's go ahead and specify that model where all we're doing is modeling abuse confirmed abuse as a function of prior values of that same variable and the sort of parameters of our ARIMA model here will set in this option and all we're doing here is diagnosing patterns in our data that help us make these choices about what our model should look like. So let's go ahead and run that model. [ONSCREEN a table of output with columns labelled "D.abuse", "Coef.", "Std. Err.", "z", "P>|z|", "95% Conf. Interval". Rows are labelled "abuse", "ARMA", "/sigma". ] The model fit here isn't great but the most important thing you should understand here is that while we have a constant term here and you would analyze some of your summary statistics the same way you would a linear regression model. Here you have the autoregressive component of your model with the sort of one month lag, two month lag, three-month lag and here you have the moving average component of your model with the one month lag in the moving average. Our data are pretty seasonal so it turns out that confirmed abuse is pretty seasonal where let's say the number of reports in august 2012 looks very similar to the level of confirmed abuse in 2011. And so let's say we want to include a seasonal component in our model where we adjust for the fact that there's a 12-month seasonal lag in our data. We can go ahead and do that using the s ARIMA option in stata to do a seasonally adjusted autoregression model. I won't talk through these results but let's use this model then to go forward and talk briefly about forecasting. Well this is all well and good these models tell us about the autoregressive and moving average process processes in our data. But very often we go through all this trouble because we want to predict the future. Here's some more code that would help you sort of evaluate whether your model is a good candidate for forecast model but let's say we decide that it is we can go ahead and tsappend some rows to our data so now we just have some empty data at the bottom where we can project forecasted values into them. Let's go ahead and project three years of data which actually brings us up to this month. And let's go ahead and predict values from our seasonally adjusted autoregressive integrated moving average model. Let's go ahead and get confidence intervals for those projections. And then finally let's just create a plot that includes the actual trend in abuse, the trend in our predicted abuse, and then a ribbon thought that includes the upper and lower bounds of our 95 confidence interval for our predictions. [ONSCREEN Line graph showing California Confirmed Abuse Reports over time period 2011 month 7 through 2020 month 7, with projected reports over time period 2020 month 7 through 2023 month 7. The projected trend is shown to continue the actual trend, which is downward.] And then very nicely here we have a red line that shows the the true trend in a confirmed abuse in California. And then our actual data stop here and our forecasted trend takes over. And then here our blue ribbon is our 95 confidence interval. And our model here you know if statistically speaking it wasn't a great model but you can see we're not doing bad here right so our predictions actually account for the fact that there is this broader general downward trend but it didn't just protect a straight line here it understands that there's some cyclicality to our data and this is mostly due to that seasonal component in our model. And so it's going to kind of mirror the seasonality of our observed data in our forecast. Okay so I'll mostly leave it there except to say that if you're sort of further interested in time series processes natural next steps would be to explore vector autoregression models, panel data models and state space models depending on your interest and your level of statistical training. So I'll stop there and open it up for questions and I'll just thank you for your attention and thank you in advance for any questions. [ONSCREEN Questions? Alex Roehrkasse, Assistant Professor, Butler University, aroehrkasse@butler.edu] [Clayton Covington] Well thank you Alex for that wonderful presentation I feel like you took a really complicated kind of concept and made it really digestible. So I'm really happy that you're able to share your insights with us. As a reminder to all of our attendees if you have any questions please post them in the Q a box at which point and I'll read them aloud and then I'll allow Dr Roehrkasse to respond so we do have one question so far and it asks: I have a data set where responses are collected twice a year. Does that work with time series analysis? [Alexander F. Roehrkasse] Yeah it really is going to depend on how many years of data you have so if your data are collected twice a year for three or four years I think very much no. But if you have data that are collected twice a year for the last 60 years say the last you know half of the 20th century and the beginning of 21st century then I think you are in a sort of a strong position to do a time to series analysis. There are subtle differences but for the most part I think you can think about You Know sample size in this context as being very similar to how you would think about sample size in other regression contexts. If you if you showed me the results of a linear regression model that only had 36 observations I'd be fairly suspicious of any inferences you made from that model. But if you have hundreds or thousands of observations my confidence starts to get Stronger. So it depends on what you're trying to make inferences about and it depends on what the length of your time series is. So the interval is not so much important as the regularity of that interval and the number of observations you have over time. [Clayton Covington] We do have a question okay it asks I'm sorry if this was already answered but we're recordings for the whole series be accessible to watch later. I can answer this question so yes all of our presentations both from this year's series and past year's series are available on the NDACAN website. If you look into the group chat Alexandra Gibbons has posted a links to both the most recent series occurring this summer and then also previously year's series. Because topics will change from year to year and so there may be a topic that's of your interest of a certain type of analysis or a particular data set which is not covered this year but I would encourage you to visit the NDACAN website to look at both the current series presentations which are being uploaded periodically as they are processed during the course of this year's series in all previous years recordings are also available. [Alexander F. Roehrkasse] Thanks Clayton. I'll also just say this was quite a lot all at once and so if you have follow-up questions that strike you in the next day or month or even next year please feel free to reach out. Here's my contact information I'm always happy to talk about time series analysis or if it's not a good candidate for your research questions, other appropriate methods for analyzing archive data. I'll say this is I think as I mentioned up front a difficult topic. There aren't too many examples of time series analyses using archive data so for that reason it can be a little bit challenging to find support for this kind of analysis. On the other hand the flip side of that coin is that it's quite an exciting way to think about child maltreatment data so I think there's a quite a few opportunities to think about analyzing maltreatment data in this way. So if you want to talk more about this please don't hesitate to be in touch. [Clayton Covington] Thank you Alex we have another question and it asks can you clarify how viewing the data monthly versus bi-monthly resolve the issue of having intervals of different sizes since having a different amount of days every two weeks versus every month is still a concern. And it says thank you for the great talk. [Alexander F. Roehrkasse] Yeah it's a great Point. Right so half months are going to range in length from 13 to 16 right? So 13 in the second half of February to 16 in the second half of say July. Whereas full months are going to range in length from 28 to 31. So just kind of as a proportion of the length of the interval full months will vary a little bit less than half months. But the long and short of it is that it doesn't resolve the issue we still have intervals that are of slightly different length. Even if you were to sort of collapse to the year level there are leap years right? So every once in a while one year is just a day shorter. I don't think you should be too worried about this. So if working in half months is easier and makes more sense for your research question you should feel comfortable doing that. If working in months makes more sense that's fine too. Working in years is going to be challenging for most archive data because we very infrequently have time series in archive data that are long enough to support a year by year time series analysis. So I'm mostly sort of showed you how to collapse time intervals just so that you know how to do it when it's appropriate, but I think whether you work in half months, months, or years is mostly a function of how much data you have and what your research question is. [Clayton Covington] All right we have a two-part question. The first part is asking what is a good way to deal with the monthly time series of a count variable that resets each year at the same month? Will seasonality models suffice? [Alexander F. Roehrkasse] That's a really interesting question so the count resets each month or sorry resets each year. So I wish we could sort of have a back and forth so I could clarify a little more. I think the most likely appropriate way to think about this is that you want to transform that variable so that you're capturing the number of events that actually occur over any given interval. So let's say your variable resets on the first of January every month but then each month you have a count of events that occur since the beginning of the calendar year. You can just difference that variable month over month to know how many events actually occurred in each month and I think that's going to tend to be the more interesting quantity of Interest. So this is basically a unit of analysis question and so I'd want to know more about your research question but my intuition is that you can you probably can and should transform your variable so that you're counting the number of events that actually occur in each interval. [Clayton Covington] Okay see no additional questions. I just want to thank Dr Roy Cassie again for an excellent presentation and also give you all a preview of next week's presentation which is going to feature NDACAN statistician Sarah Sernaker who's going to be doing a presentation on data visualization in R. So that will be the last session of this year's Series so make sure you come in for some really interesting and I would say visually stimulating presentations. So again thank you to Dr Roehrkasse for your presentation today and make sure to join us at the same time next week 12 to 1 pm eastern time here on Zoom for the conclusion of the summer training series. Thank you all so much for your attendance and we look forward to welcoming you next week. [Alexander F. Roehrkasse] Thanks everyone. [voiceover] The National Data Archive on Child Abuse and Neglect is a collaboration between Cornell University and Duke university. Funding for NDACAN is provided by the Children's Bureau an office of the Administration for Children and Families. [musical cue] [STATA do file code] ************************************* ************************************* * NDACAN SUMMER TRAINING SERIES * AUGUST 2, 2023 * TIMES SERIES (TS) ANALYSIS IN STATA ************************************* ************************************* ********* * LINKS * ********* * Stata .do file: * Powerpoint slides: ********************* * HELPFUL RESOURCES * ********************* * Stata tutorial series (beginner): https://www.youtube.com/playlist?list=PLN5IskQdgXWlEVJe6t9urIMoJVHdifFuR * Stata reference manual: https://www.stata.com/manuals/ts.pdf * Juan D'Amico's tutorial series (intermediate): https://youtube.com/playlist?list=PLsZ8kVwX52ZEFZsVViYs60lf7idJuKKUO ********* * SETUP * ********* * Let's set up our workspace. clear // clear any data in memory set more off // avoid having to click 'more' all the time set seed 1013 // always set a seed for any random processes cd "C:\Users\aroehrkasse\Box\Presentations\-NDACAN\2023_summer_series" // set your working directory **************************************** * SET UP DATA FOR TIME SERIES ANALYSIS * **************************************** * Let's read in some example data, specifically, * an anonymized 1% sample of several variables * from the 2017 NCANDS Child File. use "data\ts_example.dta", clear // read dta file * Let's examine the first observation. list in f list in f, nol * Now let's clean our data. * First let's create a state FIPS code variable. gen stfips = round(rptfips/1000,1) list in f, nol * Next, for TS analysis, * let's reformat the report date variable * into a monthly format that Stata recognizes as such. * First tell Stata that our report date variable is a date. gen date = date(rptdt, "YMD") format date %td list in f, nol * Then convert this date into a year-month variable. gen datem = mofd(date) format datem %tm list in f, nol * Finally, let's create a binary variable that is * 1 if there is confirmed abuse, and * 0 if there is confirmed neglect but not confirmed abuse. gen abuse = 0 replace abuse = 1 if chmal1 == 1 & mal1lev <= 2 | /// // physical abuse chmal1 == 4 & mal1lev <= 2 | /// // sexual abuse chmal1 == 5 & mal1lev <= 2 | /// // psych/emo maltreatment chmal1 == 7 & mal1lev <= 2 | /// // sex traficking chmal1 == 8 & mal1lev <= 2 | /// // other chmal2 == 1 & mal2lev <= 2 | /// chmal2 == 4 & mal2lev <= 2 | /// chmal2 == 5 & mal2lev <= 2 | /// chmal2 == 7 & mal2lev <= 2 | /// chmal2 == 8 & mal2lev <= 2 | /// chmal3 == 1 & mal3lev <= 2 | /// chmal3 == 4 & mal3lev <= 2 | /// chmal3 == 5 & mal3lev <= 2 | /// chmal3 == 7 & mal3lev <= 2 | /// chmal3 == 8 & mal3lev <= 2 | /// chmal4 == 1 & mal4lev <= 2 | /// chmal4 == 4 & mal4lev <= 2 | /// chmal4 == 5 & mal4lev <= 2 | /// chmal4 == 7 & mal4lev <= 2 | /// chmal4 == 8 & mal4lev <= 2 gen neglect = 0 replace neglect = 1 if chmal1 == 2 & mal1lev <= 2 | /// // neglect chmal1 == 3 & mal1lev <= 2 | /// // medical neglect chmal2 == 2 & mal2lev <= 2 | /// chmal2 == 3 & mal2lev <= 2 | /// chmal3 == 2 & mal3lev <= 2 | /// chmal3 == 3 & mal3lev <= 2 | /// chmal4 == 2 & mal4lev <= 2 | /// chmal4 == 3 & mal4lev <= 2 keep if abuse == 1 | neglect == 1 * Let's keep only the variables we need. * Note that after the previous command, if abuse = 0, neglect = 1. keep abuse datem stfips list in f/10, nol * And finally let's collapse our data into counts of reports by month. * Note that half-months will be combined. gen n = 1 collapse (count) n, by(abuse stfips datem) order stfips abuse datem n sort stfips abuse datem list in f/10 * Now let's read in pre-processed count data for FY 2012-2021. * Note that small counts are arbitrarily inflated to prevent disclosure risk. use "data\ts.dta", clear // read dta file * Let's merge it to a utility file that contains * state FIPS codes and state abbreviations (ab). merge m:1 stfips using "data\statecodes.dta" drop if _merge < 3 drop _merge list in f/3 * And let's label our state FIPS variable and its values. * (this requires installation of labutil package). * ssc install labutil // uncomment this to install label var stfips "State" labmask stfips, values(ab) * And now let's tell Stata that our data are time-series data so that we can run * specialized TS commands. Note that the optional first term is our panel variable, * and the required second term is our time variable. tsset stfips datem, m * Oops! Because our data are long (i.e. "n" counts both abuse and neglect), * our panel data isn't identified. So let's reshape. reshape wide n, i(stfips ab datem) j(abuse) rename n0 neglect rename n1 abuse * And try TS setting our data again. tsset stfips datem, m *********************** * VISUALIZING TS DATA * *********************** * Let's say we want to visualize some trends in our data, but they're noisy. * Let's first visualize raw data on abuse across a few states. * If we want to visualize the same time series across multiple panels, * it can actually be easier to use Stata's xt commands, * for panel data. These mostly work with tsset data, but you may have to xtset. xtline abuse if stfips < 9, /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") ytitle("Confirmed abuse reports") * Note that counts seem very low in early/late months. This is because many reports * are lagged in their submission to NDACAN relative to the report date. * For this reason, it is EXTREMELEY important to censor your data appropriately. * My rule of thumb is you can only analyze one fewer fiscal year than submission year. * We're using the 2012-2021 Child Files (submission year), * so we'll censor to FY2012-2020 (fiscal year). drop if datem < tm(2011m9) | datem > tm(2020m8) xtline abuse if stfips < 9, /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") ytitle("Confirmed abuse reports") * Our data look kinda noisy. What if we want to plot a smoother line? * We can do this using Stata's moving-average capability. tssmooth ma abuse_ma1 = abuse, window(1 1 1) twoway (tsline abuse abuse_ma*) if stfips == 6, /// ylabel(, angle(horizontal)) xtitle("Time") ytitle("Confirmed abuse reports") legend(order(1 "Raw data" 2 "3-mo. moving avg.")) * Or we can compute a weighted moving average, where nearer observations count more. tssmooth ma abuse_ma2 = abuse, weights(1/6 <7> 6/2) twoway (tsline abuse abuse_ma*) if stfips == 6, /// ylabel(, angle(horizontal)) xtitle("Time") ytitle("Confirmed abuse reports") legend(order(1 "Raw data" 2 "3-mo. moving avg." 3 "12-mo. weighted moving avg.")) ************************** * TIME-SERIES OPERATIONS * * ************************ * Stata also makes it very easy to calculate common time-series quantities of interest. * Let's say we want to know the one-month lead of a variable, * Stata has a specific syntax for this. list stfips datem abuse F1.abuse in f/10 * We can do the same for lags. list stfips datem neglect L2.neglect in f/10 * Let's say we want to calculate the difference in values * across time periods (in our case, months). * We again use Stata's special TS syntax. list stfips datem abuse D1.abuse in f/10 * Note that d2 is NOT a two-period difference, but rather * a second-order difference. list stfips datem abuse D1.abuse D2.abuse in f/20 * Let's say we want to know the 12-month change, * i.e. the seasonal difference. Here we use different syntax. list stfips datem abuse S12.abuse in f/20 * Let's visualize this seasonal difference, * or year-over-year monthly change. gen abuse_s12 = S12.abuse xtline abuse_s12 if stfips < 9, /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") ytitle("12mo change in confirmed abuse reports") ***************************** * UNIVARIATE AUTOREGRESSION * * *************************** * Let's say we want a statistical model that captures the properties of our maltreatment trends. * To keep things simple, let's just focus on CA from here on out. keep if stfips == 6 tsset datem, m * Time series models generally require that the variable of interest is stationary, * basically meaning that it is independent of time. * Are abuse trends in CA stationary? Simply examining, it appears not. tsline abuse, /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") ytitle("Confirmed abuse reports") * However, formal tests reject the null hypothesis that the abuse trend * has a unit root (i.e. is not stationary). That double negative is tricky: * in other words, they seem to indicate that the process is stationary. dfuller abuse, trend regress pperron abuse, trend regress * If the process isn't stationary (which it usually isn't), * we can often model the first difference, which usually is. * This difference is also of policy interest: will abuse go up or down this month? * For illustration, let's model this difference. tsline D1.abuse, /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") ytitle("1mo change in confirmed abuse reports") * The most popular time-series model is the * autoregressive integrated moving average (ARIMA) model. * This model combines analysis of autoregressive and moving-average processes. * Parametric ARIMA models require us to specify how we want to model these processes. * How should we choose these parameters? It's more of an art than a science, * though new versions of Stata include model selection features (arimasoc). * First, moving-average processes are fundamentally about autocorrelation. * What does the autocorrelation of our first difference look like? * We can use a correlogram to see. ac D1.abuse * The fact that the first lag is outside the confidence interval * tells us that 1 is a good starting point for our moving-average parameter ("q"). * Second, autoregressive processes are fundamentally about partial autocorrelation. pac D1.abuse * The fact that the first four lags are outside the confidence interval * tells us that 4 is a good starting point for our autoregressive parameter ("p"). * Our final parameter in the ARIMA model is the integrated (difference) order ("d"), which will be 1. * Let's fit our model using the (p,d,q) syntax! arima abuse, arima(4,1,1) * Note that the above could also be written as the following: * arima D1.abuse, ar(4) ma(1) * Recall from our correlogram that we had a noticable 12-month lagged autocorrelation. * This is seasonality! We can adjust for this using a helpful option in Stata. arima abuse, arima(4,1,1) sarima(1,1,1,12) *************** * FORECASTING * *************** * So what!? Well, learning about time-series processes can help us predict the future, * based solely on the pattern of trends in the outcome variable. * To forecast, we would first want to do some diagnostics (beyond today's scope). predict error, resid summarize error tsline error, yline(-22.08081) // Are residuals tightly grouped around the mean (good)? wntestq error // Do we fail to reject the null hypothesis that our process is white noise (good)? estat aroots // Are the roots inside the circle (good)? * If we meet these criteria, we have a good candidate model for forecasting! * Let's create some empty cells to forecast into. tsappend, add(36) * And predict values using our SARIMA model. predict abuse_f, y dynamic(m(2020m9)) * Let's get confidence intervals for our forecasting predict abuse_fv, mse dynamic(m(2020m9)) generate ub = abuse_f + 1.96*sqrt(abuse_fv) generate lb = abuse_f - 1.96*sqrt(abuse_fv) * And finally, plot our forecast against the real data. twoway (rarea ub lb datem if datem >= tm(2020m8), fcolor(blue%25)) /// (tsline abuse) /// (tsline abuse_f if datem >= tm(2020m8)), /// xlabel(, angle(vertical)) ylabel(, angle(horizontal)) xtitle("Time") /// ytitle("Confirmed abuse reports") legend(order(2 "Actual" 3 "Forecast" 1 "95% CI" )) ***************** * GOING FURTHER * ***************** * NDACAN data users further interested in time-series analysis will likely benefit from exploring: * 1. Vector autoregression models * 2. Panel data models * 3. State-space models