[Voiceover] National Data Archive on Child Abuse and Neglect [Erin McCauley] Hello everyone and welcome to the second to last session of the summer training series 2022. This is the ndacan or national data archive on child abuse neglect summer training series we are co-hosted at Cornell university and duke university and we run this workshop series every summer. This summer's theme is the power of linking administrative data so if you've been with us throughout the summer you will have seen that we've done a lot of presentations about linking data both internally within ndacan from our multiple administrative data sets and then also externally to other data sources. As I said this is hosted by ndacan and we are run through contract with children's bureau. This is our overview of the summer as you can see we're almost done in our last two sessions. Today we're going to be having a workshop by Alex Roehrkasse about propensity score matching and just generally about matching methods. So now I'm going to pass it over to Alex. Thank you so much Alex for being here and for leading this presentation. [Alex Roehrkasse] Thanks Erin and thanks everyone for being here today I'm I'm really excited to lead another session of this summer training series. Just to lay out where we're gonna go today because we're talking about propensity score and matching methods we're going to be talking broadly about how to estimate causal effects. In order to understand what we mean when we talk about causal effects and causal inference it's going to be helpful to have some basic language from what's usually called the potential outcomes framework which is really the predominant sort of conceptual framework for thinking about causal inference today. So we'll overview the potential outcomes framework as a way of understanding a little bit about how we might think about causality. Then we'll move on to talk about propensity score methods. We'll talk a bit about what the propensity score is and what it's good for and we'll learn that propensity score methods and matching methods have a lot in similarity and we often talk about propensity score matching when we use the propensity score to match but we'll also see that the propensity score can be used for non-matching estimators and we can also do matching estimators that don't actually involve the propensity score. And then we'll do a walk through where we'll sort of transition from PowerPoint to rstudio and we'll use some sort of anonymized NCANDS data to do a demonstration of how we might not only implement but also evaluate various propensity score methods and matching methods in R. I'm going to now post a link to the script that I'll be using a little bit later so that if you want to follow along you should be able to access that script on google drive. Okay so let's talk a bit about how to think about causality from the perspective of the potential outcomes framework. One of the mantras of thinking about causality in this framework is that there is no causation without manipulation and what that means is that the thing over which we're trying to estimate a causal effect which we'll call our treatment has to in some sense be manipulable. Okay what do we mean by that? Suppose that we have n units or sort of units of observation in our example later on this is going to be children and we'll index children using the subscript I so that we have n children in our analysis. Then we'll have a treatment condition. People falling into the treatment will have a value of d I equal to one. So this means you're in the treated group. People falling into the control condition will have a d value of zero. Another common example that's simple for thinking about causal inferences let's say we wanted to understand the effect of say smoking on cardiovascular disease. In that case smoking would be our treatment condition people who didn't smoke would have a the value 0 for d, people who did smoke would have the value 1 for d. We can then think about the causal effect of treatment d equals one on outcome y say cardiovascular disease for each individual person, i, which we're going to define as tau which is equal to the outcome under the treatment condition, subtracting the outcome under the control condition. So in other words each unit has a potential outcome under each condition and the causal effect is the difference between the two. So taking me for example let's say the subscript indexes me, the causal effect of smoking on cardiovascular disease for me is going to be my likelihood of having cardiovascular disease if I smoked subtracting the likelihood that I developed cardiovascular disease if I didn't smoke. You may have already seen though that this poses a problem which is often called the fundamental problem of causal inference. Which is that it is empirically impossible to observe both y one and y zero. This is to say that I can only ever not smoke or smoke. We never get to observe the world in which I do both and measure the outcomes of both cases. Because of the fundamental problem of causal inference, we have to derive a variety of or rely on a variety of tricky solutions instead. The gold standard for causal inference remains randomized controlled trials which have their own limitations but tend to yield the most reliable and precise causal estimates. The main limitation to randomize controlled trials though is that they're not implementable in many of the contexts that we're interested in measuring causal effects and child welfare is a is usually one such context. Instead we'll sometimes rely on natural experiments which are sort of naturally-occurring scenarios where we have sort of conditions that look like an experiment even though they don't involve experimental manipulation themselves so very often a natural experimental environment would be one where we had say a policy change and we can look at how we think that policy change arises from you know random or quasi-random processes and we can look then at the effects of that policy on say a variety of child welfare outcomes. But even when we don't have natural experimental conditions we can use a variety of statistical techniques to try and approximate causal effects. I like to divide various statistical techniques for causal inference into two broad camps. The first I would call selection on observables and these are a broad family of techniques that rely on the assumption that people fall into the treated and control groups conditional on a set of attributes or characteristics that are entirely observable. So this is to say let's say we think that whether or not I smoke can be perfectly predicted by a set of variables that we have about me, my age, my race, my sex, where I'm from. And there's a large family of statistical techniques that are suitable under this assumption of selection on observables including regression adjustment which you might just call multiple regression including covariance and propensity score methods that we're going to talk about here today. It's important to note though that all of the techniques we're going to talk about today depend on the selection on observables assumption. There are a family of methods that don't rely on this particular assumption and which allow for units to fall into the treated or control groups according to factors that are unobserved. These include fixed random and mixed effects models we won't be talking about these models here today but they're also very powerful and frequently used in child maltreatment research. Okay so let's talk about propensity score methods but in order to understand them and their relationship to other more common techniques like multiple regression let's first talk a little bit more about what this selection on observables design entails. So there are two key assumptions of all selection on observables designs. The first is called the assumption of conditional independence. What this notation says is that these potential outcomes, right, the outcome under the treatment condition and the outcome under the control condition are independent of actual treatment assignment conditional on all of the covariates I observe in my model. In other words the potential outcomes of different treated and control conditions are unrelated to the actual treatment assignment itself once we account for a set of covariates. The second key assumption is the assumption of overlap and this assumption you'll see as we move particularly into the demonstration is one of the most important assumptions underlying propensity score designs and matching designs. The notation here is a little counterintuitive. What this says strictly is that the probability that we fall into the treatment condition, conditional on x, are these various covariates that we might use is something more than zero but less than one. What that means is that for any given set of covariates, it must be possible that some units having those covariates get treated and others don't. If it's the case that all units having a particular combination of covariates only ever get treated or only ever end up in the control group we won't have a set of units to compare across the treated and control conditions. The treated and control groups will be too different to make meaningful comparisons across them. So in other words the overlap condition says that there has to be decent overlap between the treated and control groups in the values of x. Right? So we're going to measure overlap in a variety of different ways later on but this is one of the most important things that you'll be examining when you're designing and evaluating your propensity score designs, your matching designs. Okay so regression adjustment which many of us know as multiple regression. How does this work? The basic idea behind multiple regression is that we model the relationship between the treatment, the outcome, and a variety of controls simultaneously, and a formula looking something like this [onscreen 𝑌_𝑖 = α + 𝑋_𝑖 𝛽 + 𝜏𝐷_𝑖 + 𝑢_𝑖] is probably familiar to many of you where we have some sort of intercept, we have a set of covariates which have sort of a vector of corresponding parameters, we have our treatment d here and we're trying to estimate the causal perimeter tau and then we have an error term. One of the main benefits of regression adjustment is that it's easy to implement and it's fairly simple to interpret. But one of the main limitations of regression and one of the main motivations for starting to move toward propensity score or matching designs is that our regression may rely on unrealistic assumptions about the model parameters. So whenever we run a regression we're making assumptions about the structure of the relationship between these different parameters. Sometimes we can test whether those assumptions are correct or not but sometimes that will be quite difficult and we may want to employ techniques that allow us to use a more relaxed set of assumptions about the parametric relationship between our treatment and our outcome. Okay this moves us into the world of matching estimators. [onscreen 𝜏 =1/𝑁_𝑇 ∑_(𝑖∈{𝐷=1}) openbracket 𝑦_𝑖−∑_(𝑗∈{𝐷=0}) 𝑤_𝑖 (𝑗)𝑦_𝑖 closedbracket] the gist of matching estimators is that we try to find pairs of observations with different values of the treatment which is to say pairs of people in the treatment and control conditions, so they have different values of the treatment, but who otherwise have similar or even identical values of all of the covariates and then we simply compare the values of the outcome between these pairs. So the notation here is a little bit tricky but the most important thing to focus on here is that we're summing up all of the the outcomes for people in the treatment condition and the difference in their outcome to the treatment to the out to the outcomes in the control condition here where w is sort of a weight which is going to tell us how to link certain people in the the treatment condition to the control condition. And because the basic estimator here is really just a sort of set of arithmetic where we're subtracting one outcome from another the main advantage of matching estimators is that it's going to rely on fewer parametric assumptions. And the main assumptions we're going to make are about how to match different pairs of individuals in the treated and control groups which is in many respects going to come down to how we measure these weights. The main limitation of matching estimators it's what's sometimes called the cursive dimensionality. And this is to say that as we deal with a larger number of variables in our sort of covariate matrix it becomes less and less likely that we'll find overlap between the treatment and control groups. So imagine for example we're looking at people in the treatment and control groups and we're trying to find good pairs based on just a couple attributes like say in the smoking case their age and their sex. Well depending on our sample size we're very likely to find people in both the treatment and control groups who have similar or even exact values of age and sex. But let's say also. [Alexandra gibbons] Oh sorry we just have a really quick clarifying question about if the overlap assumption ensures that we are not comparing apples to oranges? [Alex Roehrkasse] That's basically it yeah that's the basic logic of the overlap assumption. If we in order to make meaningful comparisons that will allow us to sort of estimate precise causal effects we need to be comparing like with like so that is to say people who got the treatment, compare them to people who didn't get the treatment, but who are otherwise similar. So I think that's a that's a good analogy or a good metaphor for thinking about the overlap assumption it's a sort of technical description of the importance of comparing like with like. So the cursive dimensionality basically says that okay let's say to continue this metaphor of apples and oranges if we're just trying to find oranges in in the in the treatment and control condition we just need to find say you know units in the treatment and control condition that are just both orange then that would be fairly easy. But if we're also trying to match units based on say the weight of the orange, the circumference of the orange, the particular type of orange that it is, it becomes increasingly difficult to find exact matches as we add more and more dimensions to this matrix of variables along which we're going to compare different units. This is where propensity score methods come in. Propensity score methods in a matching context in some respects can be thought of as a dimensionality reduction strategy. So what we're going to do in a propensity score is use a regression model to estimate the likelihood of treatment assignment conditional on covariates. And this you could think of as the definition of the propensity score the likelihood that a unit receives the treatment or does not receive the treatment conditional on our observed covariates. So we can define this propensity score where we'll have a propensity score for each particular combination of covariates. [onscreen p(x_i) - e openbracket d_i|x_i closedbracket] And we'll define it as the expected value of the treatment conditional on those covariates. We can then redefine our independence assumption from earlier such that the potential outcomes are independent of treatment assignment conditional not on the entire matrix of covariates but just on the vector of propensity scores. And so this vector of propensity scores is a much lower dimensional object than the entire matrix of covariates and so we're much more likely to find decent matches. Propensity score methods continue to allow us to sort of more substantial flexibility in special specifying the functional form of our model similarly to matching estimators but they also handle the high dimensionality of sort of our covariates better than a straightforward matching estimator. It's important to remember though that propensity score methods like all selection on observables methods continue to rely on the conditional independence and overlap assumptions. Okay I said before that propensity score methods and matching methods are not synonymous. We can have matching estimators that don't use the propensity score we can use propensity scores in ways that don't actually involve matching. So we can actually use propensity scores in regression adjustment. We can estimate a propensity score and then simply include it as a regression coefficient. We can use it in combination with matching estimators which we'll do in a little bit where we use the propensity score as a criterion to identify good matches between observations in the treated and control groups. We can do things like blocking and stratification using the propensity score where we use different propensity score values to group observations and then run sort of localized regressions within those those strata or blocks. We can use the propensity score as a regression weight essentially sort of estimating a weighted least squares regression and we can combine multiple methods so I think one of the most sort of compelling methods is one where we include the the propensity score as a covariant in our regression model but also use it as a weight in the in the in the regression model. This often gives us more opportunities to be wrong but still estimate a valid causal effect so for example if we can combine various propensity score methods to increase the likelihood that we get an accurate final estimate even if we've guessed wrong about certain model parameters. Oh okay so now I think it's a good time to move toward a demonstration while I get set up for that I just want to note I've included in the in the slides a few links so here's where you can actually go and download r the programming language. I recommend using rstudio as an interface for programming in r and here are a couple very helpful introductions to learning how to to program in r and rstudio. So give me just one second as I transition to pulling up r studio to do this demonstration [voiceover] The program written in r will be displayed and read aloud at the end of the video. [Alex Roehrkasse] So I've opened a window in r studio and I've opened this script which I'll call s5 and this is the script which you should be able to download from the the link that we shared in the chat. If you haven't already downloaded rstudio you should be able to open this file in a simple program like text edit or something like that if you want to take notes in the program itself. So as I say here this program is going to demonstrate some of the matching and propensity score methods. I've included my contact information here as well if you have any questions [onscreen afr23@duke.edu, https://pubmed.ncbi.nlm.nih.gov/34268425/, https://cran.r-project.org/web/packages/matchit/vignettes/] I've also included a couple links to a couple resources that I think are particularly helpful both conceptually and in terms of programming when when thinking through the implementation and diagnosis of matching estimators in r. Okay let's start as we always should by clearing our environment. I've listed the packages that we're going to use here today you would need to install these if you've never done it before but I've installed those so I'll skip that. I will though go ahead and load these various packages. I'll note that we're going to be working mostly with the matchit package today. I've chosen the matchit package because it's a very common package for for doing matching in r. There is also Jasjeet Sekhon's 'matching' library which is arguably a bit more flexible but the the documentation isn't quite as as good. So I'll go ahead and set file paths which tell r where I'm going to keep my my data. Rather these define different file paths and then I'll tell r to use one of those file paths as my working directory. I'll set a seed which is always good practice even though I don't think it will be relevant today, and then I'll go ahead and read in my data. Okay I want to pause to say a few things about this data set that we'll be working with today. So I've read in this data we'll call it d and you can see it appear as an object here in my environment. We have about a quarter million observations and about seven variables. The data we're going to use today are anonymized NCANDS data from Texas for for for the child file in 2019. Now I've done a number of things to sort of pre-process the data so the data you're looking at right now wouldn't be what you would exactly see if you downloaded NCANDS. The first thing I've done is anonymize the data by sort of manipulating some values and I won't say how I've done that so it's just important to know that we're not actually looking at child records but we're looking at data that has a very similar structure to what NCANDS would look like if you downloaded it. For demonstration purposes I've also dropped observations with missing values. You would never want to do this in an actual observation an actual analysis but for the purpose of focusing on sort of the key principles we're trying to demonstrate today, namely propensity scores and matching, I've decided to sidestep the question of missing values. [Erin McCauley] Hey Alex we have another question that came in. Is the data file used in the r script available as well? And I believe the answer is no just because we wouldn't want to release fake versions of the data in case people were to be confused and use the wrong data but that's just my guess. [Alex Roehrkasse] Yep that's a great answer I will say though that if you wanted to so I've i've included what I'm going to do here is I won't actually run these codes but I'm going to walk through sort of what I've done to process some of these data. Not the actual code that anonymizes the data but some of the code that does the important reformatting of the data. Such that if you were to go actually download NCANDS or get NCANDS access and you wanted to start from here with the NCANDS2019 child file this program should for the most part work for you so you should be able to actually implement this code if you'd like you just need to go and get NCANDS access first. [Erin McCauley] And I will also say that during the academic year we have office hours. Our fall season hasn't been announced yet because we're up for the contract this year but we should be announcing it in the next month or so and during the office hours series we have 30 minutes of open support time where you can meet with Alex in a small group as a statistician who could of course guide you on this. And then there is also another question in the q and a so is listwise deletion never recommended for propensity score matching techniques? [Alex Roehrkasse] Yeah so this is really just a question about missing data and I don't myself know if the propensity score techniques or matching techniques we're discussing today have any special relevance on whether any particular missing data strategy is is good or bad so I would say that observation's missing with missing data but I would say that any sort of best practices around missing data are going to apply equally to propensity score matching scenarios as they would to a multiple regression setting. Okay so imagine though that we have the NCANDS child file. I'll walk through what I've done here to kind of arrive at this this particular data object. First I've pulled out data only from Texas that's because Texas actually has some pretty good data on some of the variables that we're gonna we're going to use today and so you can see how I've done this by just creating a fips variable and filtering out observations that only I have Texas's fips code. And then because the 2019 child file includes reports that are submitted to NCANDS in 2019 it actually includes reports from earlier years that didn't actually get reported to NCANDS until 2019 so I'm going to filter out only those reports which are actually made in the 2019 fiscal year. Then I'm going to go ahead and collapse the data to the child level. So as many of you may know the unit of analysis in the child file is the child report. So any given row in the NCANDS child file is going to be a child report. What that means is that children who experience multiple reports are going to appear multiple times in the child file and reports that involve multiple children are going to also appear multiple times in the child file. Today the basic question we're going to explore is what is the causal impact of children's participation in public assistance programs on the likelihood that children who receive a screened-in report in any given year receive a second screened-in report? In other words among children who receive a screened-in report in a given year, what's the impact of participation in public service assistance programs on the likelihood that they're re-reported. I'll say a little bit more in a minute about why we might be interested in that question. But either way our unit of analysis is going to be the child and so we have to figure out how to reduce our data from child reports to the child level. This is a little bit tricky because we're using a variety of covariates and children with multiple reports will have varying values of different covariates like age and potentially even race and ethnicity across multiple reports. So we have to decide which covariate values we're going to use and so here what I do is first pull out for each child the date of the first report in the fiscal year and I also create a count of repeat reports. So I'm going to group the data by children by child id and then I'm going to summarize the data where we're going to reassign the report date as the minimum value of report date across all children. So if a child has multiple report dates we'll take only the minimum value of report date which is to say the first report that a child receives in a given year. We'll also count the number of child reports that each child has which is to say the number of times a given child is reported in a given has a screened in report in a given year and we'll assign that the to the variable n. Then we'll go ahead to bring our covariates back in we'll go ahead and join that child level data to the full data so that we can get our covariates back in namely the covariates corresponding to the first report. So let's say a child was in in Texas was received a screened-in report in January and then again in march but they had a birthday in February so they were six in January they were seven in march using this method we'll use their age at the first report namely their their six-year-old age, and the same will be true for all other co-variates. Then we'll go ahead and create our tree or sorry our outcome variable re-report which is going to equal zero if they have only one report in the year, and it will equal one if they have two or more reports in the year. So again we're interested in whether children who received a screened-in report received a second screened-in report in a given fiscal year that's our outcome. So let's go ahead and take a peek at this data. So here's our re-report value this is our sort of treatment sorry our outcome variable we'll see see that a few children here don't have re-reports and then this fourth child does. We have a few other variables report source so this is who who reported the first screened-in report was it a educational personnel was it a law enforcement officer was it medical personnel? Here we have the child's age and years, their sex, their race and ethnicity, and two other variables here fc money is an indicator variable indicating whether a caseworker identified the child as experiencing some sort of financial strain, and fc public is also a binary variable indicating whether or not the child was participating in a public assistance program like TANF, snap, wic, ssi something like that. We can go ahead and summarize our data here and see that roughly nine percent of children who receive a screened-in report in Texas in 2019 are re-reported within the fiscal year so this is the outcome that we're interested in. We're going to go ahead and clean the data a little bit so we'll recode sex as sort of zero one instead of one two. We'll do the same for public and and money so that if the child is not receiving public assistance they'll have a zero, if they are receiving public assistance they'll have a one, ditto financial strain, and then we'll also tell r that we're going to treat this race and ethnicity and report source variables which it currently understands as a numerical variable or a continuous variable, we're going to tell it to treat it as a categorical variable or a factor variable. Okay so let me go ahead and do that, and the last thing we're going to do is take a 10 sample of the data. If you were here for the data management session the first session in this series I discussed the importance of sometimes sampling your data and so we're dealing only with one year of data here from one state Texas admittedly a large state. Many of the sort of computations we'll be doing today though are pretty computationally intensive and so to be honest even some of these sort of smaller samples with these simpler models if I were to walk through all of these with the full data from Texas it would take my computer, which is which is a decently powerful computer, too much time to actually pull this demonstration off in an hour. So I think it's it's it's very useful to sample your data when you're sort of designing your models and testing your models, and then and then use a full sample of data when you're finally ready to generate your final results. That's all the more true when dealing with sort of propensity score methods and matching methods these are tend to be computationally intensive methods. So go ahead and take a 10 sample of our data so you can see here we have now about 22,000 observations instead of 227,000 observations. I see a question about how much the results might differ between multiple regression models and propensity score methods. We'll see in we're going to do both here in this particular case. There's no clear-cut answer sometimes they'll be very similar other times they'll be very different. Okay so let's first examine just in a very basic way what is the relationship between public assistance and reinvestigation or re-report we might say? So we might think that public assistance so children's participation in a sort of public assistance program like like TANF would decrease the likelihood that a screened-in child receives another screened-in report within the year and this is because we think that that financial support decreases the likelihood of maltreatment. So maybe we think there's some link between financial strain and the actual underlying incidence of maltreatment and so children who are receiving more financial support are less likely to to have a second screened and report because they're less likely to be maltreated. We might also think that public sort of conversely we might think that public assistance increases the likelihood that a screened in child receives another screened in report because contact with these programs intensifies surveillance, so children in these programs are perhaps more likely to interface with caseworkers or other public officials and so given any underlying incidents of actual maltreatment that maltreatment may be more likely to be identified. We could come up with other hypotheses about why the relationship between these two variables should be positive or negative. I just mean to illustrate that this is a potentially theoretically interesting question where empirically we might expect to see it go one way or the other. So let's first go ahead and estimate a very simple model a bivariate logistic regression model where our only regressor is this sort of treatment variable public participation or participation in a public assistance program, and our outcome variable is this re-report variable. We'll use a generalized linear model, we'll use our 10 percent data, and we'll tell r that we want to use a logit link in our glm. Let's go ahead and estimate that model, and then we can tell r to give us a summary of the model. We see here we have a coefficient for the intercept and a coefficient for the treatment. Of course because this is a glm this is a sort of these are the coefficients represent a log odds so we might be more interested in sort of seeing these coefficients as odds ratios and so we can use this little chunk of code here to get an exponentiated coefficient. So what this tells us is that according to this basic sort of bivariate regression model public participation in a public assistance program is associated with about a 57 percent increase in the likelihood that any screened in child receives a second screened-in report within the fiscal year. Okay but the relationship between public assistance and reinvestigation is very likely to be confounded by other factors like whether children experience financial strain, children's age, sex or race ethnicity, or the source of their first report. And so this is one of the reasons why we want to use regression adjustment or multiple regression. So let's go ahead and add some covariates to our model and we can see how the estimate for the effect might change. So here again our estimated coefficient is about about a 57 increase, now we're going to go ahead and add these covariance to our model otherwise a similar model. And we can go ahead and look at the exponentiated coefficients for our model this might actually take just a second. Well hopefully not that long. Okay and now we see we have coefficients for all of these different model parameters and we can see that our estimate for our main effect here of public assistance hasn't changed that much closer to a 58 change. Okay one of the sources I linked up top though I think summarizes very nicely why we might want to move from a multiple regression framework to a propensity score or matching framework. For regression adjustments it's often difficult to assess if models have correctly specified treatment assignments and covariates to outcomes. Goodness of fit measures such as the coefficient of determination, r squared, or the mean squared error, do not test whether a model is correct for causal inference. Thus a model that accurately predicts outcomes may still have biased estimates of treatment effects. So these are pretty strong effects and they're fairly precisely estimated but their argument is that even though we may be sort of successfully predicting a certain outcome it does not necessarily tell us about the causal effect of that that impact and that may have something to do with how we've specified the functional form of our model or so the parametric relationship between these different different inputs. Okay so now let's get to the heart of it let's do some matching estimators. Okay let's first think through the sort of theoretically ideal matching scenario. So theoretically if we could find observations that were identical on all observed measures except the treatment, that is to say every observed attribute of children except whether or not they're participating in a public assistance program are the same, we could then just take the difference in outcomes between these matched and controlled pairs to estimate the treatment effect, and this is what we were talking about earlier in the PowerPoint presentation. So let's give this a try let's use match it to match children with and without public assistance in other words in the treated and control groups who otherwise have exactly the same covariate values. So how are we going to do this? We're going to use this matchit command from the matchit package, and we're going to use a formula here that looks a lot like the equation we would use to estimate a propensity score. So here's our tr this isn't we're not going to put here on the left-hand side the outcome of interest but rather the treatment condition, so again the treatment that we're interested in is whether or not a child participates in a public assistance program. And then on the right hand side we'll put all of our other covariates that we think might impact treatment or the outcome. So just say same covariates that we had in our in our data model up here. We'll go ahead and use our same data but we're going to tell matchit that the method we'd like to use when matching is an exact match. So this tells this this algorithm only produce matches that have the between children who have different values of the treatment but otherwise exactly the same values of the covariates, so let's go ahead and try this. That worked pretty fast and it produced a matchit object. Now what we want to do is go ahead and compare the covariate balance in the full data set and in our new matched data set, and an easy way to do this is to use what's sometimes called the love plot so go ahead and generate one of those. Ah here we go. Okay so we have each of our covariates down here and and then on the horizontal axis we have the absolute standardized mean difference. What this basically means is how different are the covariate values between the treated and control groups. The hollow squares here are for the entire data set before we did any matching and the dark circles here are in our matched sample. Now you can see here that say with respect to this money variable, children in the treated and control groups had very different values and you might expect that right children who are are or are not in participating in a public assistance program you might expect them to also have different values of financial strain right. But then you can see all these black dots are perfectly aligned on this zero, right? And that is basically by construction, right? Because we've told this matching algorithm we only want to match children who have exactly the same values of these various covariates and so as a result the difference between these covariate values is zero. I'll tell you that oh so but in order to achieve these perfect matches you'll see that we've lost some cases. So if we summarize this matching object it's going to report a bunch of statistics that we won't get into right now but what's helpful to see here is that among all children, say our 22,000 you know in our 10 sample right we have about 19,000 children in the control group and about 3,600 in the treated group. I see a question from Yana are these one to one exact matches? Yes, we'll start to move to one to one non-exact matches and at the very end I'll talk about matching sort of across multiple pairs of children. Okay ah but what's important to note here is that the set of all children we had in our data is less than the set of matched children. So that is to say for 16,000 of the 19,000 children in the control group we found an exact match but for another twenty five hundred children we couldn't find an exact match. Similarly five hundred of the thirty six hundred children in the treated group we couldn't find an exact match for. Honestly that's not bad. Most data sets will shrink much more than ours did and that's partly because we're not using that many covariates. It's definitely because we're using predominantly binary or categorical variables or variables without very many values. But you can imagine if we started to measure many more variables let's say we expanded our analysis beyond Texas and we also wanted to match children based on the state they lived in. You could start to imagine that it might be hard to find children with particular combinations of age, race, sex, report source, in smaller states like Wyoming or Rhode island. Yes the question about replacing replacement no these estimators do not use replacement. We won't talk about replacement much here today. I'll include some code at the very end of the the session that does involve some matching with replacement. This is a little bit more of a technical issue than we're going to get to today. The question about an acceptable benchmark to drop cases will be what we start to talk about in just a minute but essentially this is the question the fundamental question of matching estimators. Here we've decided that an acceptable match is only a match that includes identity across all covariates. In most data contexts you're going to lose a lot of cases if you require such a stringent criterion for matching right. It's actually a unique attribute of NDACAN data that we have such large data sets and data with not a ton of variation in the values so I would encourage people to try exact matching estimators when they're when they're starting their matching designs although you're very likely to have to abandon such a stringent criterion. Let's say that we didn't do so well here and we can't actually match so many children using our exact matching criterion. We may instead try to find best matches according to some other criterion. One way to do this is to find matches by minimizing what's called the Mahalanobis distance which you can think of as kind of a Euclidean distance in a multi-dimensional space so imagine like so two points in space according to all the variables we're interested in we're going to try and find what's sometimes called the nearest neighbor according to those values. So let's go ahead and estimate another match very similar here except we're going to tell matchit that we want to use the nearest method and the distance we're going to use to evaluate nearness is the Mahalanobis distance. Let's go ahead and look at our love plot for this match. So here again we had quite a bit of difference between the treated and control groups in terms of this money covariate. We've improved the covariate balance using this nearest neighbor match but as you can see it's still not all that well overlapping. Other variables like whether children have this particular value for race and ethnicity are now completely overlapping and so you can see there's just some sort of variety of performance across these different criteria. Now sometimes the Mahalanobis distance will be will not work so well or or it will be computationally intensive. So then we might try to match based on the propensity score. So here let's go ahead and use the sort of propensity score matching estimator in matching. It's going to look very similar to the Mahalanobis language we're going to use the nearest neighbor method but here we're going to say distance is equal to glm. And what this tells matchit is that we're going to use a glm and by default a logit model to estimate a propensity score, and then we'll identify nearest neighbors based on minimizing the difference in propensity scores across treated and control units. So this is basically going to be the equation according to which we estimate the propensity score and then we'll find nearest neighbors across children using that propensity score. So let's go ahead and estimate those matches. And summarize the performance. So here you can see again we're not doing so great and in some cases actually our matched data has worse overlap than the unmatched data. So if I were to see this love plot for our propensity score matching I would say it's probably a good idea to go back to the drawing board and try and figure out a better parameterization for our propensity score model. I just use this as an illustration for how you might test out different matching methods and evaluate how successful they've performed. There are a variety of other ways we might evaluate the success of a a particular matching exercise we might look at the density of particular covariates across the full sample or the matched sample. So here we want these lines to be very close to one another. They're already pretty close in the full sample but the distribution of children's age in the matched sample is very close. I'll skip some of these other visualizations but I'll leave them here as a sort of primer on how you might think about diagnosing some of the performance of your matching models. But again more or less all of these different diagnostics are trying to figure out how well you're satisfying the overlap assumption of this design which relies on the selection on observables assumptions. Okay so we've estimated some matches which is to say we've used a variety of matching or propensity score methods to match particular cases to one another let's now go ahead and use these matched data to estimate some models. I see a question about variables that should not be used to match on, for example experiences of the treatment and control groups after assignment. Yeah so in principle from a theoretical standpoint you want the variables that predict treatment assignment to predict treatment assignment. And so they may be a so I think it's broadly speaking theoretically ideal to have predictors of your predictions in your propensity score model that would have some meaningful influence on treatment assignment. So the sort of example you give where there are sort of attributes that follow after treatment assignment would probably not be good candidates for predicting the treatment assignment itself. Okay so we have these sort of match objects but now what we want to do is turn them into data objects involving matched pairs. So let's go ahead and do that we're going to move forward with both our example of our mat exactly matched data and our data matched using the propensity score so let's create an exact match data object and a propensity score matched data object. Let's go ahead and look at what these look like. Okay so this is our new exactly matched data set. Here's our outcome variable, here are various covariates, but now we have a couple new variables here. This weights variable we're going to use in our regression and these weights will effectively implement our matching estimator. This subclass variable identifies our various pairs. So these are all of the children subclass 1 all of these children have exactly matching values for all the different covariates. The different covariates but varying values for the treatment, right? If we looked at our propensity score data we'd see something very similar we have a set of weights here which in this particular case because we're not using replacement and we're using nearest neighbor matching all our weights are going to be one. But if we look at our subclass here each subclass identifies a unique pair of children, right? One child, two child, one child, two child, one child, two child, and these children in many cases are gonna have identical values of our various covariates, right? but they'll always have different values of the treatment, right? We have one child from the control group one from the treatment group, one from the control group one from the treatment group. And this distance value is going to tell us more or less how similar these two children are. Okay now let's go ahead and re-estimate our binary regression model using our exactly matched data. So in this case we're going to go back to estimating only the binary relation so bivariate relationship between our outcome and our treatment variable, we're going to get rid of our covariates again we're going to use our matched data, we're going to define our weights using the weights in the new data object, and again we'll use a logit link. So let's go ahead and estimate that model. You can ignore this warning for now. We can go ahead and summarize that model, and again get our sort of exponentiated coefficients. In this matching context to get valid estimates of sort of standard errors I think it's usually advisable to cluster your standard errors on each matched pair, which is to say we're going to cluster on this subclass variable. I've written a chunk of code here that will sort of apply a robu cluster-robust variance estimation and yield exponentiated coefficients. So here's our sort of best estimate for the effect of public assistance participation on re-reporting using our matching estimator based on exact matches. And then we can do the same with our propensity score model here and we'll see we get a very similar oh in this case a, A very similar outcome which may involve a coding error on my part. Either way you should have the code here that should illustrate how to implement not only matching but then estimation based on your matched data sets. We're starting to get low on time here and I want to leave a little bit of time for q and a so I'll wrap up by just saying a couple things. We've left our covariates out of the data model here and they're really entering through the estimation of our matches particularly through this weight variable. We can add precision to our model by including those covariates in our final data model I will say though that this requires some additional data preparation particularly if you're dealing with generalized linear models. And so you'll want to read up on how to sort of particularly re-center your variables and I put a link here for how to do that. You wouldn't want to just put your covariates right back into this data model it won't yield valid estimates of the treatment effect. I also said that there are opportunities to sort of combine different methods and so we won't get to talking about it here but I've included some code and some description of ways you might consider combining the propensity score in a variety of ways to yield more robust estimates of the treatment effect and also to move from estimates of the average effect of treatment on the treated to average treatment effects. So I'll pause there and open it up for questions in the hope that I might clarify some of what we talked about. I grant that these methods are quite complicated and it's quite a lot to work through in an hour so thanks for hanging with me. I'm happy to discuss more detailed questions about this over email [onscreen afr23@duke.edu] or in office hours anytime. But thanks for thanks for listening and I'm excited for questions. [Alexandra gibbons] Yeah so we have a question about if you can please explain again what the weights are doing in our re-running of the binary regression model. [Alex Roehrkasse] Yeah absolutely, so it will vary a little bit depending on particular which particular matching strategy we've used. So for example in this in our propensity score matching model where we have we're matching based on the propensity score and without replacement. You'll notice that the weights in this particular data object. Oh. You'll notice that well seem to have my computer seems so frozen you if you were to open this this propensity score data object you would notice that all the weights are one. I've included sort of this this condition nevertheless because I think it's just good practice to always re-weight the data. But if you were to look at the exactly matched propensity score data those weights are essentially telling your your glm estimator how to weight each observation such that you're appropriately comparing each set of matched treated and control units to one another. So it's a little bit complicated to explain honestly in just a minute or two but it's effectively an implementation of the kind of pairing and differencing that we described in the sort of mathematical language in the powerpoint presentation. I'd encourage you to read some of the vignettes that I included at the top of the code if you're interested in learning more about precisely how the weights implement the estimators we discussed. [Alexandra gibbons] Thanks Alex! Can you also go into a bit more detail about how the results differ from what we can interpret from a logistic regression? [Alex Roehrkasse] But I see the question here. A bit more detail about how the results differ from what we can interpret from a logistic regression. Yeah so the logistic regression there's there's a number of differences. So first the logistic regression involves all of the observations in our data set, right? We've estimated that model on all observed children. The matching estimator didn't use all children, right? You recall that particularly the exact matching estimator a number of children dropped out for whom we could not find exact matches. The regression adjustment technique or the multiple regression is sort of particularly designed to predict an outcome whereas the matching estimator is especially designed to measure a causal effect. There's different ways of specifying causal effects I i sort of hinged it up this at the end the sort of average treatment effect on the treated, the average treatment effect, by default the the various matching estimators we talked about today measure the average treatment effect on the treated. You can in certain matching estimators adjust the sort of estimation technique so that the s demand is instead the average treatment effect, you can estimate marginal effects. These are all different ways of measuring treatment effects and I would just refer you to sort of broader sort of discussions of of how to think about treatment effects that would apply equally to matching estimators as as to other kinds of estimators. [Erin McCauley] All right great thank you so much Alex. We are at times we have to wrap it up for today but Alex will be back next week leading another workshop thank you everyone so much for being here we hope to see you next week at our final workshop. [Alex Roehrkasse] Thanks everyone! # NOTES # THIS PROGRAM FILE DEMONSTRATES SOME OF # THE MATCHING/PROPENSITY SCORE METHODS DETAILED IN # SESSION 5 OF THE 2022 NDACAN SUMMER TRAINING SERIES # FOR QUESTIONS, CONTACT ALEX ROEHRKASSE (AFR23@DUKE.EDU; ALEXR.INFO) # HELPFUL RESOURCES ON PROGRAMMING MATCHING ESTIMATORS: # https://pubmed.ncbi.nlm.nih.gov/34268425/ # https://cran.r-project.org/web/packages/MatchIt/vignettes/ # 0. SETUP # Clear environment rm(list=ls()) # Install packages (only necessary once) #install.packages(c('data.table', 'tidyverse', 'MatchIt', 'optmatch', 'lmtest', 'sandwich')) # Loads packages library(data.table) library(tidyverse) library(MatchIt) library(lmtest) library(sandwich) # Cf. the Jasjeet Sekhon's 'matching' library # Set filepaths afrmac <- '/Users/Alexanderroehrkasse/Library/CloudStorage/Box-Box/Presentations/-NDACAN/2022_summer_series/' afrpc <- 'C:/Users/aroehrkasse/Box/Presentations/-NDACAN/2022_summer_series/' # Set working directory wd <- afrpc # Toggle filepath here to change working directory setwd(wd) # Set seed set.seed(1013) # Read the data d <- fread('data/s5.csv') # Today's data are anonymized NCANDS data from Texas in 2019 # I've done a number of things to pre-process the data # First, I've anonymized the data by manipulating some values (I won't say how) # For demonstration purposes I've also dropped observations with missing values # (which you shouldn't do!) # But the other processing is helpful to know # in case you want to do something similar: # Select reports from Texas (for which key variables are complete) # and reports made in 2019 FY tx <- ncands2019 %>% mutate(fips = round(RptFIPS/1000)) %>% filter(fips == 48) %>% filter(RptDt > '2018-10-1' & RptDt <= '2019-9-30') # Collapses to child level. # Children with multiple reports will have varying values of age # or even race/ethnicity across reports # We have to decide which covariate values to use. # So here I keep the date of the FIRST report # and create a count of repeat reports tx_first <- tx %>% group_by(StChID) %>% summarize(RptDt = min(RptDt), N = n()) # Then I join the child-level data to the full Texas data # to get the covariates observed at first report. # I create a treatment variable indicating whether # children received second screened in report d <- tx_first %>% left_join(tx) %>% mutate(rereport = as.numeric(n > 1)) # Take a peak head(d) summary(d) # Clean the data d <- d %>% mutate(ChSex = ChSex - 1, fcpublic = case_when(fcpublic == 2 ~ 0, fcpublic == 1 ~ 1), fcmoney = case_when(fcmoney == 2 ~ 0, fcmoney == 1 ~ 1), RaceEthn = factor(RaceEthn), RptSrc = factor(RptSrc)) # Take a sample of the data d10 <- d %>% slice_sample(prop = .1) # 1. WHAT IS THE RELATIONSHIP BETWEEN PUBLIC ASSISTANCE AND RE-INVESTIGATION? # We might think that public assistance decreases the likelihood # that a screened-in child receives another screened-in report within the year # because that support decreases the likelihood of maltreatment. # We might also think that public assistance increases the likelihood # that a screened-in child receives another screened-in report within the year # because contact with public programs intensifies surveillance. # Let's first estimate a bivariate logistic regression model: model1 <- glm(rereport ~ fcpublic, data = d10, family = binomial(link = 'logit')) # Summarize the model summary(model1) # Output coefficients and 95% CIs as odds ratios exp(cbind(coef(model1), confint(model1))) # But the relationship between public assistance and re-investigation is likely # confounded by other factors like # whether children experience financial strain, # children's age, sex, or race/ethnicity, # or the source of the first report. # 2. REGRESSION ADJUSTMENT: CORRECTLY SPECIFIED PARAMETERS? # Let's add some covariates to our model # How does our estimate for the effect of public assistance change? model2 <- glm(rereport ~ fcpublic + fcmoney + ChSex + ChAge + RaceEthn + RptSrc, data = d10, family = binomial(link = 'logit')) summary(model2) exp(cbind(coef(model2), confint(model2))) # Zhao et al (2021): # "For regression adjustments, it is often difficult to assess if # models have correctly specified treatment assignments and covariates to outcomes. # Goodness-of-fit measures, such as # the coefficient of determination, R2 and mean squared error (MSE), # do not test whether a model is correct for causal inference. # Thus, a model that accurately predicts outcomes may still have # biased estimates of treatment effects." # https://doi.org/10.21037%2Fatm-20-3998 # 3. MATCHING: OVERLAP? # Theoretically, if we could find observations that were identical on # all observed measures except treatment, # we could take the difference in outcomes between # matched treated and control pairs to estimate the treatment effect. # Let's use matchit to match children # with and without public assistance (treatment and control groups) # who otherwise have EXACTLY the same covariate values: match.exact <- matchit(fcpublic ~ fcmoney + ChSex + ChAge + RaceEthn + RptSrc, data = d10, method = 'exact') # And then let's compare the covariate balance in the full and matched data # using a Love plot plot(summary(match.exact), var.order = "unmatched") # By construction, using exact matching our covariate balance is perfect. # Note that our dataset has shrunk because not all observations have exact matches. summary(match.exact, un = F) # Most datasets will actually shrink MUCH more than ours did, # so generalizability is severely limited. # We may instead try to find "best" matches rather than exact ones. # One way to do this is to find matches by minimizing the Mahalanobis distance, # which is similar to a Euclidean distance in multidimensional space. # This is often called "nearest neighbor" matching. match.nearest <- matchit(fcpublic ~ fcmoney + ChSex + ChAge + RaceEthn + RptSrc, data = d10, method = 'nearest', distance = 'mahalanobis') plot(summary(match.nearest), var.order = "unmatched") # Or we can estimate the propensity score and choose our match based on the # minimal difference in scores: match.ps <- matchit(fcpublic ~ fcmoney + ChSex + ChAge + RaceEthn + RptSrc, data = d10, method = 'nearest', distance = 'glm') plot(summary(match.ps), var.order = "unmatched") # We can also summarize the performance of our matching using # plots of the densities of covariate values in matched observations vs. all units: plot(match.ps, type = "density", which.xs = c("fcmoney", "ChAge")) # Or plots of the distribution of PSs in matched/unmatched treated/control units: plot(match.ps, type = "jitter", interactive = FALSE) # Other helpful diagnotistics: plot(match.ps, type = "qq") plot(match.ps, type = "ecdf") # 4. ESTIMATION AFTER MATCHING # Let's output a dataset of matched observations match.exact.data <- match.data(match.exact) match.ps.data <- match.data(match.ps) view(match.ps.data) # And now let's restimate our binary regression model, using our exact matches: model.exact <- glm(rereport ~ fcpublic, data = match.exact.data, weights = weights, family = binomial(link = 'logit')) summary(model.exact) exp(cbind(coef(model.exact), confint(model.exact))) # But let's also cluster our SEs on matched pairs coeftest(model.exact, vcov. = vcovCL, cluster = ~subclass) %>% cbind(coefci(model.exact, vcov. = vcovCL, cluster = ~subclass)) %>% as.data.frame() %>% mutate_at(c("Estimate", "Std. Error", "2.5 %", "97.5 %"), exp) # And using our PS matches: model.ps <- glm(rereport ~ fcpublic, data = match.ps.data, weights = weights, family = binomial(link = 'logit')) coeftest(model.exact, vcov. = vcovCL, cluster = ~subclass) %>% cbind(coefci(model.exact, vcov. = vcovCL, cluster = ~subclass)) %>% as.data.frame() %>% mutate_at(c("Estimate", "Std. Error", "2.5 %", "97.5 %"), exp) # Note that adding covariates back into your data model can improve precision, # but valid estimation REQUIRES additional data preparation, # which can be complicated depending on the structure of your data: # https://cran.r-project.org/web/packages/MatchIt/vignettes/estimating-effects.html # 4. ROBUST ESTIMATION # The above methods are based on matched pairs of observations. # They are best suited for the estimation of # average treatment effects on the treated (ATT), the default. # An easy way to do matching that is # less sensitive to misspecification of the PS model # and which allows you to estimate average treatment effects (ATE) is # "full optimal matching" or "full matching". # Full matching is less precise, but less likely to be biased. # NB: Full matching is considerably more computationally intensive!! match.full <- matchit(fcpublic ~ fcmoney + ChSex + ChAge + RaceEthn + RptSrc, data = d10, method = 'full', estimand = 'ATE', distance = 'glm') match.full.data <- match.data(match.full) model.full <- glm(rereport ~ fcpublic, data = match.full.data, weights = weights, family = quasibinomial(link = 'logit')) summary(model.full) exp(cbind(coef(model.full), confint(model.full))) [voiceover] The National Data Archive on Child Abuse Neglect is a collaboration between Cornell University and Duke University. Funding for NDACAN is provided by the Children's Bureau an Office of the Administration for Children and Families. [musical cue]