National Data Archive on Child Abuse and Neglect. [Erin McCauley] okay everyone welcome we will get started hopefully you guys all already know this but this is the summer of NYTD 2018 summer training seminar series and this is hosted by the National Data Archive on Child Abuse and Neglect which is in the Bronfenbrenner Center for Translational Research at Cornell University. We're really excited to be offering this service to our data users and this whole series kind of focuses specifically on the NYTD data and what we can do with it. So here is the overview of the summer training series as you can see we are about halfway done. Today's going to be our second expert presentation so Michael Dineen is going to be back at it teaching us some wonderful things. And then just a little preview forward we'll talk about this more at the end but the next session will also be a Michael session he's going to be talking to us about how to link see NCANDS and AFCARS with the NYTD which is one of the really special capacities of this data set and then after that we are going to have to research presentations and I'll be giving the first one this will be a very in progress research study so I'm just a little bit into a research study I'll be walking through with you guys kind of what I did how I come up with my questions what I've done with the data so far the descriptive analysis that I did and my initial results and where I plan to take it based on those. And then our last presentation will be like a full-fledged finished conference style research presentation. And so for those of you who have attended previous sessions you know that typically we ask you to keep your questions until the end and then we have a Q and a session. This time around we'll still be doing a question-and-answer section at the end so if you have questions feel free to pop it out into our chat box. And now I'm going to allow Michael to take over.. [Michael Dineen] okay hi everybody I'm Michael Dineen I'm the manager for the NYTD and other administrative data sets here at the National Archive. Today I'm going to be telling you about weighting with NYTD first I want to tell you a bit about why we need to weight. So the purpose of NYTD is to get information on kids who age out of foster care because that's a particularly vulnerable population and there are a lot of issues with them. So we have the survey that's done you know for them at age 17 and 19 and 21 the like any survey we see respondents may not be like the population because it's not a random sample people will just respond to the survey are not respond to the survey and the people who respond could be different in important ways from those who don't respond. we want to generalize our findings not to people who respond but to kids who age out of foster care we want to know that's who we want to know things about not about this particular group that happened to respond and who may differ in certain characteristics. The differences can bias the statistical inferences about the population. We've all seen cases of political surveys where they had you know too many Democrats in there are too many Republicans and it biased the results and people made a prediction that was not that didn't bear out. So weighting is used to bring the proportion of certain subgroups let's say it's sex and race back to the proportions that they are in the original population. That's what this kind of weighting weighting it's called weighting for nonresponse. So what I'm going to do is go through a series of steps with you and show you how to develop the weights I'm going to be using an Excel spreadsheet and show you how just hands-on how to make the weights then we'll that will transfer to a table we'll save it out as a table I mean every spreadsheet is a table and we'll transfer that weight table into stats program. In our case will be using SPSS as a to demonstrate. Then we're going to merge the weight table with the data table and then once we have the weight variable in the data table we'll be able to turn on weights using that weight variable and then we'll be able to do computations on the weighted are on the unweighted frequencies to see if the weights are making much of a difference. So let's get started doing that so to compute the weights we are first going to compute the proportion of the population for each cell in the population table then we'll compute the population for each respondent and then we'll compute an expected count for each cell and then we'll be able to compute the weights from that so I'm going to switch to Excel. And now okay here's what we're working with. I've got a what is essentially a 2 x 8 table contingency table 2 values for sex male and female, eight values for race, 1 is white, 2 is African-American, 3 is Native American, 4 is Asian, 5 is Hawaiian or other Pacific Islander, 6 is multiple race, and 7 is Hispanic. And send the same eight values for females. So that makes a 16 cell contingency table. So in the population there are 6176 white males and so forth is each of these categories and then in the cohort we had here are the counts for those each of those cells in the cohort. So the first thing we want to do is sum the population here 29,104 which is the correct number who are in the wave one baseline of the 2011 outcomes survey. So I'm going to to name that cell because it'll be more clear that we I'm going to call it call it 'population'. And then I take the sum of the cohort or the observed values that's 15,597 which in fact is the number in the cohort so I'm going to name that cell 'cohort'. You don't have to name the cells but it's just clear I'm it's clearer for this demonstration purpose. Okay now I want to compute the pop the percentage of the population which would be here it would be C 2 divided by 'population' which is that 29,104 here. Okay and that's 21% and then I just drag that down and these are our population proportions for each of the cells in the sex in the sex by race contingency table. So now we can compute an expected value and expected value would be how many we expected to get in the respondents based on this proportion of the population. So if we get we should expect if our if our respondents were kind of random we would expect to get these percentages in each of those cells. We would get 21% white males and say 1% would be female Native American now we can compute the expected values. I'm entering a formula D 2 that's the percentage of white males times the population of the cohort or the total of the cohort. And that doesn't look right I must have done something wrong let's see it's oh I know it's I shouldn't have divided I should have multiplied because that's a percentage. I want 21% of 15,597 so that's D 2 times the cohort population or the cohort total there so okay now I forgot to put the =. I got it right this time D 2 times cohort will give me the figure of 3309.75 we should have that's what the number of the number of respondents we would have gotten if they were exactly the same proportion of the population. Okay well that was computing an expected value for the number of respondents we expected to get based on 21 that they are 21% of the population of of kids in foster care at age 17 so to compute the weight I use 'expected' divided by and that's so 0.92 is the weight so if we so if we just to test it we'll multiply that equals G 2 times F 2 and fats 3309.8 which brings us to our expected value so that weight is correct. So we sum these weights we compute these weights all the way down well wait first we have to compute the expected values all the way down and note that when I sum the expected values I got the same as the cohort. So that equals 'expected' divided by 'observed'. And these are our weights. So that's the weight table so we save that and then we go to SPSS. Oh is this big enough can you guys read this? I hope you can see it. So I'm going to first I'm going to go get the the file the this is our data file. The data file is the one you would receive if you downloaded data from the archive called outcomes cohort 11 wave three version 2. Wave three is all three waves so I go get that. And that's here and it's went and got it no errors and then it looks like this. This is the variable view with all the NYTD variables and here's the data view. All the NYTD variables. So we set that down and we'll do a frequencies on wave just to see what's in there. It shows us that we've got 29,000 at wave one and 15,000 14,000 at wave 2 and wave three. So that's our full outcomes table. Then we are only interested in the wave one people who are in the cohort and the reason is because those are the only people who we're going to give weights to. We are interested in that wave one cohort at least in this demonstration you could be you know you would you could be interested in the wave 2 cohort or the wave three cohort but for demonstration purposes we are using wave one people who are in the cohort because those are the respondents and we'll so select if means that we're going to select only people who meets those criteria. And then we'll run the wave frequency again. So here's our frequencies now, this is the new frequencies we only have one age 17 15,597, those are the people we are working with now who we're going to put weights on. Okay so then we're we'll save that file out the weighted before we screw everything up and because it's possible we could have since we eliminated all those cases and if we save the file it would be it would eliminate the cases from our big data file. So I saved it right away. So now we want to get the table that's called weights sex race.XLS that's the table we just created an Excel. So let's go get that. We're importing that Excel table into SPSS. And then we're going to give that one a name we're going to call it 'weights'. So here's the table. It's got all the same variables sex, race, population. This is when and where we said percent pop expected and cohort and there's our weight and then it's called and it's a variable called 'weight'. So now we're going to join the 2 data sets together. This is called a star join in Excel which is just using a SQL statement. So for those of you unfamiliar with SQL, it starts with the select statement which tells you which columns you want or which variables and then it chose you from where you want to pull those variables from, and it renames the table that you were going to pull them from, it's going to call it T0 that's why each of these is called G0 wave because table T0 column called wave or variable called wave and then it's going to join this table called weights and it's going to call that table T1. And the way I set this up the only variable I am pulling from that table is the weight itself because we don't need all those other columns that we had in the spreadsheet. And noticed too how this works, we're linking on we're saying race ethnicity in the data table is equal to race ethnicity in the weights table. The race ethnicity in the race table has those values 1,2,3,4,5 and what ever row in the data table has a value for one in the column race ethnicity will join with the one in the weights table. And then it's also saying it says 'and' sex in the data table equals equals sex in the weights table. So the results of this is that anyone who has the value let me go back. Anyone who has this combination of values for these two variables will be linked with the weight in now SPSS. And so forth. Okay so we'll run that and now we've got table a new table in SPSS which has all the same variables in that that are the NYTD variables but at the very end it has a weight variable. So that's now the weight of the sex by race 2 variables. Okay so now we're going to to see to show you what this does. Just to show you what we have here. That custom table is just a demonstration. See now we have Hispanic white sex male has the weight of 9.19. Female has the weight of that. So you can see we've successfully moved our weights over to the SPSS table. Okay now we'll save that out again. And now what we're going to do is run frequencies of current enroll weight by weight, and frequencies of current enrollment with weight off to show you the difference. We run that ends that looks like here's the weighted we get 14,595 14-five... So it looks like very little difference in the variable current enrollment by gender and race. Let's do another one we are going to do this for current connection with an adult and see if that makes a difference. And we have weight by weight and this is unweighted here weight off so 14,522, 14,548 it's negligible 95.1% 93.3%. So these the weighting that we've done so far with those two variables is not doesn't seem to be making any difference in our and our frequencies so that's a good thing. That means our respondents were pretty random. We'll do it again for 'incarcerated' because that could be skewed by racial and gender issues. And this is the weighted up here and unweighted down here. It's 61.9%, 61.8% again insignificant differences. So that by the weighting that we are doing here isn't going to help us change anything because it doesn't we don't need to be changing we're we found out now through the weighting process that our respondents are pretty random at least of those variables that we computed so far, looked at. So that's that's a good thing to know so. So to review in computing weights you compute a proportion of the population for each cell. You compute the proportion of the cohort or the respondents for each in the respondents table. And then you compute the expected count. Here's the formula for the weights, where weight is the expected I J is a we had a two-way contingence table so that's row I, column J and the same for the observed we have the expected divided by the observed for the same this is just saying it's the same cell. We use the values for the same cell and then we'll get a weight for that cell. There's a slightly different way to do it where you can compute the proportions, and they would just give you exactly the same thing. But I like to do expected count because like if you get into tiny tiny proportions you can't really see much difference and then the and in the counts you can see they are different. It's a lot more intuitive I think. So that ends that segments of the presentation. What I wanted to tell you now about weighting is that with NYTD we have an unusual situation where we know a lot about the nonrespondents because they are in AFCARS and we have them in the foster care table. So it's possible to pull variables out of the AFCARS table and weight on those if you think that those might be skewing the data. First what you'd have to do is link the two tables to have an AFCARS NYTD table and send you can compute the weights from the linked tables. What I'm going to do next to show you how to do that in the spreadsheet and then the rest of the process is the same as I showed you before which is this. Develop a table of weights, transfer them to your stats program, merge the weight table with the data table, turn on weights in the stats program, and then do your analysis, whatever it happens to be. It doesn't have to be just frequency it can be you know something more complicated. So here I have not just sex and race but I have fees variables these four variables are from the AFCARS table and I brought over do they have a clinical disability, what was the total number of removals for that child, how many placements did they have, and where they flagged as having entered foster care because of physical abuse. With total removals (total rem) I limited that to if it was more than 10 I just said 10 because otherwise we'd get some you know that were maybe 40 or 50 different total removals and you wouldn't have queued have very low cell counts in them and they wouldn't help you. Same thing with number of placements. If it was more than 11 I just said 11. So okay here's the population for this cell this is a cell in a six dimensional table where sex has to values race and ethnicity has eight values, clinical disability has three values and total removals has as I said 10 values. Number of placements has 11 values and physical abuse has so that gives you a total of let's go all the way down to the bottom 1673 cells in this multidimensional table. So the first thing we want to do as before is sum our population. And sum our cohort the observed and just for simplicity going to call this 'population'. And I'm going to call this 'cohort', otherwise I'd have to say for the denominator G 1674 and that's not clear what that is. So percentage of the pop is G 2, the count in that cell divided by the population or 0.9% so I'll track that down. And then now I can compute the expected number of respondents we should have gotten if the if the respondents responded at the same proportion that they are in the population. So that would be H 2 times 'cohort'over 135. So we got a few more than what we expected so I'll bring that down. And just to check our auto sum should be the same as 14,788. It checks. And now we can compute our weight which is the expected remember that formula the expected I 2 divided by observed J 2. So that's the weight for this cell this array of values. Now we go down and give weights to all cells. Now I just wanted to show you if we do percent of observed equals J 2 divided by 'cohort' and take sat down what I'm going to demonstrate here is that second formula where you can do the proportions of the percentages in the two populations the people who responded and the people in the population. So that would be let's see equal to the percent in the population equals H 2 divided by K 2. So you see I get exactly the same weight if I do it that way. So anyway that's how you would do it. It's the same process now you save this out and bring it into your statistics software and links it tables in the statistics software, and then you'll have the weight. And then you can use the weight for checking whether your whether your respondents are affecting that variable and if they are you can use that weight in your accounts and in your analyses. So let's see that ends the demo. Questions? [Erin McCAuley] is the info from AFCARS what you see for the first weight example, where do these populations come from? [Michael Dineen] in the first example the gender and sex come from the NYTD table. Because when you receive the NYTD table for wave one you get all the entire population in the 29,000 not just the 15,000 who responded. And for those people who didn't respond to get a certain number of demographics you get their gender and their race. I think that's about it. So those come that demonstration was entirely inside of the NYTD outcomes table. The one that you can't when you downloaded it from us. [Erin McCauley] yes so that's the way I interpreted it that the first weighting presentation was like kind of weight across NYTD for nonresponse as we progress through the the waves. And if you remember back to the first week Telisa and Tammy talked about how some states that have a ton and ton of people may not follow up with everyone and then they also just have kind of expected attrition working with a vulnerable population. And whereas when you are thinking back to the AFCARS which it was the second presentation we can weight based on things that may have affected response at all in the in the NYTD and that we don't have that information for. So like the clinical disability we don't have that in the NYTD but we do have that in the AFCARS. They said thanks very much, one more: when would you use the AFCARS linking versus not. [Michael Dineen] if you want to use variables from AFCARS then you have to link to AFCARS for the people in AFCARS who are in the NYTD table. [Erin McCauley] yeah I think Michael is totally correct so like whatever your kind of question is and where you were going will determine which data set you want to use for linking and for weighting. And so I think one of the really neat capacities for this is that of this data set the NYTD data set is that we do have that history and so I think while the NYTD is wonderful data set on its own and can be weighted across that, some of the like really interesting and creative uses of NYTD will involve linking it with the other data sets and so once you start linking with them you should start thinking about weighting with them. And then one thing you can do as well is just like Michael did these two different versions and we saw where it made a difference and where it didn't, you can try multiple types of weighting. And we have another question: do any of the existing cohorts already have weights created? [Michael Dineen] the wave one has weights that were supplied by the Children's Bureau. They are based on like 24 different variables that are specified in the user guide but people have found that those they don't change anything plus I wanted to tell you how to do it yourself because you may not want to use those 24 variables. You may want to weight on different variables and you know the variables that you're going to be using in your analysis fixate you are interested in "does clinical disability affect outcomes for kids who age out of foster care?" Or "does length of stay you know how long they were in foster care affect their outcome?" Or "does the fact that their parents imprisoned and that caused them to go into AFCARS, does that affect their outcome?". So there are a lot of questions there are a lot of information in AFCARS that you would want to know whether that affected kids' outcomes, kids who age out. And that's why you don't you wouldn't even have to weight to do those analyses but you might want to weight to see if your if the variables of interest that you are working with our unequally represented in the cohort. [Erin McCauley] and we have another question it's a rather long one so Michael if you can pull up the chat it might help you to answer it but I will read it out loud for everybody: "in your first example you use only sex and race to create the weights so the expected and actual numbers used to calculate the weights were fairly large. The second example had only one or two respondents in some categories. Does this difference impact the strength of the weights? Is there a limit to how many factors can or should be used to create a weight?" One thing that I would think about is when Michael said that he limited some of the ones where when we get really high numbers of responses like once we had 10 or more placements he was kind of lumping them together. And so you might want to think about kind of how how many people are going to be in each category. But as as Michael said though the weight that comes with that first cohort uses a very large number of variables to create the weights. And so you definitely can use more. I don't know if there would be a limit. I know for me if I were so I'm doing a presentation on an ongoing research study I'm doing looking at just disability and outcomes after aging out. I haven't started weighting yet but my plan is to kind of do what Michael is doing here kind of experiment with a couple different weighting options but then meet with the statistical consulting center on Cornell's campus. And I think most campuses have something similar to kind of talk it through with somebody who does that specifically. But Michael if you want to take a crack. [Michael Dineen] yeah there is an issue with small cell sizes creating unstable weights. What I mean by unstable weight is that the weight computed when the cell count is very tiny, like say on, and send the difference if it went to 2 would double or break in half that weight so it's like for tiny cell counts little differences in the count can make huge differences in the weight. So small cell counts are always like in most cases with small cell counts troublesome. Because weights won't settle down until you get a significant number. So you can that's part of the process of weighting is you have to make a judgment yourself of whether you think that that's too much, or you don't want to use that cell are you can just set the cell you can you can put a limit on the weight, they have a name for that like truncated weights or something like that, or you can set it at one so that it's just represents one person or just something like more than one but in the vicinity of other weights that give it a little more weight but not so much that it's going to throw things off. So it's kind of a judgment call weighting with those small cell sizes but they are troublesome and you do need to look at them and you do need to do things like I did you can group things like with age you can do it into groups or anything that has a kind of continuous variable you would want to be careful in weighting. [Erin McCauley] yeah and I think you could also look to the literature to see how they recommend grouping like is there kind of a division point in for example the number of placements. And then just like Michael created that one table with the weights and then merged the weights back in I think you could probably create a data table that has like weights using a bunch of different specifications and then match it with your data set and then kind of run what the frequencies would be across three or four different weighting choices. [Andres Arroyo] one question from Lauren: the question is "if the weighted frequency table is similar to the unweighted frequency table for a particular variable, which table should you utilize/report?" [Michael Dineen] oh I would say Jews use either one because those are no different they are not different. But a better answer would be use the one that's not weighted because it's just simpler. [Erin McCauley] yes and then you know I would note in your method area are in like sensitivity analysis checks that you did try it with weighting them and kind if a few make the weights yourself what choices you make. And then note that they are not different. [Michael Dineen] in general use the simpler, and unweighted is simpler then waited. Okay it sounds like there are no more questions so thank you everybody for attending. I hope it was useful to you and we'll see you next week. We are going to be talking about linking the NYTD outcomes table to the NCANDS child file and to AFCARS and what things to look out for when doing that. [Erin McCauley] yes thank you Michael that was an absolutely wonderful presentation. And I will be sending a reminder out for next week's session so we hope you can join us again I think this will be a really helpful presentation and this is one of the number one questions that we get from data users but also sort of the number one most unique facet of the NYTD data is this capacity to link backwards and account for peoples foster care experiences and histories. So we hope that you'll see us next week and thank you for coming and thank you Michael for that wonderful presentation. [Michael Dineen] you're welcome. Bye everybody. The National Data Archive on Child Abuse and Neglect is a project of the Bronfenbrenner Center for Translational Research at Cornell University. Funding for NDACAN is provided by the Children's Bureau.