^M00:00:15
>> So you heard about what LONGSCAN has to offer in terms of the conceptualization of LONGSCAN, the measures and what we have and how we try to do what we do, and now I'd like to try to talk to you as best that I can about how to take what we have available and help you apply that to what your interests are; and the answer to all of it is generally that's about what we have and it all depends and thank you for your attention. Any questions? Okay, that's it. So I tried to organize this as best I could in the way that makes sense to me, but I'll have to preface this by saying that I have been working with LONGSCAN for seven years, so I have sort of a unique understanding of the data that, as I'm going through this, might not make any sense to anybody else. I'm hoping in the end that it will kind of all come together for you; and some of it may seem a little mind numbing and I'll try to go through that part of it as much as I can, but I feel like that the more you know about it, both in terms of the really, sort of technical parts about how the datasets are named, to where you can find variables that all of you will need to the end part which is talking about attrition. In the sample that, hopefully it will help you to sort of navigate more smoothly all of the datasets that we have. Now I don't know off the top of my head how many datasets that there actually are here over the timeframes that we deposited, but I know that it - where I sit, we're sitting on over 400 of them at this point. So that's a lot to keep up with, so the more that you kind of know in terms of the general structure and the way that things work and the evolution of kind of where we got to now, I hope that it will help make things a little bit easier for you. So I've kind of organized it into - some of it will be a little bit of a repeat in terms of when we collect it, how we collect it, who from. The data structure, in terms of how datasets are named. Scored data, recommendations for construction of datasets, linking observations in datasets. Documentation, the maltreatment data, datasets that have useful variables; talk about a couple of datasets or ways to think about some of the data that are sort of more specific to some of the interests that you have, from reading some of the interest from your sheets, and then some of the data nuances, which is always just a treat, attrition. So you just get started, please interrupt me at any point if I'm not entirely clear, and I will not be entirely clear. Again, it's one of those things where you have, it will - in your head, and getting it out to somebody else is not always that easy. I tend not to see the forest for the trees, so if you need me to back up and take a larger view, I'll try to do that as best I can. So some of this early part is going to be just a repeat of what you've heard already, but I want to kind of talk about it in depth - this is going to be important to reference, because the data are structured around these things. So our face to face interviews were conducted at 4, 6, 8 and 12, our annual contacts in the off years, and then our interviews are conducted separately with the participants and their caregivers; and I'm just saying that, again, because the data are going to be structured in part by visit, by who the respondent was, and, kind of around the evolution of the data, collection methods, when the type of data management system in use at the time. As was eluded to, the data management system's changed over time to keep up with the technology in use at the time of the field, so we started in 1991 and it currently, so we've had three evolutions of a data management system where the early one, the data were collected on paper and pencil, interview administered and then intern by hand into the DMS. That would cover up through Age 7, our annual contacts through Age 7. At Age 8 it was interview administered, but computer assisted, so I think they did it with a computer and kind of entered it as the interview was being conducted, but it was still interview administered. And that was 8 to 11, was administered in that data management system. Then, at Age 12, marked the beginning of audio enabled computer assisted self-interview format. And that data management system goes from 12 to 18, and I'll tell you, at 12, that sort of changed everything, in terms of data, from what I have to work with. We'll get to that shortly. In terms of informants, we have the child informant, where data were collected at each face to face interview. At Age 12, the time and the amount of data collected from the child participant was astronomically different from any other interview prior to that, it was a lot more data that were collected directly from the participant. Those interviews also got a lot more complicated, particularly as administered to the child. So a lot of those questionnaires and measures were branching administrations. That's also going to be important in terms of knowing the data and what the data look like. Caregiver informants, the data were collected at all the face to face interviews, and the annual contact interviews. CPI's, those review cycles vary, so the data collection is not necessarily correspond to the face to face interviews. The reviews were done differently, at the different sites. Generally no more than 2 years would pass without a comprehensive review of the participants, but sometimes those review cycles vary depending on access to CPS agencies at that time. We also collected data from the teachers, and there are some interviews - kind of not technically an informant, but they do provide interview writings, and they also did, for a particularly in the early part and then I think for the Vineland at 12, administer some developmental and cognitive and social measures to the participants, to the youth participants. We have combination of different structures of datasets. Some of our datasets are flat, indicating there's only one observation per ID. Some people refer to those as being wide datasets. Then there are stacked datasets where there are multiple observations per ID. Some refer to those as long datasets. Regardless of who the informant is, all the data are linked by the child's subject identification number, which is a combination of the study site, and then a unique numeric identifier. Okay, and any - most of this stuff, you're going to see general in white, because it's all general. A lot of exception, so general rule of thumb is that there is a dataset for each measure, for each respondent, for each visit. But, there is some exceptions. So for our staff datasets, so the one where there are multiple observations per ID, there could be multiple observations because they were administered at multiple visits, like the CBCL. They are multiple referrals, like the CPS data, or, for example at Age 12, there are multiple respondents of the same type, so we may have had more than one teacher responding about the same child at the same visit. That's why there would be multiple observations per ID in some of our bigger, longer, stacked, datasets. But regardless, again, particularly with the latter, with regard to teachers, or if it were the caregiver interview, or the child interview, or the annual contact interviews, all of the ID's are the child subject identification number. They're all linked that way. All the datasets can be merged by ID. Dataset naming. So, how many of you have gotten into looking at the data in the datasets? So we notice the dataset names are a little bit funky? I'll see if I could tell you how we got there. So generally, again, dataset names can be broken down into the abbreviation of the measure name, and in some cases, it might be the construct name, it just sort of depends, I take no credit or responsibility for naming them, I didn't do it. The form version, and the data closure date. For us, it's the data retrieval date, but for here, it's the data closure date, because the data we deposit here are closed out. There's no more data collection for that particular interview. For example, the DEMA0404 is going to be the demographics form, form version A, and the data closure date was April of 2004. Most of the datasets are 8 characters long, except data from the 8 to 11 data management system for whatever reason in that particular DMS system we were working with - I think it was one of the earlier versions of FoxPro - wouldn't allow an 8-character name, so the form name is going to be actually a 3-character name with the 4 numeric closure date on the end. 
^M00:10:22
So you can see the DEA0708, it has a different closure date there because it was closed out after the Age 4 date. And then there's a couple of exceptions to, particularly at Age 12, and there's a couple at Age 4, where there might have been two different form versions, but they were really close enough that it doesn't make sense to maintain the two different versions, so you guys don't have 400 datasets that you have to live with, we try to combine them to make it a little bit easier for you, so that you don't have so much to stack and then merge and transpose and do all the acrobatics that you need to do, and so in those cases what we would try to do is drop the form name at the end and those will be a, sort of a 3 character - 3 digit character prior to the closure date. In some cases, we have measures that are really the same thing. But they are measured, administered at different time points that were in different data management systems. So for example, the CESC, the depression scale was administered at 4 and 6, and again at 12, but there's two different datasets for that, because the data management system changed. So they'll be the DEPA and the DEPB, because the difference is the data management system in use at the time. That's not always the case, just generally. For example, the CBCL is collected at all of the time points, it covers all of the data management systems, but we've combined that for you. Same with the teacher form. Dataset names may change, or form versions may change if we did any variations between form versions to the wording. The response options, the order of the questions. Any changes to a previous versions gets a new form name, and a new dataset. That's how we got to 400 of the little buggers. So for example, the demographics forms are here. So we have the DEMA at 4, the DE6A at 6, the DEA at 8, and the DEMB at 12. All of those are caregiver demographics form, they generally ask the same thing, the same questions; but there are some slight differences either to response options, for example, income on the DEMA, maybe the DE6A, I don't think there's a category 12, which is unknown, but in the DEMB, there is, so it gets a different form name; and then you have to remember the code that is missing, because 12 doesn't mean anything, so watch out for that stuff. The variable names are going to be a reflection of the form name. SO the DEMA, those variables will be DEMA1, DEMA2, DEMA3. The DE6A will be DE6A1, 2, 3; and they're not going to be in the same order if the question order changed over the form versions, so data dictionary, be your best friend. This is the only thing - a bad example, so it's like a sort of kind of example. I'll tell you what I mean, because I looked at later and like, okay, that just didn't make sense. So if the construct is the same construct, like depression, is assessed over time, but we change the measure, like the CESD at 4 and 6 to the BSI at 8, then back to the CESD at 12, generally the mnemonic, or the dataset name is going to represent the measure name, not the construct, and I'm going to follow up with an example that doesn't really show that, but - So at the CESD at 4 and 6, the dataset name is the DEPA, and then at 8 we changed it to the BSI, so that's the BSA, and then back to the DEPB at 12. The part where is doesn't make sense is why the Age 4 and 6 is DEPA and not CESA, but - so it's kind of sort of an example, but - so anyway. Okay, if we have scored a measure like the CBCL, or the Vineland, or the WPPSI, those - well not the WPPSI, those scores are going to be housed in a different dataset, so they're not included with the item level data, there's a different dataset. There's different data dictionary for all those scored data; and what they're going to have is generally those scored data will have a partial mnemonic of the item level dataset, but they'll end with an S, generally, generally . So the sample, the CBCL is in the CBCL dataset and the scored data for the CBCL is going to be in the CBCS dataset. In terms of constructing analysis datasets, because there are some that are stacked, and some that are flat, and particularly when we get to the RNAB, you don't want to try to merge them all together and make a long scan dataset. Don't do it. It's bad, bad, bad, bad idea. Don't try to do that. What instead, you need to do is kind of think about what your analysis questions are, look at the measures manual, the data dictionaries, find out what variables you need from what datasets. Look at the structure of those datasets and figure out if they're not in the same format, how to get them in the same format and then do your merge at that point and make your life a lot easier than trying to figure out how to transpose all of this and make you one big really monster messy dataset. Alternatively, if you want to do a longitudinal analysis, for example, HO Amerson [phonetic], the GE models that where you need a stacked dataset; then you need to again look at the structure and kind of get that in the - all the datasets in the right structure so you can do a merge. In that case you would do it by ID and visit to get it in the right structure that you have this nice stacked dataset. And if you have any questions about how to do that, Holly will help you out. So how to link the datasets and observations. All again, all the datasets can be linked by ID, that's the one thing across all the datasets where you can merge them together is by ID. If you want to do a stacked dataset and pull them together that way, then you need to link by ID and then visit number. The visit number essentially is the interview number. Corresponds to the age. I might use visit and age sort of interchangeably, but it means the same thing. So visit will be for 6, 8 and 12. Any questions so far? Super. Okay. So critical, critical, critical information. I can't tell you how important it is to reference these pieces of information. The measure's manuals and the data dictionaries. And you need both of them, trust me. You need them both, because they each contain unique sources and pieces of information that you need to use in conjunction with each other. The measures manuals, there are three volumes and that's the reference website, I think they're also available someway through you guys, and then the data dictionaries. The data dictionaries are on our website, but they're on our internal site, you can't access them through there, but we've sent them here and I think that last document was just for 12 was 400 and something odd pages long. Just for 12, but you need it, trust me, you need it. So let me talk about the structure and what's in these things and why they're really so important and critical. So for the measures manual, again, I said we have three volumes. They correspond to the developmental periods that we've had over the study. So the early one covers up to 3, the middle childhood covers up to 12, and then the early adolescent is 12 and 14, but yeah, 12 and 14 because you'll link up to what we have. Within these, what's described, there's an overall description of the measure, whether we developed it or whether it was one that we borrowed or modified or one that was a standardized measure out there like the CBCL, will be the purpose, the conceptual organization, the item selection, materials and administration method and training. That's just generally for whatever measure that was. Scoring types and score interpretations. What I also want to say about this is that there might be some things in the measures manual about scoring where we did not score it. So it might be some hints about ways that the author for example has suggested you might use the data that we did not formally score include. Good to look there. Norms and comparative data. Then the long scan use, when we administered who to the mnemonic inversions of the dataset names, if there's a scored dataset, it will tell you what that dataset is, makes it easier for your reference back to what you're trying to look for, because sometimes the measure name and the dataset name don't necessarily correspond very well. 
^M00:20:08
The rationale for the administration and any administration or scoring notes, also very important, very important here, because if there are any sight deviations, any differences in the way that that particular measure was done, that's going to be there, it will describe any differences in insight, deviations and administration. For example, it will tell you that the VICA, the caregiver victimization was not administered at the southwest site. It will tell you that the CTS parent to child at visit 4, administered at southwest site and at the northwest site, the response options were different than the other sites. So always very important to look at any deviations in site administration. Then we provide descriptive statistics and any reliability that we did with our sample. So it's our description statistics of our sample. How they performed on the measure, and then references and a bibliography. Okay, so for the data dictionaries, there is detailed information on every single variable and every single dataset that we have, and the scored data, when we have - when we score the data for any measure, it is in its own dataset, it has its own data dictionary and we include the algorithms for how that was scored and any interpretation on sort of how to interpret the scores or any other general information that you would need to know regarding the scored data, if there is any that you would need to know, other than if it was obvious. The data dictionaries that you will have, I think, are arranged in the following order, so there will be a table of contents, then there's item level data, then there's score data, so always remember your score data is going to be at the bottom, and then there are appendices that would be relevant to these data. For example, the, I think we sent the codebook for the CPS data, and then I think we also included a tutorial on how to use the CPS data, the RNAB dataset, I think is what it's called, there'll be some additional information in those appendices for your use. This is what one looks like, I'm going to point out some important information that you can find some little ways that you can figure out what's in where. So there'll always be - they're all arranged in exactly the same way, where you have the variable name, format, description and the coding categories for all of them, they're all set up this way. The center variable, that's going to tell you who has data for that particular measure. So again, not all of the sites administered all of the measures at all of the time points, if you look at that, it's going to tell you who has data there. And for those where we have stacked data, or we have multiple observations per kid, this will tell you what sites administered at what time points. So you can see if there's any deviations or if one site, for example, doesn't have a 6, and the others have 4 and 8, 12 or something like that. So if you look here, it's going to tell you who has it and when. And then these would be, so for example, for each item, you'll get the description of what the item was and then the coding category, so what all the response options are, because I don't think what we have value labels for all the response options, you can't just do like a proc-contents [phonetic] and figure it out. You kind of need to supplement that. And then what I've highlighted here, this is one of our Age 12 measures, the delinquency survey, this is just, I just took part of it. So we go through and the kid, again, at Age 12 in particular, what the kid sees is that they see basically one questions come up at a time. You know, "Did you take part in a gang?" And then they answer that, and then the next question comes up, "Did you belong to a group others might consider a gang?" and then they answer that. And then when they get to this one, "Were you in a physical fight?" If they answer no, then it goes to item 5, if they answer yes, then they get, "How many times were you in a physical fight?" So if you, for example, were interested in using as a variable the number of times a kid was in a physical fight, then you have to understand that the DMS skips anybody who said no, and you're going to have missing data on that variable for everybody who said no. so you're going to have to go in and do a little bit of re-coding so you don't lose those subjects out, and go back in, and know, you're going to have to say okay, well if ASDA4 is 0, then ASDA4A is going to have to be equal to 0, sort of change your quasi continuous scale a little bit, add another coding option so you don't lose half your sample and then be putting all your datasets together like, "why did I lose 500 kids? I don't get it." So the measures manual will, there's a link in the measures manual right to the form, so you can see on the form where the skips are, and then use this to supplement, and then when you're going through it, kind of, will guide you on it; and if you need to do some re-coding or actual sort of programming in terms of putting your datasets together. So this is a wonderful multi-informant, multi-method study. Yay LONGSCAN. And we have lots of different maltreatment data types. The CPS case record reviews, the very raw data that describes the actual coding and entry of the CPS records is in the RNAB dataset. Then we have constructed a dataset that kind of aggregates a lot of data out of that CPS data to make it a lot easier to work with. That turns that case record every day set into a flat data file, sort of aggregates over all the data. So that's a derived set of data that we've done for you. Then, as Al mentioned earlier, there's some caregiver-reported sexual abuse, there's the [inaudible], and the [inaudible] my measure name, I've heard of [inaudible]. In these datasets, there's also the "you-self" report that we collect at 12. The parentheses that refers to the scored dataset, so that's where you'll find the scores for those measures. We did not send the scores for the "about my parents" or the - technically that's the "multidimensional neglectful scale for parenting" I think is the technical name of the Strauss, modification of the Strauss measure, because we're still sort of trying to figure out what we're doing with those scores. And then if there was a positive endorsement on the sexual abuse scale, then they also got some supplemental questions on the SASA, but that's not necessarily scored, so those are where you can get some of the maltreatment data from different informants. Then if you were feeling really spunky, you could look at other possible data places to get it from like the conflict tactics scale; and another possible place is - would be the conflict tactics scale parent to child, the 4 is asterisked because, again, the San Diego site did administer that at 4 and the response options at the Seattle site were slightly different than the other response options. There are some IRB concerns at their sites, so they could ask whether it happened or not, but not the frequency with which it happened. So be some variations by site. All explained in the measures manual. Let's talk about the case record reviews. So Al talked to you a little bit about this, I'm going to talk about it a little bit more. Just in case you didn't get enough. So the case record reviews are reviewed for each subject, depending on when the site could get in and review, it's supposed to be lifetime reviews at each cycle that they review it to make sure that they didn't miss any, that they didn't happen again in the previous review. Really there is a tri-coding of allegations and findings to where we also code the CPS label for maltreatment allegations, findings and risk factors. Then we code them using the NIS system, and then we code them using the MMCS system for allegations and findings. For this, we use CPS labels and for the latter two we use the allegation narrative and the summary findings. The summary findings and substantiations are only going to be available if the case was accepted for investigation. At least that's the way it's supposed to look. Each type of coding offers a different perspective and slightly different information. So it's always good to read what it is, what's available and what your questions would be best suited in terms of the data applicable. So the observations in this dataset, the RNAB, any given observation reflects a referral, a new referral to CPS. Any referral can have up to six allegations, and six substantiations, for any referral. The number of observations and referrals are going to vary across the participants depending on how many referrals that there were, and the number of allegations for any given referral are going to vary depending on what was alleged when the reference called in. The data in this dataset are not organized by age of the participant or interview cycle, there's nothing in there that you're going be able to use to merge in. You're going to have to use a combination of the date of the referral and the subject's date of birth to figure out when it happened and it takes a little work. If you want any tips, you can ask Chris. 
>> They're not by ID number, child ID?
>> Yeah, it's by ID, but, you don't want to merge it that way, trust me. I mean, because you're going to get, I mean I'll show you in a minute why you don't want to do that. 
^M00:30:02
You don't want to do that. So, but the visit number in the RNAB kind of reflects what - when the review done was closest to the visit number and the form sequence of how many referrals were coded during that review cycle, so it's going to be like a visit 220. 220 doesn't - you can't merge in by that visit, it doesn't fit with any of the other visit numbers for the face to face interviews. It doesn't help you, and it doesn't line up with anything. So you can merge by ID, just - you're going to have to work with the data before you can merge it at all. So the great news is that these data are structured to provide absolute most flexible use of this data. You can get anything you want, you can structure it any way you want, you can use - you can figure out what timeframes you want out of it. Super, super flexible, but you may have to spend a considerable amount of work trying to massage that dataset to get it to work the way that you might want it to. It takes a little programming skill to do that. As such, there's a tutorial provided for you on how to use those data if you are so brave to want to do that. And you might, you might need to do that. For example, Meghan wanted to know the severity of the first incidents of physical neglect reported for the kids in the sample. Well that's - you have you go to dot, this dataset to get that out. This is an example, a very small piece of what the structure of this dataset looks like. So for this one, so this little guy has one observation here. There's only one referral of maltreatment for this little guy. And I'm just - I just took a little pieces of what we might have, so it looks like this guy has two allegations in his narrative. We get the maltreatment code, the severity, we have up to two perpetrators coded for any allegation along with their gender and age. And then we go into the second piece of information, on out to 6. We also get who the referent was, so who called in the allegation. Whether the case was investigated, again, the CPS labels including risk factors and what they labeled the maltreatment; and [inaudible] and then everything is repeated for substantiation. So all of the allegations are repeated and then there's a conclusion code for whether or not it was substantiated. New severity perpetrators, etc., etc., etc. This guy had two reviews, and during this one, there were two referrals. In this referral, he had one allegation, this referral there were two allegations. In this review, only one, this guy had two allegations for this particular review. So what I'm pointing out here is that the number of observations per ID is going to vary depending on what was found in the review cycles for these individuals over basically from birth through 12. And this dataset currently houses 4,000 observations, covers 914 kids. The number of reports per individual, 214 subjects have one report, let's see, 19 subjects have 10 reports, or referrals, I should say; and then there are two subjects that have 20 reports. Now you notice this is 914 kids, so if there's no allegations, there's no report, there's no observation for that kid in this dataset. So if you try to merge in, you're going to have missing data for the other whatever number that is from 1,354. So if there's no report, there's no observation in this dataset. Now I will say we have looked and used number of allegations versus yes/no. There was an allegation and really, the data are pretty skewed and almost kind of come out bimodal anyway. So you can see it's pretty heavy on the less than sort of 5 end, and then dramatically goes down from there. Usually we found if we've modeled something with number of allegations, and then we've numbered them with it's a yes or no, there's no difference in the outcome or associations, but - again, it will depend on your research questions. But this, if you try - so if we have 1,354, you know, and you were merging in this dataset with 4,000, this is why you just wouldn't want to do that. Okay, so just in case you don't want to deal with that , we've developed some ways to sort of help you work with this a little bit better, and that is the Underbar SD [phonetic] dataset, or we'll call it the MSD dataset. Basically what we've done is try to make this a little bit easier for you to work with so you don't have to try to get these and do a bunch of arrays and transpose it and then merge it in. We've tried to do all of that for you to get it down to one observation per ID. We've aggregated a lot of the information and we've added in addition to that a lot of the variables that were used in the dimensions papers from the 2005 Child Abuse and Neglect special issue as well, in terms of maximum severity, expanded hierarchical types, single versus multiple chronicity as well. So these are relevant to the data that were collected, we've just aggregated them for you and we've done it into timeframes that correspond to our face to face interviews. The information contained in the MLTX, which was deposited with the last deposit of the 8 to 11, and the MSD, the variables included are all the same. The kinds of information are the same, but the timeframes are different. So in the MLTX, the timeframes were based on the chronological age of the child. The MSD is based on the date of the interview. So the date that the child had an interview. If the child didn't - the child or caregiver did not have an interview at that date, at that timeframe, then we used the chronological age to substitute in when that timeframe would be. So it's easier for us when we archive it here, because in part the interviews do not always occur at or around the child's birthday for that particular interview period. So you don't want to use a chronological age if the interview happened prior to that of using information that occurred after the facts, so we thought it was really easier and made more sense to make these timeframes correspond to the interview; and again, that's Al's request. He's from the San Diego site, you can thank Al for making your analyses cleaner. So included in this dataset are allegations, referrals, the CPS determinants and referrals based on the summary, the referral narrative; and we have number of allegations, number of reports, indicators, so if you don't want to use numbers, you can have the yes/no indicators. We have indicators and numbers for type of maltreatment, physical, sexual, emotional abuse, neglect, broken down into failure to provide, lack of supervision. We have the educational neglect, moral or legal drugs, alcohol. We have whether there was a single type of abuse referred or multiple types of abuse referred within that particular timeframe. Combinations of maltreatment types, so the expanded hierarchical variable, which is okay, it was sexual abuse alone, or physical abuse alone, or it was physical and sexual abuse, or it was neglect and other types together. We have the maximum severity for each type of abuse, and this chronicity variable actually you should know it only goes to 9-1/2. So if you're - it doesn't go all the way to 12, it goes to 9-1/2. And all of these are broken down into timeframes from 0 to 4, 4 to 6, 6 to 8, 8 to 10, 10 to 12 and 8 to 12. Makes it very easy to aggregate over timeframes if you're interested in say, 0 to 6 and 6 to 12 or some combination of those. If you want to go backwards and you want to do 0 to 2, then you need to get friendly with the RNAB dataset, because we don't have that right now. Maybe in your next installment, but not currently. Now, just an example of this structure, see it's nice and clear, there's one observation per kid , how it works. And so for this one, it's like the number of physical abuse allegations from 0 to 4, this one is a 2. Substantiations is a 0. Maximum severity on across, and then this would be like 4 to 6 is 0. So you have that for each subject all the way across the timeframes available. And in this one, there are 1,354 subjects. So it's all the subjects, not just the ones with the report; and if they didn't have a report, then they'll have 0's filled in for you. So you don't have to fill in your missing data. It's all there, 1,354. Okay, so that's what I just said. This represents aggregated data, but specifically, you won't have data specific to any particular referral or allegation. And, or, if you're interested in a timeframe that we don't have divided up already, then you'll have to go back to the other one to work with that. 
^M00:40:08
Okay, so notes of caution about either of these two datasets. The absence of an observation in the RNAB does not necessarily mean that there was no maltreatment of that subject. You have to remember that these are CPS data that we're working with, they - the problems with using data from these social service agencies, there are problems inherent in doing that. In some cases, if we waited too long to review and the case wasn't accepted for investigation, they could have expunged the records, not there to review, but in fact there was a referral. Maybe it wasn't in the records at the time of review. So it doesn't necessarily mean there wasn't any maltreatment, but we work with what we have to work with. But in the aggregated data, in the MSD, we assume that no maltreatment or no referral, no record in the RNAB assumes that it's no maltreatment. Not - to some extent the best that we can do. Okay, so let me talk to you a little bit about - thank you.
>> Are there datasets with not useful data?
>> Yes . I'm sure there are. But these are sort of datasets where you almost always are going to have to pull something out of. In terms of reporting your sample or control variables, things like that. So these are places where you can find the data that you're almost always going to end up using. The IDS - sorry, the IDS file is what we refer to as our master file. So you could get child gender and child race from other forms at different times, I wouldn't recommend it. This is our master file. This IDS file, it's in variance, it's not going to change. It's data we collected at baseline, because I tell you, we've got kids that change race throughout the study, I'm not kidding. There might be one or two that change gender, I'm also not kidding. Because again, depending on the age of the interview, this is data that, you know, the caregivers or the youth are entering in themselves. So we do checks to the best extent that we can on these, but the master file, right here. So child gender, child race, child date of birth. We have interview indicators to help you with your - if you need to do any kind of subsampling. Say, okay, well, I know my outcome is at 8, so if they don't have an Age 8 interview, makes it a lot easier to subsample my data by dropping anybody who doesn't have - where CH, underscore, Age 8 equals 1. So there are interview indicators in there that tell you whether the youth or the child had completed an interview by our standards which are pretty lax. That doesn't mean if they completed one, if they have a 1 there, that doesn't mean that there aren't going to be missing data for whatever you're interested in, so a completed interview doesn't mean there are no missing data, this is like, 2 - depending on the interview, up to 2-1/2 hours' worth of measures for each respondent. Sometimes they skip things. And then study site, I've asterisked date of birth of here, because for any of you that have already been in the data, you know that we try out best to de-identify participant information as much as we can. Any dates, any dates, referral dates from CPI's, interview dates, child's dates of birth, all of the dates are de-identified to be the 15th of the month. Whatever it is. So if you've gone in the data and said, "Wow, look at that, all their birth dates are all on the 15th, cool, how did they select that sample?" But that's the way it is, so it's not going to, you know, it's not going to make any difference, because it's just going to be plus or minus, you know, less than 15 days. But I did want to just sort of point that out, that it is de-identified to the 15th of the month. Okay, other datasets. The cover sheets. There's a cover sheet completed for each interview for each informant, other than teachers. And that will have, for the caregiver, it will have what their relationship is to the child, and will have date of the interview. So for the child, these are the cover sheets. The 4 is asterisked here because the eastern site did not complete cover sheets on their kids. I don't know why. These are the caregiver cover sheets, and then these are the annual contact cover sheets. So those are there. Another that we sort of deposited very last minute was the derived household composition dataset, and again, what we tried to do is that 4, 6, 8 and 12, we gave different measures to try to figure out who was in the household and who was living with the child. The way that that data were collected and the response options and the way you can code who that person was kind of changed from form to form, and if you were looking over time, it was a little difficult to try to merge in all of these forms and the coding options changed. In one, it might have only gone from 1 to 8, and then the next variation it went from 1 to 15, so it could get a little harry in terms of getting all this collected into one, so we aggregated all of this again in the derived household composition form. What you can get out of this is the respondent's gender in relationship to the child at each time point. Not - annual contact interviews are not included in this. The foster status of the caregiver, number of adults, children in total in the household, indicators for the presence of particular household members. Whether it was a multi-generational household and then there are three here that are attempting to kind of get at the general structure of the household in terms of family structure, household composition and then the living arrangement. All this in the data dictionary. So a couple of datasets that I kind of wanted are ways to conceptualize data that I thought some of you were particularly interested in, including the caregiver arrangement or the foster issue, and another was history of victimization. I'm going to go quickly through this because Al covered a lot of this. Ways to kind of figure this out, or options to sort of figure this out, first of all, all of the participants from the northwestern site were removed from their homes prior to Age 4. There you go. Southwest site, sorry. 
>> [Inaudible] just make it easier.
>> From the San Diego site. There's one. The caregiver relationship to the youth at the time of the interview, those are in the derived household composition forms, the cover sheets, at the face to face and the annual contact interviews will indicate on there if that's a foster caregiver. The household composition forms and, well, as the life events scale for children. I'll get to that in a minute. So you need to look at the data dictionaries again, very carefully because from 4 to 6, in the cover sheets, the distinction between kin and non-kin foster was not made, and then after that it was made, so it doesn't exactly line up between 4 and 6 and then after. Those time points. And if you want to look at the life events house - life events scale, and again, that - the advantage of that is it was collected every year, so you do have the advantage of looking in between the face to face interviews, and there is a question that ask in the last year, has the child moved away from the family, and if the answer is yes, then there is some questions about the number of times that they've moved - for various reasons and one of those would be moved into - the number of time moved into foster care or group home or shelter. These datasets are the LECA, which is at 5, 6 and 7; and I asterisked 5 because the Baltimore site did not administer that at 5. The LUB is the 8 to 11 data, and then the LECC is at 12, and I should note that the LECC version asked whether or not they moved to foster care, doesn't say how many times. There's a slight version in the response option there. Try to move a little bit faster. This is what I promise, in terms of who the caregiver was across time. In most cases it's the bio mom, but you'll see that there are different caregivers represented and the percentages of those - who those caregivers who are - do change to some degree over time. Foster mother is again asterisked because depending on the time of administration, the characterization of a foster caregiver is different depending on the cover sheet. Okay, goes quickly. The caregiver victimization and loss was actually split into two separate measures, one's assessing loss, one's assessing history of victimization. The VICA the victimization part was not administered by the San Diego site. Okay, data considerations. The first sort of funky thing is that LONGSCAN considers as its baseline sample any interview or any participating out of 4 or 6, instead of - that's alluded to when you look at the number of interviews completed for each timeframe, we don't have 1,354 at 4; and the reason why is there are 104 participants who don't have a 4, but who have a 6. Site samples vary by maltreatment risk, scan entrants into the study. That means that the age of our kids are going to vary depending on what site they were from. That's - and the age distribution within a site. So one site, the kids were all about the same age, but for all the other sites, there's a continuum of ages from within the site, so that there could be more than one interviewer - more than one interview being conducted at any one time, not with the same participants of course, but there are age ranges of their samples. So you could have younger kids and older kids, but there's one site who only had kids of one age. 
^M00:50:01
So only one interview was being administered there. Why is this important? Well, because if you look at things like poverty or welfare, or welfare reform, or service receipt, you might want to keep in mind, and you're just going to say look at Age 4, that, when we started the first interview for Age 4 was connected in '91, and by the time we finished, it was 2000. So you might want to consider the cohort situation, if you're looking at something like social services or welfare or poverty or something along those lines. They're not, Age 4, doesn't mean Age 4 at one time for all folks, it was quite a range in the administration of those. Also, in terms of the age range, this is why kind of looking at the interview is important, or the date of the interview and not assuming that an Age 12 means that they were 12 at the time that that interview was administered; is when you look at the mean, they fall in about the places they're to supposed to, but if you look at the minimum and maximum, we've got kids that were 7, and they got the Age 4 interview, and that can depend on when the interview went into the field and how old the oldest participant was at the oldest site. And it's important to get data, so we got data. Okay, attrition. Okay, so the types of - attrition at LONGSCAN is very hard to talk about because our sample, it's not like they fall out and then never return. That is the case in some - for some kids, within some cases they miss an interview but return for other interviews. For other interviews, in some cases, they miss a lot of interviews and return for another interview. The number of interviews may vary. We can have a child interview completed, but not the caregiver, or the caregiver and not the child. It's pretty complex to talk about really, and the study hasn't ended, so it's very hard for me to kind of tell you what the attrition rate is in LONGSCAN, so don't know what the attrition rate is yet. And I'm going to - we've got kids coming back at 18, and I don't know how many there are, but I know that there are some kids coming back who we haven't seen since 6, Age 6. So it happens. So there are - overall, three types of attrition. Those that were approached but didn't consent. They consented, but they didn't participate, and then there were participate, but they didn't complete. For LONGSCAN in particular, we have very limited data and not cross sites. It's not in any database where you can talk about who was approached but didn't consent. For consented but didn't participate, we have those who consented, who participated in the baseline only, where we have no data up to 12. I can't say that then there won't be data at any time point, but I refer to them as suspected drops at this point. And then there were participated but didn't complete; and in these cases, the number of completed interviews are going to vary across individuals, and the sequence of responses are going to vary across individuals, so the pattern of completion is going to vary. Reasons for attrition is not any different than any other longitudinal study, is death, participant withdrawal, and then lack of contact. And participant withdrawal is even a little fuzzy because at 18, they can consent for themselves and some do come back into the study that they're participation was previously withdrawn for them on behalf of the caregivers. So at 18, they can consent for themselves, and they do come back in and decide to participate. And then lack of contact, and those would be where the study has attempted to follow them, attempted to re-contact them, and they lost touch and they - at some period has occurred where they just haven't been able to follow them, and a lot of new and interesting things have been occurring in attempts to relocate these people, including Facebook, Myspace, and other efforts to try to track down these missing participants. Though some of them were finding contact with us, the ways to find people evolve. Just briefly know there are types of missing data one in which the participant didn't complete the interview or where there was an interview done but they're - didn't complete all the responses or all of the measures within an interview. So there's item non-response where they didn't complete items within a measure, or the unit non-response, which is when they didn't complete a measure at all. And there are some issues that you need to be concerned about for either one of those in terms of the randomness of who did or didn't participate or what data were or were not missing. Our starting baseline sample, sort of kind of technically, it's really 2,708, because we collect data from interview - caregivers and from our child participants, or about or on our child participants. Our child participants stay the same, that's 1,354 are the same kids, but our caregiver participants may vary over time. So as long as we have a caregiver interview, doesn't matter who it was, even if it changed over time, we don't consider that caregiver as attritted out of the study for a new caregiver to replace them, so we really only counting the 1,354 here. But the way LONGSCAN counts their interviews is if there is a caregiver or a child interview. However, in the LONGSCAN master file, the IDS file, you can look at those separately. There are indicators for the youth and for the caregiver. But we count them, our interview counts are based on having either or. Okay, so in terms of, so if you want to look at attrition in terms of, okay, well I'll count it by the number of interviews completed. Well, one problem is that we have 104 who were added, or who have a 6 but not a 4, so you can't look at number of interviews and say well they should have had a 4, 6, 8, 12; 4 interviews, so anybody with less than a 4 is a dropout, that's - that won't work. And also, they may still be active, even if they have not completed the full sequence of interviews. So we start with the baseline sample, and you can conceptualize it easily. Either they completed it or they didn't complete it, but you probably won't use your data that way. So you probably are not going to start out and say okay, if they didn't complete all four interviews, or their baseline plus 8 and 12, I'm going to drop them, period. Because they could have completed the interview you're interested in, so you don't want to drop them out that way, so it really doesn't make sense to conceptualize them as they either completed it all or they didn't. So there's alternate ways in terms of there's completers, then there's drops, those who don't have anything past the baseline; and then there's partial completers, and those partial completers may have been partial sequential completers, or partial non-sequential completers. And you know, does it really matter whether they're sequential completers or partial non-completers, probably not. So we put them in a partial completer group, completers, and then it's just back to drops, where there's no post baseline data. So here's looking at our completion rates over time. So at Age 4, looking at the 1,250 to the 1,354, 92%, 91%, 84%, 72, so these are all relating back to the 1,354 a sample. If you look at the number of interviews completed, again, knowing that there's the problem of the 104 that have a 6 but not a 4, the mean is about 3.4 and 60% of the sample completed all 4 with 26, 9 and 5 completing the fewer of the interviews. I'm going really fast, so just stop me if you have any questions, but all of this in the handout. Now, looking at the retention rate between interviews, so this is between 4 and 5, so for all those who had a 4 who also had a 6, there's 85%. All those who had a 6 and an 8, 81%, and then from 8 to 12 is 67%. So it's interesting, is if you look at between 8 and 12, there's a 67% retention rate, but if you look at 12 to baseline, it's 72. So again, shows the subjects move in and out of the study. So looking at the percentage of completers, partial completers and baseline, 65% are considered completers, they have a baseline plus an 8 and a 12. Partial completers, 30%, and then only 5% are suspected drops, right now at 12, for what you have. So suspected drops have only completed one, partial completers have completed more than one but fewer than four, and then completers have completed all of them. And then, what I've done here is I've looked at three columns, the columns sum to 100 for any given demographic or attribute, in terms of completers, partial completers and suspected drops for males/females, our racial distribution, and there are very few differences. I didn't go to the extent of analysing these or showing you differences because it really is going to depend on your specific analysis question and what your analysis sample ends up being. So it really doesn't make sense for me to kind of go through the exercise of that, but I can tell you that to some degree, there's a site difference and to a small degree there's an ethnic difference. That we, I think we lose our white kids. Some of our white kids. And then here's the different status that recruitment, maltreatment and site. Very few differences. So for the most part, our sample characteristics over time stay pretty close to the same. 
^M01:00:08
 Okay, and just to show you some of the craziness in terms of our partial completers, so these are just the partial completers, so those who didn't complete all of them, so these are the kind of a list from across tabs, so this top 49% completed 4, 6 and 8, but no 12. And then these completed a 4 and a 6, but don't have 8 or 12. These completed 4 and a 6, and then a 12, but missed the 8, and you can see the pattern down, on down between this one has a 4 and this one has a 12, but nothing in between and 3% of those, so - Just an interesting pattern of completion among the subjects. Okay, then in terms of dealing with missing data, there are all sorts of different options available to you. [Inaudible] and some references if you're interested in learning more about dealing with missing data. The end, any questions? Sorry to go so long. If there any questions after that, [applause] Holly will answer them for you. 
^E01:01:13