[musical cue] [Voiceover] National Data Archive on Child Abuse and Neglect. 
[Clayton Covington] Okay it is that time so welcome everyone to the second session of the 2023 summer trading series hosted here at the National Data Archive on Child Abuse and Neglect. We're really excited to get started with this presentation today next slide please.

 So again this is hosted by the National Data Archive on Child Abuse and Neglect which is housed at Cornell University and Duke University. As indicated on the first side we're going to have our presentation and that's going to be followed up by Q and A which I will facilitate. We ask that you place all of your questions in the Q and A text box that should be available on your Zoom screen and then I will read them aloud to which the presenter will then respond. Next slide please.

 So again this is just recognition of the archive itself as well as the Children's Bureau that funds the National Data archive on child abuse neglect. Next slide please.

So to give you all a little rundown of what to expect for the remainder of the series. Today's presentation is about a new data acquisition we have here at NDACAN with the CCOULD data set. And next week we'll follow up with a presentation about causal inference using administrative data followed by evaluating and dealing with missing data in R, a time series analysis workshop and then closing out with a data visualization Workshop in R by various presenters affiliated with NDACAN. But I'm going to turn it over to Ben Allaire who's with RTI who's going to lead us in this introduction to the data so I'll pass it off to you then.

[Ben Allaire] Thank you. Yeah so let me just kind of give you a little rundown of what we're gonna go over today. So in terms of the session agenda I'm gonna start off with a quick introduction of who I am and then what we're going to talk about today is what is known as the CCOULD data and it's the child and caregiver outcomes using linked data. And it specifically we're going to kind of walk through why was it created and then introduction to What child welfare data we have in the CCOULD data set. And then you know I think that my hunch is that this the audience here probably skews towards child welfare data familiarity and may have less familiarity with Medicaid data so I do want to give a little intro to claims data. And then we're going to talk a little bit about linking child welfare and Medicaid data, and then we'll get into the limitations of the CCOULD and finally we'll hit up how to obtain it the CCOULD. So as Clayton said before my name is Ben Allaire. I am a senior research Economist at RTI International I've been there for 16 years. Here is a picture of me with slightly less gray hair and for the relevance for today I was the data linkages lead for the CCOULD project and so I'll be walking through it today.

The CCOULD project, you can see our logo here, why was it created? We'll talk a little bit about that. So essentially the CCOULD was created because you know research has shown that one in three children in foster care entered due to parental drug use. And in 2018 the Family First Prevention Services Act had there were a bunch of new incentives for states to evaluate the effectiveness of services provided by the child welfare system and or Medicaid families risk of entering or re-entering the child welfare system. So the reason this data was created is because what is known about the combination of child and parental needs is really limited. So like we just don't really have a great view into what those needs are. And unfortunately child welfare data alone cannot provide this kind of holistic picture of the supports and services needed by families in crisis in as a result of parental drug use. So you know linked administrative data sources such as CCOULD can help provide state agencies and researchers with data to assess whether parents and caregivers are receiving Services that are needed to treat substance use disorder or SUD and the impact that treatment could have on child welfare outcomes. I think this type of evidence is really needed these days and so that was kind of the impetus for why we created why we were why the CCOULD was developed. So what are possible research questions? So we kind of these are just a short list of possible research questions. Obviously there are many more and ones that we cannot even we could not even anticipate for use with the CCOULD data set. So you know things you could look at were what demographic characteristics, clinical characteristics, and medical care utilization patterns are associated with parents and caregivers receiving Medicaid funded SUD or substance use disorder treatment. What demographic characteristics are associated with parents and caregivers reading receiving Title IV-E funded SUD treatment? Then you know we have like a list of what impact does the receipt of mental health diagnoses and treatment have on child welfare outcomes? How does child eligibility for Medicaid impact caregiver utilization? So I think that's trying to get at maybe some of these gateway hypotheses that that child eligibility for Medicaid also helps caregivers get on Medicaid as well. And you know as an economist I often wondered you know what are medical costs associated with child maltreatment? There are some papers that do this but the hope is that this CCOULD data could really help illuminate what those costs are what those medical costs okay.

So who created the CCOULD? So in the CCOULD there are two States so we essentially went to States and we recruited two states as part of the CCOULD to link their child welfare administrative data with their Medicaid medical claims data. And the two states that were recruited as part of this were Kentucky and Florida. And so on the left here you'll see kind of the agencies that were involved in this. So for the federal government this was funded by ASPE, which is the Assistant Secretary for Planning and Evaluation, and ACF which is the Administration for Children and Families. They funded RTI and RTI went out and recruited these two states and then we worked with those collaborated with those states to develop those. The states had kind of two different models for developing the CCOULD data. So we worked with the Florida Department of Children and Families and the University of Florida was predominantly the research partner, and then we also worked for with the Florida Agency for Healthcare Administration or AHCA and they provided the Medicaid data. And so both of these two departments DCF and AHCA provided data to the University of Florida which conducted the linkage and also did kind of the data management and then passed that data to RTI. So they were the university partner and coordinating Center. In Kentucky you can see our federal partners are there and RTI is there we worked with the Kentucky Cabinet for Health and Family Services and the coordinating agency in that was the Office for Health Data and Analytics or OHDA and so they sort of served as this kind of coordinating agency although they are they're housed under the state government of Kentucky. And so there were able to pull resources from the Department for Medicaid Services, the Department for Community-based Services DCBS and the office of Office of Application Technology Services or OATS and they all supplied the child welfare and Medicaid data.

So as an overview of CCOULD. Let me just bring that to you here. So what are CCOULD? CCOULD are de-identified, linked longitudinal child welfare and Medicaid claims data for children and caregivers from the two states that I mentioned previously Florida and Kentucky and they're available from NDACAN. It's a really unique data set and in particular so I will say doing linkages previously, child welfare, and linking them to Medicaid claims, I think there are other data sets that have done this previously and in fact I'll present some research from prior research that I used linking some child welfare data to Medicaid data. But the real kind of the really what makes this data set unique is it's also having those caregivers linked to Medicaid. And you know as you can probably surmise from the research questions that I talked about previously you could these can be used to examine relationships between medical utilization, health and behavioral services patient-centered outcomes, and child welfare outcomes. And I'd be remiss I didn't mention there is a Lessons Learned report on the ACF website and the link is here [ONSCREEN https://www.acf.hhs.gov/opre/report/ccould-lessons-learned] For for the slides and I think these slides will be provided afterwards. You know there's a lot of really interesting detail in there about the process that went through to link these data and kind of some nuances about the data.

So what files are we do we include in CCOULD? So this is a table that is that is kind of an overview of those files. There's a child welfare report file and that's administrative data on child welfare events. There is a child foster care episode file which is Administrative data on child placement episodes so if children are placed outside the home. There is a Title IV-E services provided file and that's administrative data on Title IV-E services provided to families. Those are those files tend to be kind of a bit limited to whether or not the service was provided and so that is something to consider. And then we have Medicaid enrollment files so that is dedicated data on Medicaid enrollment and eligibility of the person, so how they became enrolled, when they were enrolled and so those reasons there. And also Medicaid claims files so that is data on inpatient, outpatient, long-term care prescription drug claims and again this is only for Medicaid.

Okay so an introduction to the child welfare data in CCOULD. The child welfare data in CCOULD. So one of the things that we had with this project was trying to recruit States and talk to them about developing a common data model so that data could be consistent across States and so that data elements were the same between those two states. And so in order to make those the same we followed NCANDS the National Child Abuse and Neglect Data System and AFCARS which is the Adoption and Foster Care Analysis and Reporting System data structures. So as much as we possibly could we adhered to that because that was data that states already were contributing to national databases and so therefore it would minimize the burden on those States. So I think for child welfare researchers that's that's a nice feature because if you are familiar with NCANDS and AFCARS generally speaking the child welfare data in CCOULD should help you know help you be able to work with it and reduce the amount of manipulation you have to go through because you're already familiar with it. So so that helped us in aiding and data acquisition from States and helps facilitate the research use.

Okay the child welfare variables in CCOULD. There is maltreatment report information including report dates, dispositions, there is maltreatment data, there is also perpetrator data, child and caregiver risk factors, and services information. So there's actually really like a wealth of data in terms of the the child welfare variables that you can use in order to answer your research question. There is also foster care and adoption data including reasons for removal, removal placement dates, number of placements and placement settings. And you know I think the foster care and adoption data took a little bit of effort to to pull together but it is a very rich set of data to understand child outcomes for this.

Okay because I think this is generally speaking a child welfare audience, I did want to use a brief guide to using Medicaid claims data in hopes of trying to in hopes of demystifying what claims data can have. So because looking at them at first they are large files and they can be a little daunting to use it first but once you get used to them I think you'll find that they are pretty they can be used for a lot of different research questions. So let's talk a little bit about Medicaid. Medicaid is a government-run program that provides health care to low income individuals and families. And the latest estimates suggest that over 82 million Americans are on Medicaid and of those 82 million, 39 million are children. So how many of those are in CCOULD? So CCOULD contains approximately a million linked children and 90,000 caregivers so there's a lot of sample in here to conduct research with and so so that's really good. Children in foster care are generally eligible for Medicaid until they're read reach the age of 18 regardless of their family's income level or medical needs. Claims data in particular so that's has to do with like the eligibility the enrollment of folks on Medicaid. Claims data are generated when healthcare services are provided to Medicaid beneficiaries. So there is a bill that providers will bill Medicaid for and that's what the claim reflects. And claims data it includes information about the healthcare provider, the service provided, in particular the diagnosis, and the cost of the service. So those that's kind of what's included in the Medicaid in a single Medicaid claims files.

 So we're going to dig just a little bit deeper. You can go further you can dig even deeper than this I will tell you there is there's more complexity here but this is a nice kind of overview of the different files. So as discussed previously we do have enrollment files and in CCOULD these provide information on how long the individual is enrolled in Medicaid so if you're looking at how you know how looking at a change in eligibility for Medicaid or change in enrollment these are the files that you would look for or oftentimes I feel like in research studies there's a certain inclusion criteria for how long people have been enrolled we want to make sure that they've been enrolled in Medicaid for six months or nine months or 12 months. So that these would be the files that you would use for that. There are in-patient files or IP files and these contain information about healthcare services provided to patients during an inpatient hospital stay. And so they have admission and discharge dates and diagnosis and procedure codes and payments. Now for disclosure reasons I will tell you now that the admission and discharge dates are at a monthly level. We do not have specific days because that information is protected by HIPAA so in order to keep the data de-identified those dates are provided at the monthly level. Then we have the other therapy or OT files I often think of these as just outpatient but obviously it does include other things other than outpatient. So these have information from outpatient care but also emergency department visits, home health services and other care centers are also included in other therapy files. The information there we also have service dates those are are at the monthly level and diagnosis and procedure codes those are critical when you're thinking about Medicaid information and also payments. Finally we do also have prescription drug or RX files and those contain information on prescription drugs filled by the person including national drug codes, the fill date, and taxonomy code.

So those are the types of files that we have in there. Now the next couple of slides are I just wanted to present some examples of linked claims data for child welfare research to give you a bit of flavor. Right now I will tell you that the CCOULD data is so new that we don't currently have any research released on it just yet. We're hoping to remedy that sometime soon with my colleagues at RTI and at ASPE and ACF. And so I am kind of presenting two studies here that to give you a little bit of flavor of things that you could use it using a linked data set that I that we had previously at RTI using the NSCAW data which is the National Survey on Child and Adolescent Well-being and Medicaid. This was a data set that we linked many years ago and I worked with Dr Ramesh Raghavan and Dr Derek Brown on these questions and so just giving you a little bit. And so we used these data to estimate for folks for kids in the child welfare system what were the costs of psychic psychotropic drugs prescribed to maltreated children? And you can see here you know we've got some cost estimates here all the way on the right hand side you can sort of ignore the part one and part two there but you can see the cost estimates for those increased considerably over age but and they are rather expensive for those drugs.

Another example of child welfare data of using linked claims data for child welfare data is another paper that Dr. Raghavan and I wrote together looking at enrollment data and how well do children in the child welfare system stay enrolled in Medicaid data over time. And you know you can kind of see that the disenrollment rates here that we put together looking at the amount of churn so you know the amount of churn in Medicaid. So when I speak of churn I see speak of people enrolling, and then disenrolling is it can be quite high so it's not often the most stable population so we took a look at that and examined here. But what I really want to emphasize for those who are unfamiliar with claims data is that the real key is that with this linked claims data and the enrollment data allows like the researcher to follow an individual over time. So you have this really rich really wonderful longitudinal data. So you can see like an association between something at time y with an outcome at time Z afterwards. And so it really is kind of this nice as opposed to I think some of our child welfare data sets tend to be cross-sectional right? So you get to see them at a point in time whereas these Medicaid claims you can see that the various individuals kind of move over time and what happens to them as they interact with providers.

Okay so the linkage information. So I think understanding how these were linked I think is also pretty critical to understanding how they can be used. So both Florida and Kentucky were linked to child welfare and Medicaid data deterministically. And they used a social security number as the primary linkage and you know as a secondary linkage they used an identifier created from the first name, last name, date of birth, and sex and that was used in the absence of the social security number and that was a secondary linkage. So deterministic linkage kind of differs from like a probabilistic linkage in that it is an exact linkage and so you know you might say there is it's almost like a merge if you were using data sets so you must have a unique identifier in both data sets in order to identify the people in both. And so the deterministic linking presumes only two outcomes which is either those people merge together or they don't merge at all and so we do not find a linkage.

So let's take a little look at our population overview here. The CCOULD Florida data set includes maltreatment reports from 2016 to 2020. And I would say also you can say that you can see that we've got the Kentucky data set has reports from 2016 to 2021 so that one is a little bit longer that the timelines don't exactly line up but still it's it's four or five really good years of data and is like right on the cusp of with COVID data so we have a little bit of COVID data in it not a lot. The child welfare data in Florida has over 800,000 unique children and it has around 650,000 reports and I will tell you that all of those children have demographic records and so in those demographic records they have we have information on gender and race and age. And in addition so the hard part in putting these all together is really linking those caregivers because the information used to link those caregivers is not as strong obviously as it is for the children. However we do have information over that time span on around 46,000 linked caregivers so a lot of caregivers. And then for Kentucky the maltreatment reports occur from 2016 to 2021 as I said before and you know we have about 263,000 unique children identified from 461,000 or actually performed it's almost 462 000,child welfare reports or investigation so a lot. And then the data set there contains information on 44 almost 44,000 linked caregivers. So again a really robust and linked data set for use. I will tell you that both States also have comparison group data and I think this is also another really nice feature of this data is that there are adults and children in the data set that have no association at least that we could find with the child welfare system which allows you to kind of look and contrast that population with those that do have interaction with the child welfare system and can be that can provide important context and flavor to the analyzes that you can do. Certainly comparison groups are pretty critical.

Okay so right so this is a little bit more information on it too. The Florida provided a Medicaid comparison sample not associated with the child welfare system. The sample it represents a 10 sample of adults and a 10 sample of children. So so again it's a really robust comparison group sample. Kentucky also provided a 10 percent Medicaid comparison sample of 330,000 individuals adults and children not associated with the child welfare system so again a really robust comparison sample. Which you know again if you're going to try to look at outcomes over time having a well-selected comparison group helps reduce confounding and can be really critical to understanding Trends in your outcome of interest.

Okay so the text on this slide is a little bit small I will tell you but I did want to can I just drop a little bit of descriptives in here about the child and then caregiver populations. You know so you can see so we've broken this table out into all children and children with unsubstantiated report disposition and then also children with substantiated report disposition, and then children with neither unsubstantiated or a substantiated report disposition so this is for both States we've combined here. And you know you can see that that the age range typically is it tends to favor children who are a little bit older in terms of the population and there's a pretty even split between male and female.

 And in terms of race it is predominantly white or a black or African-American but we do have some minority populations represented here. And then you can see also all the way down at the bottom there in terms of you know are there any mental health diagnosis in these in this population or any SUD diagnosis population, or either and you can see that you know a substantial proportion of these children do have a mental health diagnosis as part of them as part of their medical claims.

Okay the caregivers. You can see this is a combined caregiver data set. And you can see that the caregivers tend to be predominantly in the age 26 to 40. Age range they miss almost two-thirds of them and also almost two-thirds of them are female as well so not very many male caregivers. And they also you know I think more than 95 of them are white or black so that is a significant portion although there are some even smaller race and ethnicity there. The some analyses that we're doing now take a look at the mental health diagnoses in the caregivers and you can see that there's a substantial portion of them do have a mental health diagnosis or a SUD diagnosis. And that three quarters of them have either one which is really kind of an amazing statistic in whether or not you know there is this kind of unmet need for them is it is a question that we're looking at right now. So so that is kind of a brief description of the data that goes into this. You can see already just from these pretty straightforward crosstabs I won't say basic but straightforward crosstabs that these data are really rich and there are a lot of really important policy implications for a population that truly needs it. So you know I do think that this is that this can and should be used for for some good policy relevant questions to be asked.

So there are as with any dataset and there are limitations that we should most certainly address with this.

So the first limitation that I would like to just discuss is both data sets are administrative data and so they are not research data. So these are data sets that are used operationally in the state governments of both States. And so you know as a result I think they may take some work to get to analysis data set you know and that's so they have a lot of very rich detail which is great and really useful but it may take some time to kind of get them into a form in which you can conduct analyses on them. So you should know that before getting into it. And you should also know that claims may not include clinical detailed clinical information or patient-reported outcomes. So you know you will not have claims necessarily for what that person's BMI is you know although I do think there are ICD claims for our ICD-10 claims for BMI I don't know how often those are coded. So you're probably not going to have detailed clinical information but and you also won't have patient-reported outcomes for from the claims data. You know I would say one of the advantages in some sense that's a that is a limitation of the CCOULD data but in terms of putting it together it is one of the best parts about it because it is administrative data. We didn't have to ask states to go out and collect this data it was data that already exists and it was linking those data sets to make them that much more powerful. As I've noted previously service dates so admission dates on inpatient files and service dates on other therapy files are only available at the month level and that is to help ensure the privacy of the folks that are in the data set. And then so for so this next one is claims data may not capture the full scope of healthcare utilization for certain populations such as those who receive care outside of traditional care settings. So so that is definitely true like unfortunately they're limited to only the stuff that providers bill Medicaid for. So if there are settings that where people are not billing for certain services then we're not going to capture those and you just need to be aware of that when you're working for with that. That's true of all claims data but in particular with Medicaid claims as well. So you know if someone is on private insurance that will also obviously not be captured here. And then we have had questions about geographic indicators and then for privacy reasons those are limited to state-level indicators so you know in the data set who was from Kentucky and who was from Florida but you don't know anything further any granular more granular than that.

Okay now how do you obtain the CCOULD?

So first of all IRB institutional review board approval is you need it to use the data so you must obtain that first. And then you will also need to put together an application including a data storage plan and research plan for NDACAN to review so you need to have those. These are all things that you should have ready before applying for the data. And then of course you need to have the names and contact information of the people accessing the data. So like I said previously these are very large files and for folks who are more familiar with working with survey data or you know or child welfare data the size of the files is non-trivial and so you will have to figure out how to work with them so a data storage plan is pretty critical and you know there will need to be a certain level of security to ensure that those files are kept under wraps. I think you would need some sort of server to hold them on because they are so large and it is tricky using them. And you will need a database management software in order to be able to use that so I'm thinking something like SAS. I myself am a Stata guy and Stata the size of these files state it might choke a little bit on it but I do know that is improving over time.

All right so I that is my last slide. How did I do on time oh I am early so that leaves good time for questions. 

[Clayton Covington] Yes well first of all let me just say Ben thank you for this excellent presentation. I think that you know it's just so promising what these data can reveal to us as researchers and just general people interested in the child welfare space. So just I also want to echo the message that Erin sent to you all earlier that please do continue to submit your questions to the Q and A box. I'm going to proceed with starting with the first few questions that we already have queued up and then we will continue this seminar either until the time or the questions run out. So to get us started the first question asks: my understanding of Florida's child welfare system is that it has a unique administration from the rest of the nation namely a large emphasis on privatization. Did this structure of the system present challenges when linking for this data?

[Ben Allaire] Yeah so great question good to hear from you Matthew. So I think that is a great question. I don't it that specific structure I don't think presented challenges when linking this data I mean I will say there were certainly challenges throughout in terms of getting that data linkage. This specific one you know in terms of getting all of the data that we wanted I think that may have been a bit of a challenge but I think that's that system in general did not necessarily oppose post an issue for it.

[Clayton Covington] Okay the next question asks do these data include Medicaid enrollment for children and their caregivers or caregivers only?

[Ben Allaire] Yes so these data have both for children and their caregivers. So it's both so you are welcome to use both.

[Clayton Covington] Right thank you for that response the next question states many states are consolidating their coverage of youth in foster care into a single Medicaid managed care organization. This would be as opposed to fragmenting them across several companies. My understanding is that this policy choice is intended to promote specialization and more seamless contracting with medical providers for a population that often has complex health care needs. Do you envision that the CCOULD data could help the field explore the wisdom or potential impact of that policy?

[Ben Allaire] Well I think the answer is yes so long as you can you have a sense for what those trends were are were in the two states that are represented in the CCOULD data. So so I think you would need to know I don't know trying to think I think we have indicators we may not have indicators in there for managed care off the top I just kind off the top of my head to see if you could look at trends of the children in there whether or not they are in managed care. I don't think I cannot remember off the top of my head if we did but I do think you could if those indicators are in there that you could certainly look at that over time. And then kind of look at you could examine whether or not the those changes resulted in you know better care or possibly even you know less expensive care. I do think you could look at those but I can't be honest just off the top of my dome trying to remember if the enrollment files have a managed care indicator right now. But I do think we could take a look at that. Matthew contact me if that's something we could if you want to look at.

[Clayton Covington] The next question asks is it possible to identify a subset of the data and only receive that smaller subset?

[Ben Allaire] In some sense I think the answer is no but I think that might be a question for NDACAN.

[Clayton Covington] So yeah I'm other NDACAN folks please feel free to chime in but I don't believe you can request just a smaller subset of data especially if it's like a restricted data set. You just you would have to basically demonstrate that you'd have the capacity to deal with the data both with the storage plan has been discussed earlier and go through the full application process. The extent like whatever subset of the data set you actually use yourself that would be up to you in your research purposes but I don't believe you can request just a smaller subset of the data.

[Andres Arroyo] That is correct this is Andres speaking and I would like to add that we distribute the CCOULD data in different formats and so that's the smallest unit that we can go whether you choose STATA format or SPSS but we don't distribute subsets. Thank you.

[Erin McCauley] Yeah Thank you Clayton and Andres I'll also add that if you were however interested in just using a subset we could help you develop the code to drop the other fields if that makes sense. So you know like help you with the data management process to end up with the subset that you want after you receive the data but you would have to as Clayton said develop a data management plan for the whole data set.

[Clayton Covington] Right and that's just also a good reminder that you know at NDACAN in addition to you know being a repository of data we play a really active role in data analysis and also user assistance. So if you do need assistance as Erin indicated that's something that our team could assist with as well' but I'll proceed to the next question. It asks does the AFCARS data have indicators of whether a child has been deemed medically complex?

[Ben Allaire] So that's a good question. I don't think that we have indicators in there for that but I will I'll have to check I will have to check and circle back with you Matthew.

[Clayton Covington] Okay next question asks would you have any recommendations guidance for sorry recommendation slash guidance for working with such large files and statistical softwares like R or Stata?

[Ben Allaire] Yeah so like certainly. I do think that there is to the extent so that you can create a subset with your data you know certainly I've worked with some of these large claims files and what I'll do is I'll piece off so you can use Stata's B sample command and sample you know if you've got a data set with you know 20 million people in it or however many you've got observations in there, you can sample you know 500,000 of those and you sample that 500,000 and run your code make sure that you don't have any syntax errors and the code gives you what you want and then you run it on the full file. So that's definitely something that I would recommend. You know working with these large files that's definitely something that I have. The other thing is you know is to kind of have a you know not to run them on your personal CPU because it will slow down everything that you're trying to do, so if you if there's a server instance that you can utilize I think that could be really useful.

[Clayton Covington] Okay that is the conclusion of the questions so far because we do still have time we can stay open for a few more minutes if people want to mull over some questions in their head. 

[Erin McCauley] I also want to add that I posted in the chat we have had a prior summer training series session on managing administrative data that includes the code that Ben was discussing in addition to other strategies I believe it was taught in R if memory serves correct and it is periodically an Office Hours Workshop. So for those who are unfamiliar with Office Hours, we have about 30 minutes in open support time where we have breakout groups for statisticians or for the data analysts for our different types of data sets, and then there's an informal Workshop after, and so I'd say about once a year we have a workshop on that as well. 

[Clayton Covington] Yes thank you Erin we have a new question in the chat it says I might be wrong but it seems there's a substantial there is a substantial messiness. Could you give us some more details about the missing pattern or share your experience dealing with the missingness or is there any resource that we could find more information about dealing with the data?

[Ben Allaire] So so great question so I mean I think for some variables there is going to be a lot of missingness in the data and I think some of those predominantly tend to be on the child welfare side where they're just you know there just isn't a lot collected by those state agencies. And so I do think that there is there is some missingness there and and that's just a reality. And you know there are different strategies that folks have depending on the variable they're different strategies that people have employed that you know you can drop all the missings, you can impute depending on what it is. So so I think I would advise you to be very careful with whatever strategy you employ because you know I do think it has a lot of implications for the it has a lot of implications for the results that you obtained right. So if you just drop all of the missings that if it's not missing at random if it is missing you know systematic for some systematic reason then your your estimates may be biased so I do think that's something that you want to consider. So the last question that you've got on here is there any resource we can find for more information about dealing with the data? So included with the data is a user guide and we've tried to kind of be very clear about parts where there are gaps in the data and also how to deal with that. You know as much as possible we've really tried to we've tried not to curate this data and I say that I say that honestly because like we want to keep things as raw as possible so that researchers can decide what to do for themselves so that when RTI was working with the data that weren't making those choices for the researchers because that can have implications on research. So so I will tell you that there are some variables that have some unusual values in them and so but we've left that in there for the researchers to decide what to do with them.

[Clayton Covington] And I just want to Echo something that Research Aide Alexandra Gibbons just posted in the chat is that we do have an upcoming session on July 26th about dealing with missing data especially to Ben's earlier Point how a lot of the missing data with this particular data set fall in the child welfare side. So we have some more insights about that to share in the near future so make sure to attend the session later this month. Okay another question coming in and ask what was one of the most meaningful Lessons Learned from linking the data?

[Ben Allaire] All right so I will tell you Matthew Walton is in is in here and he is he works for the state of Kentucky and he was one of the key people in particular he works in the office for OHDA and so he was one of the key people that helped us link the data. So you know I would say one of the more kind of meaningful lessons to me was how you know this child welfare data is really highly protected by the states. And you know for good reason right they're really cautious about making sure that that this data cannot be re-identified and can't be used to harm the people in it. And trying to get those states to kind of open up and say look like we're gonna be good stewards of the data and make sure that people cannot be re-identified and that this data can be used and will be used to help this population that that you know not being able to see the data at all is also something that can harm the population and so in order to kind of make that case for people for you know this the folks on the state side to be able to make that case I think was something that was kind of really interesting to me that I hadn't appreciated before the project started that how protected this data is and and you know that the having such take protections could you know could do a disservice to those folks and so I'm hopeful that to opening this up will be able to.

[Clayton Covington] Thank you Ben we have another question yes who do I contact about more individualized training? I'm a lawyer and a recent phd in criminology at a national organization my dissertation used SPSS I'm trying to expand my role beyond legal into data and evaluation.

[Ben Allaire] That is a great question that I don't know that I have a particular answer for. You know I mean I think first of all I mean if I were you to this person I would say that I would definitely look internally to your to your organization to see who else is there. And then you know I think also I know that NDACAN offers a lot of really nice trainings for that. Then for myself in terms of trainings that that I've had the if you're using Stata the Stata documentation is really good and what all you need really is a research question and a license and that'll help you learn a lot better how to necessarily program.

[Erin McCauley] I will also jump in to say that we have not formally announced it yet but we do plan just as a little preview here for the folks at the STS for the office hour series next year we plan to use those half hour workshops that we offer each month for informal workshops to teach a year-long class on learning R. So what we will no be teaching statistics itself we will be teaching coding in R which would definitely help you advance that project and then remember if you're undertaking a project and you get stuck along the way that we are always available for support via our support email. But that also you can drop into the Office Hours once a month to meet with our statistician or data analyst to help you kind of code through problems.

[Clayton Covington] Yes thank you Erin for sharing that early preview of the upcoming Office Hours events and I think that you know previous attendees when we didn't necessarily do the R class have still received you know individualized attention not necessarily training per se but individualized attention to you know work through and troubleshoot various issues. So again I don't want to come up as a broken record but we are we would like to play a very active role here at NDACAN and assisting our data users with various stages of analysis. I think one other resource that's probably a little bit more far-reaching but again it's not too early to think about this is that NDACAN also has a summer Research Institute which after you have you know obtained like data access and have some familiarity with the data and have like a proposal you can actually spend in a bit of I think it's like three to four a day intensive virtual session with our staff where you will have even more individualized attention for a given project. So the way that you can apply for that is on the NDACAN web site but one of the things Alexandra was again pointing out is that you can keep up to date with all the resources and various opportunities available at NDACAN by both signing up for the listserv and following us on Twitter at NDACAN_CU. You but it looks like we still have a few more minutes so if you have any other questions feel free to throw them in the Q and A otherwise we'll conclude in a little bit. All right there's one more question that asks does NDACAN have an archive where readers can identify papers that have been you that have used the CCOULD data? So excellent question I'll answer this and say that one of the things that Ben mentioned earlier is that because this data set is so new that there aren't necessarily publications of the CCOULD data yet but they are in the works and one of the resources available at NDACAN is a candl which is the Child Abuse and Neglect Digital Library also available on the NDACAN website and Erin just dropped the link in the group chat. But basically any NDACAN related publication is accessible via this link and it will take you to the to a Zotero website if you're familiar with that platform where it's essentially bibliographic software where you can look at all of data and disaggregate it by a specific data set whether you want to do one data set a combination of data sets so I would look to the candl as a resource. So I think this is a good time to wrap up the session. Again I want to thank Ben and I'm going to ask if you can go to the last slide please Ben. So again thank you everyone so much for attending this week's session we are so excited to be you know sharing this new data acquisition with you all and we're even more excited to continue with the series with next week's presenter Garrett Baker who's a phd student in sociology and public policy at Duke University and he will be giving a presentation on causal inference using administrative data. It will happen at the same time next week from 12 to 1 pm eastern time and we really look forward to seeing you all there. Thanks everyone. 

[Ben Allaire] Thank you for listening and just appreciate everyone's time. 

[voiceover] The National Data Archive on Child Abuse and Neglect is a collaboration between Cornell University and Duke university. Funding for NDACAN is provided by the Children's Bureau an office of the Administration for Children and Families. 
[musical cue]

Presentation References: U.S. Department of Health & Human Services, Administration for Children and Families, Administration on Children, Youth and Families, Children's Bureau. (2019). The AFCARS Report No. 26. Washington, DC: U.S. Department of Health & Human Services. Retrieved from https://www.acf.hhs.gov/sites/default/files/documents/cb/afcarsreport26.pdf.

Mark, T. L., Dolan, M., Allaire, B., & Bradley, C. (2022). Linking Child Welfare and Medicaid Data: Lessons Learned from Two States. Retrieved from 
https://ASPE.hhs.gov/reports/ccould-lessons-learned-report

Raghavan, R., Brown, D. S., Allaire, B. T., Garfield, L. D., & Ross, R. E. (2014). Medicaid expenditures on psychotropic medications for maltreated children: a study of 36 states. Psychiatric services, 65(12), 1445-1451.

Raghavan, R., Allaire, B. T., Brown, D. S., & Ross, R. E. (2016). Medicaid disenrollment patterns among children coming into contact with child welfare agencies. Maternal and child health journal, 20, 1280-1287.