[Musical cue] [voiceover] National data archive on child abuse and neglect [Erin McCauley] All right well we are at time so we're gonna get started. Thank you everyone so much for being here this is the ndacan summer training series 2022. For those of you who are new the summer training series is held every summer as kind of a live training workshop and then we turn the videos into a webinar series and then they'll they are available on our website and then you can also see summer training series dating back to 2018. So as I said this is the ndacan summer training series it's hosted by the national data archive on child abuse and neglect which is co-hosted at Cornell university and duke university. This presentation is about structural equation modeling and ndacan is run with through contract with children's bureau. Here's our overview of this summer we are officially more than halfway through so we started with an introduction to ndacan which highlights our services and supports and then a live workshop on data management strategies this is especially helpful for our administrative data sets which you know if you've used them you know can be quite large. And then the following week we had a presentation by Sarah about the administrative data and then how to link them then. The following week we had frank come and give us a workshop on linking ndacan data with external products. And then today we have structural equation modeling workshop by Sarah and then the next two sessions is a workshop on propensity score matching and then studying racial disparities using uh data and this will feature the new VCIS data and these will both be led by Alex. So now I’m going to pass it over to Sarah and again if you have questions throughout the presentation please put them into the q and a box and then we'll moderate the q and a at the end of the session. Sarah take it away. [Sarah Sernaker] Thank you Erin hi everyone I’m Sarah Sernaker I’m a statistician with ndacan you may have heard my voice teaching other sessions this summer or in previous years. So today's session is about structural equation modeling and this is our little agenda and so I’m first going to introduce what is structural equation modeling the process that you would go about in specifying a model. And I have a really basic example using our data. So what is structural equation modeling? You've probably heard of it before or seen that SEM acronym in literature and wondered what it is and so structural equation modeling is kind of an umbrella term for a whole bunch of techniques. So it encompasses a broad array of statistical techniques and frameworks and it's a family of related methods rather than just a single technique. It tries to help explain the relationships between observed or measured variables and latent variables. And latent variables are variables that are not directly measured they cannot be directly measured but have an underlying impact on the response or other variables. So for example intelligence or religiosity or quality of life could all be considered latent variables. So for instance religiosity you can't measure someone's religiosity on what scale would you measure that so you use proxy measures such as how many times you go to church? Or how often do you pray? So those are observable variables that try to understand the latent measure underneath those variables. So that's what a latent variable is and that comes up in a lot of other statistical methods too. And so at the heart of structural equation modeling it's a combination of path analysis, factor analysis, and multiple regression analysis. And if you are trying to use SEM and you don't actually have latent variables or haven't identified any then it's completely analogous to path analysis so. So continuing on what is SEM so the following relationships are possible to be measured via SEM so you can understand observed to observed variables so that's basically you know our regression set up you have measurements of your response and measurements of your explanatory variables no latent variables there. We you can measure the effect of your latent variable to observable observable variables and that's simply confirmatory factor analysis or in other words just factor analysis. Or you can understand how one latent measure might impact another latent measure and that falls under the umbrella of structural regression. And so SEM encompasses there's like two parts to SEM and one part is what's called a measurement model and one part that's called a structural model and the measurement model relates observed variables to latent variables. So those are the paths that run from a latent variable to measured variables and those when you get to fitting and all the whatever which we'll get into you essentially get factor loadings. Again you're measuring the effect of latent to a measured variable. And so that comes out as a factor loading through factor analysis and this expresses expresses the strength of the relationship between an indicator and that latent variable, so as I said it's analogous to factor analysis. The other side of it is the structural model which relates latent to latent variables and the paths between the latent variables express the strength and direction of the relationships as regression coefficients. And so this is you know once you fit an SEM model you would end up with coefficients that are pretty interpretable similar to linear regression in the same way. So why would you want to use SEM? You've heard of it you might not know it is maybe you're familiar with it why in what setting would you want to use it? And SEM as I said it encompasses a few different statistical techniques and so it has more flexibility than strictly staying in the box of factor analysis or piecewise regression analysis. It allows you to understand more underlying structure and complex relationships between the variables and their covariance structures and things like that. And it's really well suited for causal analysis because of this because you can try to identify and control for a lot of underlying effects that might not be your main focus but you want to control and kind of get them out of the way. And thereby leading the way for causal analysis. And it can handle multi-collinearity pretty well so that's always a bonus because that usually pops up in social science data. So I’m going to do a quick side step to path diagrams because this is really the heart of SEM and in specifying an SEM model and so a path diagram is simply just a visual representation of a system of simultaneous equations and each each piece of a path diagram has meaning each shape line and arrow has a specific defined purpose which I’ll get into. So path diagrams there's a lot of Greek symbols here but don't worry we're not totally focused on the Greek symbols more so the shapes and arrows here. And so I’m spending so much time on this because as I said this is really the heart of structural equation modeling and you'll see when you get to specifying a structural equation model this is really where you want to start whether on your computer in stata which I’ll show you or on paper, like you really should try to understand all of the different arrows and directions of the relationship you're trying to understand. And so in building a path diagram as I said everything is very particularly defined. A rectangular or square box signifies an observed variable. So anything with a square would be a variable that you can directly measure. So someone's height, someone's weight, someone's I don't know years of education, those are directly measurable variables. Anything in a circle or ellipse signifies a latent variable. So remember these are variables that are not directly observed but that you as a researcher and you know expert of your field should know is an underlying effect. So again the religiosity thing that is a latent variable that we can't directly measure but we as humans I’m sure we're all aware of what I’m talking about when I say religiosity or even political affiliation or political being I don't know. So those latent variables go in circles. And then unenclosed variables signify a disturbance term, I should just put error, so disturbance term is just error your error your measurement error. And so for instance this epsilon is the error term and sometimes these error terms are kind of implied anytime you're fitting a linear regression model you have an error term however aware of it you're not so it's kind of the same deal here. There's there's always error lurking in the background. In path diagrams sometimes it's just implicitly there and sometimes you directly want to understand the error and so it might be worthwhile to explicitly write it here for instance. So those I think cover the symbols so those are the square, circles, and then the naked variables who don't have anything around them. And the arrows are equally as important if not more so because these define the relationship. So straight arrows signify the assumption that variables at the base of the arrow causes variables at the head of the arrow at the head of the arrow. So the direction in which you are pointing your arrow is very important. So this relationship says that this oh man my Greek is this is eta is it I don't know let's just say eta it's probably not so this eta is causing this y so for instance think of this this is a circle so this is a latent variable so this is a latent variable is affecting some observable response it's observable we know because it's in a square. Okay and curved two-headed arrows so something like we see here signifies unanalyzed associations between two variables. So these usually imply that there's a that they are not independent and that there's some covariant structure there's some relationship between these variables. [onscreen boxes and circles each containing one Greek letter , an arrow points from one unenclosed letter to a square, a curved double-headed arrow points between two variable circles, two straight arrows point between two variable circles] and then we have these two straight single-headed arrows connecting two variables which signify feedback relationship or reciprocal causation so you know maybe you can't identify one that's causing the other you just know that they are interrelated one they could be relating to each other simultaneously in other words. And just a quick aside so this figure and text is from this book by Bollen "structural equations with latent variables" which is a super helpful book and has lots this was like the simplest figure I could choose from there they have a lot of really nice examples like this and even trickier. So yeah that's kind of the very quick basic rundown of path diagrams and like I said this is really where you would start if you wanted to build a structural equation model. And each of these relationships could be written as separate equations. So these are all relationships and so for instance this could be written as a basic factor analysis model that's the easiest one to kind of translate but so every single thing here could be translated to a an equation. And that's the idea you have all of these paths going everywhere which are equivalent to equations and then at the end when you're ready to solve for it you're solving for all of the equations simultaneously which hopefully makes more sense with the example. So here's a oh so yes so here's another example that shows how these relationships between the circles and squares and Greek letters and all that translate to simultaneous equations. So we have this path diagram on the left hand side again coming from the same book and like I said these are equivalent to this set of equations. So for instance let's look at this x1. So x1 is this observed variable that's being affected by this latent variable see how the arrow is starting on our latent variable I think that's eta I should have brushed up on my Greek letters before this pointing to x. And then we have an error term delta is the error term for here so this x1 is equal to some coefficient lambda that's working off of eta and it has this error term delta, so that's what this error term is. So this if you were just to fit this because this is a latent variable this would just come out as a factor loading so lambda would be your factor loading. And similarly this eta is going to x also and so we have the same sort of equation we have the simultaneous equation happening for x2 so this is also being affected by this latent variable it has an error term. And so that is this is what I mean by each arrow and relationship in a path diagram is directly translatable to a series of simultaneous equations. I think I guess that's all I’ll say about this because all the equations are mostly the same we're just dealing with different variables here. And so I’ll just take a second here just to acknowledge we have this complex structure and so it's really for helping understand each part of this like inter-working mechanism and so it is flexible and SEM is a nice methodology to keep in mind but it also can be complicated and you know you might think you want to do SEM but if you're not really trying to understand every piece of a relationship or control for each piece of a relationship it might be it might be simpler to just go with basic regression sort of models but. [Erin McCauley] Hey Sarah [Sarah Sernaker] Yes [Erin McCauley] There is a question in the q and a that I think would be helpful to answer now okay it says can you provide some examples for curved arrow and the reciprocal causation please which I believe was the double facing straight arrows. [Sarah Sernaker] Yeah so this would be the case so this the curved arrow remember was an association between two variables so this is really measuring like a covariance structure so it says that they're related so this would be like if you have multi-collinearity for instance that you're trying to understand. I'm trying to think of like a straightforward example and of course I’m blanking yeah yeah yeah so for instance let's say you measured socio-economic status and salary I guess I don't know what is ses I don't have one at the top of my head but this you would use this sort of relationship if you had a multi-collinearity or if you knew that there was some underlying relationship between them but you want to keep them both in the model. [Erin McCauley] Would it be like for kids they're like height and weight like they have a relationship but they're not causing each other like age is causing both. [Sarah Sernaker] Yeah that could be a good example yeah so something like that where you just have these two variables so like Erin's saying height and weight those are two variables that are pretty co-linear you know like one affects the other for sure. And so but you want to include them both in the model and so you would just signify within your path diagram that they're related and then yeah I’ll just just stop. [Erin McCauley] Holly also said if these are latent variables what about religiosity and views on abortion? [Sarah Sernaker] Yes well views on abortion not to get into the topic itself but that's directly measurable so that I wouldn't consider a latent variable but what about like religiosity and maybe like political ideology which is a little more directly measurable right like you have in America we have two parties but it's still it's not as dichotomous as democratic republican right. Like that still could be considered a latent variable so something like that religiosity and political ideology are definitely well they're probably related and so there's some covariant structure you could impose to try to understand that or just to control for it in your model. And then the two single arrows that that's a trickier question what's an example of like a feedback loop that one's trickier and the reason I’m also blanking on an example is because the this one particularly doesn't seem to come up as much in using this at least in my experience. Ah sanitize feedback what [Erin McCauley] Someone said here height and weight makes sense I was thinking maybe age in height or I guess height doesn't cause your age so definitely not. [Sarah Sernaker] Yeah yeah I mean I guess height and weight could still be yeah are we still have latent variables so I mean we could still talk about religiosity or political ideology. And again so I’m using the same sort of variables for a different type of relationship but this is kind of like it's like what do you really want to model how do you want to include the model? Are these your main focus? So like the difference so let's just so let's stick with political ideology and religiosity let's assume those are latent variables you can consider them in the model and try to understand how their covariance is related just you know how are they basically related you want to control for it and the model overall maybe they're not like the main interest in your sem. In contrast this forward and backwards arrow would be a more direct measure where you're saying I’m really interested in how these relationships work and how one influences the other and vice versa. So you know like in any modeling strategy you can include the same variables just in different ways just depending on you know what your singular focus is and what your research question is and how you want to account for them in the model, if that makes sense. [Erin McCauley] Yeah that makes sense and Felipa gave another great example infant impacts the environment and then environment impacts like child cognitive growth. [Sarah Sernaker] Yeah that's a great example you guys seem to have better examples than I do which means you're already like our awards [Erin McCauley] The spotlight's not on thanks for the participation everyone yes an excellent question original [Sarah Sernaker] That is a great question because well as I said these are all really important specifications when you're specifying a model. But again like I was saying these could be the same variables but it's just a matter of what your goal is and how interested you are in looking at them and the relationships between them or again just for controlling them so yeah good good questions. Okay so then there's all that Greek. So you've decided you want to build a structural equation model what are you going to do? So first you would want to specify your model. As I said before you want to sit down and create a path diagram with your measurements and structural model of interest. So this is just basically organizing your thoughts kind of. And in stata I’ll show an example in stata where you can actually build your path diagram and stata will then fit the model and give you the code for the model. So I’ve chosen stata today because of that really great and helpful built-in functionality. So one you'd want to build your path diagram. Two you'd want to first check your measures and distributional assumptions of each variable of interest and so when you're I’m just going to go back real quick. When you're specifying your path diagram for instance, this could be a binary variable, this could be a continuous variable, this could be a categorical variable. So each pathway remember leads to an equation and so depending on the variables and their measurements it's going to lead to a different modeling strategy. So for instance if this was a binary variable this could be considered simplistically as a logistic regression. Whereas if this was a continuous this would be a just regular linear regression so it's just a sequence of different models depending on what the I don't know not responsive variable at the end of the arrow how that is measured and what type of variable that is. So that's what I mean by checking measures and distributional assumptions because you can't you don't want to just throw you just you don't want to just say everything's linear regression and fit everything and off the bat you need to understand which model is best for each relationship basically. So you specify your model, you evaluate and specify the measures, distributions, and then you would fit your model and then assess the fit. So you would assess your fit using basic statistical measures that come up in other modeling strategies like the likelihood ratio test, aic or bic measures, r squared, there's a root mean squared error of approximation so if you've ever heard of mean squared error it's basically that but you take the square root sometimes that's just easier to work with. And so you'd fit your model, assess the fit and then usually you'd probably go back to two you'd probably evaluate your model, see if there's any better fit any variables that maybe you realize are throwing off the model or don't fit as well or maybe aren't as causally related as you thought and so it's kind of like circling back from two to three until you finally landed a model you're satisfied with. [Erin McCauley] Hey Sarah we have another question is from Felipa and it says all models should start from theory correct? [Sarah Sernaker] Yes yes. And I mean in the general sense I don't know what [Erin McCauley] yeah I think that would inform how you're designing and specifying the model yeah good relationship [Sarah Sernaker] Yes so I mean theory drives it and also I mean just what's in the research field so theory in the sense that like you have latent variables and you know a whole host of observable variables could go into a latent variable so though that sort of theory theory relating to the distributions things like that so yeah I mean theory's driving it but also research might drive it because you might say I know I’m interested in this variable and so I’m including it in the model. Like I don't you know that you want to look at it you kind of make your model work around it which is usually not such a forceful thing it's just. But so in that sense that's where it kind of deviates from theory where you shouldn't get so lost in theory but you know research and literature can also inform how you're specifying your model I guess is my point. So yeah lots of things go into building this. So practically speaking you know you want to do SEM, you're ready to just fit your model. So I put three programming language here in stata there's a function SEM and there's another package gsem which is actually what I’m going to use today. SEM this function SEM and stata assumes continuous variables and gsem allows you to include categorical or binary variables it just kind of expands your modeling choices. So I should have included that but it's all under the same umbrella if you look at the documentation I think it's all under sem. In r there's another SEM package or there's this lavaan package which is really popular I’ve seen and has some really well-written documentation. In SAS it's proc calis I don't I have no idea why it's calis I’m trying to think off the top of my head is this an [inaudible] I don't know and I will say at the time I wrote this it did not seem that SPSS supports SEM so that is something to note. If you are an SPSS user I don't think it supports SEM at this time and definitely not to the complexity that stata or r could handle either if it is in there at this point. So have an example hopefully this will put in concrete some of these ideas that I’ve introduced. And so today's example I’m using our linked nytd services and outcomes data and so our nytd is the national youth in transition database which follows children who were in foster care who are aging out. So they're approaching 17 years old in foster care, they're probably not going to be they're probably not going to have another placement they're going to turn 18 and then they're adults and out of the foster care system. So this nytd services file includes information about children out aging out of foster care and what services they've received while in foster care such as educational mentorship, financial services things like that and our outcomes data follows these kids for three waves so it measures them at age 17 and then 19 and then 21. So every two years they take a survey and the outcomes are basic measures such as whether this person has experienced homelessness, their work status, other just outcomes so that's our outcomes file. So I’ve linked the 2014 cohort from the nytd services and outcomes file to try to understand you know what services received, how that affects your outcomes later in life. So I think I summarize that all outcomes component contains the results of service conducted from youth to examine certain well-being financial and educational outcomes so each row is one child, so each observation is information about the child's services and then I’ve put the data in wide format so that each child just has one observation. And I think so I’ve used the last outcomes known so each child would have information if they've responded at three waves so they have three rounds of information basically and I just took the last outcome known. So their last outcome we were aware of at wave three. So as I said we want to understand the relationship of services received while aging out of foster care, and I specifically wrote having a substance abuse problem I think I actually changed my example though so this is I don't know that that actually still holds but we're trying to understand the relationship of services received while aging out of foster care and later outcomes. So some observed variables that we can take are delinquency, education level, sex and race ethnicity. And a latent variable that I included which is not really a great latent variable is a measure of mentorship guidance. And I say it's not really a great latent variable because we can basically observe whether they've had mentorship. This is kind of a simplistic example and I don't I just couldn't think of another latent variable and I thought this was a good idea at the time and then I stewed on it for a while and I realized it didn't really make too much sense but it's still here so let's just pretend it's a latent variable. So I did a path diagram and I’m going to walk through this in stata but just to give you an example so this was all constructed in stata which is super handy and just it's really great functionality because you're able to not only just draw your path diagram so that you have it down, again if you're doing SEM you should draw your path diagram to understand the relationships and define all the relationships you want to measure, but also it shows you for instance this is a Bernoulli variable so this is going to be a logit model. This is also Bernoulli variable Bernoulli this education level could be considered Gaussian we have a Bernoulli and then the race ethnicities, a categorical, so it's of the multinomial model. Oh I did stick with substance abuse my apologies I did stick with that so we're looking at the effect of receiving services on the outcome of substance abuse. So how are all of these things related to whether or not you later have substance abuse? And so let me that's when we're gonna jump into stata so this is the example in stata [onscreen code displayed: Import delimited "c:\users\ss3376.ndacan\box\ndacan\presentations\summer series 2022\s4 sem\s4_nytd_ex.csv", stringcols(1) clear Gen mentor = . Replace mentor = 1 if mentorsv == 1 | cnctadult == 1 Replace mentor = 0 if mentorsv == 0 & cnctadult == 0 gsem (raceethn -> subabuse, family(Bernoulli) link(logit)) /// (delinqntsv -> subabuse, family(bernoulli) link(logit)) ///(edlevlsv -> subabuse, family(bernoulli) link(logit)) /// (mentor -> subabuse,family(bernoulli) link(logit)) /// (homeless -> subabuse, family(bernoulli) link(logit)) /// (sex -> subabuse, family(bernoulli) link (logit)), nocapslatent logit subabuse raceethn delinqntsv edlevlsv mentor homeless sexjust] [Sarah Sernaker] A quick aside so I’m using stata as I’ve said I have this code here so that I think is being made available to participants so first thing I’m going to do is load the data. And so I’m loading this nytd data and I’m going to show you what it looks like in a second [onscreen data table with columns stfcid, delinqntsv, edlevsv, specdsv, ilnasv, acsuppsv, psedsuppsv, careersv] [Sarah Sernaker] But I just want to emphasize that this data was heavily cleaned before this presentation and for this presentation in that all of these state foster care ids this is a child identifier if you're not familiar with our data this is like a child id. These are totally made up for data sensitivity and risk disclosure so these are not real. And then I’ve chosen a few I think I’ve just chosen a subset of variables this is all to say if you were to order nytd data from us and receive nytd data it might not look as neat as this because this has undergone some data cleaning. So this is nytd data as I said we have our state our child identifier and then we have some variables a lot of them are indicator variables such as did they receive the service or not? Did they experience a certain outcome or not? There are a few continuous variables not really many actually more so just categorical I don't know that we have any continuous variables here. So yeah that's our data anytime you load your data I always say this you should look at your data make sure it looks like you expect it to. One thing that I’m not going to get into today but if you were to use just any data. If you're using any data in structural equation modeling or any sort of research purpose make sure that you go through and clean it yourself. For instance we're not going to be using this public financial assistance variable. You can see these 88s those are not really values these are just placeholders for missingness. And so you would not want to include this variable as it is in your model because it's going to get confused with 88s. So since we're not using it I did not spend the time to replace these placeholders but this is just another reminder whenever you're using data to make sure you understand each of the variable types you're using, any sort of processing that needs to be done such as this replacing of the 88s here. But again we're not using those variables so I didn't bother to edit those. So I’ve loaded in the data I’m just generating this mentor variable I use it in the structural equation model. It's just a one if they've received mentor services or if they've had a connection I think that means connection not contact They have [Erin McCauley] It's connection with an adult. [Sarah Sernaker] Yeah connection with an adult that's what I thought so it's sort of a mentor it's not it's like if you have I think it asks them if they have an adult that they're very close with so sort of like mentor so I’m just kind of creating this mentor variable that's this summarizes the direct measure of mentorship or this connection with adult. So I’m just gonna run that. Okay so as you can see I’m using gsem so like I said before SEM is the function if you are just using basic linear models like if you have a bunch of continuous variables you'd be fine using the SEM function. If you even have one single Bernoulli variable so binary variable then you're just in the gsem world the general SEM world. It's kind of like lm and r and glm in r it's the same thing if you're familiar. But I’m gonna pause there because I did not write this myself I used we're gonna go through this in the structural equation modeling model building. So I’ve just gone to the statistics tab I don't know how big this window is but under the statistics tab if you go down to structural equation modeling they have this model building and estimation button and I’m going to click and I have to admit here I don't know which version this was implemented in. I'm using stata 16 which I know is not the most up-to-date version and it's in that so I will just say if you're using something older I don't know when this was introduced unfortunately. So that leads us to this grid so this is where you do all of your path diagrams. So you can see that we have our boxes we have an observed variables, generalized response variable, we have latent variables, multi-level latent variables, paths, covariance structures and then you have these we have these multi like this multi specification at once I did not go this route so I don't exactly know I think this is yeah you just add a set at a time. But I did this the tedious way so notice okay so let me take a step back I’m trying to start to build my model so this is where you build your path diagram with the variables and we'll define the relationships but notice I’ve clicked on this box on the left here and I’m hovering over it and I apologize that this is small I don't know how to make this these sidebars bigger but I’m hovering over and it says "add generalized response variable" so that's anything that's not a continuous variable basically so I’m clicking it because I know I have a lot of binary variables and it's telling me "builder must be in generalized SEM mode to use this tool would you like to change to generalize SEM mode?". Yes we're going to do that that's really just telling stata instead of using the SEM function we're going to use the gsem function so that's all that does. So you just start building so I have this box here and I’ll double click up so every time I click it creates a new box so I want to use my mouse again okay so I’ve included two boxes here and so as I was saying we want to take some measures of from the services file and understand how that leads to the outcomes. So some stuff that we can start with so we want to understand how race and ethnicity might relate to substance abuse. So race and ethnicity I’m go I double-clicked on my box here and you get this pop-up and so my variable I’m going to use race ethnicity [onscreen raceethn] so this does get a little tedious I’m not going to build the whole model this way but I just want to show you guys how you would go about it so my variable is race ethnicity I don't even bother adding a label and because race ethnicity is a categorical variable if you were to just model race ethnicity ask yourself what model would I be using? And so that would be a multinomial logit oh I guess that cleared it out so I guess you should specify your family link first sorry I did this a few months ago so I forgot the exact order. So as I was saying I want to start with race ethnicity I know that's a categorical variable with multiple levels. If you were just modeling with that alone you'd be using a multinomial so it's the same idea here you're specifying multinomial on my race ethnicity variable and we just say okay. So notice this change from Gaussian to a multinomial and we have race ethnicity in here. Okay so now let's include a binary variable so what else did I include so delinquent. Let's do that real quick and then we're gonna do so it's a Bernoulli variable Bernoulli just means binary really and I want delinquent so this is if they received services I know I think delinquent was if they um if they had an official record of delinquency in the general sense so I think yeah that's the general sense. I don't remember exactly if that means a police record or other means but this is if they have some record of delinquency. So I’m just going to include that. So notice this one's a Bernoulli variable it's a logit model or it's a logit link yeah okay and notice also these are our observed variables right? They're in squares as I mentioned before so they're directly observed not to mention we know they're directly observed right we directly observe race and we directly know if someone has a record of delinquency. So if you wanted to include a latent variable you would include the circle right? So I’ve included the circle and when you use a latent variable you have to name it and in stata usually you just choose a capital letter so I think here I probably did c or something that's kind of what I think I chose and what I usually choose. So you give it a name and why are we naming this why are we not choosing from a drop down? Because remember this is a latent variable so we don't in theory we do not have an observed latent variable that's that's the whole definition of latent variable we have not observed it, we know it's a phenomenon that's happening that we're trying to understand but we do not directly observe it. So you're imposing this in the model that's why it doesn't have a name yet we're naming it and I’m labeling it but it's kind of right now in all purposes it's just a placeholder right? So I’m just going to include that. So remember from my slides I said that our latent variable is mentorship which I said was not really a great latent variable but such is life and here we are. So when you have a latent variable it's not observed so you're trying to include a set of variables that you think are proxy to or you know measurements related to your latent variable. And so this mentorship variable is in theory and right here is our latent variable that could be expressed through our mentor variable which is directly observable. So this was from our mentor services or our adult connection so again this is why this is not a great latent variable because we can kind of observe whether they have mentorship but. [Erin McCauley] Sarah? [Sarah Sernaker] Yes. [Erin McCauley] There's a question in the q and a that says is it required to specify what observed variables lead to the latent variable? [Sarah Sernaker] yes there needs to be some relationship or some observed variable so. When you have a latent variable in your model you need it connected to some observed variables and I think literature tells you you should have at least three like three is ideal three or more that are going into your latent variable or that are related to your latent variable. Otherwise like it just it just doesn't really mean anything like if I just had this latent variable hanging out here and I’m going to include the response of interest pointing to that like, like this just doesn't hold any information on its own like I said it's kind of a placeholder to tell stata this is a latent variable but what like what are what are your proxy measures to your latent variable? Like you need some proxy measures. So like with the religiosity latent variable like I said you could measure how many times someone goes to church you know but that's not exactly religiosity because you know maybe you just can't make it to church or you just don't like the church structure but you're still a religious person. So that's why you have these sort of proxy measures that try to help understand your latent variable so like you need something propping up your latent variable if that makes sense and answers hopefully answers the question. [Erin McCauley] That's great thank you. [Sarah Sernaker] So this is our latent variable. As I said this is going this is trying to understand the mentorship and I’ve created this binary mentor variable so we have a Bernoulli logit and then we have that mentor variable I created so I’m going to include that. And so let me just include our response of interest then I’ll start drawing some lines oops you have to remember to go back to your cursor or you're going to end up with a bunch of boxes so now I have a box I don't want so I go to estimation and then clear oh no is it going to clear all no how can I do this this is the fun of point and click methods. Object, delete, this object okay. So I’m going to include here the main variable of interest which is our substance abuse and notice it's it just kind of uses whatever you use last so our last variable is Bernoulli logit so that's why our new box is Bernoulli logit. And so substance abuse where did it go? And I am totally not a point-and-click type of programmer so this is like it's a bit tedious but like I think it's super helpful to build this to really understand how like this code comes out of it. So that's my aside so now we're just going to start drawing some paths. So we we're trying to understand the relationship of race on substance abuse, we're trying to understand the relationship of delinquency on substance abuse. You cannot create that okay why I think it just got overlapped with that one. And then this one so we're saying this is affecting this latent variable and so when you do an arrow to a latent variable you automatically get your error term as I said before these lines represent individual equations and there are inherent errors there but they're so like implied that they don't even pop up. So I’m not entirely sure why stata makes the ado about your latent variables but so that's all that is that's our error term and then we're saying okay how does this affect substance abuse? And so this is a small subset of the model I’ve specified but this is again just to give everyone a sense of how you would build it and the arrows you would be drawing including latent variables and again this is a super simplistic example and I’m going to stop there so you can actually save your sems let me see if this works. You can save them in stata yeah okay [Erin McCauley] Sarah it's about five minutes. [Sarah Sernaker] Okay let me see if this is going to work how I want. I just want so you can build these paths and then you can save them so that you should in theory be able to open them back up if you need to tinker with them. And I think this is what I wanted. Let's let it load so this was the full model that I built and this is what is comparable to my code here. But I wanted to show you the full model because once you've oh no this was not sorry bear with me one second I did a lot of fiddling and playing around with this because that one did not include our late variable our faux latent variable. So this was the final model that relates to the one that I’ve specified. And so once you're ready you have these variables going into each of your you've defined all the relationships and pathways that you're interested in then you can go to estimation and estimate it. So what does that mean? It means that stata is going to systematically go through each relationship and fit the respective model and it's also going to spit out the code so that you can copy and paste it into a do file and rerun it. And when you do the estimation and try to fit it this pop-up comes up and usually you can just press ok like unless you have very specific changes you would just be fine with running it as it is. Some other variations you might want to include are maximization so this will be how stata fits your model if it's running into like convergence problems you might want to tweak this section. If you are worried about certain types of standard errors, if you have survey data those are the types of specifications that are in these tabs. But usually you're fine with just clicking ok. So we've translated this path diagram or rather STATA's translating it for us into a model. Look at all that it's fitting so it's going going going and let me just scroll up a little. Oh yeah where are we at where are we at so this is the translation from our path diagram to stata code basically and notice that stata is defining each relationship separately, it has the link variables, our latent variable c, and notice it says latent variable c. And then it just goes ahead and fits it it's a log likelihood, so this is where I make my plug for r because I really like this functionality and stata and I think it's a really great place to start because as I said you can draw all these diagrams, you can specify the distributions, and it just is a good place to organize what model you want to fit. I found that STATA's SEM procedure usually comes up with problems especially the more complex model that you fit and I’ve just found that r doesn't have these problems. When r lands at a solution it seems like better than stata so that is my plug but r does not have very nice pathway diagram functionality. So you get all that fit and then what are you left with? So you have a bunch of coefficients, right? And so each coefficient would be the coefficient of each specific model. So this is a coefficient modeling race ethnicity on substance abuse, this is the coefficient modeling delinquency on substance abuse, sorry I’m trying to balance the output. This is the individual coefficient on substance abuse for education level, and then sex and then c is a our latent variable and by default this is a default it's always constrained to one because of the assumptions of factor analysis which is like a totally not a totally separate topic but not it's outside the scope and time of today. So those are the basic straightforward models and then down here are the basically factor loadings because these are the the variables going into that latent variable and so yeah that I think that's really all I have to say so you would specify your path model you would fit it and then I’ve just copied the output here. I think in this version I took out the latent variable because I just I really hated it because it's really not a latent variable so. [Erin McCauley] Hey Sarah? [Sarah Sernaker] Yes okay [Erin McCauley] We're almost at time and we just we just haven't we have a few questions I just wanted to read them to you yes the first one is could you share model’s assess fit codes since you are using with gsem as well? [Sarah Sernaker] That's a great point. Gsem has some built-in functions that after you run gsem you should be able to run a function after that spits out model metrics and I have not included it here and I cannot think of it off the top of my head but definitely check out the gsem documentation. So like I just did help gsem model description so it's got to be somewhere in here but there are functions that you can run right after you fit your model to get those model fits. I apologize for not including that that's a very good point. [Erin McCauley] Great thank you yes and then we had just someone leave a comment saying thank you so much Sarah and I want to reiterate it because this has been a great live walkthrough I know I’m going to do an SEM model now. [Sarah Sernaker] Yes [Erin McCauley] And then we had another question when you were showing the model chart the question was shouldn't there be two error terms for mentorship one from the association with like the having mentoring services and one from connection with an adult? [Sarah Sernaker] Yes so I have two see I did this I edited my code after the fact this was my first round and I did this connection of adult and mentorship into this mentorship variable. I think when I went back and I just like I said I stewed on this latent variable and I was like this is not really a latent variable that's when I went and created this mentor variable and kind of simplified this even further. But I wanted to show how you would include a latent variable so here with my latent variable specification I do have those two distinct variables going into that latent variable if that makes sense [Erin McCauley] Great and then there's just one more and I know we're a minute over so be quick okay is the data fit the model? [Sarah Sernaker] Um sorry I guess I don't understand is the data does the data fit the model or? [Erin McCauley] Is like the measure of date of like the fit in the model maybe? [Sarah Sernaker] Um I guess I i don't understand the question here let me so let me while we have a minute left. I have these references first of all oh sorry sorry sorry I know I’m running over I always do this. Just a quick thing of references this are super super helpful and this one's structural equation modeling stata and then there's one for using r that's super helpful but my point is here's my email address. We're running out of time and so if you need to if you want to ask me questions or contact me please send me an email feel free. I will add a caveat I’m technically on vacation this week so it will be a few days before I get back to you. [Erin McCauley] All right great thank you so much Sarah for coming in and doing this even on your vacation just like everyone thanks everyone for attending it was great to have so much participation today and hopefully I’ll see everyone next week. Yes except for Sarah because she's on vacation [Sarah Sernaker] Well I’ll be back next week I am keen next week I think that'll be a good topic yeah. Okay bye bye. [Sarah Sernaker] Thanks everyone thanks Sarah. [voiceover] The national data archive on child abuse neglect is a collaboration between cornell university and duke university. Funding for ndacan is provided by the children's bureau an office of the administration for children and families. [musical cue]