# RITHAC 2017 Coding Workshop

**Views:**747|**Rating:5.00| View Time:1:19:39Minutes|Likes:13|Dislikes:0**

all right awesome I just tweeted up extremely yeah we're good to go it sounded good I checked it already I just tried changed a second you just changed something so tell me how house hold on okay yeah it's good yeah it's good I just I'm listening right now is Jeff yeah YouTube backs up everything right so let's give wait a couple minutes before everyone gets it yes yeah [Laughter] like strong opinions on our school starting left back welcome everyone thank you for attending online thank you for joining the spring this is the first time we're going to be doing this and hope you have become a series through various hockey conferences so just a few introductions um Ozma I'm the enemy she's talking crap I'm mamita a senior at working at the University of Pennsylvania hi right for coffee graphs and the a thank you so much to our helpers Alex and Connor we won't be helping you guys out online so the main goal of this workshop is to get you guys excited about exploring statistical ways to explore hockey questions you want you guys to be less overwhelmed by this kind of stuff engines to you it very easily run through and what it takes to build a very simple waffle so first thing we're going to be looking into what is linear regression second that we're going to be actually what is regression so regression analysis broadly speaking is predicting response variable from one or more predictor variables and what that helps us do is identify what is important in predicting something and it helps us describe the relationship between that thing we're predicting and finally from that you get an equation that will predict what you're trying to predict Tyler is not variable so there are many types of regressions this is only some of the most common ones we're going to be going through one of the most common types of regression they're called ordinary least-squares or brushing and specifically pretty doing the multiple linear line and OLS body speaking is helps us find a weighted sum of radicular variables and we're initially exactly that what we're doing is fitting a line to something so we have data we've X my access a predictor and response variable in the most simplest form and the simple linear regression is and then we fiddle mind that basically so minimize the distance between the points and that line and tries to predict basically what that relationship is so you know don't worry too much about some words on the screen but we're just showing you basically one of the most common methods of statistical analysis I will say for example but once you go forth in the world and explore stats much further you can do stuff like logistic regression which is a very common tool for predicting basically binary outcomes so it's something either happens or it doesn't you may have seen prediction models that people are doing as to whether teams will win games not I actually ferment presentations tomorrow hazards aggression just make survival curves so lot of cool stuff but we're working the green lines so yeah don't worry too much about the Greek letters so what is happening here is that we're trying to basically come up with a weighted sum a bunch of predictor variables to get the best sense of like what the actual response so you know the coding guy goes in to predict employees so I'll just talk about that let's say that we have three variables that we think will be very helpful for predicting funds what happens fit in a regression we get basically slope estimates so multipliers for the values of these three variables that we add together to get our best estimate of what the responses so what the actual predictive value and if this will be much easier to see when we actually go through it but this is just giving you a very broad or because betas are those kind of weights variables the X's are all the different variables and there might be some intercepts that you have to add tab rejection I think yeah again don't worry it's easier when you see it so if you want to already look at the sort of coding guide just to get a sense of what a bunch of like that that's totally cool we're gonna jump in at that very soon but again we're trying to minimize the sum of the squared errors and basically I mean it truly is just trying to fit a line that gets closest to as many points in the most basic terminology that's the fact that we're doing and the way that they are able to do that is minimize basically the sum of the squared distances between all of the points and the actual line that you're setting and to so you know we can get more into this down the road but the important thing to remember and just thinking about this very broadly you can have data and you can feel like I'm gonna fit linear regressions in this I'm gonna fit a line to it you can do that the question is should you always and the answer is unequivocally no and so some of the sort of guidelines and assumptions that we work with when using regression is to see if the sort of residuals to the actual minus the predicted variable values after we'd have our predictions are normally distributed basically what that means is there aren't like clear values they're really far away from each other and it doesn't seem like you can trust the predictions at all we also want to know that the response is that we're actually trying to predict our independent of each other so basically that there's not something that we're not taking into account with our model and again like linearity that this is a really important point I think that gets lost with a lot of people because again this is one of the most popular models that people use for statistical analysis but when you do it you have to make sure that the complete dependent variable is related in a linear fashion to the Internet variable and if that's not true there's some other kinds of methods or some ways upon transporting the data to get to that but it's really important to you look at that and evaluate it before you end up like presenting your results and lastly homoscedasticity fun word to say but basically when you're looking at the line that you're fitting you want to see basically constant variance at every part of the sort of line like you don't want to see that for high values if it's really low than the low values it's all over the place you want to see some consistency and you know we've said that we're going to get into multiple regression so basically simple linear regression you're looking at one particular variable so that's that X versus Y fitting line here that multiple regression is a little harder to visualize between general concepts there's more than one predictor and they're basically weighting them based on where do they are and same thing predicting B response all right so I'm going to pull up the coding guide and I think hopefully this will make more sense once I do that get this if you don't know where to find this check my timeline and there should be a Dropbox link these documents all right I'm gonna to China so or also check your email if you signed up for this because I sent out links to all these as well so for this guide basically what you can do is while you're following along to this and trust me seriously basically nothing that you have to do we're just taking if you're this but while you're following along when it's great to do is open up our studio I know that many of you downloaded it before manage you want to do which is great open up the our file that I sent you in addition to this HTML file and then what you can do is you can run the commands in that art file as I go through them in this point right and then you can see without what looks like it should look very similar to what's on the screen and you know hopefully you get a sense of how everything will tear actually go through this but the guy it has all the explanation is let me go here as well so hopefully that sounds good to you guys and just in a very like general philosophical sense you know as I said in the very first slide this guy unfortunately you can't learn art including language in two hours it's not truly what we should be doing is taking you through a very like brand 10 hour overview of like data structures like you know the philosophy of like coding effectors and all those really fun amazing things but what we thought would be more normal would be is if you can see the end result of the kinds of projects that you do and you know definitely after doing this what we want you to do is continue to use some of these like basic are coding resource and both of your knowledge in other ways well knowing that you know this is what you can do these are the types of projects they can work with and also for this for this particular coating guide that you can basically go from Jeff's downloading some data from course qaddafi very common resources that a lot of us use and go through that to cleaning it up making a linear model just by using these steps but representative it's definitely supposed to be researched now as in each other that was a very long when you don't repeatable but I hope that basically that you guys are really understand kind of what the point of this I guess so and yeah the keyboard shortcut to run one lytic code is ctrl enter or command enter so that's good for our studio hit that and see the output that I have basically in each coach so the first thing I did was assign the number three to variable X 2 times X and it turned out to 6 which is great we're off to an amazing start and now you know you can make a formula like that well refer to a to this 3 and so in general you know we're signing variables we're sending things to variables you think that little arrow that's always very important and again like I said I got this data directly with where Scott and what I thought would be most helpful you know it's kind of a little bit intensive as you'll see but what I thought would be most helpful is to you know do none of the data cleaning up front just show you what you can go directly from a course to download CSV file to doing all the data cleaning in art and then having to make use of it so I decided to completely go a little nuts with this and get like power play even strength penalty kill and all situations data and just combine it all together monstrous data said which you can use for basically any project that you want to do and so the first thing we need to do is read it I'll just see a schema that I sent you again I just down the really done and then some of you guys and one of the things the only other thing that this command is like if you just greet them in without saying like check names as falls or anything like that it'll sort of modify the column names in a way that are likes better because the course could download has basically pretended signs and other things that are as affected – totally readable so it changes up for you so I just said data sets and then also this is really important so if you haven't already done so you do have to run these minutes so I comment is amel because I've already installed these packages but basically what you have to do is in our run install packages ggplot2 and it's all factions deep layer if you want to be able to follow along because some of the commands that I use reference tools for this packages and then you actually load them in Debbie's library command and what that does is again now you can use mints with those particular packages so here the great thing about our is that there's a lot of stuff you can do but also anything that someone has imagined that isn't included in these are as long as they've made a package for you can load that in and use those as well our is really great their regression there's been 200 or I think now close to 300 functions that have been granted are the regression analysis but other Python is also very useful but for regression I would recommend far and those two packages and I mean you know a certain regression being one of the sort of more basic model living technique obviously it's pretty easy to implement in Python as well but the thinking of our is that do you think you're able to be with Python and not necessarily our edges inside there's okay so what did we use for thirty examples then I just put up there because I'm going to be writing a function later on it's just to show you this Intex for it so I defined it and I said okay it's a function that you have to give it three balance you have to give it an a a B and a C and then what it does is it doesn't matter on the three values and then it returns an output D and then so what I did was I called that function I just printed test function in three two one and then it did all those mathematical operations on those values and then it returned eight and then one of the important things to know what functions is that it doesn't actually save that D is variable used that's why this error message that comes up ease it down so but in general functions are a way to do a lot of things quickly to make a lot of different data sets or Tom's datasets whoever's the case may be and so let's actually look at some of the data and so one of the really good ways to do that it's just to use the head command it shows you the beginning their first expose although if you do like comma 10 I will show you first 10 rows and so forth and so this is just good to get a quick sense like what you're actually working with what you're actually looking at but since this new fragment has a lot of columns we won't see all of them when we just print that output so some of the other things to do to get a sense of what the data is that you're working with thread the dimensions so this is rows by columns as you can see over 3,000 row those 44 columns so it's a pretty big data set which it should be because it has all the players for the last four seasons so the commons you can see that we have a lot of different fantasy stats for players in addition to informational in season team position you know points etcetera whatever makes a lot of sense yeah again this is from course code on hockey there's a like in the guy at the very beginning but you can also see his site for more information and you know another thing we can do is we can look at let's say you want to look first 10 rows and 5 columns so we did that with using these brackets rows by columns and so basically this is the way that you look at a specific set of rows or specific set of poems so if you just wanted to look at basically let's say the second rows I can come even just you instead of 1 to 10 just so you can play around with that and try to get you know different values but actually the first thing that we wanted to do was remove some the clumps make it a little more peaceful to work through and so I made basically a vector of all the column names that I decided that I would like to remove from the dataset and then I'm just testing out on you know I made a copy of the all situations like unclean did is that just called an X I'm just testing out these commands on X I decided to remove Easton homes by making them dull so they just here they don't exist anymore and then one of the other things they did and truly I know that this is like not you know probably your favorite part of this but if what it really should do is give you a sense of how much work goes into the data cleaning aspect before you can really get into the model so it's really important to do all of these steps to think of actually work data and then move your building and models don't encounter errors and so forth so I got rid of those columns one of the other things is it doesn't particularly like percentages and it doesn't leak that well as calm names so I ended up basically substituting them in the Jesus function with things that are finds based and more readable and so you know the way this function works and basically you looked up the way that any function works you're not expected to memorize any of you say like okay I press substitute this string with that string and reassign those to call names and then lastly for this part you know we don't I'm kind of thinking ahead and I'm thinking to the fact that I want to merge all these different data sets so all situations that penalty help our play I'm gonna merge them all together but what's gonna happen of that is if you have the same column names for all of the core data sets are is bigger length like what's what so one of the other things that are waiting to do is sort of paste the name of where that help is coming from so if you have time an ice cream sample I'm at four different columns for time line all situations penalty powerfully so for all that it's distinct data sets just tasting the word referred to what it is report is all made didn't have some more explanation function but as you can see here for the all situations this is how we got the companies which perfect for your actually blowing nose and then I go to function that did all the steps that I described and applied it to the four different data sets and then you merge them all together so some of you familiar with sequel and you know about joining things so these are left joins and basically I put all the organs that's together by telling our that the columns that are the same across all four data sets are player seasoned team and position and then it realizes that you know keeping those constant you can sort of merge all the base with the columns together into alignment you know big huge data set and then this is what you get in terms of all the different colonies but you know remember that it's very standardized so it's like the same variables that we looked at before but you can just have to preface it like what situations you're looking at and then one way to get a quick overview of different features and distributions of the different variables is to use a summary command this is also 15 he's like right when you read in your data and you can see for example you know these players five courses of data in the in the data sets that come up four times we're looking at 2013 2014 all the way up to the end of last season and you know some stuff to keep an eye out for analyzing data is for example basically something like what you see when you look at the positions so for most of the players you have diamond centers like waiters but for some of them make multiple positions and so you know when you're building a model sometimes that tricky because if you want basically a coefficient for this specific case like there's only three players in that doesn't make much sense so you might have to reassign them just like so that's you know an example to me to keep and then you also want to look at just like the general distributions of the variables to see for example you don't get that make sense so if you saw something like someone had regular-season games played above eighty two you'd be like okay there's something wrong with the data oh okay interesting no well that's fun so what I'm gonna do in that case and what you now to try to look at why that yeah it doesn't affect the actual commands that we run but it's an infinitely basically a cautionary tale of trying to but you will so yeah and then finally those specific projects that we're doing is predicting basically a season's worth and the way that we're going to do that is you've data capacitance so what we want to do for these three columns is create basically lagged columns so the last columns tell you what that value you know if you have points for this season you can have a line one column that tells you what could clean your screens water for less and that's what you can use let's say as a predictor variable in your linear regression and so you know I tried to do this basically in a very quick way which is she's deep fire which is a good our package for basically messing around with dataset the trades do something like this any gonna you know it sounds weird baby can't understand the syntax on that right away but what I ended up doing is saying okay for each player let's make a bunch of new columns where we lag values once twice and three times and also I'm going to filter it by and say you know even string time eyes must be greater than 400 and that's random value but it's something that's worth thinking about especially when you're doing honey analyses because when you're building models if you have players that haven't played a lot they're probably gonna be adding noise to your model you're probably not the beginning and then and I called this data set with the lag variables left so that's what I'm referring to as we look at the dimensions now for 64 columns which is even larger but also we have fewer rows so if you go up and you remember the original dimensions that's really up further than intended 3000 something rows so that filtrate is ranked ionized minimum took out a lot of players and as you can see here if you just use the attention and get a sense of what they look like and you'll notice that there's any values in the library constants and so usually when there's any values here there's no there's nothing int that popular value and usually that's a cause for concern but in this case I know that it's like and the reason I know is that Aaron a flood has three seasons of data in this dataset and elacry column basically it requires data from three seasons ago so obviously when you have data for three seasons ago or 2014 15 15 16 or even 16 cents you know like this entry can have blad to palms and yet basically like 15 season but he doesn't have any seasons prior to that you know obviously Julia so again that's why it you see those anomalies and you know you can mention that there are ton of palms when she's history scary but you know remember that the way that I structure so it has a very very standardized data dimension so you start with a very little self like or proximity situation you're interested in and then it there's like a lag to it it's from here about two years ago three years ago lack one two three okay so that was a lot of work on data bleeding now in action looking stuff which is fun so high in my infinite wisdom decided to show you guys ggplot which is a slightly harder way to plot things but it's also much prettier way to plan things which is why I care about so you have some quick instructions here but basically I showing you what the syntax looks like it's something something that you have to get used to but basically you tell ggplot the data so that you're using the sort of variable of interest or the two or three variables of interest depending on where your graph is as well as the type of graph it needs to be and then again if you're pretty fill color outline color again this is an additional package arm has its own base graphics so you just run this histogram code in your console you can see the same histogram but it wasn't me really so I have not so basically you can see here for example we have points in all situations and because even countless players that follow each sort of Ben and I see that you know the trends you know roughly how we expect them to these players are presumably because we're just using the original data where we didn't exclude any players we see a lot of players you know with this like zero line to region because maybe they were just called up for a game that didn't really do much that sort of thing so that's piercing there now with me Sarah we can look at anything like even strain versus powder liquids and again I'm using the original data set so I haven't excluded anything yet and the GM point tells you that it's a scatter plot and I decided to make grants as possible and here I'm telling it that access even straight points why it's perfectly and we can see again you know the trends it makes sense that some on average somebody scores more even straight points but also score more powerful insights and again if you think back to the regression discussion that we had basically if you were going to fit a regression to this you'd be trying to see how something like even turning points can predict power lines and what's the variation of it still left so you know you try to make it a nice line that minimizes the distances of all the points from that line and then see what the actual formula is and there's a lot of other commands that you can use with GE but all of the plots that I make they're using their B there's just the sky's the limit with what you can do with that okay now we're gonna do our first regression so I'm gonna start with a simple linear regression so it's just one predictor and there so this is kind of actually slightly Python oriented way to do it but I like it as well all right I make basically a copy of the data with just the columns that I'm going to use because that way I don't have to deal with billion other columns I just think it served nature but you can also just specify it manually in the coordinates of the causing is but so you know this LM command stands for linear model which is what we want we tell it to use the data and we tell it to predict all situations points in the season this dot tells it to use all the other which in this case is just wonder calm and then you get this outfit and this output is sort of will end up looking very familiar to you if you continue to models you know there's a lot of basically things to keep an eye out for stuff one of the first things that you look at is the coefficient estimates for the last year's of situations points for the whole season and what this is telling you that is that the models best guess for how many points will score this season is roughly 0.7 four times the amount of quickly squared last season plus the intercept which is eight and this says this p-value being worse than five and these stars we can talk about balance later but basically it says that it's statistically significant that it's very sure that this is a positive relationship and that this is basically again its best guess for what that actual number is that you have to multiply last time so you know one thing to keep it out of here is insanely okay you deleted some observations due to missing neces that means that they didn't have they had any values base plates of the columns but we know that this makes sense because there might be some players or the players most recent seasons basically or in just like the players first season actually except the offset alleges that the players first season won't have a lag one column because we don't have any information so it's fine that those and then we can also look at the R squared so the R squared is the measure of the amount of variation that is explained by so you think of it this way once you fit a line you have so you have basically a plot and even punch place there's a lot of variation once you've fitted line you have your basically estimates for how the why that response variable they interested in varies by s and the r-squared tells you basically how much of that initial variation that you were able to explain and what's left so this says like okay sixty percent of that variation in terms of why you regular explained forty percent of it is still unexplained and so you know you just have to look at that and I think that it's good for now so again our estimate of this season's points is that intercept eight point three plus the slope 0.74 10 glasses points and I really want you to think about what that implies so you never be taller example that this will be a free like you know conservative estimate for someone like hundred David right and he swears hundred points in a season this is basically predicting that it was for 82 points in Vanessa's based on the formula so it's assuming than he's gonna basically regressed meet a little benefit can't keep up its performance that some things to remember about models in general is that they're very skeptical you know the model doesn't know who Connor McDavid is right so it's just assuming that if something is like way out of the ordinary it's going to come back and that kind of works as you can see from the intercept is if you had zero points last year it comes out as we expect eight points Nexus and regression to the mean works the other way too with players yeah absolutely yes this is simple linear regression and every sense of the word but you know it doesn't explain a lot of the variation you know so various benefit to using this model to get a sense of what's going on and then so what I did here is I plotted the points so x equals the lag the X values the life points the Y values the points and then I also plotted the line that we predicted and you can see exactly what we talked about that it's very good at identifying the general trend what happens this year but for any individual point it doesn't do super super well half of it which again is roughly what we expected but we wanted to get a sense of the general trend which is what we did it's also worth looking at what some of these sort of really far away like outliers Molly burns them it was particularly interested in this one because it's way they point the yeren question or way higher than the points from last year I was wondering who that is it was it was 92 dro and didn't funnily enough with my favorite players and the reason that that happened is because in the sort of flag you know Keith when he came into the league he played like one or two games for Calgary at the end of one of our seasons and then you know played a full season it did really well made Meg and so what happened his ladder was really low because he only played a couple games season my nature and so that's why this prediction to do that well so one of the things that do going back is trying to prevent that from happening by instituting a more rigorous play time on is requirement for the lagged columns as well as be sort of concept looking at we can also have games play requirement very similar but we can you know by looking at some of the cases where the model fails we can try to basically make more rigorous assessments which points makes sense to be included which takes don't but you know it's important to be very cautious about just kicking out points because once you start taking out values left right then you're just predicting the ones that you think gonna grow well and that's not super helpful right okay and then I wanted to do a multiple regression so I decided to switch it up a bit so I decided to predict points to sixty and this may be easier to predict if players have area versus case clearly drop seasons and sound like you drew a connect like maybe a quick fix it's just another 2016 and the important thing to note if they give a lot of diplomatic your variables you have a lot of different slopes so you can consider the flow of each variable as what would happen if holding all other things constant you were to basically increase that and you know I talked about the issue with position before where you have a couple players who are listed as having multiple positions and that would kind of screw up the estimates for the regression so what I did was I just used the first position list episode a little extra data cleaning step before we end up using this and it's categorical so if we sort of scroll back up and think about what that would need in the context of your model if you're just predicting points based on let's say position basically you'll have a coefficient for each position and then one of the positions it will use the default so let's say that you're just and you can try this that we just try to predict points using positions just right position here is it'll use one of the values as a default of let's say like it'll say our best guess for a center who's playing ladies like 20 points or something like that and then it will give you coefficient estimates and saying like okay if you're a demon subtract 10 but if you're a winger maybe at effect I don't I'm just thinking numbers up but so that's how this work categorical variables work in the context of lists and we can scroll back down and see that so I added a bunch of different predictors for points for 60 last year's points per 60 last year's primary points for 60 I thought you know maybe that would be helpful because it's beautiful I also added two years ago points per 60 position so we can see here that for example the default position because it's not listed its center and then this tells you like what to subtracted here looking at it fenced in with structure little up here and then what – a dagger located right baggage and then I also included things for just as I felt like it again this is very pretty like you know when you're looking at this it's a lot of exploration it's a lot of testing columns see what works see what makes sense of your intuition and just you know adding stuff subtracting stuff as but basically you want to keep in the predictors that are statistically significant more generally so the reason I chose to keep positioned in for example if that way the coefficients were denied it statistically significant so I kept that in but it seems like winners are not statistically significantly different from so one of the things you might do in that situation is failing okay it seems like the additional information we have about whether someone is a winger or etcetera is not that helpful so maybe much just reclassify all of them into four words so that's another step that you take to kind of iterate on this based on what you see and model results and we can see that our R squared is points of one server where the county chord 71 percent of the variation within point 60 which again is you know for like a pretty big model then assumes basically linearity oh and yeah so you know again my discussion you know also you want to look at the slopes of the coefficients and see if it makes sense so we would expect the coefficient 30 men to be negative because we don't expect them to have score as much as forwards we would expect the coefficient for last year's points for 60 to be positive because if you scored more points here you should score more points less here we should expect that last year's points for 60 is more important than two years ago 40 means last year is probably a better example of what we'll do to this and so all of these trends are shown within the data which is very comfortable then also we can see that Katy score does add a little bit you know maybe you're doing the little things right that aren't necessarily seeing the point you mentioned last year you're also a pretty good year this year and then so again we mentioned residuals in this live so that's the difference between the predicted value and the actual value it's why I made a histogram and the residuals from this model and they are centered around zero which is definitely definitely what you want if you can see in the movie it looks you know fairly normally distributed most of the residuals are very close to zero which is great saying that for most of the data points you can get a particularly fits as of what this year's play 460 is just based on all of his variables that we have from last journal which is me and I decided to use the color violet red three because I liked it very much and then the Alpha actually makes it slightly transparent so just to be completely you know break it up graphic in bed I didn't you think is here because I wanted to see the players that have the biggest difference between the model and what they actually did so what I did was I made sure that we're using basically a data set that has all of these the predictors that we need so basically I subset in it so that we're looking observations that also have black two terms for all variables and then I need a new column of the prediction column that predicts these points for sixteen based on the model that we made to say and then I made a residual column which puts in all residuals that the model predicted basically after looking at the predictions and then I got rid of a lot of columns because I just wanted to see these once and then I looked at the sort of first six values would be sorted by largest residual and then the first six values will be sorted by smallest residual so a little bit went into that but the point is to see who was exceeding and who is falling short of expectations but you know as we can see we have victor Arvidsson on the very top end that just because she's completely exceeded expectations based on what he had seven previous year last season and so we can to think about him and just like maybe think about some variables that would have maybe predicted this meteoric rise that he have or maybe there are many variables that we can use that would predict that sometimes these things just happen so you know it's like something to think about but also something to keep in mind that we're never going to be able to forget it 100% of a variation of anything greatly if we were you know probably wouldn't even watch this board everything would be seven stone Washington with whatever you so so yeah so these are kind of you over the borders based on the model and these are the other partners in terms of season points for sixteen and then I just decided to plot on the distribution of all situations points per 60 below music planners or manager and then I plotted the distribution of our predictions and one of the things that I went wanted to show you guys and this relates to our regression to the main point earlier is that the predictions are not as spread out at the actual balance so again this goes along with the Chloe set because when you're building this model which is definitely going to be happening is that for players to do really poorly it's going to predict the slightly less poorly and therefore players that do really well it's going to predict that they you know come back better so you know that's another important point to keep in mind and thus ends the guide that I created so again you can refer back to this change whatever like predictors that you wanted you can change whatever sort of filters that you want if you think that 400 even strain you know minutes on the ice completely stupid decision by meaning you can rerun this play whatever number one so there are many points at which you can't basically play around with this build your own models and really the point of this was for me to do all the dirty work in getting all of this we've done for you to be able to use in a while I'm setting okay and so now we can go back to these flies yeah so you built your model it's pretty good but if unfortunately it doesn't end there and many people say that this step could be just a song sorry again so you guys remember this life but the assumptions we have to make to use my last impression but we have to test that and the me just like sort of showed us some of that but we're just gonna go over it again and so basically that's the stuff that we're gonna be showing you guys right now it's some of the things you can do to grapple with validate your model or check if that's accurate so to test our assumptions there are many functions you can use these are to do it and we're not going to go in detail or describe them at all because it's beyond the scope of this workshop but just know that you have to retest your assumptions that make sure that they are being respected these are some of the summary fun thing you can get from a regression so one of the really important things you linearities like that example if you see that trend basically in your data or in the residual where it looks curved that tells you like okay why maybe it's not the best fit for this data and what's really important is that you know that you look at that rather than just start stopping at something like R squared because you can have a model with a really good R square dancing like okay we're predicting a lot of variation we're doing really your job that seems cool but if you look you might see that Oh like a polynomial term would help better fit this so it's important not to just stop at something like R squared but to actually look at you know what the data looks like the residuals look like and again who's it difference between the actual values at the predictive values for the response variable and then also you know an around home assisity the reason why that's important is because you know if you have like mostly residuals that look amazing that than just some valleys that are like all over the place that don't make any sense you know what happens if you're trying to predict like one additional variable or one additional value and it turns out to be that there is the type of observation that was where your pitches for some reason our way out of the ordinary don't make any sense you know you have to deal with that right so it's important to test to see if these assumptions are satisfied meaningfully and in sometimes if they're not you can still you know talk about your model you can still talk about the general trend but you can stay like listen you know this is general trend we see a lot about nargis here you know these are some of the reasons I think that feature outlines yeah some of these are definitely inflexible and we can talk about the independence assumption as it pertains to please yeah so I guess yeah you know and I'll talk a little bit about that I guess so the independence assumption that we talked about before you might be saying hey wait points scoring points is independent of all the other observations because if you're the line name that scores a lot points you made it get any goals and assists off with his amazing production since when that would lead you to do is saying okay the independence assumption isn't necessarily satisfied what can we do to kind of include that in other words if you're close for another TV to go into the season many other tests you should be doing on your data and we're just scratching the surface here we're just going to list some of the ones that come up a lot in hockey analytics the first one beat multi-channel in there and exhaustion can speak a little bit and esra can speak a little bit about their experience with that yeah I need so I think this you know so again like you do diagnostic test to test for some of the calming problems so these are basically had any problems sometimes you can ignore them sometimes we can fix them but multiple annuity when you're making multiple regressions a lot of tips Victor's that's saying like okay there's two predictors two variables that are very highly correlated with each other and you're using them to predict third variable you know the model kinda doesn't know what to do with that it doesn't know you know how much of the third variable is the first one or the second one you lose a lot of precision and you know I guess there's a ton of like real-world instances if you just think about and it's really tough and there's a lot of people in here and do some general what it's like you buy vacuum the Amazon so that's what you want and that's a big issue the cockatoo in all sports in general soccer has this biggest problem we're having things like stalking your on pitch and there's three substitutions so you can probably all there know if it's a piece of units of parts of though and like a Brad Marchand and like purchase virginal there's so they're always hide the hip together there's a cost in fact and I think of all the areas they it's hard to heart so because in top-of-the-line how much credit you Jesus and that'll definitely be influenced but what kind of lot of issues this is three bases linear regression and they're gonna be some instances where it won't handle that well because if you wanted given inches moment you think you have some issues with that you know this time I was it all this stuff just Google if you run into any issues if you're any art you get an error just google it and that's the best way to learn it's just a liars influential observations I think kind of go hand in hand we don't want them to throw everything off and they're tough to deal with on both scenarios so one thing is regressions it means like talking you drove you got 75 points of next year the furniture but if the rest that spirit sports we need no it wouldn't completely fixed find ways to do with that linear regression yeah there are many types of regressions and some pharmacists into outliers in influential observations some are so this is just leading to the art of modeling you're just going to be playing around with a lot of these models in you're going to have to sometimes pick the best that's an entire conversation then you might want to do Idolator conference but thinking the best model is also something that's important for sure so the Sun as I think us talking at you for a very long time but you know if you guys wanted to talk about some of your ideas or go ahead and implement them with the code that we've given you so we have a couple questions on the he really just great you guys really thinking so one question is like what important variable I mean everybody but you know what important variable have we left out that could be a significant predictor of a player's point reduction yes that was the answer great job oh so you know you have all done any research might have seen stuff about aging curbs so that would be you know a next step is to get all of them player ages and try to see you know controlling for player quality how does age impact point production over the years and then you can think of what other variables yeah yeah exactly so we used the variables on Corsica and we deleted some of them that we intuitively got would make a huge difference but are there any variables that you think would make it in half I know there's a couple trackers in this audience and you think from your tracking yeah that could be any other recommendation and this is genuinely like an open question like I just threw some variables and of this model but I'm curious to see what you guys think could be predicted points that we haven't talked about number of different teams player yeah oh you did that actually it gets into like the really interesting question of like how does like switching teams midseason or across seasons expect the players preferences you to potential there might be some effect there or adjusting so yeah sure and that would be actually something that's pretty easy one time everyone say even but it's already contained with it the data you would basically look for instances within the teen column of having like more than one team and then you can make that into a new poem that's how nice is this flavor plate for this year and then you could add that into progression yeah absolutely and I think those parallels are in there and you can just basically straight up put in those column names into the regression and see if they are significant see what the coefficients are at if it makes sense and kind of go from there and do all the things I'm touch done so definitely so again there I even created this like huge data frame for you guys to think about I don't think this publicly it's not as players because yeah for all my draft analysis I just look at the skaters yeah so the last question there as you get into hockey on analytics you're going to be reading a lot of articles and a lot of them use different inclusion thresholds and so this broad question is can you expect a results to change if you tweak that significantly does it say you use a hundred minute yep you know that's a good comment so if you decrease the conclusion cutoff and you're just looking at overall points within a season your estimates are generally going to drop because you're going to be including more players who fit you know score that many points because you're including all the players basically you can play with between 100 and 400 even strict minutes you're adding those players back to the data the other thing that's more topical and you know different tests this is that if you haven't really low cutoff again you're going to include a lot of players who didn't play a lot of games so probably the games that they did play are not going to be the most representative of their true talent you know going back to the third row example like if we include hidden the season replayed like one game that really tells us nothing about what Johnny Gaudreau capabilities actually are so again it goes back to what I said about like that would probably add more noise to your model that would help it but on the flipside if you make your inclusion cutoff pretty high chances are you know the people who do meet that and will be able to predict them pretty well you know they have a good enough amount of games and my timer that shocked you the next season by getting suddenly way better but the downside to that of course is that you would have excluding a lot so it's from the ones to debasement they about a player who had you know a lesson basically just say sorry can't do that so you know it's all about striking that balance of trying to make a model that can be as widely applicable as possible without sort of training it on data that you know is too noisy and doesn't work hard to go yeah you can try I guess manually striking that balance but there are also more rigorous quantitative ways to find a suitable session but you can only survive just changing the four hundred number and see all right so you know playing portion but some quick thank yous – yeah so thank you to Professor Michael Lopez who provided us with some linear regressions Liza helped us with an introduction with regression analysis many who had an immediate website I'm sure you guys all heard about a course kohake that's where all the data is from Alex and Kannan thank you they're helping us with the online side in our colleagues at hot rods everyone who participated thank you and of course Ryan and Matt for organizing this conference and we have some references and recommended readings and so you know you can look at this to extend your knowledge go beyond this guide and also if you have any more general questions about like what we do you know what the projects are like what issues we run into this in careers whatever you can ask this now as well or just play around with computers either one is good all right thanks [Applause] yeah we'll send the car quickly I knew the questions so speaking for myself I missed at concentrator at we're human and so I took a couple stat classes I definitely learned more of the math side and then I've got coding I know the big theories like I'm certainly true right way to do it because I do think that when you're learning about regression it is helpful instantly run as many of them as a candidate like see how things changed when you change the variables in them but you know it is definitely important to get a solid foundation like what you're actually doing it for your models I think that is a problem that people run into because as you can see em you downloaded are you downloaded art studio so commanded for linear model is LM like that's pretty easy to type it's very easy for people to do so we wanted to keep you guys at least kind of the minimum over V that you would need to be able to do it responsibly so I think it's really important that you understand that like when you're testing new models and you're using models that you never worked with before it can be very easy to hook them up an artist built that ways you can try every model and you could every month but it's important to first get that foundational knowledge you're actually doing there's a lot of yeah I think I just reiterate the fact that like all the commands I wrote in that guide if you tested me on them right now I would absolutely not remember or I probably remember vaguely 60% because it's like you know it's not important that you not just like encyclopedic knowledge of coding what's important is that you know you understand what you're trying to do and how you can basically find the process to do that and so hopefully you know that guy can get you off to a good start with that yeah I mean don't feel like you have to memorize it or anything like that and yeah Google is your friend over any other questions all right I guess we'll wrap us up so thank you guys so much for being here and thank you to the streamers as well for streaming and we hope that this was helpful to you in some way all right [Applause]