Deep dive on the Pandas Groupby Function
There are some functions or methods that you know … kinda; when you think about them its fuzzy and your knowledge is not to the point where five minutes of googling can fill all the gaps. Theres are often the functions that in the heat of a project you just end up avoiding and never getting better at. For me, one of those is the pandas groupby function. Groupby is incredibly powerful and I know how to get a sum or something simple but more advanced operations seem to require hours of trial and error to get exactly what I want from my data. Recently I found a solution to a query I had been working on for hourse using only one line of code with a groupby. So I figured it was time to fill in some of the gaps in my knowledge of groupby: writing a blog for yall is a great way to do just that!
The Fundamentals
Wes McKinney (the creator of Pandas) describes the structure of groupby functions as coming from Hadley Wickham’s (Mckinney is to python as Wickham is to R) split-apply-combine description of the group functions he coded for R. Lets take a deep dive on what he means by this. Often knowing the theory behind a function can help to understand why it's structured the way it is, rather than just understanding on a surface level what people commonly do with it.
Split: the data is split by one or more keys along either the rows or the columns. In other words, you give the group by function a list of column names or row monikers, and the data is divided into these categories.
Apply: a function is applied to these groups
Combine: the results of the function are aggregated
One of the most confusing things about groupby is that it is a “lazy” object or it does not actually compute anything unless explicitly told to do so. When you call groupby you create a groupby object which is not interpretable. Only after you tell it what to compute from this object, ie a sum or mean does it display your results. It’s like if my mom told me to “clean” I’m gonna need some specifics before I actually do the work momma.
Let's dive into a specific example to show you what I mean. Recently I have been working on last year's NFL big data bowl on Kaggle. It's a data set with a ton of information about every player on the field on every run play from 2017–2020. The goal is to predict how many yards each run play gained. In my EDA phase, I wanted to know if how many defenders were in the box affected the yards gained.
As you can see, just making a groupby object does not do anything:
df.groupby("defendersinthebox").yards
When you call the mean function however, the function is applied to the groups defined in the groupby object and your results are combined:
df.groupby("defendersinthebox").yards.mean()
Its really easy to throw results like this into a simple visualization.
df.groupby("defendersinthebox").yards.mean().plot(kind = 'bar')
Keep in mind when you are using groupby, that whenever you call an arithmetic function it will ignore non integer data. These are called “nuisance columns.” Groupby also automatically ignores nans when making aggregate calcualtion like mean.
This is the simplest application of groupby, but it can do so much more. Heres all the functions you can apply (including your own custom ones which we will get into later). Instead of going through each one, I will give you an example of a problem I recently had and how I finally solved it with groupby.
A Recent Example
OK, a little context, the NFL dataset has each row as a player, not as the play itself. The goal of the project is to predict what happened on each play, so I needed a way to efficiently make the player data into additional columns, so that each observation is one play then all the data about the players from that play.
See PlayID is the same for all five rows but the other variables are different; each row is a player, not a play. I got stuck on how to turn each player into a column for longer than I care to admit. I even tried making some custom functions but with 60,000 rows they were all too slow. This is another advantage to using inbuilt pandas or numpy functions like groupby, they are generally much much faster than trying to come up with your own solution in raw python.
Searching around, I finally came across cumcount() a handy groupby function that will keep track of how many times the grouper column has appeared. Lets dive into how I used it to solve my problem!
dfvec["player"] = dfvec.groupby(["playid"]).cumcount()
This is the crux of it, in this one line of code I add a column that has a unique value from 0–22 for each of the 22 players on the field for each play. This was a groupby function I never knew about before I had this problem I made a complext series of if statements to try to add similar labeling without the groupby function but these took forever and where difficult to trouble shoot.
Now it was fairly straightforward to grab all the “1s” or all the “18s” from each play and make them into a data frame and then combine them so that each play has a column for the data of each player.
lsplayersdf = [dfvec[dfvec.player == x] for x in list(range(0,22))]
df_all = lsplayersdf[0]
for x in list(range(1,21)):
df_all = df_all.join(lsplayersdf[x])
Now this may not be the perfect solution to this problem, in fact I get the feeling its pretty clunky. I included it in this blog to show you the power and flexibility of using groupby functions. Take a look through all the options, just so that if you ever get stuck in Pandas, maybe just maybe you remember one of the options in the back of your head and can solve your problem with it. Theres usually a better way than just jumping right to the old if statements, especially with how developed Pandas has become at this point.
Custom Groupby Functions
Before I wrote this blog, I honestly did not know this was something you could do. Using .agg or the aggregate function on a groupby object allows you to create a function, then call it on the groups. Heres some very simple code that would take the distance of the maximum value in each group from the mean of that group.
def distance_from_mean(arr):
return arr.max()-arr.mean()
df.groupby('defendersinthebox')['yards'].agg(distance_from_mean)
I honestly had trouble coming up with a good example for this one, but this code shows you how it could work. Maybe as I get more expierence with it some better ideas will come to me and I can append them to this blog. It’s nice to know the option is there/
Conclusion
Thanks for checking out my blog, I hope that by reading along you learned at least something about the groupby function or got some new ideas on how to use it. I think its a very powerful tool in Pandas and one that I probably underutilize. It can be confusing, but just remember what groups you want to create and what information about those groups you want to get. Diving straight into a for loop can seem fun and programmery (atleast for me) but remember that these powerful modules contain efficient functions that almost always offer better solutions. Happy data scieincing!