Use Machine Learning to Find Players of Similar Profiles: The Streamlit App
When I wrote an article last year titled 'Using Machine Learning To Find A Jules Koundé Alternative for Tottenham', the idea of creating an app that allows anyone to find player alternatives had already been sowed in my mind. It was then nurtured by the impending overhaul at Manchester United and has finally blossomed as the transfer window now comes around the corner.
I finally got around to making the app, which you can now find on my Twitter profile here. In this article, I will take you through the idea, the execution, and the issues I faced while building the web app.
My biggest project is finally here!I made an app that employs Machine Learning to group and find players with similar profiles.https://t.co/m4yhib1XvD pic.twitter.com/P9E4BSmVw3— Anuraag Kulkarni (@Anuraag027) May 19, 2022
The Idea
So as I stated earlier, the idea came from a previous article. The thought in my mind was simple - is there a way I can tell which players are similar by looking at their stats. The stats are affected by a number of factors, like the inherent quality of said player, the role he is being asked to fulfill in the team's setup, the team's quality, etc. Given these factors, I did not want to make these 'similarity' decisions myself - as that would leave the process prone to my fickle mind, and more importantly, my biases.
And that's where Machine Learning comes in. A way to train a model, unaffected by human biases, and letting the model do the 'similarity finding' is what I thought was the perfect way to bring about what I had in my mind. I am not a professional ML engineer or even a data scientist, it's only something I picked up as a hobby over the past year - so apologies if my methodology isn't the best/most optimal or even if there are some steps that I took erroneously. If you do observe any of these, I would really appreciate it if you could let me know on Twitter.
The Machine Learning Part
If you're unaware of what ML is, this definition may be helpful:
'Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.' [source: Wikipedia]
The ML algorithm I used to accomplish this, is called K-Means clustering.
It is an unsupervised learning algorithm, which is a 'type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. In contrast to supervised learning where data is tagged by an expert, e.g. as a "ball" or "fish", unsupervised methods exhibit self-organization that captures patterns as probability densities or a combination of neural feature preferences.' [source: Wikipedia]
So basically, a type of algorithm that is capable of finding patterns on its own without the need for human intervention - which is exactly what I meant when I said that I did not want my own biases to creep into model building.
With that cleared, let's take a look into what K-Means clustering does.
The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters by reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
The Execution
The entire project was coded in Python and then later turned into an app using Streamlit.
I first started with collecting data from fbref - an excellent site to check player statistics which is powered by Statsbomb. I had to do a lot of data cleaning to have all the data in the required format, but the 'pandas' Python library was built specifically for this purpose and worked like a charm (by that I mean spending 5 hours solving logical and syntactical errors. Still worked though. Not Pandas' fault).
Next, since fbref defines player positions quite vaguely, I needed a better source that could give me more accurate player positions. That's where Jase's post came into the picture - his sheet with all the positions already tagged was exactly what I needed for my work.
Now that I had all the data available in a csv/excel format, it was time to get started with the code.
An important step before feeding data into any ML model is scaling them. This step is often called 'feature scaling', and here is what it means:
Machine learning is like making a mixed fruit juice. If we want to get the best-mixed juice, we need to mix all fruit not by their size but based on their right proportion. We just need to remember apples and strawberries are not the same unless we make them similar in some context to compare their attribute. Similarly, in many machine learning algorithms, to bring all features in the same standing, we need to do scaling so that one significant number doesn’t impact the model just because of its large magnitude. [source: All about Feature Scaling]
To explain its importance, let's consider this example.
If we want to measure the aerial prowess of a player, we can consider two stats - aerial duels won per 90, and the percentage of aerial duals won. Now while the number of aerial duals won will be around 5 (Shane Duffy has the highest at 5.08 p90), the percentage of aerial duals won can be anywhere from 0% to 85%. While this makes sense to us intuitively, to the model this is just confusing - what should be the base number? Why does one metric max out at 5 while another goes on till 85? This can unbalance the model, and that's why feature scaling is important - to basically bring all metrics into a standardized range that makes sense to the model.
To scale the metrics, Python's scikit-learn library provides the option of StandardScaler, which is what I used to bring the stats into a sensible range.
With that done, the data was then ready to be fitted to the model.
The final step before we can go ahead with the fitting, however, is to select an appropriate number of clusters for the procedure, i.e., in how many parts or groups would you want your data points to be divided.
Even the number of clusters cannot be hardcoded in the model code, since that too brings in a bias from the coder. One of the better ways to find the ideal number of clusters to be used for the model is the 'Elbow Method':
In the Elbow method, we are actually varying the number of clusters ( K ) from 1 – 10. For each value of K, we are calculating WCSS ( Within-Cluster Sum of Square ). WCSS is the sum of the squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1. When we analyze the graph we can see that the graph will rapidly change at a point and thus creating an elbow shape. From this point, the graph starts to move almost parallel to the X-axis. The K value corresponding to this point is the optimal K value or an optimal number of clusters. [source: In-depth Intuition of K-Means Clustering Algorithm in Machine Learning]
For one of my cases, this is what the WCSS vs number of clusters graph looked like:
During the early stages, I had hardcoded the number of clusters (k) to 4 as I thought it was giving me the best results. However, I knew it was wrong in the back of my mind and I had to make sure I overcame it. Automating the elbow finding for every iteration was probably the biggest challenge for me and I was happy on being finally able to do it using Yellowbrick. This made sure the clustering process was as bias-free as possible and the results were to the best of the model's capability.
Finally, I had my scaled data and the number of clusters automated for every situation. I had all the resources ready to be finally able to build my model.
I used the K-Means model provided by scikit-learn and used the scaled data I got earlier to fit it to the model. Then, I just had to check the model's predictions to assign the appropriate cluster to every player. Once this was done, all I had to do was isolate all the players with the same cluster as the player the user of the web app had suggested. This was basically the model telling me which players it thought were similar.
Next, I needed a way to have some sort of a scoring system to check which players out of the same cluster were most similar to the player in question. To do this, I settled on using the simple methodology of Euclidian distance. In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points.
In general, for points given by Cartesian coordinates in n-dimensional Euclidean space, the distance is:
From this point, all I had to do was create a radar plot to show a comparison between the 3 most similar players to the selected player, to provide a visual of how well the model has done. I did this using mplsoccer's excellent library.
That was almost it on the inner workings of the app. I was able to scale all the metrics, automate the ideal number of clusters (k) finding, build and fit the model, create a scoring system for the players and plot a radar chart for the 3 most similar players for a visual comparison. All that was left was to build a shareable app that anyone could use.
The App Building
Coming from a non-CS background and never really having done this sort of a thing before, this part seemed like a slightly more daunting task to me. But thanks to Streamlit, it became quite easy. Their APIs are quite easy to use and the documentation is elaborate enough to understand what to pick and how to implement those APIs. I did face some issues, however, during the app building process.
The biggest one and what took the most time to figure out was the error I kept getting due to importing a chunk number of libraries that I wasn't actually using in the code. The main problem was narrowing down that the error was being caused due to those redundant libraries since the error didn't distinctly state so. After a lot of googling and even rage-deleting the entire project repo at one point, I was able to figure out what was causing the error and was able to resolve it.
Once this was resolved, it was more about making incremental changes and resolving minor issues to make the performance better, like using caching to keep the app under resource limits and adding a text section to describe the metrics being used in the cluster formation and the radar plot.
Another thing was pertaining to the sample size of certain players. Stats are heavily susceptible to players having played a small number of minutes, which can lead to inflated numbers. This can then cause the model to get confused and lead to incorrect profiling of players. To mitigate this, I decided on having twelve full 90s as the basis for player comparison. Things were fine once again, in my mind. What I had forgotten however is that this removes a lot of players who have completed less than twelve 90s, which users may want to search for. So I decided on the following criteria to put right this problem once and for all - for players who have completed more than twelve full 90s, twelve will be the threshold; while for players with fewer than that, the number of full 90s completed by that specific player will be the threshold.
The final important feature was allowing the user to select the age range and showing the player results only in the desired range. This would allow users to look for 'younger versions' of the current player they have selected.
Making these upgrades meant I could allow the users to have an easy and informative experience while using the app, while also learning how I can mess around with a working app, and how I can make changes to the app without having the entire thing shut down (lol).
Conclusion
The journey of turning a simple idea into something I could share and let users all across the world use was a satisfying one. I learned plenty of new things, from collating multiple functions to building a web app and all that goes into deploying one. There are still some ideas in my mind that I plan to release as future updates to the app, so please watch out for those.
Once again, I thank everyone who helped me make the app and everyone who used it. If you find any errors or have any other questions, feel free to reach out to me on Twitter. I hope everyone has a fun time using the app - link :)
Comments
Post a Comment