/tmp/ipykernel_17118/1998127023.py:7: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width. pd.set_option("display.max_colwidth", -1)
EDA¶
The goal of this is to provide a general overview of the dataset and to select the features that can be used in our model.
We'll focus on:
- Analyzing the differences between national leagues:
- Goal scoring data based on teams and players
- The "closeness" in performance between teams by league (basically whether some leagues are dominated by a small number of teams or whether all team are similarly strong)
Football Leagues¶
/tmp/ipykernel_17118/4035172131.py:11: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.boxenplot( /tmp/ipykernel_17118/4035172131.py:24: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator. ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
The colors in the "Number of Matches" match those in the goal distribution boxen plot. We can see that for some reason data for Belgium is missing for the year 2013.
There are also some significant differences in overall scores between leagues.
Parsing Goal Events and Player Scoring Data¶
The distribution of types goals scored in all leagues. The difference in total number of penalties can be probably explained by different strictiness of referees etc. and random variance. The number of own goals seems to be significantly higher in Poland. Additional research might be neccesary to explain this (unless there are issues with data quality).
The wildly different proportion of goals that a have an assist associated with them is likely due to issues in data quality and is not tracked consistently across all leagues.
/tmp/ipykernel_17118/1727508686.py:47: RuntimeWarning: invalid value encountered in scalar divide percentages[label] = (goals_at_threshold - last_threshold) / total_goals /tmp/ipykernel_17118/1727508686.py:47: RuntimeWarning: invalid value encountered in scalar divide percentages[label] = (goals_at_threshold - last_threshold) / total_goals
The chart allows us to compare the 'inequality' of goal scoring across different leagues. For instance, we can see that the top 1% of goal scorers score only ~10% of goals in England but this increases to ~22% in Switzerland. Based on this we can assume the at least the amongst goal scoring players the differences in player skill is much lower in some leagues than in other (where most games are dominated by a smaller proportion of gaol scorers).
<seaborn.axisgrid.JointGrid at 0x7fb5e3993df0>
This chart show the relationship in goals scored by individual players in relation to the number of matches they have played.
Text(0.5, 1.0, 'Matches Count by Player ')
/tmp/ipykernel_17118/3368172736.py:23: UserWarning: Attempt to set non-positive xlim on a log-scaled axis will be ignored. ax.set_xlim([0, 100])
This chart shows the proportion of players who have scored goals in relation to the number of matches they have played in. We can see that around 30% of players who have played 100 games or more still have scored any goals. In addition to goalkeepers this probably included a signifcant proportion of defenders who tend to not score many goals.
Season Points Analysis¶
/tmp/ipykernel_17118/1595024916.py:21: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy league_last_games["normalized_points"] = league_last_games["cumulative_points"] / ( /tmp/ipykernel_17118/1595024916.py:35: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy league_last_games["won_season"] = False /tmp/ipykernel_17118/1595024916.py:38: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy league_last_games["top_4"] = False
Team Performance Analysis¶
/home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 15.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 14.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 7.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 13.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 15.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 5.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 14.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 7.2% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning) /home/paulius/miniconda3/envs/rapids_v2/lib/python3.10/site-packages/seaborn/categorical.py:3370: UserWarning: 13.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot. warnings.warn(msg, UserWarning)
This chart show the distributions of teams by the number of points collected at the end of the season. The Kurtosis value indicates the tails of the distribution and the number of outliers (high Kurtosis means that there are many teams in league which score much more or much less points than most other teams).
We can see that the Polish, English and Belgian leagues are the most equal in performance. While the Portuguese league was dominated by a small number of teams in all the seasons included in the dataset.
Team Rating¶
The "Team Rating" is the sum of all the individual player ratings (based on EA FIFA games) in every match.
Team Rating vs Team Position at the End of Season¶
Generally teams that have higher ratings tend to dominate most of the leagues. This indicate that it might be a somewhat accurate at predicting the result of matches at least on average over the entire season.
/tmp/ipykernel_17118/2216064224.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. mean, std = groups.transform("mean"), groups.transform("std") /tmp/ipykernel_17118/2216064224.py:2: FutureWarning: The default value of numeric_only in DataFrameGroupBy.std is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function. mean, std = groups.transform("mean"), groups.transform("std")
<seaborn.axisgrid.FacetGrid at 0x7fb5e906e530>
Change in Team Rating Over Season¶
The chart above shows how the team rating has changed on average in every league during the seasons (over the entire period included in the dataset. The trend in leagues which are shown individually was signficantly different to the average trend across all the remaining leagues (i.e. slope significantly different using the p-test).
To get a slightly clearer picture we have also shown the bottom and top 20% and 80% percentile normalized ratings across all seasons and all leagues.
Our next step was to further examine the relationship between the sum of player ratings and the likelihood of a team with a higher rating winning. This is a simple logistic regression showing the probability of a binary outcome Win - Not Win based on the different in team ratings. We can see that the likelihood of a team which has an at least >30% higher rating is almost 80%. We will expand on this relationship when building our multi-classification model in our next notebook.
PCA Analysis¶
13
var | PC | cum_var | |
---|---|---|---|
0 | 0.290852 | PC1 | 0.290852 |
1 | 0.266693 | PC2 | 0.557544 |
2 | 0.126880 | PC3 | 0.684425 |
3 | 0.076688 | PC4 | 0.761113 |
4 | 0.065649 | PC5 | 0.826762 |
5 | 0.058473 | PC6 | 0.885235 |
6 | 0.049256 | PC7 | 0.934491 |
7 | 0.028955 | PC8 | 0.963446 |
8 | 0.013163 | PC9 | 0.976609 |
9 | 0.012951 | PC10 | 0.989560 |
We have attempted to use PCA to determine whether we can simplify our models by reducing the number of features while retaining most of the variance.
When all the of the features which will be used in our full mode (see TabularModel.ipynb
) we need to have about 8 components to keep at least 95% of all variance, considering that we only have 13 features in total this doesn't seem like reasonable approach.
id | team_fifa_api_id | team_api_id | date | buildUpPlaySpeed | buildUpPlaySpeedClass | buildUpPlayDribbling | buildUpPlayDribblingClass | buildUpPlayPassing | buildUpPlayPassingClass | buildUpPlayPositioningClass | chanceCreationPassing | chanceCreationPassingClass | chanceCreationCrossing | chanceCreationCrossingClass | chanceCreationShooting | chanceCreationShootingClass | chanceCreationPositioningClass | defencePressure | defencePressureClass | defenceAggression | defenceAggressionClass | defenceTeamWidth | defenceTeamWidthClass | defenceDefenderLineClass | days_after_first_date | team_long_name | league_id | id_x | id_y | country_id | league_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1112 | 1113 | 874 | 1601 | 2010-02-22 | 30 | Slow | NaN | Little | 40 | Mixed | Organised | 50 | Normal | 35 | Normal | 70 | Lots | Organised | 65 | Medium | 60 | Press | 50 | Normal | Cover | 584 | Ruch Chorzów | 15722 | 120 | 15722 | 15722 | Poland Ekstraklasa |
1113 | 1114 | 874 | 1601 | 2011-02-22 | 48 | Balanced | NaN | Little | 51 | Mixed | Organised | 68 | Risky | 67 | Lots | 51 | Normal | Organised | 46 | Medium | 48 | Press | 50 | Normal | Cover | 949 | Ruch Chorzów | 15722 | 120 | 15722 | 15722 | Poland Ekstraklasa |
1114 | 1115 | 874 | 1601 | 2012-02-22 | 53 | Balanced | NaN | Little | 55 | Mixed | Organised | 44 | Normal | 65 | Normal | 50 | Normal | Organised | 43 | Medium | 44 | Press | 49 | Normal | Cover | 1314 | Ruch Chorzów | 15722 | 120 | 15722 | 15722 | Poland Ekstraklasa |
1115 | 1116 | 874 | 1601 | 2013-09-20 | 53 | Balanced | NaN | Little | 55 | Mixed | Organised | 44 | Normal | 65 | Normal | 50 | Normal | Organised | 43 | Medium | 44 | Press | 49 | Normal | Cover | 1890 | Ruch Chorzów | 15722 | 120 | 15722 | 15722 | Poland Ekstraklasa |
1116 | 1117 | 874 | 1601 | 2014-09-19 | 53 | Balanced | 48.0 | Normal | 38 | Mixed | Organised | 66 | Normal | 65 | Normal | 50 | Normal | Organised | 43 | Medium | 44 | Press | 49 | Normal | Cover | 2254 | Ruch Chorzów | 15722 | 120 | 15722 | 15722 | Poland Ekstraklasa |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
68 | 69 | 112513 | 158085 | 2014-09-19 | 69 | Fast | 66.0 | Normal | 39 | Mixed | Organised | 55 | Normal | 59 | Normal | 46 | Normal | Organised | 35 | Medium | 37 | Press | 37 | Normal | Cover | 2254 | FC Arouca | 17642 | 49 | 17642 | 17642 | Portugal Liga ZON Sagres |
69 | 70 | 112513 | 158085 | 2015-09-10 | 65 | Balanced | 66.0 | Normal | 39 | Mixed | Organised | 55 | Normal | 59 | Normal | 46 | Normal | Organised | 37 | Medium | 39 | Press | 37 | Normal | Cover | 2610 | FC Arouca | 17642 | 49 | 17642 | 17642 | Portugal Liga ZON Sagres |
274 | 275 | 112409 | 208931 | 2014-09-19 | 32 | Slow | 46.0 | Normal | 31 | Short | Organised | 47 | Normal | 36 | Normal | 54 | Normal | Organised | 46 | Medium | 44 | Press | 51 | Normal | Cover | 2254 | Carpi | 10257 | 19 | 10257 | 10257 | Italy Serie A |
275 | 276 | 112409 | 208931 | 2015-09-10 | 80 | Fast | 45.0 | Normal | 65 | Mixed | Organised | 70 | Risky | 40 | Normal | 50 | Normal | Organised | 25 | Deep | 55 | Press | 35 | Normal | Cover | 2610 | Carpi | 10257 | 19 | 10257 | 10257 | Italy Serie A |
858 | 859 | 111560 | 274581 | 2015-09-10 | 50 | Balanced | 50.0 | Normal | 50 | Mixed | Organised | 50 | Normal | 50 | Normal | 50 | Normal | Organised | 45 | Medium | 45 | Press | 50 | Normal | Cover | 2610 | Royal Excel Mouscron | 1 | 30 | 1 | 1 | Belgium Jupiler League |
1458 rows × 32 columns
Team Attributes PCA¶
Our next step was look into the various team style/tactics/etc. attribute features they have used in their matches:
['buildUpPlaySpeed', 'buildUpPlaySpeedClass', 'buildUpPlayDribbling', 'buildUpPlayDribblingClass', 'buildUpPlayPassing', 'buildUpPlayPassingClass', 'buildUpPlayPositioningClass', 'chanceCreationPassing', 'chanceCreationPassingClass', 'chanceCreationCrossing', 'chanceCreationCrossingClass', 'chanceCreationShooting', 'chanceCreationShootingClass', 'chanceCreationPositioningClass', 'defencePressure', 'defencePressureClass', 'defenceAggression', 'defenceAggressionClass', 'defenceTeamWidth', 'defenceTeamWidthClass', 'defenceDefenderLineClass']
We have previously attempted to include them into our model but their predictive power insignificant. PCA would allow us to reduce all these features into a limited number of components that can be possibly comapred between individual teams directly.
However again this hasn't been very successful, 10 components (out of 21 features) only explain ~30% of all variance. This indicates that, this data:
- has high Dimensionality
- the relationship might non-linear or likely non existant
Clustering¶
Using hierarchical clustering with the same features doesn't seem to be that effectively either. We've attempted to determine whether some teams could be grouped together based on their style/tacticts/other attributes. However the result has proven to be disappointing even when multiple different hyperparameter combinations were tried:
/home/paulius/data/projects/football_m2_s4/workbench/src/stats_utils.py:172: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy metrics_df["score_davies_bouldin"] = 1 / metrics_df["score_davies_bouldin"]
name | component_method | n_components | method | cutoff | eps | min_samples | n_clusters | min_count_in_cluster | score_silhouette | score_calinski_harabasz | score_davies_bouldin | score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13 | Hierarchical | None | None | ward | 230 | None | None | 2 | 198 | 0.079399 | 44.468856 | 3.078634 | 0.800000 |
14 | Hierarchical | None | None | ward | 240 | None | None | 2 | 198 | 0.079399 | 44.468856 | 3.078634 | 0.800000 |
15 | Hierarchical | None | None | ward | 250 | None | None | 2 | 198 | 0.079399 | 44.468856 | 3.078634 | 0.800000 |
7 | Hierarchical | None | None | ward | 170 | None | None | 5 | 31 | 0.065273 | 37.352303 | 2.342327 | 0.506753 |
5 | Hierarchical | None | None | ward | 150 | None | None | 6 | 31 | 0.067596 | 36.096352 | 2.411916 | 0.506391 |
6 | Hierarchical | None | None | ward | 160 | None | None | 6 | 31 | 0.067596 | 36.096352 | 2.411916 | 0.506391 |
10 | Hierarchical | None | None | ward | 200 | None | None | 3 | 108 | 0.062023 | 41.589361 | 2.797015 | 0.476989 |
11 | Hierarchical | None | None | ward | 210 | None | None | 3 | 108 | 0.062023 | 41.589361 | 2.797015 | 0.476989 |
12 | Hierarchical | None | None | ward | 220 | None | None | 3 | 108 | 0.062023 | 41.589361 | 2.797015 | 0.476989 |
0 | Hierarchical | None | None | ward | 100 | None | None | 16 | 4 | 0.071681 | 26.141644 | 1.886573 | 0.470284 |
3 | Hierarchical | None | None | ward | 130 | None | None | 8 | 27 | 0.062360 | 32.986357 | 2.166632 | 0.396250 |
2 | Hierarchical | None | None | ward | 120 | None | None | 10 | 27 | 0.063856 | 30.742088 | 2.118654 | 0.382587 |
8 | Hierarchical | None | None | ward | 180 | None | None | 4 | 92 | 0.055599 | 39.727458 | 2.583317 | 0.357206 |
9 | Hierarchical | None | None | ward | 190 | None | None | 4 | 92 | 0.055599 | 39.727458 | 2.583317 | 0.357206 |
4 | Hierarchical | None | None | ward | 140 | None | None | 7 | 27 | 0.058556 | 34.458802 | 2.291476 | 0.339944 |
1 | Hierarchical | None | None | ward | 110 | None | None | 11 | 23 | 0.060786 | 29.607235 | 2.129367 | 0.303922 |