Today is the first day of the 2018 World Cup football in Russia. While, I do not know a lot about football, I do enjoy cheering for the Red Devils at both the European and the World Cup. And so do most of my friends. That’s why for the second time in a row, we are organising a small fantasy football league. This year we will be making use of the Sporza WK manager, provided by our national sport news network. The rules of their game are simple:
- Pick 11 players
- Stay within a budget of 300 million euro
- A maximum of 3 players from the same team are allowed
- Transfers are possible in certain phases of the tournament
You can obtain points if your players score goals and keep clean sheets, while you’ll lose points if your players scores owngoals or gets cautioned.
1. Data collection
The first step of our journey is of course, data collection. Searching for a decent dataset was probably one of the hardest tasks. I wanted to use some random up to date player and team ratings, but could not find any decent one. Since I didn’t have a lot of time, I just settled with this ‘2018 FIFA World Cup squads’ Wikipedia page.
The page has a section for every team that contains the following information:
- The group they are in
- The coach of the team
- A table with player stats such as
- Date of birth
- Caps (= matches played)
I’ve never done this before but this seems like a pretty easy web scraping job. For this I will try out Rvest using this tutorial as a guide. In the end I parsed the whole page using the following script:
|# Read in the website|
|site <- read_html("https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads")|
|# Parse website for player tables|
|players <- site %>%|
|html_table(fill = T) %>%|
|.[1:32] # Keep only the tables related to the 32 teams|
|# Parse website for team names|
|teams <- site %>%|
|html_nodes("h3 .mw-headline") %>%|
|.[1:32] # keep only the first 32 hits|
|# Parse website for coach names|
|coaches <- site %>%|
|html_nodes("h3+ p") %>%|
|.[1:32] %>% # Keep only the first 32 hits|
|str_replace_all("Coach: ", "") %>% # Clean up the string|
|str_trim() # remove leading whitespaces|
|# Parse website to figure out in which group the team competes|
|group <- site %>%|
|html_nodes("h2 .mw-headline") %>%|
|.[1:8] %>% # Keep only the first 8 hits|
|rep(4) %>% # Make the group vector match the team vector|
|# Now that we have all of the tables separatly, let's combine them into one|
|table <- tibble(team = teams,|
|coach = coaches,|
|group = group,|
|player = players) %>%|
|unnest() %>% # The players table was a list, we need to unnest this|
|rename(position = `Pos.`) %>%|
|mutate(position = str_sub(position, 2,3)) %>% # Fix parsing error|
|rename(age = `Date of birth (age)`) %>%|
|mutate(age = as.integer(str_sub(age,–4, –2)))|
2. Exploring the players and teams
Before I hire anyone, I’d like to get to know them a little bit more. So let’s do some exploratory analysis. For example, who are the youngest players in the cup and how many goals have they scored?
|arrange(age) %>% # Sort on age|
|.[1:10,] %>% # Select top 10|
|mutate(Player = factor(str_c(Player, "(", age,")"))) %>% # Add age to the players name for plotting|
|ggplot(aes(x = fct_reorder(Player, Goals), y = Goals)) + # Start plotting|
|geom_col(aes(fill = team)) + # Add bars|
|geom_text(aes(y = –0.3, label = position, colour = team)) + # Add text that shows their position|
|scale_colour_brewer(palette = "Dark2") + # Change colour scheme|
|scale_fill_brewer(palette = "Dark2") + # Change colour scheme|
|coord_flip() + # Flip x and y axis|
|xlab("") + # remove label on y-axis|
|ggtitle("Youngest players in the World Cup") + # Add title to the plot|
|theme_minimal() + # Use a different theme|
|theme(panel.grid = element_blank()) # Tweak the theme a little bit|
All right, this 19 year old Kylian Mbappé from France seems to be doing a great job! Who knows, he’ll end up in my team!
Now let’s do the same for the oldest players:
Cool! There seem to be 4 goal keepers in this list. None of them scored a goal, which is extremely normal for a player in that position I guess. In addition, Tim Cahill is one of the oldest players in the world cup and with almost 50 goals, scored the highest number of goals from these 10 players.
Next, let’s have a look at the teams. I wanted to figure out which team was the most experienced. For this, I will just look at the number of caps that each player has:
It’s a little bit over plotted but wow, this is good news! Belgium tops the list and thus has the highest median number of caps. While one could probably interpret this graph in many different ways, to me this means that we’re the most experienced team and thus have a high chance of doing a pretty good job. Or at least, that’s what I choose to believe.
3. Team selection
For my team I’ve decided to go for a 3-4-3 formation because well eeeeuhh…. Just for no reason actually.
I will also start by recruiting midfield players, because back in the days when I played football, I always played that position.
In total there’s 234 players to choose from. Let’s go for the obvious ones, the ones that scored the most:
|filter(position == "MF") %>%|
The first player I wanted to recruit was the top scorer, Thomas Müller, but apparently he’s annotated as a forward on the Sporza game. So unfortunately we can’t use him on this position.
- So let’s recruit the second in line: Keisuke Honda! Welcome to the team!
- The third in line is also from Japan, which does not seem like a good choice in this phase of the game, so let’s go for the captain of the Mexican squad Andrés Guardado!
- Up next, from Costa Rica, also the captain: Bryan Ruiz!
- Our final position, does not go to Özil as Germany is in the same group as Costa Rica and I do not want too much competition within the team, so give a round of applause for Denmark’s own Christian Eriksen!
All players are in the top 8 teams regarding their median number of caps (see above), so that’s a good sign!
Up next: 3 defenders! Again, I will go for the defenders that scored the highest number of goals. This is actually almost the only option I’ve got with the data I’ve collected.
- The top goalscorer is again a Mexican: Rafael Màrquez. Of course, a good
scientists always blasts his sequence firstcoach always googles his players first before picking them for their team. Doing this, I found that Màrquez had been banned for playing for his club Atlas for the last 2 months. So maybe that’s not such a strategic choice. Sorry Màrquez!
- Instead, I will go for Sergio Ramos!
- Unlucky for Branislav Ivanovic, but I will skip a Serbian player, as they are at the bottom of our “Number caps per team”-graph.
- Although the fact that Bruno Alves (Portugal) is in the same group as Sergio Ramos (Spain; group B), I will still go for him as I think that both Spain and Portugal have a chance of getting through the first round!
- My final player will not be from Panama as they are in the same group as Belgium and I do not think Panama will survive.
So excluding all the players above and the teams from which I’ve already picked a player, I will now focus only on the players with 8 goals and see whether there’s a good fit there:
Well, I’m not going to lie here as this is a little bit more arbitrary, but as a Belgian, I can’t ignore Jan Vertonghen on this list. So welcome to the team, Jan Vertonghen!
That leaves space for three forwards!
Oh, this is definitely a super difficult list to choose from. Let’s make it a little bit easier by looking at what groups we already chose our players in:
|filter(selected == "YES") %>%|
|summarise(count = n()) %>%|
|ggplot(aes(x = group, y = count)) +|
Players from group A and D are missing!
- Argentina is in group D, and Lionel Messi plays for Argentina. So welcome to him!
- Uruguay is in group A, and Luis Suárez plays for Uruguay. I’ll happily make space for him!
- Now that every group is represented, we’ll have to pick a player that’s already in one of these groups. Starting back at the top, we can’t go for Ronaldo as the count in group B would then be 3. So the next in line is Neymar from Brazil!
Only one final slot to fill: our goalkeeper!
It makes a lot of sense that of all goalkeepers, not a single one has scored a goal. This means that choosing a keeper will be practically impossible with the data we have.
Therefore I’ll just go for our national hero: Thibaut Courtois!
4. Closing remark
So finally, I proudly present my official team lineup!
I’m pretty sure I will do a terrible job in our competition, but at least I’ve learned how to scrape a Wikipedia page using R.