Using Rvest to pick my 2018 World Cup fantasy team

Today is the first day of the 2018 World Cup football in Russia. While, I do not know a lot about football, I do enjoy cheering for the Red Devils at both the European and the World Cup. And so do most of my friends. That’s why for the second time in a row, we are organising a small fantasy football league. This year we will be making use of the Sporza WK manager, provided by our national sport news network. The rules of their game are simple:

  1. Pick 11 players
  2. Stay within a budget of 300 million euro
  3. A maximum of 3 players from the same team are allowed
  4. Transfers are possible in certain phases of the tournament

You can obtain points if your players score goals and keep clean sheets, while you’ll lose points if your players scores owngoals or gets cautioned.

Like I said, I do not know a lot about football, so I would not be able to design a team from scratch. Therefore I will be using R, the tidyverse and Rvest to help me in this serious task!

1. Data collection

The first step of our journey is of course, data collection. Searching for a decent dataset was probably one of the hardest tasks. I wanted to use some random up to date player and team ratings, but could not find any decent one. Since I didn’t have a lot of time, I just settled with this ‘2018 FIFA World Cup squads’ Wikipedia page.

The page has a section for every team that contains the following information:

  • The group they are in
  • The coach of the team
  • A table with player stats such as
    • Position
    • Date of birth
    • Caps (= matches played)
    • Goals
    • Club


I’ve never done this before but this seems like a pretty easy web scraping job. For this I will try out Rvest using this tutorial as a guide. In the end I parsed the whole page using the following script:

# Read in the website
site <- read_html("")
# Parse website for player tables
players <- site %>%
html_table(fill = T) %>%
.[1:32] # Keep only the tables related to the 32 teams
# Parse website for team names
teams <- site %>%
html_nodes("h3 .mw-headline") %>%
html_text() %>%
.[1:32] # keep only the first 32 hits
# Parse website for coach names
coaches <- site %>%
html_nodes("h3+ p") %>%
html_text() %>%
.[1:32] %>% # Keep only the first 32 hits
str_replace_all("Coach: ", "") %>% # Clean up the string
str_trim() # remove leading whitespaces
# Parse website to figure out in which group the team competes
group <- site %>%
html_nodes("h2 .mw-headline") %>%
html_text() %>%
.[1:8] %>% # Keep only the first 8 hits
rep(4) %>% # Make the group vector match the team vector
# Now that we have all of the tables separatly, let's combine them into one
table <- tibble(team = teams,
coach = coaches,
group = group,
player = players) %>%
unnest() %>% # The players table was a list, we need to unnest this
rename(position = `Pos.`) %>%
mutate(position = str_sub(position, 2,3)) %>% # Fix parsing error
rename(age = `Date of birth (age)`) %>%
mutate(age = as.integer(str_sub(age,4, 2)))

2. Exploring the players and teams

Before I hire anyone, I’d like to get to know them a little bit more. So let’s do some exploratory analysis. For example, who are the youngest players in the cup and how many goals have they scored?

table %>%
arrange(age) %>% # Sort on age
.[1:10,] %>% # Select top 10
mutate(Player = factor(str_c(Player, "(", age,")"))) %>% # Add age to the players name for plotting
ggplot(aes(x = fct_reorder(Player, Goals), y = Goals)) + # Start plotting
geom_col(aes(fill = team)) + # Add bars
geom_text(aes(y = 0.3, label = position, colour = team)) + # Add text that shows their position
scale_colour_brewer(palette = "Dark2") + # Change colour scheme
scale_fill_brewer(palette = "Dark2") + # Change colour scheme
coord_flip() + # Flip x and y axis
xlab("") + # remove label on y-axis
ggtitle("Youngest players in the World Cup") + # Add title to the plot
theme_minimal() + # Use a different theme
theme(panel.grid = element_blank()) # Tweak the theme a little bit


All right, this 19 year old Kylian Mbappé from France seems to be doing a great job! Who knows, he’ll end up in my team!

Now let’s do the same for the oldest players:

Cool! There seem to be 4 goal keepers in this list. None of them scored a goal, which is extremely normal for a player in that position I guess. In addition, Tim Cahill is one of the oldest players in the world cup and with almost 50 goals, scored the highest number of goals from these 10 players.

Next, let’s have a look at the teams. I wanted to figure out which team was the most experienced. For this, I will just look at the number of caps that each player has:


It’s a little bit over plotted but wow, this is good news! Belgium tops the list and thus has the highest median number of caps. While one could probably interpret this graph in many different ways, to me this means that we’re the most experienced team and thus have a high chance of doing a pretty good job. Or at least, that’s what I choose to believe.

3. Team selection

For my team I’ve decided to go for a 3-4-3 formation because well eeeeuhh…. Just for no reason actually.

I will also start by recruiting midfield players, because back in the days when I played football, I always played that position.

3.1 Midfield

In total there’s 234 players to choose from. Let’s go for the obvious ones, the ones that scored the most:

table %>%
filter(position == "MF") %>%

view raw
hosted with ❤ by GitHub


The first player I wanted to recruit was the top scorer, Thomas Müller, but apparently he’s annotated as a forward on the Sporza game. So unfortunately we can’t use him on this position.

  • So let’s recruit the second in line: Keisuke Honda! Welcome to the team!
  • The third in line is also from Japan, which does not seem like a good choice in this phase of the game, so let’s go for the captain of the Mexican squad Andrés Guardado!
  • Up next, from Costa Rica, also the captain: Bryan Ruiz!
  • Our final position, does not go to Özil as Germany is in the same group as Costa Rica and I do not want too much competition within the team, so give a round of applause for Denmark’s own Christian Eriksen!

All players are in the top 8 teams regarding their median number of caps (see above), so that’s a good sign!

3.2 Defence

Up next: 3 defenders! Again, I will go for the defenders that scored the highest number of goals. This is actually almost the only option I’ve got with the data I’ve collected.



  •  The top goalscorer is again a Mexican: Rafael Màrquez. Of course, a good scientists always blasts his sequence first coach always googles his players first before picking them for their team. Doing this, I found that Màrquez had been banned for playing for his club Atlas for the last 2 months. So maybe that’s not such a strategic choice. Sorry Màrquez!
  • Instead, I will go for Sergio Ramos!
  • Unlucky for Branislav Ivanovic, but I will skip a Serbian player, as they are at the bottom of our “Number caps per team”-graph.
  • Although the fact that Bruno Alves (Portugal) is in the same group as Sergio Ramos (Spain; group B), I will still go for him as I think that both Spain and Portugal have a chance of getting through the first round!
  • My final player will not be from Panama as they are in the same group as Belgium and I do not think Panama will survive.

So excluding all the players above and the teams from which I’ve already picked a player, I will now focus only on the players with 8 goals and see whether there’s a good fit there:


Well, I’m not going to lie here as this is a little bit more arbitrary, but as a Belgian, I can’t ignore Jan Vertonghen on this list. So welcome to the team, Jan Vertonghen!

3.3 Forwards

That leaves space for three forwards!


Oh, this is definitely a super difficult list to choose from. Let’s make it a little bit easier by looking at what groups we already chose our players in:

table %>%
filter(selected == "YES") %>%
group_by(group) %>%
summarise(count = n()) %>%
ggplot(aes(x = group, y = count)) +
geom_col() +

view raw
hosted with ❤ by GitHub


Players from group A and D are missing!

  • Argentina is in group D, and Lionel Messi plays for Argentina. So welcome to him!
  • Uruguay is in group A, and Luis Suárez plays for Uruguay. I’ll happily make space for him!
  • Now that every group is represented, we’ll have to pick a player that’s already in one of these groups. Starting back at the top, we can’t go for Ronaldo as the count in group B would then be 3. So the next in line is Neymar from Brazil!

3.4 Goalkeeper

Only one final slot to fill: our goalkeeper!


It makes a lot of sense that of all goalkeepers, not a single one has scored a goal. This means that choosing a keeper will be practically impossible with the data we have.

Therefore I’ll just go for our national hero: Thibaut Courtois!

4. Closing remark

So finally, I proudly present my official team lineup!


I’m pretty sure I will do a terrible job in our competition, but at least I’ve learned how to scrape a Wikipedia page using R.


All code can be found on my Github page as an RMarkdown file.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s