Easy Guide to Web-Scrape Transfermarkt with Python (Easy Steps that anybody can follow!)

Sanjit Varma
6 min readSep 16, 2021

--

This blog is meant to be a tutorial to guide readers who are looking to collect information from Transfermarkt.co.uk. There does not seem to be an easy way to download all data from the website. Thankfully by making use of Python and its BeautifulSoup library, even people with non-programming skills can scrape data from this website.

For this blog, we will look at how to scrape data of all players from a given team’s page. We will be using my favorite team, Manchester United for this example.

The first step you may want to do is install a Python IDE-Integrated Development Environment (non-programming readers, it’s not as difficult as it sounds!) on your computer. I would recommend Jupyter Notebook as it has a simple interface that is pretty easy to navigate.

Here is an easy-to-follow guide on how to set up Jupyter Notebook on your computer.

Once you have Jupyter Notebook installed you will have something that looks like this:

The first thing you want to do is import the necessary libraries that will help you access a website, scrape its data and store it in a dataframe.

For web-scraping, we are going to make use of ‘BeautifulSoup’, and to clean and store our messy scraped data, we will use ‘Pandas’.

Type the code you see below, in your notebook so that we can import these packages.

If you do not have ‘Pandas’ installed in your computer, just type ‘ ! pip install pandas’. Since I already have it installed, my IDE tells me “Requirement already Satisfied”

Now that we have our packages installed, we need to create a headers dictionary from which we will send our request to the website we want to scrape from:

You may then copy the code written below and paste it in a new cell on your notebook:

After running the cell above, you can type ‘pageSoup’ in the next cell to see what the html source code of your webpage looks like.

Yes I know what you may be thinking:-“what on earth is this mess I find myself in and why can’t there be a simple download button on Transfermarkt’s website.”

Thankfully, we do not need all the information on this page. All we need is to scrape the player information. In this scenario we will scrape for a player’s name, age, position, nationality and transfer value.

First, create a bunch of empty lists for data we want to gather. In a cell below, type the following code:

Given that all our data is stored in a table, we can filter the page to only look for objects that are of the type we are looking for.

We will now filter the webpage’s source code to find the information needed and add them to the above lists. Enter the code below in a cell below.

Each of the variables created below filter through the ‘pageSoup’ source code for the webpage to show only specific object types using the ‘find_all()’ method.

This is what you should see if you run the ‘Players’ variable in a cell below.

You can see in the snapshot above that we can see our player names. However there is still a lot of unnecessary code on there that we don’t want. We simply want a list of player names. The same goes for all the other variables we created above: ‘Age’, ‘Positions’, ‘Values’, ‘Nationality’ and ‘Club’.

Run the code below in a new cell to filter for only the information we want:

The code above iterates through each line in the variables set in the previous cell and converts the information into string format before essentially cropping out the unwanted text like in the example of the ‘Players’ variable seen above. All of this cropped information is then stored into the empty lists we created earlier with the help of the “append()” method. We can test if this worked by entering the name of the Player names list, “PlayersList” and running the cell.

Pretty cool right? Let’s look at the list of the players’ respective positions:

And their transfer values:

Whoops! That doesn’t look so good. Ideally, we would like their values stored as numbers instead of them being treated as string characters. It won’t be so easy to use a standard method to split all of them the same way. This is because some values are in millions while others are in thousands. A conditional loop will need to be made to clean this up. Copy the code below onto your notebook to do this.

The above loop checks for two conditions. It iterates through every item in the list of scraped transfer values and sees if it contains the substring ‘m’ or ‘Th.’ in it. If either condition is met, they are cleaned up accordingly. Below is a cell that shows us what our “cleaned_values” list looks like:

That’s more like what we were looking for! Now that we have all the data we want; we can start combining it all into one dataframe. Now is the time when we will make used of the Pandas library that we loaded earlier.

In the cell above we initiated a Pandas dataframe by creating a dictionary of column names and assigning our lists of scraped information as the dictionary values. Each column will now consist of elements of their given lists as rows of data. Let’s look at what our final dataset looks like.

Now that’s a clean looking dataset! If you want, you can export this into an excel spreadsheet using the code below:

Now you can go to your Documents folder and see the Excel file stored there.

Hopefully, this guide to scraping Transfermarkt was easy to follow and feel free to follow my Medium profile for more how-to guides in the future. Cheers :)

--

--