Code Header

Building a Continuous Webscraping System

bahn_wordcloud_full

As a Data Scientist, it is important to have consistent data at a high scale to start doing what you want to do. But reliable sources are hard to find - or have already been used in many projects before. That’s why gathering your own data sometimes becomes so important.

In this tutorial, I show you how to create a robust system for continuous scraping of online resources. It is a great technique to track the development of a web page, so that you can create time series data, which enables you to eventually predict changes in the future. I will also show you a quick example of how to analyse the extracted data. You don’t need to be a programming expert, but there are still a few requirements:

  • You need a system which is available 24x7, preferably running on Linux. I’m using an old Raspberry Pi (version 1) with a Raspbian distribution.
  • Since you are going to harvest a lot of data (I'm currently collecting 3GB / week), an external hard drive is usually needed.
  • For the data analysis, you need a python installation (download here), the scientific library pandas (link) and beautifulsoup. For visualization purposes, you need matplotlib and wordcloud.
  • Optional: In case you want access your machine remotely, you can e.g. use dataplicity (find the docs here).

Once you decided which websites you want to scrape, we can start with the implementation. The resources must either be online or your device is within the network of the site.



Step 1 - Connecting your external hard drive (NAS specific)

I'm using a network attached storage (NAS) to store my extracted data. I was using it anyway and it has around 4TB of free space. It will take around 25 years until the disc is full - this sounds fair to me. To connect the NAS to your system, you need to look up the IP address of the storage (typically you can find this in the setup page of your router). Once you've got the IP address, execute the following command on your "24x7" -system:

$ sudo mkdir /data
$ sudo mount -t cifs -o username=USERNAME_NAS,password=PASS_NAS //192.168.**.**/data /data

The first command will create a folder where you are able to store the data later on. The second command will mount the "data" folder on your NAS to the "data" folder of your system.

To check if the mounting was successful, type sudo mount. This should print you the following result:

//192.168.**.**/data on /data type cifs(rw,relatime,vers=default,cache=strict,username=*****,domain=,uid=0,....)



Step 2 - Setting up the cron tab

Now that we have enough space ready, we can start scraping. This is a fairly simple process, just execute sudo crontab -e. You will be asked which editor your prefer to use. Choose option #1 if your are not familiar with vim. You should see the following screen now:

Screenshot empty cron


Here you can add your cron (a time-based job scheduler) in the following format:

┌───────────── minute (0 - 59)
│ ┌───────────── hour (0 - 23)
│ │ ┌───────────── day of the month (1 - 31)
│ │ │ ┌───────────── month (1 - 12)
│ │ │ │ ┌───────────── day of the week (0 - 6) (Sunday to Saturday)
│ │ │ │ │
│ │ │ │ │
│ │ │ │ │
* * * * * command to execute

Since we want to download a website, add the following line at the end of the file (scroll down with your arrow button). Remember to replace the URL with your desired web page.

16 6,8,10,14,18,22 * * * /usr/bin/wget http://reddit.com -O "/data/reddit/$(date +"\%Y\%m\%d_\%H\%M\%S").html"

This will download the website "reddit.com" everyday at 6:16am., 8:16am., 10:16am. and so on. Be aware that this is the UTC time zone. Also make sure that you have created the folder "/data/reddit". The downloaded files will have the following naming convention 20180928_081602.html.

I've set up many cron jobs in my crontab file. It looks like this:

Screenshot crontab


After you are done with editing, save and close the file (ctrl + x). You will see the following line crontab: installing new crontab. Depending on the schedule you have set up, you can see the first downloaded files in your /data folder or on your NAS.

Here you can find more info about crons.



Step 3 - Analyzing the data

Once you have a solid foundation of data, the fun part starts - analyzing the files. The reason why I don't do it on the fly, while still downloading the html files, is simple: Analyzing the sites can be error-prone, since the file format or the design of the website can change. Of course like this you store a lot of unused information. But I think it´s still a good trade off compared to storing no data at all (which would be the case if there somewhere s an error that you don't catch immediately).

I have downloaded the status website of the "Deutsche Bahn" (the German railway company). More specifically, I have downloaded the website ~1500 times (every hour for the last 2 month). At a first glance, this site seems uninteresting, but over time you get a good picture which train connections are constantly delayed and which cities are effected (I know there is an API existing - but this is a demo of web scraping + the API actually never really worked for me...). Anyway, the page usually looks like this:

deutsche bahn website


I've done an example examination of the files in python. You need to install pandas, beautifulsoup, wordcloud and matplotlib. Via pip you can do it using the following:

$ pip install pandas beautifulsoup4 wordcloud matplotlib

Now we have all the dependencies for the analysis. Create a text file, which you name analyze_sites.py and write the following code into it (in case you are lazy, you can download the file here) :

import os
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
import matplotlib.pyplot as plt
from wordcloud import WordCloud

folder = u'/Data/<FOLDERNAME>/'
directory = os.fsencode(folder)
df = pd.DataFrame()

for file in os.listdir(directory):
%t%f_name = os.fsdecode(file)
%t%soup = BeautifulSoup(open(folder + f_name), 'html.parser')

The first part of the script is generic and can be applied to any website you are scraping. One thing you need to do is to replace the folder variable with the actual folder your files are located in. The script checks this folder and loops through all the files. BeatifulSoup is opening the html, where we are able to find specific html tags.

This looks like this and is specific to the page from "Deutsche Bahn":

%t%text = list()
%t%for item in soup.find_all('span'):
%t%%t%if 'class' in item.parent.parent.attrs:
%t%%t%%t%if item.parent.parent.attrs['class'][0] == 'bullet-list':
%t%%t%%t%%t%text.append(item.text)


soup.find_all('span') will get all <span> tags. You can loop through the items and check if the class 'bullet-list' is in it. If you check the source code of the website, this is the place where the specific delay information of "Deutsche Bahn" is displayed (like train RB 85 from XX to YY). I take this string and append it to a list.

Afterwards I append the Pandas DataFrame with the text-list, the date and the number of delays.
%t%df = df.append({
%t%%t%'date': datetime.strptime(f_name[:-5], '%Y%m%d_%H%M%S'),
%t%%t%'number_of_delays': len(text),
%t%%t%'text': text,
%t%},ignore_index=True)
Now we have a nice DataFrame to make some visualizations. Firstly, I want to create a word cloud with all the text strings from the delay notifications. We need to flatten the 'text' column and join the list of strings to one big string. Afterwards, we can use the WordCloud package to generate the image:

wordcloud = WordCloud(max_words=10000, max_font_size=60, width=2000, height=1000).generate(
%t%' '.join([item for sublist in list(df['text']) for item in sublist])
)
image = wordcloud.to_image()
image.save('wordcloud.png', format='png', optimize=True)

Execute the file with the command: $ python analyze_sites.py . This takes a while (but not more than a few minutes). The image will be saved in your current folder. Depending on the frequency in which each phrase occurs, the words are getting bigger and more present in the image. The generated word cloud looks like this:

bahn_wordcloud_full


Since we also have saved the date and number of delay messages in our DataFrame, I want to plot a graph that shows the course over time. You can do this with matplotlib:

df.set_index('date', inplace=True)
ts = pd.Series(df['number_of_delays'], index=df.index)
ts.sort_index(inplace=True)
ax = ts.plot(kind='line', title='Delays at Deutsche Bahn status website', grid=True, rot=45)
ax.set_ylabel("# of delays")
plt.savefig('course_over_time.png')
plt.close()

This graphs actually looks pretty boring in the beginning, but I'm sure with some tweaks it gets pretty interesting (like showing during which hours per day the most delays arise, or what happened during that spike with around 30 delays).

db delays over time



What's next...

As I said, this is just an example evaluation of the extratced data. It demonstrates how a complete process of web scraping could work. In my future blog posts, I will show you more useful analyses for scraped html files.

Until then, feel free to share your thoughts and comments.

blog comments powered by Disqus