Python Web and PDF Scrapping for a Django Web Application

In this tutorial we continue with our Python Django web app development tutorial. We look in to Python web and PDF scrapping techniques that we can use to collect data for our web app.

Data can come in many formats, mostly in HTML from other websites – but in some cases you get lists of PDF documents and have to extract that data for your website. Hence we will be showing two techniques in this tutorial.

Web App Initial Data – where to get it from

When you build a web application like we are doing, you need to get initial traction from users. The users we need are:

Clients who will apply for jobs
Companies to post those jobs

However when we start, we dont have either one. And people will likely not use our application if there is no job data in it. This is the challenge for most web app companies, how to get that initial traction of traffic?

Web Scrapping

Web scrapping has been around since there were websites. People have manually collected data from online sources until it made more sense to automate the process with code. Google is the biggest web scrapper of all, this is a skill you can monetise in many ways from employment to creating your own opportunities.

Job Data from Government Website

We are going to start our web data gathering efforts with government jobs. They are usually the easiest to find in one place. This is the website we will be working from: http://www.dpsa.gov.za/dpsa2g/vacancies.asp

We already have a Python Django application running, check out our previous tutorials to catch up to where we are:

The Data we will be getting is in PDF format – and we will be using Python PyPDF2 library to extract the data out of the PDF file using the steps below:

Step 1: Download the Document

import subprocess

url = 'http://www.dpsa.gov.za/dpsa2g/documents/vacancies/2021/05/a.pdf'
location  = 'filepath-location-on-your-computer-documentName.pdf'

def run_command(command):
    p = subprocess.Popen(command, stdout=subprocess.PIPE)
    out, err = p.communicate()
    return out

run_command(["wget", "-O", "{}".format(location), "{}".format(url)])

We are going to download the document that is located at the provided URL using wget. So before you run this code, make you you have wget installed in your machine.

the run_command – function just allows us to run a shell command using subprocess library, which we have imported at the top of the document.

Step 2: Read the PDF Document using PyPDF2

Once the document is downloaded, we can open it using Python library PyPDF2. We will then be able to run queries against the document to extract the data.

from PyPDF2 import PdfFileReader
location  = 'filepath-location-on-your-computer-documentName.pdf'

content_list = []
with open(location, 'rb') as f:
    doc = PdfFileReader(f)
    pages = doc.numPages

    count = 0
    while count < pages:
        the_page = doc.getPage(count)
        the_text = the_page.extractText()
        a_list = the_text.replace('\n', '').split(' : ')
        for x in a_list:
            content_list.append(x)
        count += 1

This code above, opens one PDF document from where is was initially saved when we downloaded it with step -1. After opening the document, we enquire on the number of pages in the document, using our PyPDF library.

We then iterate through those pages – one at a time and extract the text from the individual pages. We then split the text in to an array and collect the items in to our content_list variable.

Step 3 – Parse the PDF document to get the data out

Parsing the document is the complicated part. The code will depend on the document we are parsing. We basically need to find elements in the document that are consistent to anchor on and use them to navigate through the content of the file.

The character we decided on is ” : ” to split the document, from this point – we can navigate through the document and find the titles, paragraphs etc.

We then coerce this data in to an object that matches the database model that we are using in our Django app.

post_string = 'POST 05/'

post_list = []
index_list = []

final_jobs = []

for x in content_list:
    if post_string in x:
        post_list.append(x)

for x in post_list:
    the_index = content_list.index(x)
    index_list.append(the_index)


for x in index_list:
    obj = {}

    #Titles
    a = content_list[x+1]
    obj['title'] = a

    b = content_list[x+2]
    obj['salary'] = b

    c = content_list[x+3]
    obj['location'] = c

    d = content_list[x+4]
    obj['requirements'] = d

    e = content_list[x+5]
    obj['duties'] = e

    f = content_list[x+6]
    obj['enquiries'] = f

    final_jobs.append(obj)

Step 4 – Save that data in to the data base

The steps to add the data in to our Django database is an involved one, we need to:

Create a Python Django model that will fit the data we want to upload
Import the Django model in to the python file we are going to be using
All Django related functions that talk to our application need to be run inside the Django environment
Collect the data from the PDF documents and parse it in to the variables that are in the Job Model. This involved data clean-up and PDF parsing.

#import models
from jobs.models import *
from django.contrib.auth.models import User

the_user = User.objects.get(email='admin@skolo.online')
the_company = Company.objects.get(uniqueId='**********')
the_category = Category.objects.get(uniqueId='***********')

for test_job in final_jobs:

    newjob = Jobs.objects.create(
    title = test_job['title'],
    location = test_job['location'],
    salary = test_job['salary'],
    requirements = test_job['requirements'],
    duties = test_job['duties'],
    date_posted = date.today(),
    enquiries = test_job['enquiries'],
    company = the_company,
    category = the_category,
    owner = the_user,
    )

Extracting data from company website Careers Portal

The next illustration we make is of scrapping data from the Vodacom careers portal website. I am not going to share that code specifically – but you can follow out web scrapping tutorial on our Youtube channel for more specifics about web scrapping.

Once the data has been collected, we follow same steps as above to add it to the database.

Dividing the data in to specific job categories

When we collect the data, we introduced the ability to classify it and add it to different job categories. Instead of creating a job category as a CharField in the Django model, we chose to create it as a separate model and use ForeignKey to link it to the Job model.

This code below needs to be run inside of the Django environment. Also, the job categories would have to be creating and existing in the database, before running this code.

from jobs.models import *
from django.contrib.auth.models import User

the_user = User.objects.get(email='admin@skolo.online')
the_company = Company.objects.get(uniqueId='*********')

if 'Director' in test_job['title']:
        the_category = Category.objects.get(title='Director')
    elif 'Engineer' in test_job['title']:
        the_category = Category.objects.get(title='Engineer')
    elif 'Developer' in test_job['title']:
        the_category = Category.objects.get(title='Developer')
    elif 'Databases' in test_job['title']:
        the_category = Category.objects.get(title='Databases')
    elif 'Business Development' in test_job['title']:
        the_category = Category.objects.get(title='Business Development')
    elif 'Technology' in test_job['title']:
        the_category = Category.objects.get(title='Technology')
    elif 'Research' in test_job['title']:
        the_category = Category.objects.get(title='Research')
    elif 'Trainee' in test_job['title']:
        the_category = Category.objects.get(title='Trainee')
    elif 'Specialist' in test_job['title']:
        the_category = Category.objects.get(title='Specialist')
    elif 'Manager' in test_job['title']:
        the_category = Category.objects.get(title='Manager')
    else:
        the_category = Category.objects.get(title='Uncategorised')

newjob = Jobs.objects.create(
    title = test_job['title'],
    location = test_job['location'],
    type = test_job['type'],
    contract_type = test_job['contract_type'],
    urlLink= test_job['urlLink'],
    date_posted = date.today(),
    description = test_job['description'],
    company = the_company,
    category = the_category,
    owner = the_user,
    )

Detailed Video Tutorial – Python Web and PDF Scrapping