Web Scraping (HTML parsing and JSON API) using Scrapy Python

Published in

Analytics Vidhya

11 min readMar 3, 2021

Introduction

Web scraping is a technique to extract data from a website. Many tools can be used to scrape a website. And now I want to explain how we can extract data from a website using scrapy python.
And now we will scrap data using scrapy from https://www.jobstreet.vn/j?sp=search&q=C%C3%B4ng+ngh%E1%BB%87+th%C3%B4ng+tin&l.

We will take the URL for each job title such as Giang vien…., Nhan vien …, and many more. After that, we could have to extract the data from each page.

Requirements

Must understand about scrapy theory (https://docs.scrapy.org/en/latest/index.html).
Must understand python programming language (especially the OOP theory).
Of course, we need a code editor and python that have been installed on your PC/Laptop.
The browser, in this case, is Google Chrome, so the options that will be mentioned in this article are available on Google Chrome.

What you will learn

Web crawling technique using spider scrapy.
Scraping technique with HTML parsing method.
Scraping technique with JSON API.
Debugging technique for scrapy in the terminal.

Project’s steps

Here the project‘s steps for scraping it.

You must finish reading this article first, and then doing the practice technically.
Scraping the main page and get the URLs for all the job titles in there.
Scraping all the URLs page.
Scraping the texts at the page which has the ads post label.
Scraping the texts at the page which has the non-ads post label.

The job title on the main page is divided into two categories, there are ads-post and non-ads-post. Well, the ads post is the job title that has a sponsor and the sign ad for each of them.

That’s the point! We can scrape the data from the non-ads-post using the HTML parsing method. But it is doesn’t apply to the ads-post because in this case, the data from the ads-post can be gained using the JSON API method only.

In this case, I assume that you have already read or understood the scrapy theory here before.

Prepare for environments

First, we have to create a new folder which the project is working on. For example, my new folder’s name is myscrapproject.
Open our terminal and type:

mkdir myscrapproject

and then type for a switch to the target directory:

cd myscrapproject

In this case, I assume that the python virtual environment has been installed on your PC. You can search the tutorial about installing a python virtual environment if it isn’t installed on your PC. Then create a new virtual environment.

For example, I have a new virtual environment named scrapy_env. We can type:

python3 -m venv scrapy_env

And then we must activate the virtual:

source scrapy_env/bin/activate

So we can see that the virtual has been activated. And now, install the scrapy:

pip install scrapy

Let’s create a new scrapy project that represents our project. For example, our new scrapy project’s name is jobstreetvn.

scrapy startproject jobstreetvn

Then type:

cd jobstreetvn

And now, the directory has already used then minimizes the terminal. Thus open the VSCode or another code editor. Import folder jobstreetvn and all files in there to the code editor.

Then we must create a new file that has .py extension in the spiders folder. For example, our new file is posts_spider.py.
Then create the code in there as below:

import scrapy #to import the scrapy module
import json #to import the JSON module

HTML parsing method

(For the detailed steps, in this case, you can see in the Getting the text from HTML section after this. Just scroll down.)
Since we choose that the step for getting the link for each job title is to get the URL, so we must get the HTML code for each job title on the main page.

We can inspect the element by right-clicking on the main page, and then choose the inspect element (or press Ctrl+Shift+I) on Google Chrome. So we have seen the HTML code for all the data on the main page.

In this case, we must get the HTML code for each job title by picking the element from the page.
Let’s click:

and then choose one of the job titles in there (we must click the title which has a non-ads-post label for this step), then click like below:

On inspect element side, we will get the HTML code. After that, we must choose the parental element code for each job title there. Also, we can check the class that represents the parental element code.

And now we get the class named job-item with <a> element. In a rule of scrapy script, we must type the used class such as a.job-item which represents all of the job titles with the non-ads-post label.

Just for a reminder, for the detailed steps, in this case, you can see in the Getting the text from HTML section after this.
So, the code is:

Now, open back the terminal (stay on jobstreetvn directory). Let’s check the result of our code by type:

scrapy crawl posts -o mainpage.json

And now, we can check the result in the code editor. Open the folder in the code editor and search file named mainpage.json.
Let’s see! BOOM!

And we can see 10 URLs and job titles in there. But, wait. Why 10 URLs that scraped only? Whereas 15 job titles that posted on the main page. Yes, because there are 10 job titles with the non-ads-post label only and 5 job titles with the ads post.

So, the code in post_spider.py can get the job title’s URL with the non-ads-post only for now. And how about the others (the ads-post label)?

JSON API

In this case, we can get the job title’s URL which has the ads post label by JSON API parsing.

(for this section, I will not give a detailed description/explanation. And if there any question, you should be emailing me at ma.arryanda@gmail.com).

And now, for to get the URLs for the ads-post label in main page, we must have the URL link named https://jupiter.jora.com/api/v1/jobs?keywords=C%C3%B4ng%20ngh%E1%BB%87%20th%C3%B4ng%20tin&page_num={page_number}&session_id=1f4498b9c6f2ebda3cd5dcdf8ef6b15f&search_id=3yAkpixVHSHokFUnNESz-1f4498b9c6f2ebda3cd5dcdf8ef6b15f-X86gxLy3TuLx42PSU59a&session_type=web&user_id=3yAkpixVHSHokFUnNESz&logged_user=false&mobile=false&site_id=1&country=VN&host=https://jupiter.jora.com&full_text_only_search=true&ads_per_page=5&callback=_jsonp_0?

So we can get this URL for the ads post by inspecting elements actually (at the main page, right-clicking and choose to inspect option). On the inspect element side, choose the network tab, and reload/refresh the page. Then choose the API link (jobs?keywords=……..) as the image below:

and copy the link address (right-clicking -> copy link address) and paste it on the tab of the window. Let's see as the image below:

The image above is the dictionary about all of the ads post. If we check the URL, we can see about page_number that represents the page number’s website. Whereas the number of ad posts for each page is five.

Getting the text from HTML

In this section, we would like to learn how to get the texts from each scrapped page, and I will explain one case only. For example, now we try how to get the URL’s CSS response on the main page. If we check the code that scrapped has, then we can see the URL’s CSS response is (“.job-item ::attr(href)”). So how to get it?

Let’s go to https://www.jobstreet.vn/j?sp=search&q=C%C3%B4ng+ngh%E1%BB%87+th%C3%B4ng+tin&l. Then right-clicking and choose the inspect option, and press Ctrl+Shift+C to select an element on the page.

After that, we click on one of the job titles in there. (Remember! this step is just for the non-ads post only, not applicable for the ads post.)

Then, we can click the job title which has no ad label. For example, let’s click the job title named Nhan Vien Kien….. as the image below:

Now we can see that the job title has been clicked, has an HTML element namely h3.job-title.heading-large.-no-margin-bottom in there. Then check on the element sidebar, and do the same as the image below:

The image above explains to us how to get the job title’s element by looking for the required parental element.

For example, Nhan Vien Kien….. has the element namely h3.job-title.heading-large.-no-margin-bottom as the same as the image above. It means we can get the element for one job title only in there named Nhan Vien Kien….. . But wait, don’t forget about the number of the job titles with the non-ads post label in there.

There are 10 job titles on the main page. So, we must find out the parental element since we’ve got the child element of the job title. Now, let’s check the image above.

As we know that the h3 element which has the class named job-title heading-large -no-margin-bottom is the child element from the div element which has the class named job-item-top-container. Meanwhile the div element which has the class named job-item-top-container is an element that represents one job title only and the information about it.

So, we must be looking for the element which has a class name that represents all the job titles with the non-ads post.

If we scroll up to the top of the picture, appears an element that has a class named job-item. It seems that we have found the required parental element which represents the required for all of the job titles.

And now, we have the class name which is taken as CSS response for getting all the URLs the job titles with the non-ads post label (it has 10 URLs).

Since we know that the step for getting the text for CSS response is by adding …::text for each CSS response in the last command, now we must add …::attr(href) for getting the URL for CSS response in the last command.

Getting the URL for the next-page pagination HTML

This section explains how we get the next page URL by looking at the required element of CSS response. Anyway, the first step for this is not different from the steps to get the text in the previous section. We must be looking for an element that represents the next page pagination from each page.

For example, we must try it on the first page only. The picture below explains the element for the next-page pagination:

and the sign of the next-page on the page is below (it located on the bottom page area):

According to the picture, now we know that the next-page pagination has an element class named next-page-button. So, the code is:

next_page = response.css("a.next-page-button::attr(href)").get()

How to scrapy debug in the terminal?

In addition, we must take the technique to get the CSS response for the item that we must be having. For example, we take one of the many cases here for debugging it.

Now, as a trial, we will get a CSS response from the next-page pagination as the previous section has already explained.

So, open the terminal like the image below:

Then we must type:

scrapy shell "https://www.jobstreet.vn/j?sp=search&q=C%C3%B4ng+ngh%E1%BB%87+th%C3%B4ng+tin&l"

and if successful, it will appear a condition like a picture below:

then we must type the CSS response code like below:

response.css("a.next-page-button::attr(href)").get()

and if successful, we will have the condition like below:

And congrats!!! we got the trick for it. So we must do the same trick for all variables or items needed in this project.

Conclusion

Since we have gotten all URLs on the first page, it means that we can also get all URLs from all pages. So, the code for scraping all of the pages is below:

import scrapy
import jsonclass PostsSpider(scrapy.Spider):
    name = "posts"start_urls = {
        "https://www.jobstreet.vn/j?sp=search&q=C%C3%B4ng+ngh%E1%BB%87+th%C3%B4ng+tin&l"
    }
    
    #NON ADS
    def parse_item(self, response):
        item = {}
        company_name1 = response.css("#company-location-container > span.company::text").get()
        company_name2 = response.xpath("//*[@id='job-description-container']/div/div/p[17]/b/text()").get()
        company_name3 = response.css("#job-description-container > div > div > strong ::text").get()
        company_ads = response.css(".job-title::text").get()
        if company_name1:
            #no ads
            #top
            item["type"] = "no ads",
            item["jobtitle"] = response.css("h3.job-title.heading-xxlarge ::text").get(),
            item["company_name"] = company_name1,
            item["location"] = response.css("#company-location-container > span.location ::text").get(),
            item["site"] = response.css("#job-meta > span.site ::text").get(),
            #desc
            item["desc"] = ''.join(response.css("#job-description-container ::text").getall()),
        elif company_name2:#company in bottom
            #no ads
            #top
            item["type"] = "no ads, company name at the bottom side",
            item["jobtitle"] = response.css("h3.job-title.heading-xxlarge ::text").get(),
            item["company_name"] = response.xpath("//*[@id='job-description-container']/div/div/p[17]/b/text()").get(),
            item["location"] = response.css("div #company-location-container > span.location ::text").get(),
            item["site"] = response.css("div #job-meta > span.site ::text").get(),
            #desc
            item["desc"] = ''.join(response.css("#job-description-container ::text").getall())
        else: #no description
            item["type"] = "no ads, no desc",
            item["jobtitle"] = response.css("h3.job-title.heading-xxlarge ::text").get(),
            item["company_name"] = company_name3
            item["location"] = response.css("#company-location-container > span.location ::text").get(),
            item["site"] = response.css("#job-meta > span.site ::text").get(),
            item["desc"] = "no desc"
        return item 
    
    #ADS
    def parse_item_ads(self, response):
        item={}
        company_ads = response.css(".job-title::text").get()
        if company_ads:
            item["type"] = "ads"
            item["jobtitle"] = response.css(".job-title::text").get()
            item["company_name"] = company_ads
            item["location"] = response.css(".location::text").get()
            item["site"] = response.css(".site::text").get()
            item["desc"] = ''.join(response.css("#job-description-container ::text").getall())
        return itemdef parse_item_json(self, response):
        text_clean = response.text.replace("/**/_jsonp_0(", "")
        text_clean = text_clean.replace(")", "")
        result_json = json.loads(text_clean)
        for data in result_json['ads']:
            url = data['url']
            yield scrapy.Request(url = url, callback = self.parse_item_ads)
          
    def parse(self, response):
        page_number = 1
        for post in response.css('a.job-item'):
            data = {
                #total = 15, ads = 5, non ads = 10
                #non ads
                "url" : post.css(".job-item ::attr(href)").get()
            }
            linkads = f"https://jupiter.jora.com/api/v1/jobs?keywords=C%C3%B4ng%20ngh%E1%BB%87%20th%C3%B4ng%20tin&page_num={page_number}&session_id=1f4498b9c6f2ebda3cd5dcdf8ef6b15f&search_id=3yAkpixVHSHokFUnNESz-1f4498b9c6f2ebda3cd5dcdf8ef6b15f-X86gxLy3TuLx42PSU59a&session_type=web&user_id=3yAkpixVHSHokFUnNESz&logged_user=false&mobile=false&site_id=1&country=VN&host=https://jupiter.jora.com&full_text_only_search=true&ads_per_page=5&callback=_jsonp_0"
            link = "https://www.jobstreet.vn/" + data.get("url")
            page_number +=1
            if link is not None:
                yield scrapy.Request(url = link, callback = self.parse_item)
            yield scrapy.Request(url = linkads, callback = self.parse_item_json)next_page = response.css("a.next-page-button::attr(href)").get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, callback=self.parse)

Don’t forget about the steps on how to get the CSS response for each item in there. We can get it by checking on the inspect element and check the HTML code for the texts needed. And then we must do debug them in the terminal.

(for this section, I will not give a detailed description/explanation. and if there any question, you should be emailing me at ma.arryanda@gmail.com)

Happy Coding!