Best practices and respectful way to collect data from Web (Python, Beautiful Soup 4 and Selenium)

In the world of machine learning, one of the roadblocks is getting the actual data. Whatever we want to do, from exploratory analysis to predict the value of a business, we need to get data. The world wide web is one of the authentic data sources available for mass people who want to learn machine learning (not only machine learning but also named entity recognition or text mining).

There are a lot of resources available on the web that suggests best practices for web scraping. I will reference a few of the articles I followed and express my understanding.

Robots.txt

RESPECT ROBOTS.TXT

Robots.txt file is for search engines to identify accessible URLs. We can find this file on www.example.com/robots.txt (www.google.com/robots.txt). The "Allow" and "Disallow" parameters tell us which URL should or should not be crawled by search engines. Another important parameter is "Crawl-Delay". The "Crawl-Delay" parameter tells the search engines crawler to execute the request with a certain delay so that server doesn’t get slower for regular users.

TIme.sleep

If we want to navigate multiple URLs on the same server, using time.sleep (randomly) is always a better choice. Using this feature may increase the time of extracting information, but a good practice to put much lesser load on the web server.

import time
from random import randint

time.sleep(randint(30, 60))

Fake-useragent

User-Agent request header should be provided to the scrapper while extracting information. fake-useragent is simple user-agent database that can be easy to use (https://github.com/fake-useragent/fake-useragent). The below code takes responses from IMDb website and checks whether the response is an instance of bs4.

from urllib.request import Request, urlopen
from urllib.parse import urlparse
from selenium import webdriver

from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent

from bs4 import BeautifulSoup


def get_driver():
    ua = UserAgent()
    user_agent = ua.random
    options = Options()
    options.headless = True

    options.add_argument("disable-infobars")
    options.add_argument("--disable-extensions")
    options.add_argument("disable-gpu")
    options.add_argument("--window-size=1920,1200")
    options.add_argument(f"user-agent={user_agent}")
    driver = webdriver.Chrome(
        service=Service(ChromeDriverManager().install()), options=options
    )
    return driver


def test_drive():
    driver = get_driver()
    req = Request("https://www.imdb.com/")
    status_code = urlopen(req).getcode()
    resp = driver.get("https://www.imdb.com/")
    soup = BeautifulSoup(driver.page_source, "html.parser")
    driver.quit()
    assert isinstance(soup, BeautifulSoup) == True

Use of pytest

pytest can be effective for writing a good crawler. Before doing trial and error to understand whether we are receiving correct content from the webpage, we can store the HTML on a text file and write some test cases. The below snippet finds a text from span under a div that has a “test-div” class.

def read_html():
    with open("html_content.html", "r") as f:
        soup = BeautifulSoup(f, "html.parser")
        return soup

def extact_name_inner_element(self):
    soup = read_html()
    name = (soup.find("div", class_="test-div")
		.find("span")
		.get_text()
		.strip())
    return str(name)

Reference

  1. https://www.crummy.com/software/BeautifulSoup/bs4/doc/
  2. https://selenium-python.readthedocs.io/getting-started.html
  3. https://docs.pytest.org/en/7.2.x/

Update – 17 November 2023

I received a message from one of the mavens who advised me to use the following link to locate the robots.txt file.

https://www.websiteplanet.com/webtools/robots-txt/