In the world of machine learning, one of the roadblocks is getting the actual data. Whatever we want to do, from exploratory analysis to predict the value of a business, we need to get data. The world wide web is one of the authentic data sources available for mass people who want to learn machine learning (not only machine learning but also named entity recognition or text mining).
There are a lot of resources available on the web that suggests best practices for web scraping. I will reference a few of the articles I followed and express my understanding.
Robots.txt
RESPECT ROBOTS.TXT
Robots.txt file is for search engines to identify accessible URLs. We can find this file on www.example.com/robots.txt
(www.google.com/robots.txt
). The "Allow"
and "Disallow"
parameters tell us which URL should or should not be crawled by search engines. Another important parameter is "Crawl-Delay"
. The "Crawl-Delay"
parameter tells the search engines crawler to execute the request with a certain delay so that server doesn’t get slower for regular users.
TIme.sleep
If we want to navigate multiple URLs on the same server, using time.sleep (randomly) is always a better choice. Using this feature may increase the time of extracting information, but a good practice to put much lesser load on the web server.
import time
from random import randint
time.sleep(randint(30, 60))
Fake-useragent
User-Agent request header should be provided to the scrapper while extracting information. fake-useragent is simple user-agent database that can be easy to use (https://github.com/fake-useragent/fake-useragent). The below code takes responses from IMDb website and checks whether the response is an instance of bs4.
from urllib.request import Request, urlopen
from urllib.parse import urlparse
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
def get_driver():
ua = UserAgent()
user_agent = ua.random
options = Options()
options.headless = True
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument("disable-gpu")
options.add_argument("--window-size=1920,1200")
options.add_argument(f"user-agent={user_agent}")
driver = webdriver.Chrome(
service=Service(ChromeDriverManager().install()), options=options
)
return driver
def test_drive():
driver = get_driver()
req = Request("https://www.imdb.com/")
status_code = urlopen(req).getcode()
resp = driver.get("https://www.imdb.com/")
soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()
assert isinstance(soup, BeautifulSoup) == True
Use of pytest
pytest can be effective for writing a good crawler. Before doing trial and error to understand whether we are receiving correct content from the webpage, we can store the HTML on a text file and write some test cases. The below snippet finds a text from span under a div that has a “test-div” class.
def read_html():
with open("html_content.html", "r") as f:
soup = BeautifulSoup(f, "html.parser")
return soup
def extact_name_inner_element(self):
soup = read_html()
name = (soup.find("div", class_="test-div")
.find("span")
.get_text()
.strip())
return str(name)
Reference
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- https://selenium-python.readthedocs.io/getting-started.html
- https://docs.pytest.org/en/7.2.x/
Update – 17 November 2023
I received a message from one of the mavens who advised me to use the following link to locate the robots.txt file.