Web Scraping

What is Web Scraping

Web Scraping Tools

Python libraries for web scraping include the requests library to fetch web pages, BeautifulSoup, Selenium, Scrapy, Caqui, and others.  

Selenium

Selenium allows you to perform web browsing tasks like a human would, such as clicking on links and performing searches.  

https://pypi.org/project/selenium/

 

BeautifulSoup

BeautifulSoup is a Python library for pulling data out of HTML.  

Proxy Rotating

A proxy server acts as a gateway between you and the internet (intermediary), often performing the function of a firewall and filter.   Using a proxy enables you to make your request from a specific geographical region or device, enabling you to see the specific content that the website displays for that given location or device.   Some sites limit your activities by checking your IP.   By rotating your IP address using a proxy, you can avoid this limitation.  

How To Use A Proxy With Python Requests

How To Rotate Proxies and change IP Addresses using Python 3

https://free-proxy-list.net/

 

Overcome CloudFlare Blocking


ser = Service("C:\\users\\denni\documents\Python Scripts\\ucc\\chromedriver.exe")
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(service=ser, options=options)

 

Pausing


element = WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.XPATH, '//input[@value="Next 10 Records"]')))

 

Newspaper3k

The newspaper3k library provides an easy way to scrape and extract content from news articles.

 

Related Links

Web Scraping in Python: Avoid Detection Like a Ninja

Stream Your Data Using Nothing But Python’s Requests Library

(using Python to scrape a webpage)

Web scraping in 2023 — Breaking it down to basics

Web Scraping With Python: Beginner to Advanced

 

Build a Data Catalog with Cloud Pickle

Free and open source code that builds a fully functional data catalog.   It has four main data structures: Datasets, Dataset, Table, Column.   A data catalog is a list of my datasets and a description of what’s in them.  

The code uses cloud pickle.   `cloudpickle` makes it possible to serialize Python constructs not supported. by the default `pickle` module from the Python standard library.   Pickle in Python is primarily used in serializing and deserializing a Python object structure.   In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

Build a data catalog in 383 lines of Python

GitHub source