Scrape a Website With This Beautiful Soup Python Tutorial

https://ift.tt/38hRUx0

Beautiful Soup is an open-source Python library. It uses navigating parsers to scrape the content of XML and HTML files. You need data for several analytical purposes. However, if you’re new to Python and web scraping, Python’s Beautiful Soup library is worth trying out for a web scraping project.

With Python’s open-source Beautiful Soup library, you can get data by scraping any part or element of a webpage with maximum control over the process. In this article, we look at how you can use Beautiful Soup to scrape a website.

How to Install Beautiful Soup and Get Started With It

Before we proceed, in this Beautiful Soup tutorial article, we’ll use Python 3 and beautifulsoup4, the latest version of Beautiful Soup. Ensure that you create a Python virtual environment to isolate your project and its packages from the ones on your local machine.

To get started, you must install the Beautiful Soup library in your virtual environment. Beautiful Soup is available as a PyPi package for all operating systems, so you can install it with the pip install beautifulsoup4 command via the terminal.

However, if you’re on Debian or Linux, the above command still works, but you can install it with the package manager by running apt-get install python3-bs4.

Beautiful Soup doesn’t scrape URLs directly. It only works with ready-made HTML or XML files. That means you can’t pass a URL straight into it. To solve that problem, you need to get the URL of the target website with Python’s request library before feeding it to Beautiful Soup.

To make that library available for your scraper, run the pip install requests command via the terminal.

To use the XML parser library, run pip install lxml to install it.

Inspect the Webpage You Wish to Scrape

Before scraping any website you’re not familiar with, a best practice is to inspect its elements. You can do this by switching your browser to the developer’s mode. It’s pretty easy to use Chrome DevTools if you’re using Google Chrome.

However, it’s necessary to inspect a webpage to know more about its HTML tags, attributes, classes, and ids. Doing that exposes the core elements of a webpage and its content types.

It also helps you develop the best strategies you can use to get the exact data you want from a website and how you can get it.

How to Scrape a Websites’ Data With Beautiful Soup

Now that you have everything up and ready, open up a preferred code editor and create a new Python file, giving it a chosen name. However, you can also make use of web-based IDEs like Jupyter Notebook if you’re not familiar with running Python via the command line.

Next, import the necessary libraries:

from bs4 import BeautifulSoup
import requests

First off, let’s see how the requests library works:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com')
print(website)

When you run the code above, it returns a 200 status, indicating that your request is successful. Otherwise, you get a 400 status or some other error statuses that indicate a failed GET request.

Remember to always replace the website’s URL in the parenthesis with your target URL.

Once you get the website with the get request, you then pass it across to Beautiful Soup, which can now read the content as HTML or XML files using its built-in XML or HTML parser, depending on your chosen format.

Take a look at this next code snippet to see how to do this with the HTML parser:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com')
soup = BeautifulSoup(website.content, 'html.parser')
print(soup)

The code above returns the entire DOM of a webpage with its content.

You can also get a more aligned version of the DOM by using the prettify method. You can try this out to see its output:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
print(soup.prettify())

You can also get the pure content of a webpage without loading its element with the .text method:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
print(soup.text)

How to Scrape the Content of a Webpage by the Tag Name

You can also scrape the content in a particular tag with Beautiful Soup. To do this, you need to include the name of the target tag in your Beautiful Soup scraper request.

For example, let’s see how you can get the content in the h2 tags of a webpage.

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
print(soup.h2)

In the code snippet above, soup.h2 returns the first h2 element of the webpage and ignores the rest. To load all the h2 elements, you can use the find_all built-in function and the for loop of Python:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
h2tags = soup.find_all('h2')
for soups in h2tags:
 	print(soups)

That block of code returns all h2 elements and their content. However, you can get the content without loading the tag by using the .string method:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
h2tags = soup.find_all('h2')
for soups in h2tags:
 	print(soups.string)

You can use this method for any HTML tag. All you need to do is replace the h2 tag with the one you like.

However, you can also scrape more tags by passing a list of tags into the find_all method. For instance, the block of code below scrapes the content of a, h2, and title tags:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
tags = soup.find_all(['a', 'h2', 'title'])
for soups in tags:
 	print(soups.string)

How to Scrape a Webpage Using the ID and Class Name

After inspecting a website with the DevTools, it lets you know more about the id and class attributes holding each element in its DOM. Once you have that piece of information, you can scrape that webpage using this method. It’s useful when the content of a target component is looping out from the database.

You can use the find method for the id and class scrapers. Unlike the find_all method that returns an iterable object, the find method works on a single, non-iterable target, which is the id in this case. So, you don’t need to use the for loop with it.

Let’s look at an example of how you can scrape the content of a page below using the id:

from bs4 import BeautifulSoup
import requests
website = requests.get('http://somewebpages.com/')
soup = BeautifulSoup(website.content, 'html.parser')
id = soup.find(id = 'enter the target id here')
print(id.text)

To do this for a class name, replace the id with class. However, writing class directly results in syntax confusion as Python see it as a keyword. To bypass that error, you need to write an underscore in front of class like this: class_.

In essence, the line containing the id becomes:

my_classes = soup.find(class_ = 'enter the target class name here')
print(my_classes.text)

However, you can also scrape a webpage by calling a particular tag name with its corresponding id or class:

data = soup.find_all('div', class_ = 'enter the target class name here')
print(data)

How to Make a Reusable Scraper With Beautiful Soup

You can create a class and put all the previous code together into a function in that class to make a reusable scraper that gets the content of some tags and their ids. We can do this by creating a function that accepts five arguments: a URL, two tag names, and their corresponding ids or classes.

Assume you want to scrape the price of shirts from an e-commerce website. The example scraper class below extracts the price and shirt tags with their corresponding ids or classes and then returns it as a Pandas data frame with ‘Price’ and Shirt_name as the column names.

Ensure that you pip install pandas via the terminal if you’ve not done so already.

import pandas as pd
class scrapeit:
	try:
		def scrape(website=None, tag1=None, id1=None, tag2=None, id2=None):
			if not (website and tag1 and id1 and tag2 and id2)==None: 
				try:
					page = requests.get(website)
					soup = BeautifulSoup(page.content, 'html.parser')
					infotag1 = soup.find_all(tag1, id1)
					infotag2 = soup.find_all(tag2, id2) 
					priced = [prices.text for prices in infotag1]
					shirt = [shirts.text for shirts in infotag2]
					data = {
					'Price':priced, 
					'Shirt_name':shirt}
					info = pd.DataFrame(data, columns=['Price', 'Shirt_name'])
					print(info)
				except:
					print('Not successful')
			else:
				print('Oops! Please enter a website, two tags and thier corresponding ids')
	except:
		print('Not successful!')

The scraper you just made is a reusable module and you can import and use it in another Python file. To call the scrape function from its class, you use scrapeit.scrape(‘Website URL’, ‘price_tag’, ‘price_id’, ‘shirt_tag’, ‘shirt_id’). If you don’t provide the URL and other parameters, the else statement prompts you to do so.

To use that scaper in another Python file, you can import it like this:

from scraper_module import scrapeit
scrapeit.scrape('URL', 'price_tag', 'price_id', 'shirt_tag', 'shirt_id')

Note: scraper_module is the name of the Python file holding the scraper class.

You can also check the Beautiful Soup documentation if you want to dive deeper into how you can make the best use of it.

Beautiful Soup Is a Valuable Web Scraping Tool

Beautiful Soup is a powerful Python screen scraper that gives you control over how your data comes through during scraping. It’s a valuable business tool, as it can give you access to competitor’s web data like pricing, market trends, and more.

Although we’ve made a tag scraper in this article, you can still play around with this powerful Python library to make more useful scraping tools.

non critical

via MakeUseOf.com https://ift.tt/1AUAxdL

January 6, 2021 at 01:05PM