https://lh4.googleusercontent.com/vD-dRHvXKOGoUhzenRC4PFEy0rKu-ajEi-SsFgYkEljZYRsGrQ15J4ly4lzFRKw_9boxt6uFGyp_mD83azXJg_0frzBn9HaYOR8SxTrzNzWXg8opxtV248AfyaBWXh-R9WZ0P0o0
When we browse through a webpage, we see some blue text with an underline underneath. These are called anchor texts. That’s because when you click on these texts, they take you to a new webpage.
The anchor tags, or the <a> tags of HTML, are used to create anchor texts, and the URL of the webpage that is to be opened is specified in the href attribute.
Refer to the below image to understand it better.
In almost all web scraping projects, fetching the URLs from the href attribute is a common task.
In today’s article, let’s learn different ways of fetching the URL from the href attribute using Beautiful Soup.
To fetch the URL, we have to first find all the anchor tags, or hrefs, on the webpage. Then fetch the value of the href attribute.
Two ways to find all the anchor tags or href entries on the webpage are:
soup.find_all()SoupStrainerclass
Once all the href entries are found, we fetch the values using one of the following methods:
tag['href']tag.get('href')
Prerequisite: Install and Import requests and BeautifulSoup
Throughout the article, we will use the requests module to access the webpage and BeautifulSoup for parsing and pulling the data from the HTML file.
To install requests on your system, open your terminal window and enter the below command:
pip install requests
More information here:
To install Beautiful Soup in your system, open your terminal window and enter the below command:
pip install bs4
To install Beautiful Soup, open the terminal window and enter the below command:
import requests from bs4 import BeautifulSoup
More information here:
Find the href entries from a webpage
The href entries are always present within the anchor tag (<a> tag). So, the first task is to find all the <a> tags within the webpage.
Using soup.find_all()
Soup represents the parsed file. The method soup.find_all() gives back all the tags and strings that match the criteria.
Let’s say we want to find all the <a> tags in a document. We can do as shown below.
import requests
from bs4 import BeautifulSoup
url = "https://www.wikipedia.org/"
# retrieve the data from URL
response = requests.get(url)
# parse the contents of the webpage
soup = BeautifulSoup(response.text, 'html.parser')
# filter all the <a> tags from the parsed document
for tag in soup.find_all('a'):
print(tag)
Output:
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia"> <strong>English</strong> <small><bdi dir="ltr">6 383 000+</bdi> <span>articles</span></small> </a> . . . <a href="https://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike License</a> <a href="https://meta.wikimedia.org/wiki/Terms_of_use">Terms of Use</a> <a href="https://meta.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a>
Using SoupStrainer class
We can also use the SoupStrainer class. To use it, we have to first import it into the program using the below command.
from bs4 import SoupStrainer
Now, you can opt to parse only the required attributes using the SoupStrainer class as shown below.
import requests
from bs4 import BeautifulSoup, SoupStrainer
url = "https://www.wikipedia.org/"
# retrieve the data from URL
response = requests.get(url)
# parse-only the <a> tags from the webpage
soup = BeautifulSoup(response.text, 'html.parser', parse_only=SoupStrainer("a"))
for tag in soup:
print(tag)
Output:
<a class="link-box" data-slogan="The Free Encyclopedia" href="//en.wikipedia.org/" id="js-link-box-en" title="English â Wikipedia â The Free Encyclopedia"> <strong>English</strong> <small><bdi dir="ltr">6 383 000+</bdi> <span>articles</span></small> </a> . . . <a href="https://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-ShareAlike License</a> <a href="https://meta.wikimedia.org/wiki/Terms_of_use">Terms of Use</a> <a href="https://meta.wikimedia.org/wiki/Privacy_policy">Privacy Policy</a>
Fetch the value of href attribute
Once we have fetched the required tags, we can retrieve the value of the href attribute.
All the attributes and their values are stored in the form of a dictionary. Refer to the below:
sample_string="""<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>"""
soup= BeautifulSoup(sample_string,'html.parser')
atag=soup.find_all('a')[0]
print(atag)
print(atag.attrs)
Output:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
{'href': 'http://example.com/elsie', 'class': ['sister'], 'id': 'link1'}
Using tag[‘href’]
As seen in the output, the attributes and their values are stored in the form of a dictionary.
To access the value of the href attribute, just say
tag_name['href']
Now, let’s tweak the above program to print the href values.
sample_string="""<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>"""
soup= BeautifulSoup(sample_string,'html.parser')
atag=soup.find_all('a')[0]
print(atag['href'])
Output:
http://example.com/elsie
Using tag.get(‘href’)
Alternatively, we can also use the get() method on the dictionary object to retrieve the value of ‘href’ as shown below.
sample_string = """<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>"""
soup = BeautifulSoup(sample_string,'html.parser')
atag = soup.find_all('a')[0]
print(atag.get('href'))
Output:
http://example.com/elsie
Real-Time Examples
Now that we know how to fetch the value of the href attribute, let’s look at some of the real-time use cases.
Example 1: Fetch all the URLs from the webpage.
Let’s scrape the Wikipedia main page to find all the href entries.
from bs4 import BeautifulSoup
import requests
url = "https://www.wikipedia.org/"
# retrieve the data from URL
response = requests.get(url)
if response.status_code ==200:
soup=BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all(href=True):
print(tag['href'])
Output:
//cu.wikipedia.org/ //ss.wikipedia.org/ //din.wikipedia.org/ //chr.wikipedia.org/ . . . . //www.wikisource.org/ //species.wikimedia.org/ //meta.wikimedia.org/ https://creativecommons.org/licenses/by-sa/3.0/ https://meta.wikimedia.org/wiki/Terms_of_use https://meta.wikimedia.org/wiki/Privacy_policy
As you can see, all the href entries are printed.
Example 2: Fetch all URLs based on some condition
Let’s say we need to find only the outbound links. From the output, we can notice that most of the inbound links do not have "https://" in the link.
Thus, we can use the regular expression ("^https://") to match the URLs that start with "https://" as shown below.
Also, check to ensure nothing with ‘wikipedia’ in the domain is in the result.
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.wikipedia.org/"
# retrieve the data from URL
response = requests.get(url)
if response.status_code ==200:
soup=BeautifulSoup(response.text, 'html.parser')
for tag in soup.find_all(href=re.compile("^https://")):
if 'wikipedia' in tag['href']:
continue
else:
print(tag['href'])
Output:
https://meta.wikimedia.org/wiki/Special:MyLanguage/List_of_Wikipedias https://donate.wikimedia.org/?utm_medium=portal&utm_campaign=portalFooter&utm_source=portalFooter . . . https://meta.wikimedia.org/wiki/Terms_of_use https://meta.wikimedia.org/wiki/Privacy_policy
Example 3: Fetch the URLs based on the value of different attributes
Consider a file as shown below:
Let’s say we need to fetch the URL from the class=sister and with id=link2. We can do that by specifying the condition as shown below.
from bs4 import BeautifulSoup
#open the html file.
with open("sample.html") as f:
#parse the contents of the html file
soup=BeautifulSoup(f,'html.parser')
# find the tags with matching criteria
for tag in soup.find_all('a',{'href': True, 'class' : 'sister' ,'id' : 'link2' }):
print(tag['href'])
Output:
http://example.com/lacie
Conclusion
That brings us to the end of this tutorial. In this short tutorial, we have learned how to fetch the value of the href attribute within the HTML <a> tag. We hope this article has been informative. Thanks for reading.
Finxter