All Courses

Started with Beautiful Soup

Nithya Rekha

4 years ago

Before entering into the topic of Beautiful soup, we should know what is meant by scraping.
In today's world, extracting data from web sites, plays a major role to generate revenue. This extracting of useful information from website is known as scraping. This extraction will be done in much efficient way by using Beautiful soup.
           Beautiful Soup is a python library, helps to pull data out of HTML and XML files. It works with the parser that helps to make the proper tree structure for navigating and extracting the information. By the time of programmers has been saved in a-lot manner.
This Blog helps you to understand the usage of Beautiful Soup.
Lets start with the Installation
pip install beautifulsoup4 #For Windows
$ apt-get install python3-bs4 # For Linus based OS
After the Installation of this beautiful soup, some third party parser should also be included, one of them is lxml parser.
$ apt-get install python-lxml

$ easy_install lxml

$ pip install lxml
  Another alternative is the pure-Python html5lib parser, which parses HTML the way a web browser does. Depending on your setup, you might install html5lib with one of these commands:  
$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib
Here’s an HTML document I’ll be using as an example throughout this document. It’s part of a story from Alice in Wonderland:
html_doc = """
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well....
"""
Running the “three sisters” document through Beautiful Soup returns us with a BeautifulSoup object, which represents the document as a nested data structure:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# 
#  
#   
#    The Dormouse's story
#   
#  
#  
#   
#    
#     The Dormouse's story
#    
#   
#   
#    Once upon a time there were three little sisters; and their names were
#    
#     Elsie
#    
#    ,
#    
#     Lacie
#    
#    and
#    
#     Tillie
#    
#    ; and they lived at the bottom of a well.
#   
#   
#    ...
#   
#  
# 
Here are some simple ways to navigate that data structure:
soup.title
# The Dormouse's story

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# The Dormouse's story

soup.p['class']
# u'title'

soup.a
# Elsie

soup.find_all('a')
# [Elsie,
#  Lacie,
#  Tillie]

soup.find(id="link3")
# Tillie
One common task is extracting all the URLs found within a page’s <a> tags:
for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
Another common task is extracting all the text from a page:
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.

Submit Review