All Courses

Python - Reading HTML Pages

Raj Mehta

2 years ago

Reading HTML Pages in Python | InsideAIML
Table of Contents
  • Introduction
  • Install Beautifulsoup
  • Reading the HTML file
  • Extracting Tag Value
  • Extracting All Tags

Introduction

          Today in this article, we are going to learn about how we can handle HTML file using python very easily.  
Python provides us with a library known as Beautifulsoup. Using this library, we can search for the values of HTML tags and get specific data like the title of the page and the list of headers in the page.

Install Beautifulsoup

          Use the Anaconda package manager to install the required package and its dependent packages.
conda install Beaustifulsoap

Reading the HTML file

         In the below example we make a request to an URL to be loaded into the python environment. Then use the HTML parser parameter to read the entire HTML file. Next, we print the first few lines of the HTML page.
import urllib.request 
from bs4 import BeautifulSoup
request_url = urllib.request.urlopen('https://insideaiml.com/') 


html_doc = request_url.read()

# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')

# Format the parsed html file
strhtm = soup.prettify()

# Print the first few characters
print (strhtm[:225])
When we execute the above code, it produces the following result.
<!DOCTYPE html>
<html lang="en">
 <head>
  <!-- Google Tag Manager -->
  <!-- <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
    new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],

 
  
  
  

Extracting Tag Value

          We can extract tag value from the first instance of the tag using the following code.
import urllib.request 
from bs4 import BeautifulSoup

request_url = urllib.request.urlopen('https://insideaiml.com/') 


html_doc = request_url.read()

# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')

print (soup.title)
print(soup.title.string)
When we execute the above code, it produces the following result.
InsideAIML
InsideAIML

Extracting All Tags

          We can extract tag value from all the instances of a tag using the following code.
import urllib.request 
from bs4 import BeautifulSoup

request_url = urllib.request.urlopen('https://insideaiml.com/python/python_overview') 

html_doc = request_url.read()

# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')

for x in soup.find_all('b'): 
    print(x.string)
When we execute the above code, it produces the following result.

Python is Interpreted
Python is Interactive
Python is Object-Oriented
Python is a Beginner's Language
Easy-to-learn
Easy-to-read
Easy-to-maintain
A broad standard library
Interactive Mode
Portable
Extendable
Databases
GUI Programming
Scalable
I hope you enjoyed reading this article and finally, you came to know about Python - Reading HTML Pages.
    
For more such blogs/courses on data science, machine learning, artificial intelligence and emerging new technologies do visit us at InsideAIML.
Thanks for reading…
Happy Learning…

Submit Review