By: Noushid Khan

APR 8,2020

Use of BeautifulSoup in Python

Technical

In python, BeautifulSoup is used for operating with HTML queries and XML queries. It helps to take HTML and XML codes is based on tags. Tags can take on the basis of id and class also this will get as an object here we can do several operations.

* To parse a document it can be open as a file or given as a string

#Code

    	  from bs4 import BeautifulSoup
           with open("index.html") as fp:
           soup = BeautifulSoup(fp)
           soup = BeautifulSoup("<html>data</html>")

Operations:

* If we need to find any tag in the HTML and XML,

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
print(soup.find(‘tag_to_find’))

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	print(soup.find(‘p’))

#Output

	<p>HTML FILE</p>

* In the above method, it will find the first matched tag in the HTML code.

* If you want to find all the matched tags, you need to call the find_all method.

* It returns as a list of matched tags.

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	print(soup.find_all(‘p’))

#Output

	 [<p>HTML FILE</p>, <p>END</p>]

* A tag may have any number of attributes, we can access attributes treating it has a dictionary.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
tag = soup.find(‘tag_to_find’)
print(tag[‘id’])

#Example

	html_cont = “<div>
	<p id=”boldest”>HTML FILE</p>
	<img>Image</image>       <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	tag = soup.find(‘p’)	print(tag[‘id’])

#Output

	boldest

* If you want to call all the attributes of the specified tag.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
tag = soup.find(‘tag_to_find’)
print(tag.attrs)

#Example

	html_cont = “<div>
	<p id=”boldest” class=”bold-class”>HTML FILE</p>
	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	tag = soup.find(‘p’)	print(tag.attrs)

#Output

	 {id: ”boldest”, class: ”bold-class”}

* If you want, add a new attribute to the tag.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
tag = soup.find(‘tag_to_find’)
tag[‘your_attribute’’] = “attribute_value”

#Example

	html_cont = “<div>
	<p id=”boldest” class=”bold-class”>HTML FILE</p>
	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	tag = soup.find(‘p’)	tag[‘new_attribute’] = “1”        print(tag)

#Output

	<p id=”boldest” class=”bold-class” new_attribute = “1”>HTML FILE</p>

* If you want to remove any attribute from tag it can be done by in the below method, by using it we can delete attributes int the specific tag.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
tag = soup.find(‘tag_to_find’)
del tag[‘your_attribute’’]

#Example

	html_cont = “<div>
	<p id=”boldest” class=”bold-class”>HTML FILE</p>
  	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	tag = soup.find(‘p’)	del tag[‘id’]         print(tag)

#Output

	<p class=”bold-class”>HTML FILE</p>

* You can find the tags with only not its tag name and we can also find the tags with id and class.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
print(soup.find(id=‘id_to_find’, class_=’class_to_find’))

#Example

	html_cont = “<div>
	<p class=”p_class” >HTML FILE</p>
	<img id=”img”>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	print(soup.find(id=’img’))        print(soup.find(class_= “p_class”))

#Output

	<img id=”img”>Image</image>
	<p class=”p_class” >HTML FILE</p>

* Sometimes we need to see all the texts in the code it can be easily done by using Beautifulsoup. In the below method explains how to get all texts in the code.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
print(soup.get_text())

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<p></p>	<img>Image</image>        <p>END</p>
	</div>”		soup = BeautifulSoup(html_cont,  'html.parser')	print(soup.get_text())

#Output

	HTML FILE
	Image
	        END

* If need to change the tag name in an HTML

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
tag = soup.p
tag.name = ‘h1’

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<p></p>	<img>Image</image>        <p>END</p>
	</div>”		tag = soup.p        tag.name = ‘h1’ 
		print(tag)

#Output

	<h1>HTML FILE</h1>

* We can take texts in a tag using the below method

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
print(soup.find(‘p’).string)

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<p></p>	<img>Image</image>        <p>END</p>
	</div>”		print(soup.find(‘p’).string)

#Output

	HTML FILE

* If we need to change any string of the tag, we can’t edit string on its place. It can be replaced with the other string.

#Code

from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_code,  'html.parser')
print(soup.find(‘p’).replace_with(‘new_string’))

#Example

	html_cont = “<div>
	<p>HTML FILE</p>
	<p></p>	<img>Image</image>        <p>END</p>
	</div>”		soup.find(‘p’).replace_with(‘edited file’)	print(soup.find(‘p’))