logo

Web Scraping and Removing Tags with Python 📂Programing

Web Scraping and Removing Tags with Python

Overview

Python is well equipped with packages for crawling, making it easy to follow along. Let’s try reading a web page and removing the html tags.

Example

Code

import requests
from bs4 import BeautifulSoup
import re
 
rq = requests.get("https://ko.wikipedia.org/wiki/%EC%98%A4%EB%A7%88%EC%9D%B4%EA%B1%B8")
rqctnt = rq.content
soup = BeautifulSoup(rqctnt,"html.parser")
 
OMG = str(soup.find\_all("p"))
 
OMG = re.sub('<.+?>', '', OMG, 0).strip()

Result

20180521\_143907.png

  • For an example, let’s read the Oh My Girl entry from Wikipedia. The necessary packages are requests and bs4 as you can see.

20180521\_143926.png

  • If you only read and print it out like this, html tags are attached all over it as shown above. To remove them, you need to use regular expressions like the example code and the package re is needed.

20180521\_143943.png

  • After removing the tags and printing it out, it looks clean with only the necessary content as shown above. In <<Banana Allergy Monkey>>, < is converted to &le; and > to &gt;, so this part only needs to be fixed again.