Web Scraping and Removing Tags with Python
Overview
Python is well equipped with packages for crawling, making it easy to follow along. Let’s try reading a web page and removing the html tags.
Example
Code
import requests
from bs4 import BeautifulSoup
import re
rq = requests.get("https://ko.wikipedia.org/wiki/%EC%98%A4%EB%A7%88%EC%9D%B4%EA%B1%B8")
rqctnt = rq.content
soup = BeautifulSoup(rqctnt,"html.parser")
OMG = str(soup.find\_all("p"))
OMG = re.sub('<.+?>', '', OMG, 0).strip()
Result
- For an example, let’s read the Oh My Girl entry from Wikipedia. The necessary packages are
requests
andbs4
as you can see.
- If you only read and print it out like this, html tags are attached all over it as shown above. To remove them, you need to use regular expressions like the example code and the package
re
is needed.
- After removing the tags and printing it out, it looks clean with only the necessary content as shown above. In
<<Banana Allergy Monkey>>
,<
is converted to≤
and>
to>
, so this part only needs to be fixed again.