Removing HTML Tag removes additional words

Question:

I am working on a data cleaning problem wherein I have a task to remove HTML tags from string while keeping the content of text.

Example text for cleanup is given below. I tried removing "pre" tags and somehow i do not get any data.

x = '<pre>i am </pre><p>  siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*</pre>', '', x)

If I try adding "n" which i deleted before, i do get output as shown below

x = '<pre>i am </pre>n<p>  siddharth </p><pre> sid </pre>'
re.sub(r'<pre>.*</pre>', '', x)

output – ‘n

siddharth


A string from dataset for cleanup is given below for reference

'<p>I've written a database generation script in <a href="http://en.wikipedia.org/wiki/SQL">SQL</a> and want to execute it in my <a href="http://en.wikipedia.org/wiki/Adobe_Integrated_Runtime">Adobe AIR</a> application: </p> <pre> <code> Create Table tRole (      roleID integer Primary Key      ,roleName varchar(40));Create Table tFile (    fileID integer Primary Key    ,fileName varchar(50)    ,fileDescription varchar(500)    ,thumbnailID integer    ,fileFormatID integer    ,categoryID integer    ,isFavorite boolean    ,dateAdded date    ,globalAccessCount integer    ,lastAccessTime date    ,downloadComplete boolean    ,isNew boolean    ,isSpotlight boolean    ,duration varchar(30));Create Table tCategory (    categoryID integer Primary Key    ,categoryName varchar(50)    ,parent_categoryID integer);... </code> </pre> <p> I execute this in Adobe AIR using the following methods: </p> <pre> <code> public static function RunSqlFromFile(fileName:String):void {    var file:File = File.applicationDirectory.resolvePath(fileName);    var stream:FileStream = new FileStream();    stream.open(file, FileMode.READ)    var strSql:String = stream.readUTFBytes(stream.bytesAvailable);    NonQuery(strSql);}public static function NonQuery(strSQL:String):void{    var sqlConnection:SQLConnection = new SQLConnection();    sqlConnection.open(File.applicationStorageDirectory.resolvePath(DBPATH);    var sqlStatement:SQLStatement = new SQLStatement();    sqlStatement.text = strSQL;    sqlStatement.sqlConnection = sqlConnection;    try    {        sqlStatement.execute();    }    catch (error:SQLError)    {        Alert.show(error.toString());    }} </code> </pre> <p> No errors are generated, however only <code>tRole</code> exists. It seems that it only looks at the first query (up to the semicolon- if I remove it, the query fails). Is there a way to call multiple queries in one statement?</p>'

Detailed code for cleanup is given below. The array "arr" contains all the text for which cleanup is needed.

arr = [i.replace('n','') for i in arr]
arr = [re.sub(r'<pre>.*</pre>', '', i) for i in arr]
arr = [re.sub(f'<code>.*</code>', '', i) for i in arr]
arr = [re.sub('<[^<]+?>', '', i) for i in arr]

Kindly let me know if anyone has experienced same issue and is able to surpass this blockage.

Asked By: Siddharth vij

||

Answers:

Because of BeautifulSoup tagging – To remove a specific tag and keep its content may use .unwrap():

from bs4 import BeautifulSoup

html = '''<pre>i am </pre><p>  siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')

for tag in soup.select('pre'):
    tag.unwrap()

soup

->

i am <p>  siddharth </p> sid 

Or to extract texts only use .get_text():

from bs4 import BeautifulSoup

html = '''<pre>i am </pre>n<p>  siddharth </p><pre> sid </pre>'''
soup = BeautifulSoup(html, 'html.parser')

soup.get_text(' ', strip=True)

->

i am siddharth sid
Answered By: HedgeHog

Another method, using .extract():

from bs4 import BeautifulSoup

html_doc = '<pre>i am </pre><p>  siddharth </p><pre> sid </pre>'

soup = BeautifulSoup(html_doc, 'html.parser')

# remove <pre> and <code> tags:
for tag in soup.select('pre, code'):
    tag.extract()

# get remaining text:
text = soup.get_text(strip=True, separator=' ')
print(text)

Prints:

siddharth
Answered By: Andrej Kesely