The idea is to make a simple function that strips all HTML tags on a given string.
I know the following code might not seem the best code in the world, but keep in mind that I’m just looking for a simple function that does the job.
def stripHTMLTags (html): """ Strip HTML tags from any string and transfrom special entities """ import re text = html # apply rules in given order! rules = [ { r'>\s+' : u'>'}, # remove spaces after a tag opens or closes { r'\s+' : u' '}, # replace consecutive spaces { r'\s*<br\s*/?>\s*' : u'\n'}, # newline after a <br> { r'</(div)\s*>\s*' : u'\n'}, # newline after </p> and </div> and <h1/>... { r'</(p|h\d)\s*>\s*' : u'\n\n'}, # newline after </p> and </div> and <h1/>... { r'<head>.*<\s*(/head|body)[^>]*>' : u'' }, # remove <head> to </head> { r'<a\s+href="([^"]+)"[^>]*>.*</a>' : r'\1' }, # show links instead of texts { r'[ \t]*<[^<]*?/?>' : u'' }, # remove remaining tags { r'^\s+' : u'' } # remove spaces at the beginning ] for rule in rules: for (k,v) in rule.items(): regex = re.compile (k) text = regex.sub (v, text) # replace special strings special = { ' ' : ' ', '&' : '&', '"' : '"', '<' : '<', '>' : '>' } for (k,v) in special.items(): text = text.replace (k, v) return text
To see this function in action (and get an idea of what does), following you can find an example of a HTML containing some javascript, styles, classes, and several kind of tags:
print stripHTMLTags (''' <html> <head> <title> Whatever </title> <script language="javascript"> var j = 0; function asdf() { if (j < 0) alert ("Whatever"); } </script> </head> <body style="background: red;"> <div class="container"> <h1>This is the title</h1> <p> This is the first paragraph with little text. <br/> This is the second line of the first paragraph (note the BR tag). </p> <p>This is the second paragraph with some extra text.</p> </div> <div> <p> <span>Follow this link:</span> <a href="http://www.codigomanso.com/">Codigo Manso</a> </p> <p>Do you like it?</p> </div> </body> </html>''')
And this is the output of stripHTMLTags:
This is the title This is the first paragraph with little text. This is the second line of the first paragraph (note the BR tag). This is the second paragraph with some extra text. Follow this link: http://www.codigomanso.com/ Do you like it?
My requirements were to do a simple function that transformed HTML into text. I know it won’t handle all cases, not even bad formatted HTML code. It’s OK! I’m fine with that.
If you need a function which handles wrong-formatted HTML for your project I recommend you to look for a good HTML or XML parser (BeautifulSoup is a great HTML parser for Python).


Español