La idea es eliminar todos los tags HTML de una determinada cadena.
El siguiente no es el mejor código del mundo, pero hace el papel.
def stripHTMLTags (html): """ Strip HTML tags from any string and transfrom special entities """ import re text = html # apply rules in given order! rules = [ { r'>\s+' : u'>'}, # remove spaces after a tag opens or closes { r'\s+' : u' '}, # replace consecutive spaces { r'\s*<br\s*/?>\s*' : u'\n'}, # newline after a <br> { r'</(div)\s*>\s*' : u'\n'}, # newline after </p> and </div> and <h1/>... { r'</(p|h\d)\s*>\s*' : u'\n\n'}, # newline after </p> and </div> and <h1/>... { r'<head>.*<\s*(/head|body)[^>]*>' : u'' }, # remove <head> to </head> { r'<a\s+href="([^"]+)"[^>]*>.*</a>' : r'\1' }, # show links instead of texts { r'[ \t]*<[^<]*?/?>' : u'' }, # remove remaining tags { r'^\s+' : u'' } # remove spaces at the beginning ] for rule in rules: for (k,v) in rule.items(): regex = re.compile (k) text = regex.sub (v, text) # replace special strings special = { ' ' : ' ', '&' : '&', '"' : '"', '<' : '<', '>' : '>' } for (k,v) in special.items(): text = text.replace (k, v) return text
Ejemplo de texto en HTML con algo de javascript, y varios tipos de tags:
print stripHTMLTags (''' <html> <head> <title> Whatever </title> <script language="javascript"> var j = 0; function asdf() { if (j < 0) alert ("Whatever"); } </script> </head> <body style="background: red;"> <div class="container"> <h1>This is the title</h1> <p> This is the first paragraph with little text. <br/> This is the second line of the first paragraph (note the BR tag). </p> <p>This is the second paragraph with some extra text.</p> </div> <div> <p> <span>Follow this link:</span> <a href="http://www.codigomanso.com/">Codigo Manso</a> </p> <p>Do you like it?</p> </div> </body> </html>''')
Salida de la función stripHTMLTags:
This is the title This is the first paragraph with little text. This is the second line of the first paragraph (note the BR tag). This is the second paragraph with some extra text. Follow this link: http://www.codigomanso.com/ Do you like it?
Pues eso, la función es un tanto churrasco, pero es muy sencilla y hace el papel (al menos para mis propositos). El que quiera algo mejor, que se busque un buen parser de HTML (BeautifulSoup es un gran parser HTML para Python).


English