Manso Trick: simple strip HTML tags using Python

The idea is to make a simple function that strips all HTML tags on a given string.

I know the following code might not seem the best code in the world, but keep in mind that I’m just looking for a simple function that does the job.

def stripHTMLTags (html):
  """
    Strip HTML tags from any string and transfrom special entities
  """
  import re
  text = html
 
  # apply rules in given order!
  rules = [
    { r'>\s+' : u'>'},                  # remove spaces after a tag opens or closes
    { r'\s+' : u' '},                   # replace consecutive spaces
    { r'\s*<br\s*/?>\s*' : u'\n'},      # newline after a <br>
    { r'</(div)\s*>\s*' : u'\n'},       # newline after </p> and </div> and <h1/>...
    { r'</(p|h\d)\s*>\s*' : u'\n\n'},   # newline after </p> and </div> and <h1/>...
    { r'<head>.*<\s*(/head|body)[^>]*>' : u'' },     # remove <head> to </head>
    { r'<a\s+href="([^"]+)"[^>]*>.*</a>' : r'\1' },  # show links instead of texts
    { r'[ \t]*<[^<]*?/?>' : u'' },            # remove remaining tags
    { r'^\s+' : u'' }                   # remove spaces at the beginning
  ]
 
  for rule in rules:
    for (k,v) in rule.items():
      regex = re.compile (k)
      text  = regex.sub (v, text)
 
  # replace special strings
  special = {
    '&nbsp;' : ' ', '&amp;' : '&', '&quot;' : '"',
    '&lt;'   : '<', '&gt;'  : '>'
  }
 
  for (k,v) in special.items():
    text = text.replace (k, v)
 
  return text

To see this function in action (and get an idea of what does), following you can find an example of a HTML containing some javascript, styles, classes, and several kind of tags:

print stripHTMLTags ('''
<html>
  <head>
  <title> Whatever </title>
  <script language="javascript">
  var j = 0;
  function asdf() {
    if (j < 0)
      alert ("Whatever");
  }
  </script>
  </head>
  <body style="background: red;">
    <div class="container">
      <h1>This is the title</h1>
      <p>
        This is the first paragraph with little text.
        <br/>
        This is the second line of the first paragraph (note the BR tag).
      </p>
      <p>This is the second paragraph with some extra text.</p>     
    </div>
    <div>
      <p>
        <span>Follow this link:</span>
        &nbsp;&nbsp;<a href="http://www.codigomanso.com/">Codigo Manso</a>
      </p>
 
      <p>Do you like it?</p>
    </div>
  </body>
</html>''')

And this is the output of stripHTMLTags:

This is the title
 
This is the first paragraph with little text.
This is the second line of the first paragraph (note the BR tag). 
 
This is the second paragraph with some extra text.
 
Follow this link:  http://www.codigomanso.com/
 
Do you like it?

My requirements were to do a simple function that transformed HTML into text. I know it won’t handle all cases, not even bad formatted HTML code. It’s OK! I’m fine with that.

If you need a function which handles wrong-formatted HTML for your project I recommend you to look for a good HTML or XML parser (BeautifulSoup is a great HTML parser for Python).

Trackback URL

, , , ,

  1. Philip
    12/10/2010 at 1:45 pm Permalink

    Cool approach! Works well enough for text only view of HTML emails.

  2. Patrick
    09/11/2010 at 5:13 pm Permalink

    Hi there, I am just learning python for a project to do just this but with xml. When I am trying this code I am getting an invalid syntax with no more information but it is highlighting the last ‘ of each line in the rules. Any idea why that might be happening?

    Thanks!
    Patrick

  3. Pau Sánchez
    13/11/2010 at 3:19 am Permalink

    A problem with spaces, might be? I don’t know, which version of python are you using?

    That code should work fine on Python 2.5 to Python 2.7 for sure.

  4. e1000
    02/03/2012 at 2:06 am Permalink

    Cool ! I simplified the rules a bit, btw making them just a list of tuples:


    # apply rules in given order!
    rules = [
    (r'\s+', ' '), # replace consecutive whitespace
    (r'\s*\s*', '\n'), # newline after a
    (r'\s*\s*' , '\n'), # newline after and and ...
    (r'\s*\s*' , '\n\n'), # newline after and and ...
    (r'.*]*>' , ''), # remove to
    (r']*>.*' , r'\1'), # show links instead of texts
    (r'<[^\s*', ''), # remove remaining tags
    ]

    for rgx, val in rules:
    regex = re.compile(rgx)
    text = regex.sub(val, text)

  5. bj
    16/07/2012 at 9:52 am Permalink

    Does not work with tables.

  6. Abhiram
    31/10/2012 at 11:44 am Permalink

    Thank you. Your approach reminds me of my old lex days 🙂