Manso Trick: simple strip HTML tags using Python

The idea is to make a simple function that strips all HTML tags on a given string.

I know the following code might not seem the best code in the world, but keep in mind that I’m just looking for a simple function that does the job.

def stripHTMLTags (html):
  """
    Strip HTML tags from any string and transfrom special entities
  """
  import re
  text = html
 
  # apply rules in given order!
  rules = [
    { r'>\s+' : u'>'},                  # remove spaces after a tag opens or closes
    { r'\s+' : u' '},                   # replace consecutive spaces
    { r'\s*<br\s*/?>\s*' : u'\n'},      # newline after a <br>
    { r'</(div)\s*>\s*' : u'\n'},       # newline after </p> and </div> and <h1/>...
    { r'</(p|h\d)\s*>\s*' : u'\n\n'},   # newline after </p> and </div> and <h1/>...
    { r'<head>.*<\s*(/head|body)[^>]*>' : u'' },     # remove <head> to </head>
    { r'<a\s+href="([^"]+)"[^>]*>.*</a>' : r'\1' },  # show links instead of texts
    { r'[ \t]*<[^<]*?/?>' : u'' },            # remove remaining tags
    { r'^\s+' : u'' }                   # remove spaces at the beginning
  ]
 
  for rule in rules:
    for (k,v) in rule.items():
      regex = re.compile (k)
      text  = regex.sub (v, text)
 
  # replace special strings
  special = {
    '&nbsp;' : ' ', '&amp;' : '&', '&quot;' : '"',
    '&lt;'   : '<', '&gt;'  : '>'
  }
 
  for (k,v) in special.items():
    text = text.replace (k, v)
 
  return text

To see this function in action (and get an idea of what does), following you can find an example of a HTML containing some javascript, styles, classes, and several kind of tags:

print stripHTMLTags ('''
<html>
  <head>
  <title> Whatever </title>
  <script language="javascript">
  var j = 0;
  function asdf() {
    if (j < 0)
      alert ("Whatever");
  }
  </script>
  </head>
  <body style="background: red;">
    <div class="container">
      <h1>This is the title</h1>
      <p>
        This is the first paragraph with little text.
        <br/>
        This is the second line of the first paragraph (note the BR tag).
      </p>
      <p>This is the second paragraph with some extra text.</p>     
    </div>
    <div>
      <p>
        <span>Follow this link:</span>
        &nbsp;&nbsp;<a href="http://www.codigomanso.com/">Codigo Manso</a>
      </p>
 
      <p>Do you like it?</p>
    </div>
  </body>
</html>''')

And this is the output of stripHTMLTags:

This is the title
 
This is the first paragraph with little text.
This is the second line of the first paragraph (note the BR tag). 
 
This is the second paragraph with some extra text.
 
Follow this link:  http://www.codigomanso.com/
 
Do you like it?

My requirements were to do a simple function that transformed HTML into text. I know it won’t handle all cases, not even bad formatted HTML code. It’s OK! I’m fine with that.

If you need a function which handles wrong-formatted HTML for your project I recommend you to look for a good HTML or XML parser (BeautifulSoup is a great HTML parser for Python).

, , , ,

Manso Trick: Pad a number with leading zeroes in javascript

I was missing a simple and elegant method for padding a number with leading zeroes in javascript.

A typical example of leading zeroes is when you want to show the current time and you want the time to be formatted like hh:mm. There is no problem when it’s 12:40, but when is five minutes past one, you get 1:5, which is not what you would expect. You want it formatted like 01:05. The same happens with dates and in several other situations.

Yesterday I found the best solution I’ve seen for this problem. Before that, let’s see possible solutions.

Usually, needing only one zero, I would do something like:

h = (h < 10) ? ("0" + h) : h;

This is not bad at all, but is not elegant. The problem is, what if we want it to be formatted with 3 digits? Then, using the same schema, we can end up with something like:

h = (h < 100) ? ( (h >= 10) ? ("0" + h) : ("00" + h) ) : h;

Man, this is ugly and unreadable!! Maybe it will improve with an if/else, but it will be bad anyway. Probably if you need to do something like that, you will create a function that will handle this problem and then, you will call that function saying you want h padded with 3 zeros. But there is another solution that does not require you to use or create such a funciton…

Have a look for two digit and three digit:

("0" + h).slice (-2);  // devolverá "01" si h=1; "12" si h=12
("00" + h).slice (-3); // "001" si h=1; "012" si h=12;"123" si h = 123

I don’t know what are you thinking after seeing this solution, but I got flashed when I saw it. Just amazing!

Let’s explain what’s going on. The code above only concatenates the “00″ with the number we want to format (e.g: “00″ + 12 = “0012″) and then, we get the last three digits using the slice string function (that’s why you have a -3). Guess what? The last three characters are “012″ which is the string we want.

Of course this solution is not only limited to leading zeroes, you can use it with leading spaces, or whatever character you want.

Sources:

Update: Milan Adamovsky has created a web page http://jsperf.com/zero-padding where you can check the performance issues of several functions to pad with zeroes, and you can try it directly in your browser ;)

, , ,

How to get user-agent in Google App Engine using Python

User-Agent tells which is the client application that is making the request. It tells if it’s a browser (and information which browser it is), if it’s a robot (for example a Google/Yahoo/Bing spider), …  In theory the string is present in all HTTP headers, no matter which client is doing the request, but it might be ommited.

For knowing which is the browser the user is using, you should get the “User-Agent” string. For getting this string when programming in Python under Google App Engine you just have to include following line inside any  “get” or “post”  method which inherits from webapp.RequestHandler:

self.request.headers['User-Agent']

Another useful thing is to know the IP of the client who is making the request. This is even easier to do:

self.request.remote_addr

The IP can be used  to localize geographically (commonly geolocalize) to the one making the request.

Related links:

, ,

Max URL (or GET) length in Google App Engine

I’ve done some tests in order to evaluate what is the maximum length of a URL that the different browsers can handle, and I ended up with a restriction on the server-side.

According to the tests I’ve been doing, the maximum length of a URL in Google App Engine is 2048 characters.

So, if you are going to do a GET request on Google App Engine, make sure that the both the address and the data you are sending are less than 2048 bytes long.

, , ,

GeoIP in Google App Engine

It is surprising to me that Google does not offer any service or method in Google App Engine to get the geographic location from a IP. This is something they have in tons of products, and offering this will be something really simple to do for them. I don’t see why they are not offering this to GAE developers (they offer a solution in the client side, if you want to use their javascript, google.loader.ClientLocation)

Fortunately, this problem has two easy solutions.

Solution 1: do a request to another server

You can do an HTTP  request to http://geoip.wtanaka.com/ and this server will return the country code:

For example, to know the country of the ip 72.14.235.121 you will only have to do a request to  http://geoip.wtanaka.com/cc/72.14.235.121

If you want to know how to do this kind of requests, take a look to  http://code.google.com/p/geo-ip-location/wiki/GoogleAppEngine

Solution 2: implement it in your own server

This is the solution I like the most. You just download the latest version of  GeoIP.dat and you use it with the  pygeoip.py library.

Using it is as easy as:

def getCountryByIP (remote_addr):
  GEOIP = pygeoip.Database('GeoIP.dat')
  info = GEOIP.lookup(remote_addr)
  return info.country

Please note that using the function I defined before will not be the best practice for this library. This library has been though to  load everything into memory, so they speed-up lookups. Each time pygeoip.Database is called, the GeoIP.dat file is loaded into memory, and as you might think, you only to do that once, otherwise it makes no sense.

On the other hand, for the application I am developing, I only wanted to do one lookup, so it made no sense to load this file in memory. It just won’t speed up anything, and will load 1MB into memory.

My solution has been to update this library to allow two modes of operations. The first mode behaves the same way the original library, it loads everything to memory and it will work better if you are going to do hundreds of lookups. The second mode is access to disk for each lookup. This mode will work better if you are only going to do  a few lookups.

Here you have pygeoip.py with the changes I’ve made. For convenience, you can use disk_lookup:

pygeoip.disk_lookup (remote_addr)

This function works faster and consumes less resources but you have to use it when you want to do only a couple of lookups.

I want to congratulate David Wilson, the author of this library, because it has been really  easy to implement all the changes and improvements (of course I’ve sent him an e-mail with the changes, but now is up to him to include them or not).

, ,

Python and utf-8: force_unicode

Character encoding is the worst thing ever invented. Hopefully somebody invented Unicode and UTF-8 and UTF-32.

For me, UTF-8 is, and should be, the standard for saving and sending text strings all over the world.

Programming languages (even Python) should only support UTF-8 as input, be it from console, from a file, or from a socket. I think I’m going to stop here, because each time I think on how languages support the different encodings I get sick.

The point is that I think Python support of character encodings is not very good. Why? It is not smart enough to workaround simple concatenations, it throws exceptions constantly. I cannot have a good programming experience dealing with strings in Python.

Probably I’m missing something about python, but I used to do str.decode(‘utf-8′, ‘ignore’) where str is the string to be converted to UTF-8, but even using that sometimes I still got exceptions.

Today, looking for a solution again, I found force_unicode on django framework. After doing some tests, it seems that it works awesome.

I just wanted to share the code (slightly modified):

def force_unicode(s, encoding='utf-8', errors='ignore'):
    """
    Returns a unicode object representing 's'. Treats bytestrings using the
    'encoding' codec.
    """
    import codecs
    if s is None:
      return ''
 
    try:
        if not isinstance(s, basestring,):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                try:
                    s = unicode(str(s), encoding, errors)
                except UnicodeEncodeError:
                    if not isinstance(s, Exception):
                        raise
                    # If we get to here, the caller has passed in an Exception
                    # subclass populated with non-ASCII data without special
                    # handling to display as a string. We need to handle this
                    # without raising a further exception. We do an
                    # approximation to what the Exception's standard str()
                    # output should be.
                    s = ' '.join([force_unicode(arg, encoding, errors) for arg in s])
        elif not isinstance(s, unicode):
            # Note: We use .decode() here, instead of unicode(s, encoding,
            # errors), so that if s is a SafeString, it ends up being a
            # SafeUnicode at the end.
            s = s.decode(encoding, errors)
    except UnicodeDecodeError, e:
        if not isinstance(s, Exception):
            raise UnicodeDecodeError (s, *e.args)
        else:
            # If we get to here, the caller has passed in an Exception
            # subclass populated with non-ASCII bytestring data without a
            # working unicode method. Try to handle this without raising a
            # further exception by individually forcing the exception args
            # to unicode.
            s = ' '.join([force_unicode(arg, encoding, errors) for arg in s])
    return s

I am sure that if you have trouble dealing with strings in Python, as I was, you will appreciate and use this code.

, , , ,

mansofk: the super mega ultra lightweight js framework

I needed a javascript framework that was able to change the CSS of the elements, that was able to do AJAX requests, able to load external JS and CSS dynamically, able to add or change HTML on the fly, able to handle events, able to do animations and able to avoid collisions with other frameworks or with other versions of himself, and, on top, I wanted a framework that was ultra lightweight and I wanted something that worked on IE6+, FF, Safari, Chrome and Opera.

After being tired of looking for this, I finally decided to do it myself, and honoring the blog I called it manso framework, mansofk for firends.

I got all that functionality in just 1.5 KB!!

The main features are:

  • Easy to rename the framework to avoid collissions (with other frameworks or other versions of mansofk)
  • Chaining support
  • Dynamic load of external elements
    • Supports loading external CSS files upon request
    • Supports loading external JS files upon request
  • Simple DOM manipulations
    • Select elements by ID
    • Add new HTML on an element
    • Replace the HTML of an element
  • Support CSS manipulations
    • Get the current property of an element
    • Change the property of an element
    • Change several properties at once
  • Simple CSS animations
    • Supports several attributes at once
    • Several parameters, supports changing the duration or the frames per second
    • You can choose the linear function and the cubic function
  • Event support
    • bind
    • unbind
  • AJAX calls
    • Using GET and POST
    • Support parsing XML data
    • Supports parsing JSON data
    • Supports getting plain text
  • Super lightweight
    • 3.3 KB minified
    • 1.5 KB gzipped!

You are all free to use this framework for whatever you want, but don’t blame at me if it fails (although bugs are welcome).

Following you have the minified version using Google Closure Compiler and the development version:

So…  that’s it!  I think it would be great to give you a demo, but right now I’m out of time, so I leave it for another day.

, , , , , ,

Running Google App Engine in Ubuntu 10.4 Lucid Lynx

It is not new, it always happens the same to me. After I update my computer to the latest version of Ubuntu (in this case version 10.4) I always have to spend a couple of days reconfiguring things or reinstalling packages.

The thing is that right now I am developing an application using Google App Engine (GAE for friends), and for not having problems when deploying the next version, it is recommended to use Python 2.5 when developing.

As you might have guessed in the title, Canonical have removed the Python2.5 package from the latest Ubuntu Lucid release, so I cannot run the local  Google App Engine web server.

Lucky me, after searching for a while on launchpad.net I’ve found a solution. There is a person that has created  python2.4 and python2.5 packages for Ubuntu Lucid Lynx.

The only think you should do is add the following two lines at the end of your  /etc/apt/sources.list

deb http://ppa.launchpad.net/fkrull/deadsnakes/ubuntu lucid main
deb-src http://ppa.launchpad.net/fkrull/deadsnakes/ubuntu lucid main

And finally run:

$ sudo apt-get update
$ sudo apt-get install python2.5

And that’s it, you can now run GoogleAppEngine.

Interesting links:

, ,

Ubuntu 10.4: How to place the window close button on the right again

Ubuntu 10.4 is live. I’ve been using Ubuntu for around 4 to 5 years, and I am a pretty happy user.

From my point of view, the big problem I found on this release (which I just installed 5 minutes ago) is that they changed the window buttons from the right to the left side of the window. I think 95% of the users are used to the buttons on the right side, and I don’t see the point on moving them to the left.

Anyway, I don’t like it, so I searched around, and I’ve found a quick and easy solution.

Press Alt+F2, then type gconf-editor and press enter. That’s for opening the configuration editor.

Once on this editor, in the item tree at the left, you have to look for this path  app -> metacity -> general and you doubleclick on the field named  button_layout.

Then, you have only to change the value field and put this:

menu:minimize,maximize,close

You save the changes, and voila!! all the windows reconfigured again!

Now all the windows show the minimize, maximize and close buttons on the right again ;)


http_build_query implemented in Python

I implemented a function in Python that mimics http_build_query function.

##
# Mimics the behaviour of http_build_query PHP function
# This method can be useful for sending data to flash applications
##################################################
def http_build_query(params, topkey = ''):
  from urllib import quote
 
  if len(params) == 0:
    return ""
 
  result = ""
 
  # is a dictionary?
  if type (params) is dict:
    for key in params.keys():
      newkey = quote (key)
      if topkey != '':
        newkey = topkey + quote('[' + key + ']')
 
      if type(params[key]) is dict:
        result += http_build_query (params[key], newkey)
 
      elif type(params[key]) is list:
        i = 0
        for val in params[key]:
          result += newkey + quote('[' + str(i) + ']') + "=" + quote(str(val)) + "&"
          i = i + 1
 
      # boolean should have special treatment as well
      elif type(params[key]) is bool:
        result += newkey + "=" + quote (str(int(params[key]))) + "&"
 
      # assume string (integers and floats work well)
      else:
        result += newkey + "=" + quote (str(params[key])) + "&"
 
  # remove the last '&'
  if (result) and (topkey == '') and (result[-1] == '&'):
    result = result[:-1]
 
  return result

If you find the code useful, feel free to use it in any kind of software product, be it commercial or not :)

, ,

prev posts prev posts