Python and utf-8: force_unicode

Character encoding is the worst thing ever invented. Hopefully somebody invented Unicode and UTF-8 and UTF-32.

For me, UTF-8 is, and should be, the standard for saving and sending text strings all over the world.

Programming languages (even Python) should only support UTF-8 as input, be it from console, from a file, or from a socket. I think I’m going to stop here, because each time I think on how languages support the different encodings I get sick.

The point is that I think Python support of character encodings is not very good. Why? It is not smart enough to workaround simple concatenations, it throws exceptions constantly. I cannot have a good programming experience dealing with strings in Python.

Probably I’m missing something about python, but I used to do str.decode(‘utf-8′, ‘ignore’) where str is the string to be converted to UTF-8, but even using that sometimes I still got exceptions.

Today, looking for a solution again, I found force_unicode on django framework. After doing some tests, it seems that it works awesome.

I just wanted to share the code (slightly modified):

def force_unicode(s, encoding='utf-8', errors='ignore'):
    """
    Returns a unicode object representing 's'. Treats bytestrings using the
    'encoding' codec.
    """
    import codecs
    if s is None:
      return ''
 
    try:
        if not isinstance(s, basestring,):
            if hasattr(s, '__unicode__'):
                s = unicode(s)
            else:
                try:
                    s = unicode(str(s), encoding, errors)
                except UnicodeEncodeError:
                    if not isinstance(s, Exception):
                        raise
                    # If we get to here, the caller has passed in an Exception
                    # subclass populated with non-ASCII data without special
                    # handling to display as a string. We need to handle this
                    # without raising a further exception. We do an
                    # approximation to what the Exception's standard str()
                    # output should be.
                    s = ' '.join([force_unicode(arg, encoding, errors) for arg in s])
        elif not isinstance(s, unicode):
            # Note: We use .decode() here, instead of unicode(s, encoding,
            # errors), so that if s is a SafeString, it ends up being a
            # SafeUnicode at the end.
            s = s.decode(encoding, errors)
    except UnicodeDecodeError, e:
        if not isinstance(s, Exception):
            raise UnicodeDecodeError (s, *e.args)
        else:
            # If we get to here, the caller has passed in an Exception
            # subclass populated with non-ASCII bytestring data without a
            # working unicode method. Try to handle this without raising a
            # further exception by individually forcing the exception args
            # to unicode.
            s = ' '.join([force_unicode(arg, encoding, errors) for arg in s])
    return s

I am sure that if you have trouble dealing with strings in Python, as I was, you will appreciate and use this code.

Trackback URL

, , , ,

4 Comments on "Python and utf-8: force_unicode"

Hi Stranger, leave a comment:

ALLOWED XHTML TAGS:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

Subscribe to Comments