Character encoding is the worst thing ever invented. Hopefully somebody invented Unicode and UTF-8 and UTF-32.
For me, UTF-8 is, and should be, the standard for saving and sending text strings all over the world.
Programming languages (even Python) should only support UTF-8 as input, be it from console, from a file, or from a socket. I think I’m going to stop here, because each time I think on how languages support the different encodings I get sick.
The point is that I think Python support of character encodings is not very good. Why? It is not smart enough to workaround simple concatenations, it throws exceptions constantly. I cannot have a good programming experience dealing with strings in Python.
Probably I’m missing something about python, but I used to do str.decode(‘utf-8′, ‘ignore’) where str is the string to be converted to UTF-8, but even using that sometimes I still got exceptions.
Today, looking for a solution again, I found force_unicode on django framework. After doing some tests, it seems that it works awesome.
I just wanted to share the code (slightly modified):
def force_unicode(s, encoding='utf-8', errors='ignore'): """ Returns a unicode object representing 's'. Treats bytestrings using the 'encoding' codec. """ import codecs if s is None: return '' try: if not isinstance(s, basestring,): if hasattr(s, '__unicode__'): s = unicode(s) else: try: s = unicode(str(s), encoding, errors) except UnicodeEncodeError: if not isinstance(s, Exception): raise # If we get to here, the caller has passed in an Exception # subclass populated with non-ASCII data without special # handling to display as a string. We need to handle this # without raising a further exception. We do an # approximation to what the Exception's standard str() # output should be. s = ' '.join([force_unicode(arg, encoding, errors) for arg in s]) elif not isinstance(s, unicode): # Note: We use .decode() here, instead of unicode(s, encoding, # errors), so that if s is a SafeString, it ends up being a # SafeUnicode at the end. s = s.decode(encoding, errors) except UnicodeDecodeError, e: if not isinstance(s, Exception): raise UnicodeDecodeError (s, *e.args) else: # If we get to here, the caller has passed in an Exception # subclass populated with non-ASCII bytestring data without a # working unicode method. Try to handle this without raising a # further exception by individually forcing the exception args # to unicode. s = ' '.join([force_unicode(arg, encoding, errors) for arg in s]) return s
I am sure that if you have trouble dealing with strings in Python, as I was, you will appreciate and use this code.
Español
4 Comments on "Python and utf-8: force_unicode"