Serializing native data in Python

For a project I’m currently working on I needed to serialize some structures in Python. The only premise is that they were only made with basic/native Python types, such as lists, dicts, strings, integers, …

The thinkgs I was considering were:

  • compact representation
  • fast serialization/deserialization
  • human-readable / human-editable a must

So after thinking for a while, I found 3 possible solutions to serialize arbitrary data:

  • pickle/cPickle module for serializing in the internal pickle format which comes with python by default
  • simplejson for serializing in JSON format which is human readable
  • PyYAML for serializing in YAML format which is human readable

The next step was to get a huge amount of data, and do the tests for both speed and length of the resulting data. Following you can appreciate the speed results for two data sets:

Python Serialization Speed Test (Tiny Dataset)

Python Serialization Speed Test (Huge Dataset)

The bars above represent the time in seconds it took to encode/decode the dataset 100 times (although numbers go from 0 to 100, are absolute times, not percentages).

Apart from the speed, I also measured the lenght of the serialized data obtained:

  • yaml tiny = 32604 // huge = 912912
  • json tiny = 31305 // huge = 876540
  • pickle-raw tiny = 34504 // huge = 986531
  • pickle-highest tiny = 37541 // huge = 1101480
Conclussion:

According to the results, JSON serialization/deserialization is the best choice. It encodes/decodes slightly slower than pickle module on huge datasets (we are talking about 4% slower, not that much), and the resulting string is not only human readable, but the resulting output is smaller than pickle and yaml encoded strings.

All tests were made on Python 2.5

The code

This is a fragment of the code used to get the speed results:

    start = time.time()
    for i in range (0, 100):
      yaml.safe_load (yaml.safe_dump (data))
    self.write ("YAML TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      JSONDecoder().decode (JSONEncoder().encode (data))
    self.write ("JSON TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      pickle.loads (pickle.dumps (data, pickle.HIGHEST_PROTOCOL))
    self.write ("PICKLE TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      pickle.loads (pickle.dumps (data, 0))
    self.write ("PICKLE-RAW TIME = %.2f\n" % (time.time() - start))

Trackback URL

, , , , ,

  1. David A
    28/06/2011 at 7:36 pm Permalink

    Another option is to serialize with repr() and un-serialize with eval(). My small test suggests it’s at least twice as fast as pickle.

    start = time.time()
    for i in range (0, 100):
    eval(repr(data))
    print (“REPR/EVAL TIME = %.2f\n” % (time.time() – start))

    Of course, this method has its drawbacks. Unlike yaml, json and pickle, the representation is python specific. Hence not for communication with not-python programs. It is also unsafe to eval() data from untrusted sources.