Manso Hack: Speedup Google App Engine SDK SQLite Database

I was trying to initialize a local database using the Google App Engine SDK, and I was going crazy.

–use_sqlite parameter was not even solving my problem. Inserts on the database were really slow, like 10 per second. A nightmare.

OK, ok, it is an SDK, it emulates the server… it is intended to be used to develop your application, speed is not the most important thing… I agree. But if you have to wait 10 minutes to insert 1000 data entries, then you can desperate. If they were like 1 million in 10 minutes, then I would understand, it would be still slow, but I will understand, but 1000 inserts? 10 minutes? are we all crazy?

Anyway, assuming SQLite seems to be a good database and works usually great, I though I could get that speed improved. Looking at the SQLite FAQ I found following question:

(19) INSERT is really slow – I can only do few dozen INSERTs per second

Basically it seems SQLite should be fine doing 50,000 inserts per second in a normal computer, BUT to guarantee the data integrity it locks inserts until it makes sure they have been written to disk, so if your computer hangs your data remains safe.

These are good news because I don’t care of my local data in the local develoment environment, so in the event of a crash I can just rebuild it in a minute, instead of rebuild it in an hour.

Finally the solution is to use “PRAGMA synchronous=OFF“, and the only way of using this pragma is to edit a line in de Google App Engine SDK, after initializing SQLite database

So you’ll have to edit following file:

google_appengine/google/appengine/datastore/datastore_sqlite_stub.py

Look for the “__init__” constructor and after the connection to the database gets initialized “self.__connection = sqlite3.connect [...]” you can place following statment:

self.__connection.execute(‘PRAGMA synchronous=OFF’)

Changing that line I went from about 10 inserts/second to more than 100 inserts/second. More than a 10x gain for changing a single line. Great!

Now I can continue on the thing I was working on without getting crazy.

, , , , ,

Manso Trick: Easy way to generate random hex string in Python

Generate a random string in hexadecimal is one of those things that you always have to do, from time to time.
As far as I know, the easiest way to do this in python is to import uuid module and run the following code:
uuid.uuid4().hex

With the call above we have 32 hexadecimal characters string (16 random bytes).

Do you know any other approach?

,

MansoTrick: Convert 24h time string to 12h AM/PM format (Python)

Manso trick to convert from 24h time format to 12h (AM/PM) time format in Python


def ampmformat (hhmmss):
  """
    This method converts time in 24h format to 12h format
    Example:   "00:32" is "12:32 AM"
               "13:33" is "01:33 PM"
  """
  ampm = hhmmss.split (":")
  if (len(ampm) == 0) or (len(ampm) > 3):
    return hhmmss

  # is AM? from [00:00, 12:00[
  hour = int(ampm[0]) % 24
  isam = (hour >= 0) and (hour < 12)

  # 00:32 should be 12:32 AM not 00:32
  if isam:
    ampm[0] = ('12' if (hour == 0) else "%02d" % (hour))
  else:
    ampm[0] = ('12' if (hour == 12) else "%02d" % (hour-12))

  return ':'.join (ampm) + (' AM' if isam else ' PM')

Now some examples of the returned values:

ampmformat ("00:00:00") # returns "12:00:00 AM"
ampmformat ("12:00:00") # returns "12:00:00 PM"

ampmformat ("01:23:45") # returns "01:23:45 AM"
ampmformat ("13:23:45") # returns "01:23:45 PM"
ampmformat ("05:43:21") # returns "05:43:21 AM"
ampmformat ("17:43:21") # returns "05:43:21 PM"

ampmformat ("11:59:59") # returns "11:59:59 AM"
ampmformat ("23:59:59") # returns "11:59:59 PM"

, ,

Google App Engine SDK 1.4.0 released

The version 1.4.0 of Google App Engine SDK has been released.

I would highlight 3 things on this release:

  • The taskqueues are now part of the official API, they are not experimental anymore ;)
  • They have multiplied by 20 the maximum time of execution on cron tasks and task queues. From 30 seconds to 600 seconds (10 minutes). This is awesome! although it can be dangerous. We’ll have to use this wisely ;)
  • Finally Channel API is now officially available to everybody.

The Channel API impressed me when I saw it in May. Following a talk about Channel API on Google IO. Is a technical talk and it only make sense to see it if you program on GAE or if you want to do an API for Comet:

Links:

, , ,

Manso Trick: Getting the latitude/longitude from an address

For a project soon to be born (I hope), I’ve been doing a quick research for a way to obtain the latitude and longitude from a geographical address. For example, to determine the latitude/longitude from a totally random address like “Emerson Street, Palo Alto, California”  BTW, Hi Mariano! Thanks for the stay in USA last year :D

This is called geocoding and after spending half a second looking searching on internet I’ve found two simple REST APIs. Yahoo! and Google provide their own APIs and I’m sure a lot of companies out there make use of it, because of their accuracy.

If you want to use Yahoo! API you need to sign up so they give you some sort of appid and you can start making queries.

In the other hand, Google let’s you do this as easily as:

http://maps.googleapis.com/maps/api/geocode/json?address=Emerson+Street,+Palo+Alto,+california&sensor=false

This URL returns a JSON response which can be easily parsed in all languages (you can also ask for the XML, but I sort of hate XML so I prefer JSON).

Apart from returning the latitude and longitude, it also returns the formatted address, which can be great to normalize the address in your database, in case you need such thing. For example, for the query before, it says the formatted address is:

“Emerson St, Palo Alto, CA, USA”

On top of that, you should always read the Terms Of Service to make sure your app stays aligned with the fair use of the API, but that’s up to you.

Although more inacurate, another interesting project is GeoNames (see the URL below).

For more information:

, , , , ,

Unicode support in Python: Frustrations and Solutions

To be honest, I think the support of unicode in Python before Python 3, being totally honest, is a f*knig nightmare. I prefer PHP 5.x support for unicode (= zero support).

One can get just crazy, sometimes things work, sometimes things crash somewhere unexpectly, or they don’t really crash, they just crash when you want to show some string on the console doing a print…

Anyway, trying to solve my frustrations with unicode strings in Python I found a link that explains lots of things and helped me a lot:

Overcoming unicode frustrations in Python 2

The previous link is a must read for people getting UnicodeEncodeError exceptions, like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 654: ordinal not in range(128)

Finally, my hope is that Python 3 solves all the evil in the world of the unicode… but I need to see it working to truly believe what I’ve been reading so far…

, , ,

Serializing native data in Python

For a project I’m currently working on I needed to serialize some structures in Python. The only premise is that they were only made with basic/native Python types, such as lists, dicts, strings, integers, …

The thinkgs I was considering were:

  • compact representation
  • fast serialization/deserialization
  • human-readable / human-editable a must

So after thinking for a while, I found 3 possible solutions to serialize arbitrary data:

  • pickle/cPickle module for serializing in the internal pickle format which comes with python by default
  • simplejson for serializing in JSON format which is human readable
  • PyYAML for serializing in YAML format which is human readable

The next step was to get a huge amount of data, and do the tests for both speed and length of the resulting data. Following you can appreciate the speed results for two data sets:

Python Serialization Speed Test (Tiny Dataset)

Python Serialization Speed Test (Huge Dataset)

The bars above represent the time in seconds it took to encode/decode the dataset 100 times (although numbers go from 0 to 100, are absolute times, not percentages).

Apart from the speed, I also measured the lenght of the serialized data obtained:

  • yaml tiny = 32604 // huge = 912912
  • json tiny = 31305 // huge = 876540
  • pickle-raw tiny = 34504 // huge = 986531
  • pickle-highest tiny = 37541 // huge = 1101480
Conclussion:

According to the results, JSON serialization/deserialization is the best choice. It encodes/decodes slightly slower than pickle module on huge datasets (we are talking about 4% slower, not that much), and the resulting string is not only human readable, but the resulting output is smaller than pickle and yaml encoded strings.

All tests were made on Python 2.5

The code

This is a fragment of the code used to get the speed results:

    start = time.time()
    for i in range (0, 100):
      yaml.safe_load (yaml.safe_dump (data))
    self.write ("YAML TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      JSONDecoder().decode (JSONEncoder().encode (data))
    self.write ("JSON TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      pickle.loads (pickle.dumps (data, pickle.HIGHEST_PROTOCOL))
    self.write ("PICKLE TIME = %.2f\n" % (time.time() - start))
 
    start = time.time()
    for i in range (0, 100):
      pickle.loads (pickle.dumps (data, 0))
    self.write ("PICKLE-RAW TIME = %.2f\n" % (time.time() - start))

, , , , ,

bbcodeutils: BBCode parser and BBCode to HTML for Python

Yesterday I was looking for a Python module able to parse Bulletin Board Code (bbcode for friends) or able to transform bbcode to HTML.

After looking on the Internet I decided to create bbcodeutils a Python module to parse, generate and transform bbcode. I created in a way that I think is really simple to use.

Inside bbcodeutils you will find following classes:

  • bbcodeparser: parses bbcode and fixes invalid strings
  • bbcode2html: gets a parsed bbcode string and generates HTML from it
  • bbcodebuilder: good to generate BBCode programmatically

Here you can find the link to download bbcodeutils v1.0. You can use this module freely (see readme.txt for details).

Following there are some examples with the most common operations.

Examples:

Before you try any example I’m assuming you will do something to load the bbcodeutils module, like:

from bbcodeutils import bbcodebuilder, bbcodeparser
Parsing bbcode strings

Parsing a bbcode string can be achieved using the bbcodeparser constructor, or calling parse method on a bbcodeparser object:

> bbcode = bbcodeparser('[b]bold string[/b]')

or:

> bbcode = bbcodeparser()
> bbcode.parse ('[b]first string[/b]')

This is just to illustrate how to parse a string, but this won’t produce any output at all. What you are probably looking for is to fix an invalid bbcode string or to transform a bbcode string into valid HTML. Here we go…

Fix an invalid BBCode string

To fix an invalid bbcode string you just have to parse a string and stringify the parser object or to call the bbcode method.

> str(bbcode)
"This is [b]bold and [i]this bold and italics[/i][/b]."
 
> bbcode = bbcodeparser("This is [b]bold and [i]this bold and italics[/b].[/color]")
> bbcode.bbcode()
"This is [b]bold and [i]this bold and italics[/i][/b]."

Please note that the [/color] tag is removed and the closing [/i] tag is added properly.

Transform bbcode to HTML

The class bbcodeparser implements html method which uses bbcode2html internally. Calling this function will return valid HTML code.

> bbcode = bbcodeparser('[b]bold string[/b]')
> bbcode.html()
"<b>bold string</b>"
Create BBCode programmatically
> bbcode = bbcodebuilder()
> bbcode.b('string in bold')
"[b]string in bold[/b]"
 
> bbcode.url('http://www.codigomanso.com/')
"[url]http://www.codigomanso.com/[/url]"
 
> bbcode.url('Welcome to Codigo Manso', 'http://www.codigomanso.com/')
"[url=http://www.codigomanso.com/]Welcome to Codigo Manso[/url]"
 
> bbcode.list('item 1', 'item 2')
"[list]
   [*]item 1
   [*]item 2
[/list]"
Common operations in one line

Fix invalid BBCode in one line:

> str(bbcodeparser("[color=red]Fix this [b]string"))
"[color=red]Fix this [b]string[/b][/color]"

Transform BBCode to HTML in one line:

> str(bbcodeparser("[color=red]Fix this [b]string"))
'<span style="color: red;">Fix this <b>string</b></span>'

Thats it!

If you have any doubt feel free to leave a comment on this post

Enlaces de interés:

, , ,

[SOLVED] Add Unique Constraints to Google App Engine databases

The problem:

Google App Engine rules! The truth is that I’m starting to feel confortable programming in Python, although I still like the curly braces to indentate.

Anyway, the datastore used by Google is superpowerful and supersimple to use, but it has some limitations. With the App Engine SDK you can easily say which attributes you want to index (indexed = True), which ones you want to be required (required = True), which is the default value (default=”whatever”) but you cannot add a unique constraint on an attribute or set of attributes (there is no unique=True).

This is the typical nick or e-mail field in a database. You don’t want two users share the same nick or the same e-mail. You want all users to be identified by their nicks or their e-mails.

¿What’s the problem? As I said, you can’t do that using the Google App Engine SDK… hopefully the solution is really easy ;)

The solution:

The only thing you have to do is to override put inside our model, and launch an exception when a new element that violates our unique constraints is going to be added to the database.

In order to provide a simple example, let’s create a model named User using name, email, language and creation_date as the attributes:

class User (db.Model): 
   creation_date = db.DateTimeProperty (auto_now_add=True)
   name          = db.StringProperty (required = True)
   email         = db.EmailProperty (required = True)
   language      = db.StringProperty (default="en")

Now, to reproduce the problem in GAE, the following code will create two different entries:

User (name="Mark Johnson", email="test@codigomanso.com").put()
User (name="JohnMarkinson", email="test@codigomanso.com", language="es").put()

Now let’s add a constraint to forbid two users to share the same e-mail. With this constraint the second putshould fail or launch an exception. ¿How do we do that? It’s as simple as it seems. We just need to redefine put so when an new element is going to be added to the database, we check if violates our rules and we launch an exception.

The new class with the constraints would be as follows:

class User (db.Model): 
   creation_date = db.DateTimeProperty (auto_now_add=True)
   name          = db.StringProperty (required = True)
   email         = db.EmailProperty (required = True)
   language      = db.StringProperty (default="en")
 
   def put (self):
     # Make sure e-mails are unique for each user
     if (not self.is_saved()) and (User.gql ('WHERE email = :1', self.email).count() > 0):
       raise DuplicatedInstanceError ('User.email', self.email)
 
     # call the parent method
     db.Model.put (self)

We needed only 3 lines of code (4 counting the def line)… it can’t be easier

Some inconveniences

After thinking on it for a while, this implementation is far from being perfect, although it will work on most cases. I see two problems:

  • You have to make an extra query before inserting an item (updates behave as always)
  • Using this solution you can end up with two users having the same e-mail. How can that be? Have you heard about race conditions? The count-put operation is not atomic, so if two users try to register the same e-mail at almost the same time (you will have two instances of the application running), then it can happen that they do the count at the same time (returning zero) and so none of both will raise an exception.

The first problem is not a problem at all, just something to be aware of.

The second problem it can be a huge problem. There are two solutions that come to my mind right now, one is to run the count-put in a transaction for new instances (transactions are atomic), and the other one would be to create a mutex around the count-put. To me, this is something I’m not worried about, but you need to know that it can happen.

Apendix: DuplicatedInstanceError

If you are smart, you have probably realiced that I launched an exception of the type DuplicatedInstanceError. This exception does not exist, I created to be able to detect duplicated instances. My implementation is below, but you can use whatever you want:

class DuplicatedInstanceError (Exception):
  def __init__(self, attrpath, value = None):
    '''
    Accept a dict of attribute and value pairs, or just one attribute and a value:
 
    Example:
       # user unique constraint has been violated (User.nick to be unique)
       DuplicatedInstanceError ('User.nick', 'jsmith')
 
       # student unique constraint has been violated (student nick's are fin for different schools)
       DuplicatedInstanceError ({
         'Student.nick'   : 'jsmith',
         'Student.school' : 'demo-primary-school'
       })
    '''
    self.attrpath = ''
 
    # accept a dict as the only parameter
    if isinstance (attrpath, dict):
      for (k, v) in attrpath.items():
        self.attrpath += ', ' + str (k) + ' = ' + str(v)
      self.attrpath = self.attrpath[2:]
    # accept a couple of strings
    else:
      self.attrpath = str(attrpath) + ' = ' + str(value) 
 
    self.attrpath = '(' + self.attrpath + ')'
    return
 
  def __str__ (self):
    return str (self.attrpath)

That’s all!

, , , ,

[SOLVED] Number pad is not working in ubuntu

It’s the second time that I have to solve the problem of the number pad not working in Ubuntu. For what I’ve seen in some forums it happens in other linux distributions as well.

The solution in 4 simple steps:
– Open System -> Preferences -> Keyboard
– Click on the Mouse Keys tab
– Deselect "Pointer can be controlled using the keypad"
– And, finally, just click on Close

, ,

prev posts