Superfast tokenizer in PHP

I am currently working on a PHP library/framework (I think it looks more like a library than a framework, and I think it’s better this way).

The thing is that for some of the components I needed to parse some data (look at parser definition), and for that task I needed a PHP tokenizer (or PHP lexer if you prefer).

As PHP is a interpreted language, the faster the implementation, the best. So I have two possibilities: start from scratch, using the standard string functions that come with PHP (it seems to me that is going to be slow); or, on the other hand, use token_get_all which is available since PHP 4.2.0.

Personally I’ve tried the later. token_get_all it is intended to tokenize PHP code, but what I’ve done is to create a wrapper class that encapsulates a call to that function, and returns a normalized list of identifiers and text entries for each id.

The trick to call to token_get_all is to add “<?php “ at the beginning of the string you want to parse, and append “?>” at the end (probably the “?>” at the end is optional).

If the tokens you need for your personal tokenizer, are a subset of the tokens that PHP supports, then using this function is your best choice. Otherwise you will have to work a little harder on another implementation (which I think would be slower). Fortunately, I think PHP grammar and tokens are standard enough to work for most of the people and needs.

The problem of token_get_all function is that it returns a weird array. Sometimes it returns a simple string item, while others return an array containing a PHP identifier and the text for that token.

Let’s go to see a possible implementation:

<?php
class jtokenizer {
  const TK_UNKNOWN  = 0x0000;
  const TK_ID       = 0x0001;
  const TK_STRING   = 0x0002;
 
  const TK_PLUS     = 0x0003;
  const TK_MINUS    = 0x0004;
 
  // ...
 
  public static function tokenize ($string) {
    $tokens = array();
    $phptokens = token_get_all ('<?php ' . $string . '?>');
 
    foreach ($phptokens as $ptoken) {
      $id = self::TK_UNKNOWN;
      if (is_string ($ptoken)) {
        $text = $ptoken;
        switch ($ptoken) {
          case '+': $id = self::TK_PLUS;  break;
          case '-': $id = self::TK_MINUS; break;
 
          ////////////////////////////////////////
          // Add more tokens here!
          // E.g: '.', ',', ';', ':', '=', ...
          ////////////////////////////////////////
 
          default: /** handle error here! */ break;
        }
      }
      else { // this should be an array (tokenid, text)
        list ($tokenid, $text) = $ptoken;
        switch ($tokenid) {
          // ignore opening/closing tag
          case T_OPEN_TAG:   $id = NULL; break;
          case T_CLOSE_TAG:  $id = NULL; break;
 
          // ignore white spaces
          case T_WHITESPACE: $id = NULL; break;
 
          case T_CONSTANT_ENCAPSED_STRING:
            $id = self::TK_STRING;
            // remove ' or " at the beginning and at the end
            $text = trim($text, $text[0]); 
            break;
 
          case T_STRING:
            $id = self::TK_ID;
            break;
 
          ///////////////////////////////////////////
          // Add more tokens here!
          // Get a complete list from:
          // http://uk3.php.net/manual/en/tokens.php
          ///////////////////////////////////////////
 
          default: /** handle error here! */ break;
        }        
      }
 
      // append the token
      if ($id !== NULL) {
        array_push ($tokens, array ($id, $text));
      }
    }
 
    return $tokens;
  }
}

The code above is a really simple skeleton that exemplifies how easy is to make a parser based on that native PHP function. There are lots of things to be done on that function, of course, but you get an idea, ¿right?

Then, tokenizing a string would be as easy as jtokenizer::tokenize ($string)

The advantage is that handling the array returned by jtokenizer::tokenize method is really easy (the first element would be an ID, while the second is always the text). And we can define our own tokens, so if you want interpret “return” as a normal ID, you could do it, by returning jtokenizer::TK_ID.

I think that example is enough to anybody that wants to make it’s own parser, and anyway, if you do not want to return an array of arrays, feel free to change the code to return whatever you think is better to you 😉

For more information, take a look at the PHP documentation:

Note:If you want a faster tokenizer, then use associative arrays instead of switches (you could take a look to is it better use associative arrays or switches? – unfortunately there is no english translation yet)

Trackback URL

, , , , , , ,

  1. Jean-Marc Fontaine
    23/01/2009 at 5:38 am Permalink

    You could take a look at the tokenizer used by PHP_CodeSniffer.
    It works the same way your class do.

  2. Pau Sanchez
    23/01/2009 at 6:26 am Permalink

    Jean-Marc thanks for the information, I’ll take a look 😉

  3. Barbara65
    22/10/2009 at 8:40 am Permalink

    Now, if you or I tried to get this E. ,