<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Codigo Manso &#187; tokenizer</title>
	<atom:link href="http://www.codigomanso.com/en/tag/tokenizador/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.codigomanso.com</link>
	<description>Programación, informática y tecnología</description>
	<lastBuildDate>Sun, 21 Aug 2011 10:54:29 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Superfast tokenizer in PHP</title>
		<link>http://www.codigomanso.com/en/2008/11/tokenizador-superrapido-en-php/</link>
		<comments>http://www.codigomanso.com/en/2008/11/tokenizador-superrapido-en-php/#comments</comments>
		<pubDate>Thu, 27 Nov 2008 19:00:25 +0000</pubDate>
		<dc:creator>Pau Sanchez</dc:creator>
				<category><![CDATA[PHP]]></category>
		<category><![CDATA[analizador lexico]]></category>
		<category><![CDATA[analizador sintactico]]></category>
		<category><![CDATA[lexer php]]></category>
		<category><![CDATA[parser]]></category>
		<category><![CDATA[php tokenizer]]></category>
		<category><![CDATA[tokenizador]]></category>
		<category><![CDATA[tokenizer]]></category>

		<guid isPermaLink="false">http://www.codigomanso.com/?p=162</guid>
		<description><![CDATA[I am currently working on a PHP library/framework (I think it looks more like a library than a framework, and I think it&#8217;s better this way).
The thing is that for some of the components I needed to parse some data (look at parser definition), and for that task I needed a PHP tokenizer (or PHP [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: justify;">I am currently working on a PHP library/framework (I think it looks more like a library than a framework, and I think it&#8217;s better this way).</p>
<p style="text-align: justify;">The thing is that for some of the components I needed to parse some data (<a href="http://en.wikipedia.org/wiki/Parsing" target="_blank">look at parser definition</a>), and for that task I needed a PHP tokenizer (or PHP <a href="http://en.wikipedia.org/wiki/Lexical_analysis">lexer</a> if you prefer). </p>
<p style="text-align: justify;">As PHP is a interpreted language, the faster the implementation, the best. So I have two possibilities: start from scratch, using the standard string functions that come with PHP (it seems to me that is going to be slow); or, on the other hand, use <strong>token_get_all</strong> which is available since PHP 4.2.0.</p>
<p style="text-align: justify;">Personally I&#8217;ve tried the later.  <strong>token_get_all</strong> it is intended to tokenize PHP code, but what I&#8217;ve done is to create a wrapper class that encapsulates a call to that function, and returns a normalized list of identifiers and text entries for each id.</p>
<p style="text-align: justify;">The trick to call to <strong>token_get_all</strong> is to add <strong>&#8220;&lt;?php &#8220;</strong> at the beginning of the string you want to parse, and append <strong>&#8220;?&gt;&#8221;</strong> at the end (probably the &#8220;?&gt;&#8221; at the end is optional).</p>
<p style="text-align: justify;">If the tokens you need for your personal tokenizer, are a subset of the tokens that PHP supports, then using this function is your best choice. Otherwise you will have to work a little harder on another implementation (which I think would be slower). Fortunately, I think PHP grammar and tokens are standard enough to work for most of the people and needs.</p>
<p style="text-align: justify;">The problem of <strong>token_get_all</strong> function is that it returns a weird array. Sometimes it returns a simple string item, while others return an array containing a PHP identifier and the text for that token.</p>
<p style="text-align: left;">Let&#8217;s go to see a possible implementation:</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">&lt;?php</span>
<span style="color: #000000; font-weight: bold;">class</span> jtokenizer <span style="color: #009900;">&#123;</span>
  <span style="color: #000000; font-weight: bold;">const</span> TK_UNKNOWN  <span style="color: #339933;">=</span> <span style="color: #208080;">0x0000</span><span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">const</span> TK_ID       <span style="color: #339933;">=</span> <span style="color: #208080;">0x0001</span><span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">const</span> TK_STRING   <span style="color: #339933;">=</span> <span style="color: #208080;">0x0002</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">const</span> TK_PLUS     <span style="color: #339933;">=</span> <span style="color: #208080;">0x0003</span><span style="color: #339933;">;</span>
  <span style="color: #000000; font-weight: bold;">const</span> TK_MINUS    <span style="color: #339933;">=</span> <span style="color: #208080;">0x0004</span><span style="color: #339933;">;</span>
&nbsp;
  <span style="color: #666666; font-style: italic;">// ...</span>
&nbsp;
  <span style="color: #000000; font-weight: bold;">public</span> static <span style="color: #000000; font-weight: bold;">function</span> tokenize <span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #000088;">$tokens</span> <span style="color: #339933;">=</span> <span style="color: #990000;">array</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #000088;">$phptokens</span> <span style="color: #339933;">=</span> <span style="color: #990000;">token_get_all</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'&lt;?php '</span> <span style="color: #339933;">.</span> <span style="color: #000088;">$string</span> <span style="color: #339933;">.</span> <span style="color: #0000ff;">'?&gt;'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
    <span style="color: #b1b100;">foreach</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$phptokens</span> <span style="color: #b1b100;">as</span> <span style="color: #000088;">$ptoken</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #004000;">TK_UNKNOWN</span><span style="color: #339933;">;</span>
      <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #990000;">is_string</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$ptoken</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$ptoken</span><span style="color: #339933;">;</span>
        <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$ptoken</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #b1b100;">case</span> <span style="color: #0000ff;">'+'</span><span style="color: #339933;">:</span> <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #004000;">TK_PLUS</span><span style="color: #339933;">;</span>  <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
          <span style="color: #b1b100;">case</span> <span style="color: #0000ff;">'-'</span><span style="color: #339933;">:</span> <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #004000;">TK_MINUS</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
&nbsp;
          <span style="color: #666666; font-style: italic;">////////////////////////////////////////</span>
          <span style="color: #666666; font-style: italic;">// Add more tokens here!</span>
          <span style="color: #666666; font-style: italic;">// E.g: '.', ',', ';', ':', '=', ...</span>
          <span style="color: #666666; font-style: italic;">////////////////////////////////////////</span>
&nbsp;
          <span style="color: #b1b100;">default</span><span style="color: #339933;">:</span> <span style="color: #009933; font-style: italic;">/** handle error here! */</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>
      <span style="color: #009900;">&#125;</span>
      <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span> <span style="color: #666666; font-style: italic;">// this should be an array (tokenid, text)</span>
        <span style="color: #990000;">list</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$tokenid</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #000088;">$ptoken</span><span style="color: #339933;">;</span>
        <span style="color: #b1b100;">switch</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$tokenid</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
          <span style="color: #666666; font-style: italic;">// ignore opening/closing tag</span>
          <span style="color: #b1b100;">case</span> T_OPEN_TAG<span style="color: #339933;">:</span>   <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
          <span style="color: #b1b100;">case</span> T_CLOSE_TAG<span style="color: #339933;">:</span>  <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
&nbsp;
          <span style="color: #666666; font-style: italic;">// ignore white spaces</span>
          <span style="color: #b1b100;">case</span> T_WHITESPACE<span style="color: #339933;">:</span> <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #339933;">;</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
&nbsp;
          <span style="color: #b1b100;">case</span> T_CONSTANT_ENCAPSED_STRING<span style="color: #339933;">:</span>
            <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #004000;">TK_STRING</span><span style="color: #339933;">;</span>
            <span style="color: #666666; font-style: italic;">// remove ' or &quot; at the beginning and at the end</span>
            <span style="color: #000088;">$text</span> <span style="color: #339933;">=</span> <span style="color: #990000;">trim</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$text</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#91;</span><span style="color: #cc66cc;">0</span><span style="color: #009900;">&#93;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> 
            <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
&nbsp;
          <span style="color: #b1b100;">case</span> T_STRING<span style="color: #339933;">:</span>
            <span style="color: #000088;">$id</span> <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #004000;">TK_ID</span><span style="color: #339933;">;</span>
            <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
&nbsp;
          <span style="color: #666666; font-style: italic;">///////////////////////////////////////////</span>
          <span style="color: #666666; font-style: italic;">// Add more tokens here!</span>
          <span style="color: #666666; font-style: italic;">// Get a complete list from:</span>
          <span style="color: #666666; font-style: italic;">// http://uk3.php.net/manual/en/tokens.php</span>
          <span style="color: #666666; font-style: italic;">///////////////////////////////////////////</span>
&nbsp;
          <span style="color: #b1b100;">default</span><span style="color: #339933;">:</span> <span style="color: #009933; font-style: italic;">/** handle error here! */</span> <span style="color: #b1b100;">break</span><span style="color: #339933;">;</span>
        <span style="color: #009900;">&#125;</span>        
      <span style="color: #009900;">&#125;</span>
&nbsp;
      <span style="color: #666666; font-style: italic;">// append the token</span>
      <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$id</span> <span style="color: #339933;">!==</span> <span style="color: #009900; font-weight: bold;">NULL</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #990000;">array_push</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$tokens</span><span style="color: #339933;">,</span> <span style="color: #990000;">array</span> <span style="color: #009900;">&#40;</span><span style="color: #000088;">$id</span><span style="color: #339933;">,</span> <span style="color: #000088;">$text</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #009900;">&#125;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #b1b100;">return</span> <span style="color: #000088;">$tokens</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p style="text-align: justify;">The code above is a really simple skeleton that exemplifies how easy is to make a parser based on that native PHP function. There are lots of things to be done on that function, of course, but you get an idea, ¿right?</p>
<p style="text-align: justify;">Then, tokenizing a string would be as easy as <strong>jtokenizer::tokenize ($string)</strong></p>
<p style="text-align: justify;">The advantage is that handling the array returned by jtokenizer::tokenize method is really easy (the first element would be an ID, while the second is always the text). And we can define our own tokens, so if you want interpret &#8220;return&#8221; as a normal ID, you could do it, by returning jtokenizer::TK_ID.</p>
<p style="text-align: justify;">I think that example is enough to anybody that wants to make it&#8217;s own parser, and anyway, if you do not want to return an array of arrays, feel free to change the code to return whatever you think is better to you <img src='http://www.codigomanso.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p style="text-align: justify;">For more information, take a look at the PHP documentation:</p>
<ul style="text-align: justify;">
<li><a href="http://uk3.php.net/token_get_all" target="_blank"><strong>token_get_all</strong> function in php.net</a></li>
<li><a href="http://uk.php.net/manual/en/book.tokenizer.php" target="_blank">PHP tokenizer</a></li>
<li><a href="http://uk3.php.net/manual/en/tokens.php" target="_blank">List of tokens returned by <strong>token_get_all</strong> (very useful if you plan to extend the skeleton above)</a></li>
</ul>
<p style="text-align: justify;"><b>Note:</b>If you want a faster tokenizer, then use associative arrays instead of switches (you could take a look to <a href="http://www.codigomanso.com/es/2008/11/php-switch-vs-array-asociativo/">is it better use associative arrays or switches?</a> &#8211; unfortunately there is no english translation yet)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.codigomanso.com/en/2008/11/tokenizador-superrapido-en-php/feed/</wfw:commentRss>
		<slash:comments>814</slash:comments>
		</item>
	</channel>
</rss>

