Convert UTF-8 string into an array of chars in PHP

I hope PHP 6 will solve all my problems with UTF-8, but it has not been released yet, and one has to keep coding with the tools it has available.

I think I will probably be sharing the class for string manipulation that I’ve been coding (which of course is 100% compatible with UTF-8).

For now, I’m going to share a simple function to split a UTF-8 string into an array of strings. This is not something you want to do all over your code, because arrays consume a HUGE amount of memory in PHP, but sometimes it an be useful for specific purposes.

Converting a UTF-8 string into an array, in any version before PHP 6 has two big advantages:

  • each element of the array represents a unique character (the same that when you do $string[$i])
  • count would be equivalent to mb_strlen over the original string

The bad news are the time spent on conversion and the memory consumption (as you will be using an array instead of a native string).

Yesterday I wrote a couple of functions, and I borrowed this one that a user had shared on PHP manuals

These are the three versions in PHP:

function getCharArray1 ($jstring)
{
  $len = mb_strlen ($jstring, 'UTF-8');
  if (mb_strlen ($jstring, 'UTF-8') == 0)
    return array();
 
  $ret = array();
  for ($i = 0; $i < $len; $i++) {
    $char = mb_substr ($jstring, $i, 1, 'UTF-8');
    array_push ($ret, $char);
  }
 
  return $ret;
}
 
// code from: http://uk3.php.net/manual/en/function.mb-split.php#80046
function getCharArray2 ($jstring)
{
  $len = mb_strlen ($jstring, 'UTF-8');
  if (mb_strlen ($jstring, 'UTF-8') == 0)
    return array();
 
  while ($len) {
    $ret[]  = mb_substr($jstring,0,1,"UTF-8");
    $jstring = mb_substr($jstring,1,$len,"UTF-8");
    $len = mb_strlen($jstring);
  }
  return $ret;
}
 
// using mb_check_encoding instead of mb_substr ;)
function getCharArray3 ($jstring)
{
  if (mb_strlen ($jstring, 'UTF-8') == 0)
    return array();
 
  $ret  = array ();
  $alen = strlen ($jstring);
  $char = '';
  for ($i = 0; $i < $alen; $i++) {
    $char .= $jstring[$i];
    if (mb_check_encoding ($char, 'UTF-8')) {
      array_push ($ret, $char);
      $char = '';
    }
  }
 
  return $ret;
}

The fastest function is the latter, that uses a small trick I thought (and in fact worked great).

The following chart represents the execution times of each function for a set of iterations over the functions on a huge string.


Leyenda:

  • getCharArray1: in red
  • getCharArray2: in green
  • getCharArray3: in blue

As you can see, getCharArray3 (blue) is the fastest (around 4x faster than getCharArray1, and 6x faster than getCharArray2).

Finally, you can check the benchmark script I’ve used for generating the chart.

Trackback URL

, , ,

  1. Jose Luis Anaya
    22/02/2009 at 5:42 pm Permalink

    La grafica esta hecha con Open Flash 2!!
    Excelentes graficas, yo las he empezado a utilizar hace menos de 3 meses con Jquery y he obtenido unas graficas muy dinamicas 🙂 !

  2. Pau Sanchez
    23/02/2009 at 2:43 am Permalink

    Sabia yo que tenía que poner gráfica 🙂

    La verdad es que John Glazebrook se lo ha currado a base de bien.

  3. Steve
    25/03/2009 at 3:15 pm Permalink

    Thank you!! That was very, very helpful.

  4. Pau Sanchez
    26/03/2009 at 12:57 am Permalink

    You are welcome 😉

  5. Francesco
    23/04/2009 at 5:31 am Permalink

    This was very helpful Pau, thanks a lot.

    The last function is ingenious, really appreciated it. I will use it instead of the classic mb_substr.

  6. Pau Sanchez
    25/04/2009 at 11:25 am Permalink

    You are welcome Francesco.

    Although these functions are helpful one has to be aware they consume a huge amount of memory 😉

  7. Vlad
    06/12/2010 at 6:47 am Permalink

    Thanks, very very helpful 🙂

  8. lubosdz
    07/02/2011 at 4:11 pm Permalink

    Hi,
    your getCharArray3 is buggy – it returns array with empty values for supplied numbers. getCharArray1 + getCharArray2 are slower, but work correctly.
    cheers
    lubos

  9. Pau Sánchez
    21/02/2011 at 1:40 pm Permalink

    @lubosdz could you provide an example string that fails on getCharArray3?

  10. gilberto
    02/02/2012 at 7:44 pm Permalink

    Wow increible nota recomendare esta web a mis amigos

  11. gilberto
    02/02/2012 at 8:11 pm Permalink

    exelente me gusto mucho

  12. Huh?
    02/08/2012 at 11:38 pm Permalink

    What’s wrong with using //u in preg_split?

    $chars_array = preg_split(‘//u’, $utf8_string, -1, PREG_SPLIT_NO_EMPTY);