I hope PHP 6 will solve all my problems with UTF-8, but it has not been released yet, and one has to keep coding with the tools it has available.
I think I will probably be sharing the class for string manipulation that I’ve been coding (which of course is 100% compatible with UTF-8).
For now, I’m going to share a simple function to split a UTF-8 string into an array of strings. This is not something you want to do all over your code, because arrays consume a HUGE amount of memory in PHP, but sometimes it an be useful for specific purposes.
Converting a UTF-8 string into an array, in any version before PHP 6 has two big advantages:
- each element of the array represents a unique character (the same that when you do $string[$i])
- count would be equivalent to mb_strlen over the original string
The bad news are the time spent on conversion and the memory consumption (as you will be using an array instead of a native string).
Yesterday I wrote a couple of functions, and I borrowed this one that a user had shared on PHP manuals
These are the three versions in PHP:
function getCharArray1 ($jstring) { $len = mb_strlen ($jstring, 'UTF-8'); if (mb_strlen ($jstring, 'UTF-8') == 0) return array(); $ret = array(); for ($i = 0; $i < $len; $i++) { $char = mb_substr ($jstring, $i, 1, 'UTF-8'); array_push ($ret, $char); } return $ret; } // code from: http://uk3.php.net/manual/en/function.mb-split.php#80046 function getCharArray2 ($jstring) { $len = mb_strlen ($jstring, 'UTF-8'); if (mb_strlen ($jstring, 'UTF-8') == 0) return array(); while ($len) { $ret[] = mb_substr($jstring,0,1,"UTF-8"); $jstring = mb_substr($jstring,1,$len,"UTF-8"); $len = mb_strlen($jstring); } return $ret; } // using mb_check_encoding instead of mb_substr ;) function getCharArray3 ($jstring) { if (mb_strlen ($jstring, 'UTF-8') == 0) return array(); $ret = array (); $alen = strlen ($jstring); $char = ''; for ($i = 0; $i < $alen; $i++) { $char .= $jstring[$i]; if (mb_check_encoding ($char, 'UTF-8')) { array_push ($ret, $char); $char = ''; } } return $ret; }
The fastest function is the latter, that uses a small trick I thought (and in fact worked great).
The following chart represents the execution times of each function for a set of iterations over the functions on a huge string.
Leyenda:
- getCharArray1: in red
- getCharArray2: in green
- getCharArray3: in blue
As you can see, getCharArray3 (blue) is the fastest (around 4x faster than getCharArray1, and 6x faster than getCharArray2).
Finally, you can check the benchmark script I’ve used for generating the chart.
Español
22/02/2009 at 5:42 pm Permalink
La grafica esta hecha con Open Flash 2!!
!
Excelentes graficas, yo las he empezado a utilizar hace menos de 3 meses con Jquery y he obtenido unas graficas muy dinamicas
23/02/2009 at 2:43 am Permalink
Sabia yo que tenía que poner gráfica
La verdad es que John Glazebrook se lo ha currado a base de bien.
25/03/2009 at 3:15 pm Permalink
Thank you!! That was very, very helpful.
26/03/2009 at 12:57 am Permalink
You are welcome
23/04/2009 at 5:31 am Permalink
This was very helpful Pau, thanks a lot.
The last function is ingenious, really appreciated it. I will use it instead of the classic mb_substr.
25/04/2009 at 11:25 am Permalink
You are welcome Francesco.
Although these functions are helpful one has to be aware they consume a huge amount of memory
06/12/2010 at 6:47 am Permalink
Thanks, very very helpful
07/02/2011 at 4:11 pm Permalink
Hi,
your getCharArray3 is buggy – it returns array with empty values for supplied numbers. getCharArray1 + getCharArray2 are slower, but work correctly.
cheers
lubos
21/02/2011 at 1:40 pm Permalink
@lubosdz could you provide an example string that fails on getCharArray3?
02/02/2012 at 7:44 pm Permalink
Wow increible nota recomendare esta web a mis amigos
02/02/2012 at 8:11 pm Permalink
exelente me gusto mucho