Joomla, Drupal, Magento, PHP, XHTML, CSS, Javascript, Ajax
Home Blog Fixing PHP DOMXPath Unicode issue

Fixing PHP DOMXPath Unicode issue

I wrote a little script a couple months back, it was used to grab my friendster's friend listings, sort them by those who logged in within last 24 hours and had them bubbled up. It was working fine until lately, as some of the usernames were shown garbled. So I checked against their profile and it showed that a few of them are using chinese and korean characters (unicode).

The content of the script is shown below, it starts by grabbing the first page of my friendster and performs a regular expression match against the returned content based on the defined pattern, which is set to match those paging links ranging from 1 to 999 pages. It's purpose is to get the total number of pages and proceeds to loop through the total numbers, grabs the returned content, removes unused html objects and store them to $content variable. So far $content variable contains intermigled ascii and unicode characters, but they are displayed correctly when outputted.

 
  ..
  $pattern = '<a href="/friends/'.$friendsterID.'/[0-9]{1,3}" />';
  $null = preg_match_all($pattern, $content, $pages);
  ..
  $getUrl = "friends/$friendsterID/$i";
  ..
  //start looping
  $fp = fsockopen ("www.friendster.com", 80, $errno, $errstr, 30);
  fputs ($fp, "GET /$getUrl HTTP/1.0\r\nHost: www.friendster.com\r\nUser-Agent: MS Internet Explorer\r\n\r\n");
  while (!feof($fp)) {
      $content .= fgets($fp,1024);
  }
  ..
  $null = eregi("$fsStart(.*)$fsEnd", $content, $fsArray);
  $content = $fsArray[1];
  ..
  $fullContent .= $content;
  //end looping
  ..
  $dom = new DOMDocument();
  @$dom->loadHTML($fullContent );
  $xpath = new DOMXPath($dom);
  $xpath_query = '//html/body/div/div';
  $entries = $xpath->query($xpath_query);
  ..
  $user = $entry->nodeValue;
  ..
 

So I checked further. $content which contains ascii and unicode characters intermingly is now being fed into DOMDocument and are being queried using DOMXPath. The array result of the query is stored in $entries variable. Now the issue comes up when we try to display each value of the array, the texts are shown garbled if they contain unicode character.

I am not sure what went wrong since DOMDocument and DOMXPath are both UTF-8 friendly so I began trying out a few methods and one of the workable solution is to convert those unicode characters into equivalent html entities before feeding them into DOMDocument and DOMXPath. The following function does just that, it detects both normal charsets and unicode charsets, converts them only if it is unicode.

 
function smarty_modifier_utf8 ($source) {
    // array used to figure what number to decrement from character order value 
    // according to number of characters used to map unicode to ascii by utf-8 
    $decrement[4] = 240; 
    $decrement[3] = 224; 
    $decrement[2] = 192; 
    $decrement[1] = 0; 
    // the number of bits to shift each charNum by 
    $shift[1][0] = 0; 
    $shift[2][0] = 6; 
    $shift[2][1] = 0; 
    $shift[3][0] = 12; 
    $shift[3][1] = 6; 
    $shift[3][2] = 0; 
    $shift[4][0] = 18; 
    $shift[4][1] = 12; 
    $shift[4][2] = 6; 
    $shift[4][3] = 0; 
    $pos = 0; 
    $len = strlen ($source); 
    $encodedString = ''; 
    while ($pos < $len) {
        $asciiPos = ord (substr ($source, $pos, 1)); 
        //only modify characters not in normal charsets 
        //to reduce pagesize and runtime 
        if($asciiPos = 240) && ($asciiPos = 224) && ($asciiPos = 192) && ($asciiPos <= 223)) {
            // 2 chars representing one unicode character 
            $thisLetter = substr ($source, $pos, 2); 
            $pos += 2; 
        } 
        else {
            // 1 char (lower ascii) 
            $thisLetter = substr ($source, $pos, 1); 
            $pos += 1; 
        } 
        // process the string representing the letter to a unicode entity 
        $thisLen = strlen ($thisLetter); 
        $thisPos = 0; 
        $decimalCode = 0; 
        while ($thisPos < $thisLen) {
            $thisCharOrd = ord (substr ($thisLetter, $thisPos, 1)); 
            if ($thisPos == 0) {
                $charNum = intval ($thisCharOrd - $decrement[$thisLen]); 
                $decimalCode += ($charNum << $shift[$thisLen][$thisPos]); 
            } 
            else {
                $charNum = intval ($thisCharOrd - 128); 
                $decimalCode += ($charNum << $shift[$thisLen][$thisPos]); 
            } 
            $thisPos++; 
        } 
        if ($thisLen == 1)
            $encodedLetter = "&#". str_pad($decimalCode, 3, "0", STR_PAD_LEFT) . ';'; 
        else
            $encodedLetter = "&#". str_pad($decimalCode, 5, "0", STR_PAD_LEFT) . ';'; 
        $encodedString .= $encodedLetter; 
    } 
    return $encodedString; 
}
 
Source : http://smarty.incutio.com/?page=UTF8Encoding
 
 
  ..
  $fullContent = smarty_modifier_utf8($fullContent);
  ..
 

Now, my friendster listing page shows those unicode characters correctly.

joomlavue friendster listing

Comments (0)Add Comment

Write comment

busy
 

Our feed

feed

Our labs

Partners

Talk to Us

My status