Nov 14

Today i had a little problem removing a Byte-Order-Mark (BOM) from a UTF-8 encoded csv-file. The reason for the BOM character was a stupid application written in Microsoft Visual Basic, which is not able to do an simple export of a data spreadsheet without a byte-order-mark.

I would like to use the csv-file which stores employee information of my company to create an sortable html table with PHP5. So before the file can read into a string variable i had to remove the byte-order-mark. For this little task i wrote a small function:

1
2
3
4
5
6
function rmBOM($string) { 
    if(substr($string, 0,3) == pack("CCC",0xef,0xbb,0xbf)) { 
        $string=substr($string, 3); 
    } 
    return $string; 
}

The 2nd parameter in the pack() function is the hexadecimal representation of the BOM in a UFT-8 encoded file. To simple cut out the BOM character i read the file into a string, remove the byte-order-mark and wirte the string back to the file:

$string = file_get_contents('/full/path/to/utf8-file.csv');
$string = rmBOM($string);
file_put_contents('/full/path/to/utf8-file.csv', $string);

PHP6 should come with unicode support and will handle UTF-8 encoded files with BOM correct (PHP Bug #22108).

written by phi.mic \\ tags: