glasswalk3r has asked for the wisdom of the Perl Monks concerning the following question:

Greetings monks,

I'm working with a CSV to XML converter, but I'm having problems when reading UTF-8 files that have an BOM character in the beginning of data.

I'm using the module Text::CSV_XS to read the CSV file and parse the data. I'm expecting the CSV file to be in UTF-8 and dealing with that as expected:

$csv_file = $cfg->val( 'General', 'inputDir' ) . '\\' . $csv_file; my $csv = Text::CSV_XS->new( { binary => 1 } ) or die "Cannot use CSV: " . Text::CSV_XS->error_diag(); open( my $fh, '<:utf8', $csv_file ) or die "Cannot read $csv_file: $!\n";

As soon as I run the program, I got the error:

# CSV_XS ERROR: 2034 - EIF - Loose unescaped quote @ pos 4

The CSV file has double quotes around the fields, but they are not written inside the double quotes as the error message says.

If I remove the double quotes from the registries, everything goes as expected.

Further testing showed me that removing the BOM of the UTF-8 file and maintaining the double quotes (this is desirable) the program can process the registries without any issue.

While I have an workaround, this seems a bit odd for me. I already had problems with BOM character when dealing with UTF-16 files and the fix for it was opening the file with a:

open( my $fh, '<:raw:encoding(utf16)', $csv_file ) or die "Cannot read $csv_file: $!\n";

Doing the same, but using utf8 instead of uft16 didn't brought the expected result (no error).

Do you have any hint about that? Should I write specific code to remove the BOM from the beginning of the file? I don't need it anyway.

Thanks,

Alceu Rodrigues de Freitas Junior
---------------------------------
"You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

Replies are listed 'Best First'.
Re: Problems reading UTf-8 file with BOM
by bart (Canon) on Mar 25, 2010 at 21:04 UTC

      That could be an option. But shouldn't this have an solution in the standard perl distribution?

      I thought I was doing something wrong because it did not occurred to me that it was necessary to get into so specific details just to read an UTF-8 file. I never had before such problem to deal with UTF-8 files (with or without BOM) but I found some posts here about Text::CSV_XS and UTF-8. Looks like this module cannot deal with UTF-8 in anyway. Could it be because of the "XS" part of the module?

      Alceu Rodrigues de Freitas Junior
      ---------------------------------
      "You have enemies? Good. That means you've stood up for something, sometime in your life." - Sir Winston Churchill

        That could be an option. But shouldn't this have an solution in the standard perl distribution?

        There are many problems with including modules in core.

        • A certain level of quality is assumed.
        • A high level of maintenance is demanded.
        • Endorsement is assumed.
        • Presence in core in perpetuity is expected.
        • Even if better alternatives surface.
        • etc

        There are also problems with selecting modules to include in the code. Keep in mind that Perl is used for a wide variety of applications, and including everything is just not an option.

        The focus is on making it easy to install modules rather than including everything in core.

        You can install it with ppm install File::BOM (ActiveState) or cpan File::BOM (elsewhere). And if you have a distro that requires File::BOM, all you need to do is add one line to your Makefile.

        Looks like this module cannot deal with UTF-8 in anyway.

        It can now. There are some (documented) limits on what characters can be used as quotes and separators, but that's it.