Re^3: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?

Let me try to simplify that a bit ...

use Text::CSV_XS;

my $csv = Text::CSV_XS->new ({
    auto_diag    => 1, # Let Text::CSV_XS do the analysis
    always_quote => 1,
    binary       => 1,
    eol          => $INPUT_RECORD_SEPARATOR,
    });

binmode STDOUT, ':encoding(UTF-8)';

for my $file (@ARGV) {
    open my $fh, '<:encoding(UTF-8)', $file;

    while (my $fields = $csv->getline ($fh)) {
        $csv->print (*STDOUT, $fields); # no need for a reference
        }
    # due to auto_diag, no need for error checking here
    close $fh;
    }
[download]

If this script is to sanitize CSV data, I'd advice TWO csv objects. One for parsing, that does not pass the always_quote and eol attribute, and one for output. The advantage is that all legal line-endings are parsed well automatically, even if mixed.

I have no neat way to the BOM problem other than what you already use.

Enjoy, Have FUN! H.Merijn

Comment on Re^3: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? Download Code

Replies are listed 'Best First'.
Re^4: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma? by Jim (Curate) on Oct 03, 2011 at 14:16 UTC
Thank you, again, Tux. I genuinely appreciate the tips. I'll brush up on `auto_diag`. The BOM is a nuisance, especially in CSV files. In one of my real programs that uses Text::CSV_XS (what I posted here is a reduction that simply demonstrates a specific problem I was having), I'm stymied by the confluence of byte order marks in UTF-8 files that force me to use File::BOM and, unfortunately, some malformed UTF-8 text in the data that kills CSV parsing with this unforgiving error message: `utf8 "\xEC" does not map to Unicode at C:/strawberry/perl/lib/Encode.p +m line 176.` [download] I don't know how to tell Text::CSV_XS or File::BOM to tell Encode to lighten up already about one or two bogus characters! :-(	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^4: Why Doesn't Text::CSV_XS Print Valid UTF-8 Text When Used With the open Pragma?
by Jim (Curate) on Oct 03, 2011 at 14:16 UTC

Thank you, again, Tux. I genuinely appreciate the tips. I'll brush up on auto_diag.

The BOM is a nuisance, especially in CSV files. In one of my real programs that uses Text::CSV_XS (what I posted here is a reduction that simply demonstrates a specific problem I was having), I'm stymied by the confluence of byte order marks in UTF-8 files that force me to use File::BOM and, unfortunately, some malformed UTF-8 text in the data that kills CSV parsing with this unforgiving error message:

utf8 "\xEC" does not map to Unicode at C:/strawberry/perl/lib/Encode.p
+m line 176.
[download]

I don't know how to tell Text::CSV_XS or File::BOM to tell Encode to lighten up already about one or two bogus characters! :-(

[reply]
[d/l]
[select]