Double-bytes handling with Perl

fredo2906 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have a very simple code which actually needs to read output from a shell command and this command may return lines with actually double-bytes charcters.

my @o = `./convert.sh \"$filename\"`;
my @array;
foreach my $line (@o) {
    next if ($line =~ /^\s*$/);
    my %obj;
    my $error = getRecordObject($line,\%obj);
    print "Error code returned is $error for record\n$line\n" if ($err
+or<0);
    push @array, \%obj;
}
.... doing something

print "Total Records : ".scalar(@array)."\n";
[download]

The above will return : "Total Records : 12708" However, there are

cat ../file.converted | wc -l
12715
[download]

Those 6 records missing are actually lines containing double-bytes char.

20000 1 0 928 20 20131008100121164+09 79a87b78ade9ea2f4c059055ace98046.У‚ТБУ‚ТПУ‚ТБУƒТ„У‚ТЛУƒТЖУƒТŒУ‚ТГУ‚ТЖУƒТ‰_<> 902

How to make Perl accept those lines ? Thanks a lot.

Comment on Double-bytes handling with Perl Select or Download Code

Replies are listed 'Best First'.
Re: Double-bytes handling with Perl by kcott (Archbishop) on Oct 10, 2013 at 08:29 UTC
G'day fredo2906, It sounds like the open pragma would possibly resolve your issue. If you need further help, please advise (as already requested) what specific encoding you're dealing with: "double-bytes charcters" [sic] is not particularly helpful. Also, a *small* sample of the data could be useful. Please post this within `<pre>...</pre>` tags. Yes, that's contrary to other advice you may have read about posting data within `<code>...</code>` tags; however, Unicode characters are often rendered as character entity references when `<code>...</code>` tags are used. Your data won't wrap within `<pre>...</pre>` tags: one of the reasons I stressed "a *small* sample". -- Ken	[reply] [d/l] [select]
Re: Double-bytes handling with Perl by Corion (Patriarch) on Oct 10, 2013 at 07:59 UTC
Maybe you've already determined the cause of your problems, but there is one potential difference between `wc -l` and `print scalar @array`: In your code, you skip empty lines: `next if ($line =~ /^\s*$/);` [download] ... but `wc -l` will count these. I would keep separate counts for all types of lines encountered and output these as statistics.	[reply] [d/l] [select]
Re^2: Double-bytes handling with Perl by fredo2906 (Acolyte) on Oct 10, 2013 at 08:02 UTC
Yes, i skip empty lines in perl, just in case. But in this example there is no empty lines in the file. Thanks for your concern.	[reply]
Re: Double-bytes handling with Perl (UTF16 or UTF8)? by Anonymous Monk on Oct 10, 2013 at 07:54 UTC
What is "double-bytes characters"? UTF16 or UTF8? Tutorials: perlunitut: Unicode in Perl, perluniintro/perlunitut, Perl Unicode Essentials⁴⁰⁸ (and whatever other docs they reference, unicodespec...), http://blogs.perl.org/users/brian_d_foy/2011/07/toms-unicode-scripts-so-life-is-easier.html...	[reply]
Re^2: Double-bytes handling with Perl (UTF16 or UTF8)? by fredo2906 (Acolyte) on Oct 10, 2013 at 08:05 UTC
Potentially I am not supposed to know what is the encoding of those characters. Gona try the use encoding "euc-jp". Thanks.	[reply]
Re^3: Double-bytes handling with Perl (UTF16 or UTF8)? by fredo2906 (Acolyte) on Oct 10, 2013 at 08:10 UTC
It doesnt work. Looks like the issue is on the call to shell command. I pipe the command to "tee" then make a diff, and I get 6 lines that are not in the file. Very strange, because if I execute the command directly from my shell, it works perfectly.	[reply]