The module version works, but the standlone version crashes with "Malformed UTF-8 character"

Perlfan52 has asked for the wisdom of the Perl Monks concerning the following question:

I have two versions of the same perl search script, one ist standalone and the other one is a three-lines perl script, that calls the main function in a perl module. Both versions have the same code. The module version works as supposed and the standalone version crashes with the error "Malformed UTF-8 character" at the line with the regex /romantic/

This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues. I am using Strawberry perl 5.30.0 (built for MSWin32-x64-multi-thread) on windows 10 pro client with recent updates.

Here are the codes and the files that are needed to reproduce this problem.

- The searched files are in http://ftp.freedb.org/pub/freedb/freedb-update-20200201-20200301.tar.bz2
I am extracting this file into C:/MyScripts/freedb-update-20200201-20200301

- The module version of the script C:/MyScripts/search_script_with module.pl is:

BEGIN{push(@INC,'C:/MyScripts');}

use searchFreedb;

mainSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301');

print "End of script\n";
[download]

- The corresponding module C:/MyScripts/searchFreedb.pm is

package searchFreedb;

use strict;
use utf8;

use vars qw($VERSION @ISA @EXPORT @EXPORT_OK);
require Exporter;
@ISA = qw(Exporter);
@EXPORT =    qw(mainSearchFreedb);
@EXPORT_OK = qw(mainSearchFreedb);
$VERSION = 1.0;

$| = 1;

#############################################################
#  mainSearchFreedb
#############################################################
sub mainSearchFreedb {
    my ($searchdir) = @_;
    open(FILE, ">C:/MyScripts/sresult_module.txt") || die "$!\n";
    binmode FILE, ":utf8";
    recursivSearchFreedb($searchdir);
    close(FILE);
}

#############################################################
#  recursivSearchFreedb
#############################################################

sub recursivSearchFreedb {
    my ($dir) = @_;
    die "dir $dir!\n" if(!$dir || !(-e $dir && -d $dir));
    $dir =~ s/[\/\\]+/\//og;
    $dir = $dir . '/' if( $dir !~ /\/$/o );

    my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o );

    opendir(DIR,$dir) || warn __LINE__."$!\n";
    my @all_dir_files = readdir(DIR);
    closedir(DIR);

    print "Folder: $dir => $dirname\n";

    foreach my $dir_file ( sort @all_dir_files ) {
        $dir_file =~ /^\.+$/o && next;

        my $abspath = $dir . $dir_file;

        if( -d $abspath ) {
            recursivSearchFreedb($abspath);
        }
        else {
            if($dir_file =~ /(^COPYING$|^README$)$)/io) {
                print "skipping $dir_file\n";
                next;
            }
            elsif(-z $abspath) {
                next;
            }

            my ($content);
            open(IN, "<$abspath") || die "$!\n";
            while(my $line = <IN>) {
                next if not $line =~ /^#\s+xmcd/o;
                $content .= $line;
                my ($TITLEALL,$DISCID,$GENRE);
                for(;;) {
                    my $line2 = <IN>;
                if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;}
                if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;}
                if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;}
                    $content .= $line2;
                    if($line2 =~ /^PLAYORDER=/o) {
                        if( $TITLEALL =~ /Romanti[cqk]/io ) {
                            print FILE "$content\n";
                        }
                        last;
                    }
                }
            }
            close(IN);
        }
    }
}

##############################################################
#  end of package
##############################################################

1;
[download]

- The standalone version of the script C:/MyScripts/search_script_standalone.pl is:

use strict;
use utf8;

$| = 1;

#############################################################
#  recursivSearchFreedb
#############################################################

sub recursivSearchFreedb {
    my ($dir) = @_;
    die "dir $dir\n" if(!$dir || !(-e $dir && -d $dir));
    $dir =~ s/[\/\\]+/\//og;
    $dir = $dir . '/' if( $dir !~ /\/$/o );

    my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o );

    opendir(DIR,$dir) || warn __LINE__."$!\n";
    my @all_dir_files = readdir(DIR);
    closedir(DIR);

    print "Folder: $dir => $dirname\n";

    foreach my $dir_file ( sort @all_dir_files ) {
        $dir_file =~ /^\.+$/o && next;

        my $abspath = $dir . $dir_file;

        if( -d $abspath ) {
            recursivSearchFreedb($abspath);
        }
        else {
            if($dir_file =~ /(^COPYING$|^README$)/io) {
                print "skipping $dir_file\n";
                next;
            }
            elsif(-z $abspath) {
                next;
            }

            my ($content);
            open(IN, "<$abspath") || die "$!\n";
            while(my $line = <IN>) {
                next if not $line =~ /^#\s+xmcd/o;
                $content .= $line;
                my ($TITLEALL,$DISCID,$GENRE);
                for(;;) {
                    my $line2 = <IN>;
                if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;}
                if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;}
                if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;}
                    $content .= $line2;
                    if($line2 =~ /^PLAYORDER=/o) {
                        if( $TITLEALL =~ /Romanti[cqk]/io ) {
                            print FILE "$content\n";
                        }
                        last;
                    }
                }
            }
            close(IN);
        }
    }
}

############################################################
#  main starts here
############################################################

open(FILE, ">C:/MyScripts/sresult_standalone.txt") || die "$!\n";
binmode FILE, ":utf8";
recursivSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301');
close(FILE);

print "End of script\n";
[download]

I am starting the module version with
"perl -CDS search_script_with module.pl"

The result is:

Folder: C:/MyScripts/freedb-20200201-20200301/
Folder: C:/MyScripts/freedb-20200201-20200301/blues/
Folder: C:/MyScripts/freedb-20200201-20200301/classical/
Folder: C:/MyScripts/freedb-20200201-20200301/country/
Folder: C:/MyScripts/freedb-20200201-20200301/data/
Folder: C:/MyScripts/freedb-20200201-20200301/folk/
Folder: C:/MyScripts/freedb-20200201-20200301/jazz/
Folder: C:/MyScripts/freedb-20200201-20200301/misc/
Folder: C:/MyScripts/freedb-20200201-20200301/newage/
Folder: C:/MyScripts/freedb-20200201-20200301/reggae/
Folder: C:/MyScripts/freedb-20200201-20200301/rock/
Folder: C:/MyScripts/freedb-20200201-20200301/soundtrack/
End of script
[download]

I am starting the standalone version with
"perl -CDS search_script_standalone.pl"

The result is (it crashes very quickly):

Folder: C:/MyScripts/freedb-20200201-20200301/
Folder: C:/MyScripts/freedb-20200201-20200301/blues/
Malformed UTF-8 character: \xf6\x6e\x20\x26 (unexpected non-continuati
+on byte 0x6e, immediately after start byte 0xf6; need 4 bytes, got 1)
+ in pattern match (m//) at C:\MYSCRI~1\SEARCH~2.PL line 55, <IN> line
+ 67.
Malformed UTF-8 character (fatal) at C:\MYSCRI~1\SEARCH~2.PL line 55, 
+<IN> line 67.
[download]

Any ideas why the standalone version crashes? Can you reproduce the problem on your own pc? Thank you for your answers or ideas.

Comment on The module version works, but the standlone version crashes with "Malformed UTF-8 character" Select or Download Code

Replies are listed 'Best First'.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by choroba (Cardinal) on Mar 09, 2020 at 21:51 UTC
> perl -CDS "D" corresponds to "i + o" whose documentation in perlrun states (emphasis mine): > The "io" options mean that any subsequent open() (or similar I/O operations) in the current file scope will have the ":utf8" PerlIO layer implicitly applied to them, in other words, UTF-8 is expected from any input stream, and UTF-8 is produced to any output stream. If you put the call to open into a module, it falls out of the current file scope. The -C is intended for oneliners, in larger programs and modules, use binmode, explicit layers with 3-arg open, or open.pm. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^2: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by Perlfan52 (Novice) on Mar 09, 2020 at 22:15 UTC
You are right my friend. I must have overseen it. If I start with "perl -CS" instead of "perl -CDS" it works in both versions. I thank you very much.	[reply]
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by 1nickt (Canon) on Mar 09, 2020 at 16:11 UTC
Hi, since you are on Windows I would pay attention to the encoding of your file. MSFT often uses UTF-16, and if you try to decode that as if it were UTF-8 you could see that error IIUC. Related info on "middle byte" that has bitten me with JSON data: https://tools.ietf.org/html/rfc4627#section-3. Hope this helps! The way forward always starts with a minimal test.	[reply]
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by jo37 (Curate) on Mar 09, 2020 at 18:52 UTC
The error message states: Malformed UTF-8 character: \xf6\x6e\x20\x26 In latin-1 encoding this would be "ön &". This string (or something similar in another encoding) apparently occurs in any of your files and at least this file is not utf-8 encoded. -jo	[reply]
Re^2: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by choroba (Cardinal) on Mar 09, 2020 at 20:51 UTC
Yes, it's file blues/020d9511 which is Latin-1 encoded. In UTF-8, the problematic line would be `TTITLE2=Jung, schön & stylish feat. Justus` [download] `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]
Re^2: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by Perlfan52 (Novice) on Mar 09, 2020 at 21:15 UTC
The question is not why the standalone version does not work or where it doesn't work. If you delete blues/020d9511 it crashes immediately at next UTF-8 encoded file. The question is why the standalone version does not work and the module version works without any problem, although the code is absolutely the same. What makes a module so different internally so that perl has a different interpretation in both cases? For me as a developer a particular code must always give the same result, but in this case I am really helpless.	[reply]
Re^3: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by jo37 (Curate) on Mar 10, 2020 at 15:18 UTC
Here I disagree. The input is malformed and a crashing program is the right thing™ here. I would not care why the other version does not crash but instead correct the input data. -jo	[reply]
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by LanX (Saint) on Mar 09, 2020 at 16:06 UTC
> This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues. you mean "well known" among first time posters who still use the /o modifier and can't boil down their problem to an SSCCE ? edit did you really safe both files as UTF8? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^2: The module version works, but the standlone version crashes with "Malformed UTF-8 character" by Perlfan52 (Novice) on Mar 09, 2020 at 20:52 UTC
All three files are all saved as UTF-8 NO BOM/Unix Terminators (with UltraEdit). All scripts have only ascii encoding, that means I could also save them as ANSI/ASCII. I tested it too. No changing in result. Without /o modifier same result.	[reply]

edit