Perlfan52 has asked for the wisdom of the Perl Monks concerning the following question:

I have two versions of the same perl search script, one ist standalone and the other one is a three-lines perl script, that calls the main function in a perl module. Both versions have the same code. The module version works as supposed and the standalone version crashes with the error "Malformed UTF-8 character" at the line with the regex /romantic/

This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues. I am using Strawberry perl 5.30.0 (built for MSWin32-x64-multi-thread) on windows 10 pro client with recent updates.

Here are the codes and the files that are needed to reproduce this problem.

- The searched files are in http://ftp.freedb.org/pub/freedb/freedb-update-20200201-20200301.tar.bz2
I am extracting this file into C:/MyScripts/freedb-update-20200201-20200301

- The module version of the script C:/MyScripts/search_script_with module.pl is:

BEGIN{push(@INC,'C:/MyScripts');} use searchFreedb; mainSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301'); print "End of script\n";
- The corresponding module C:/MyScripts/searchFreedb.pm is
package searchFreedb; use strict; use utf8; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(mainSearchFreedb); @EXPORT_OK = qw(mainSearchFreedb); $VERSION = 1.0; $| = 1; ############################################################# # mainSearchFreedb ############################################################# sub mainSearchFreedb { my ($searchdir) = @_; open(FILE, ">C:/MyScripts/sresult_module.txt") || die "$!\n"; binmode FILE, ":utf8"; recursivSearchFreedb($searchdir); close(FILE); } ############################################################# # recursivSearchFreedb ############################################################# sub recursivSearchFreedb { my ($dir) = @_; die "dir $dir!\n" if(!$dir || !(-e $dir && -d $dir)); $dir =~ s/[\/\\]+/\//og; $dir = $dir . '/' if( $dir !~ /\/$/o ); my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o ); opendir(DIR,$dir) || warn __LINE__."$!\n"; my @all_dir_files = readdir(DIR); closedir(DIR); print "Folder: $dir => $dirname\n"; foreach my $dir_file ( sort @all_dir_files ) { $dir_file =~ /^\.+$/o && next; my $abspath = $dir . $dir_file; if( -d $abspath ) { recursivSearchFreedb($abspath); } else { if($dir_file =~ /(^COPYING$|^README$)$)/io) { print "skipping $dir_file\n"; next; } elsif(-z $abspath) { next; } my ($content); open(IN, "<$abspath") || die "$!\n"; while(my $line = <IN>) { next if not $line =~ /^#\s+xmcd/o; $content .= $line; my ($TITLEALL,$DISCID,$GENRE); for(;;) { my $line2 = <IN>; if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;} if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;} if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;} $content .= $line2; if($line2 =~ /^PLAYORDER=/o) { if( $TITLEALL =~ /Romanti[cqk]/io ) { print FILE "$content\n"; } last; } } } close(IN); } } } ############################################################## # end of package ############################################################## 1;
- The standalone version of the script C:/MyScripts/search_script_standalone.pl is:
use strict; use utf8; $| = 1; ############################################################# # recursivSearchFreedb ############################################################# sub recursivSearchFreedb { my ($dir) = @_; die "dir $dir\n" if(!$dir || !(-e $dir && -d $dir)); $dir =~ s/[\/\\]+/\//og; $dir = $dir . '/' if( $dir !~ /\/$/o ); my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o ); opendir(DIR,$dir) || warn __LINE__."$!\n"; my @all_dir_files = readdir(DIR); closedir(DIR); print "Folder: $dir => $dirname\n"; foreach my $dir_file ( sort @all_dir_files ) { $dir_file =~ /^\.+$/o && next; my $abspath = $dir . $dir_file; if( -d $abspath ) { recursivSearchFreedb($abspath); } else { if($dir_file =~ /(^COPYING$|^README$)/io) { print "skipping $dir_file\n"; next; } elsif(-z $abspath) { next; } my ($content); open(IN, "<$abspath") || die "$!\n"; while(my $line = <IN>) { next if not $line =~ /^#\s+xmcd/o; $content .= $line; my ($TITLEALL,$DISCID,$GENRE); for(;;) { my $line2 = <IN>; if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;} if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;} if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;} $content .= $line2; if($line2 =~ /^PLAYORDER=/o) { if( $TITLEALL =~ /Romanti[cqk]/io ) { print FILE "$content\n"; } last; } } } close(IN); } } } ############################################################ # main starts here ############################################################ open(FILE, ">C:/MyScripts/sresult_standalone.txt") || die "$!\n"; binmode FILE, ":utf8"; recursivSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301'); close(FILE); print "End of script\n";
I am starting the module version with
"perl -CDS search_script_with module.pl"

The result is:

Folder: C:/MyScripts/freedb-20200201-20200301/ Folder: C:/MyScripts/freedb-20200201-20200301/blues/ Folder: C:/MyScripts/freedb-20200201-20200301/classical/ Folder: C:/MyScripts/freedb-20200201-20200301/country/ Folder: C:/MyScripts/freedb-20200201-20200301/data/ Folder: C:/MyScripts/freedb-20200201-20200301/folk/ Folder: C:/MyScripts/freedb-20200201-20200301/jazz/ Folder: C:/MyScripts/freedb-20200201-20200301/misc/ Folder: C:/MyScripts/freedb-20200201-20200301/newage/ Folder: C:/MyScripts/freedb-20200201-20200301/reggae/ Folder: C:/MyScripts/freedb-20200201-20200301/rock/ Folder: C:/MyScripts/freedb-20200201-20200301/soundtrack/ End of script
I am starting the standalone version with
"perl -CDS search_script_standalone.pl"

The result is (it crashes very quickly):

Folder: C:/MyScripts/freedb-20200201-20200301/ Folder: C:/MyScripts/freedb-20200201-20200301/blues/ Malformed UTF-8 character: \xf6\x6e\x20\x26 (unexpected non-continuati +on byte 0x6e, immediately after start byte 0xf6; need 4 bytes, got 1) + in pattern match (m//) at C:\MYSCRI~1\SEARCH~2.PL line 55, <IN> line + 67. Malformed UTF-8 character (fatal) at C:\MYSCRI~1\SEARCH~2.PL line 55, +<IN> line 67.
Any ideas why the standalone version crashes? Can you reproduce the problem on your own pc? Thank you for your answers or ideas.
  • Comment on The module version works, but the standlone version crashes with "Malformed UTF-8 character"
  • Select or Download Code

Replies are listed 'Best First'.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by choroba (Cardinal) on Mar 09, 2020 at 21:51 UTC
    > perl -CDS

    "D" corresponds to "i + o" whose documentation in perlrun states (emphasis mine):

    > The "io" options mean that any subsequent open() (or similar I/O operations) in the current file scope will have the ":utf8" PerlIO layer implicitly applied to them, in other words, UTF-8 is expected from any input stream, and UTF-8 is produced to any output stream.

    If you put the call to open into a module, it falls out of the current file scope.

    The -C is intended for oneliners, in larger programs and modules, use binmode, explicit layers with 3-arg open, or open.pm.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      You are right my friend. I must have overseen it. If I start with "perl -CS" instead of "perl -CDS" it works in both versions. I thank you very much.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by 1nickt (Canon) on Mar 09, 2020 at 16:11 UTC

    Hi, since you are on Windows I would pay attention to the encoding of your file. MSFT often uses UTF-16, and if you try to decode that as if it were UTF-8 you could see that error IIUC.

    Related info on "middle byte" that has bitten me with JSON data: https://tools.ietf.org/html/rfc4627#section-3.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by jo37 (Curate) on Mar 09, 2020 at 18:52 UTC

    The error message states:

    Malformed UTF-8 character: \xf6\x6e\x20\x26

    In latin-1 encoding this would be "ön &". This string (or something similar in another encoding) apparently occurs in any of your files and at least this file is not utf-8 encoded.

    -jo

      Yes, it's file blues/020d9511 which is Latin-1 encoded. In UTF-8, the problematic line would be
      TTITLE2=Jung, schön & stylish feat. Justus

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      The question is not why the standalone version does not work or where it doesn't work. If you delete blues/020d9511 it crashes immediately at next UTF-8 encoded file.
      The question is why the standalone version does not work and the module version works without any problem, although the code is absolutely the same. What makes a module so different internally so that perl has a different interpretation in both cases? For me as a developer a particular code must always give the same result, but in this case I am really helpless.

        Here I disagree. The input is malformed and a crashing program is the right thing™ here. I would not care why the other version does not crash but instead correct the input data.

        -jo

Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by LanX (Saint) on Mar 09, 2020 at 16:06 UTC
    > This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues.

    you mean "well known" among first time posters who still use the /o modifier and can't boil down their problem to an SSCCE ?

    edit

    did you really safe both files as UTF8?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      All three files are all saved as UTF-8 NO BOM/Unix Terminators (with UltraEdit).
      All scripts have only ascii encoding, that means I could also save them as ANSI/ASCII. I tested it too. No changing in result.
      Without /o modifier same result.