graff has asked for the wisdom of the Perl Monks concerning the following question:

I'm out of my depth on this one, my friends. I'm actually a little surprised that I was able to boil it down to a simple, replicable test case.

Whenever I try to use "perl -d" on a script that handles utf8 data, I get the damndest bad behavior from perl 5.8.6 on my mac. The error messages when the script dies are not always the same, but they seem to share a common theme of running out of memory. Here's the test script:

#!/usr/bin/perl use strict; use warnings; my $filedata = "Here is XYZ_ABC_123456.7890 Foo Bar"; my $mode = ( @ARGV and $ARGV[0] =~ /u/ ) ? ":utf8" : ''; open I, "<$mode", \$filedata ; while (<I>) { my ( $id ) = ( /(XYZ_ABC_[\d.]+)/ ); print "ID is $id\n"; }
That will use utf8 mode to read $filedata if there's a "u" on the command line. Here's the test sequence to show what I'm up against -- everything works fine till I use the debugger with utf8 mode:
$ perl test.pl # no debugger, not utf8 ID is XYZ_ABC_123456.7890 $ perl test.pl u # no debugger, utf8 ID is XYZ_ABC_123456.7890 $ perl -d test.pl # debugging, not utf8 Loading DB routines from perl5db.pl version 1.28 ... main::(test.pl:6): my $filedata = "Here is XYZ_ABC_123456.7890 Fo +o Bar"; DB<1> c ID is XYZ_ABC_123456.7890 Debugged program terminated... DB<1> q ## and now the kicker: $ perl -d test.pl u # debugging, utf8 Loading DB routines from perl5db.pl version 1.28 ... main::(test.pl:6): my $filedata = "Here is XYZ_ABC_123456.7890 Fo +o Bar"; DB<1> c perl(18841) malloc: *** vm_allocate(size=4294832128) failed (error cod +e=3) perl(18841) malloc: *** error: can't allocate region perl(18841) malloc: *** set a breakpoint in szone_error to debug Out of memory! Debugged program terminated. Use q to quit or R to restart, use O inhibit_exit to avoid stopping after program termination, h q, h R or h O to get additional info. DB<1> q $ perl -v This is perl, v5.8.6 built for darwin-thread-multi-2level (with 2 registered patches, see perl -V for more detail)
Note that actual presence of wide characters is not required to break the debugger. As for "perl -V"...
Summary of my perl5 (revision 5 version 8 subversion 6) configuration: Platform: osname=darwin, osvers=8.0, archname=darwin-thread-multi-2level uname='darwin b28.apple.com 8.0 darwin kernel version 7.5.0: thu m +ar 3 18:48:46 pst 2005; root:xnuxnu-517.99.13.obj~1release_ppc power +macintosh powerpc ' config_args='-ds -e -Dprefix=/usr -Dccflags=-g -pipe -Dldflags=- +Dman3ext=3pm -Duseithreads -Duseshrplib' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-g -pipe -fno-common -DPERL_DARWIN -no-cpp-prec +omp -fno-strict-aliasing -I/usr/local/include', optimize='-Os', cppflags='-no-cpp-precomp -g -pipe -fno-common -DPERL_DARWIN -no-c +pp-precomp -fno-strict-aliasing -I/usr/local/include' ccversion='', gccversion='3.3 20030304 (Apple Computer, Inc. build + 1809)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='-L/usr/local/ +lib' libpth=/usr/local/lib /usr/lib libs=-ldbm -ldl -lm -lc perllibs=-ldl -lm -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=true, libperl=libpe +rl.dylib gnulibc_version='' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-bundle -undefined dynamic_lookup -L/us +r/local/lib' Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL +_IMPLICIT_CONTEXT Locally applied patches: 23953 - fix for File::Path::rmtree CAN-2004-0452 security issu +e 33990 - fix for setuid perl security issues Built under darwin Compiled at Mar 20 2005 16:34:19 @INC: /System/Library/Perl/5.8.6/darwin-thread-multi-2level /System/Library/Perl/5.8.6 /Library/Perl/5.8.6/darwin-thread-multi-2level /Library/Perl/5.8.6 /Library/Perl /Network/Library/Perl/5.8.6/darwin-thread-multi-2level /Network/Library/Perl/5.8.6 /Network/Library/Perl /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level /System/Library/Perl/Extras/5.8.6 /Library/Perl/5.8.1/darwin-thread-multi-2level /Library/Perl/5.8.1 .

Well, maybe this is moot... I just tried the same test script with perl 5.8.7 and 5.8.8 on a couple different i386-freebsd boxes, and there seems to be no problem there (so far).

(update:) when I tried a more realistic script on 5.8.7, I got a different error, which is similar to one of the variations I had seen on the mac -- line 61 of my real script had just the sort of regex match shown in the test script above:

panic: pp_match start/end pointers at my_real_script.perl line 61, <I> + chunk 2. at my_real_script.perl line 61 ### or, alternatively (if I try to single-step past line 61: Out of memory during ridiculously large request at my_real_script.perl + line 61, <I> chunk 2.
I'm puzzled why it didn't fail till chunk 2 of my input data -- i.e. the regex match succeeded once on chunk 1. Ick. (end of update)

Still, if anyone can offer suggestions on what the hell is going on here, and/or other ideas for work-arounds or fixes, I'm all ears. I could try getting a newer perl installed on my mac... (but why doesn't this just work?!)

Replies are listed 'Best First'.
Re: Perl debugging vs. utf8: I'm losing
by graff (Chancellor) on Nov 08, 2006 at 06:30 UTC
    Okay, I guess I should have looked a little further before posting... here's a relevant snippet from the "perldelta" man page that comes with 5.8.8 (though it's a little bit less than fully reassuring, and doesn't actually mention the specific problem I reported above):
    Debugger and Unicode slowdown It had been reported that running under perl's debugger when proces +sing Unicode data could cause unexpectedly large slowdowns. The most lik +ely cause of this was identified and fixed by Nicholas Clark.
    Well, on to getting 5.8.8 installed on my mac... Sorry about the noise.
Re: Perl debugging vs. utf8: I'm losing
by Tanktalus (Canon) on Nov 08, 2006 at 06:02 UTC

    Your last question may be key: why doesn't it just work? Could be simple: it was fixed in 5.8.7? ;-)

    (For the record, no problems on 5.8.8 on my i386-linux box.)

Re: Perl debugging vs. utf8: I'm losing
by jbrugger (Parson) on Nov 08, 2006 at 06:42 UTC
    Still i think it's an interesting topic, since i have various problems using utf8 with our system.
    For example, using mod_Perl, some utf-8 strings just loose their utf-8 flag. I can't do a thing about it, it's compiled and cached in apache's mod perl.
    Seems to have to deal with this or this
    I wonder if there are more issues (And if possible: Solutions!) reported by others who try to have there programme running mod_Perl and utf-8 data.

    Becouse of the first mentioned issue, we have to use PerlModule ModPerl::PerlRun in stead of the preferred PerlModule ModPerl::Registry , since sometimes good formed unicoded data comes out cripple

    "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.
Re: Perl debugging vs. utf8: I'm losing
by jethro (Monsignor) on Nov 08, 2006 at 12:55 UTC
    I got a similar problem just this morning. I have a perl script that has problems getting parsed correctly (on i386 Suse-Linux in an utf8 environment). In one case 'perl -c script.pl' even results in a segmentation fault.

    The effect vanishes as soon as I add 'use utf8;' to the script which hints at the cause of the problem.

    Will update to perl 5.8.8 and see what that does.

    Update:

    My problem still exists with perl 5.8.8:

    > perl -c EB/vb.pm *** glibc detected *** malloc(): memory corruption (fast): 0x000000000 +071f880 *** Abbruch
    If someone wants to try out, this produces a segmentation fault on my machine:
    package EB::Parse; use strict; use 5.008; use warnings; use encoding 'utf8'; use Carp; my($parser); $parser= q{ { my $execute=0; } all : /[:\\s]*/ zeitortobjekt(s? /[:\\s]*/) befehl(s? /[:\\ ]*/) /.* +/ gug : '1' #------------ }; 1;