Perl debugging vs. utf8: I'm losing

graff has asked for the wisdom of the Perl Monks concerning the following question:

I'm out of my depth on this one, my friends. I'm actually a little surprised that I was able to boil it down to a simple, replicable test case.

Whenever I try to use "perl -d" on a script that handles utf8 data, I get the damndest bad behavior from perl 5.8.6 on my mac. The error messages when the script dies are not always the same, but they seem to share a common theme of running out of memory. Here's the test script:

#!/usr/bin/perl

use strict;
use warnings;

my $filedata = "Here is XYZ_ABC_123456.7890 Foo Bar";

my $mode = ( @ARGV and $ARGV[0] =~ /u/ ) ? ":utf8" : '';

open I, "<$mode", \$filedata ;
while (<I>) {
    my ( $id ) = ( /(XYZ_ABC_[\d.]+)/ );
    print "ID is $id\n";
}
[download]

That will use utf8 mode to read $filedata if there's a "u" on the command line. Here's the test sequence to show what I'm up against -- everything works fine till I use the debugger with utf8 mode:

$ perl test.pl            # no debugger, not utf8
ID is XYZ_ABC_123456.7890

$ perl test.pl u          # no debugger, utf8
ID is XYZ_ABC_123456.7890

$ perl -d test.pl         # debugging, not utf8

Loading DB routines from perl5db.pl version 1.28
...

main::(test.pl:6):      my $filedata = "Here is XYZ_ABC_123456.7890 Fo
+o Bar";

DB<1> c
ID is XYZ_ABC_123456.7890
Debugged program terminated...

DB<1> q


## and now the kicker:

$ perl -d test.pl u       # debugging, utf8

Loading DB routines from perl5db.pl version 1.28
...

main::(test.pl:6):      my $filedata = "Here is XYZ_ABC_123456.7890 Fo
+o Bar";

DB<1> c
perl(18841) malloc: *** vm_allocate(size=4294832128) failed (error cod
+e=3)
perl(18841) malloc: *** error: can't allocate region
perl(18841) malloc: *** set a breakpoint in szone_error to debug
Out of memory!
Debugged program terminated.  Use q to quit or R to restart,
  use O inhibit_exit to avoid stopping after program termination,
  h q, h R or h O to get additional info.

DB<1> q

$ perl -v

This is perl, v5.8.6 built for darwin-thread-multi-2level
(with 2 registered patches, see perl -V for more detail)
[download]

Note that actual presence of wide characters is not required to break the debugger. As for "perl -V"...

Summary of my perl5 (revision 5 version 8 subversion 6) configuration:
  Platform:
    osname=darwin, osvers=8.0, archname=darwin-thread-multi-2level
    uname='darwin b28.apple.com 8.0 darwin kernel version 7.5.0: thu m
+ar 3 18:48:46 pst 2005; root:xnuxnu-517.99.13.obj~1release_ppc power 
+macintosh powerpc '
    config_args='-ds -e -Dprefix=/usr -Dccflags=-g  -pipe  -Dldflags=-
+Dman3ext=3pm -Duseithreads -Duseshrplib'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemulti
+plicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-g -pipe -fno-common -DPERL_DARWIN -no-cpp-prec
+omp -fno-strict-aliasing -I/usr/local/include',
    optimize='-Os',
    cppflags='-no-cpp-precomp -g -pipe -fno-common -DPERL_DARWIN -no-c
+pp-precomp -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='3.3 20030304 (Apple Computer, Inc. build
+ 1809)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
+ lseeksize=8
    alignbytes=8, prototype=define
  Linker and Libraries:
    ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='-L/usr/local/
+lib'
    libpth=/usr/local/lib /usr/lib
    libs=-ldbm -ldl -lm -lc
    perllibs=-ldl -lm -lc
    libc=/usr/lib/libc.dylib, so=dylib, useshrplib=true, libperl=libpe
+rl.dylib
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' '
    cccdlflags=' ', lddlflags='-bundle -undefined dynamic_lookup -L/us
+r/local/lib'


Characteristics of this binary (from libperl): 
  Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL
+_IMPLICIT_CONTEXT
  Locally applied patches:
        23953 - fix for File::Path::rmtree CAN-2004-0452 security issu
+e
        33990 - fix for setuid perl security issues
  Built under darwin
  Compiled at Mar 20 2005 16:34:19
  @INC:
    /System/Library/Perl/5.8.6/darwin-thread-multi-2level
    /System/Library/Perl/5.8.6
    /Library/Perl/5.8.6/darwin-thread-multi-2level
    /Library/Perl/5.8.6
    /Library/Perl
    /Network/Library/Perl/5.8.6/darwin-thread-multi-2level
    /Network/Library/Perl/5.8.6
    /Network/Library/Perl
    /System/Library/Perl/Extras/5.8.6/darwin-thread-multi-2level
    /System/Library/Perl/Extras/5.8.6
    /Library/Perl/5.8.1/darwin-thread-multi-2level
    /Library/Perl/5.8.1
    .
[download]

Well, maybe this is moot... I just tried the same test script with perl ~~5.8.7 and~~ 5.8.8 on a couple different i386-freebsd boxes, and there seems to be no problem there (so far).

(update:) when I tried a more realistic script on 5.8.7, I got a different error, which is similar to one of the variations I had seen on the mac -- line 61 of my real script had just the sort of regex match shown in the test script above:

panic: pp_match start/end pointers at my_real_script.perl line 61, <I>
+ chunk 2.
 at my_real_script.perl line 61

### or, alternatively (if I try to single-step past line 61:

Out of memory during ridiculously large request at my_real_script.perl
+ line 61, <I> chunk 2.
[download]

I'm puzzled why it didn't fail till chunk 2 of my input data -- i.e. the regex match succeeded once on chunk 1. Ick. (end of update)

Still, if anyone can offer suggestions on what the hell is going on here, and/or other ideas for work-arounds or fixes, I'm all ears. I could try getting a newer perl installed on my mac... (but why doesn't this just work?!)

Comment on Perl debugging vs. utf8: I'm losing Select or Download Code

Replies are listed 'Best First'.
Re: Perl debugging vs. utf8: I'm losing by graff (Chancellor) on Nov 08, 2006 at 06:30 UTC
Okay, I guess I should have looked a little further before posting... here's a relevant snippet from the "perldelta" man page that comes with 5.8.8 (though it's a little bit less than fully reassuring, and doesn't actually mention the specific problem I reported above): `Debugger and Unicode slowdown It had been reported that running under perl's debugger when proces +sing Unicode data could cause unexpectedly large slowdowns. The most lik +ely cause of this was identified and fixed by Nicholas Clark.` [download] Well, on to getting 5.8.8 installed on my mac... Sorry about the noise.	[reply] [d/l]
Re: Perl debugging vs. utf8: I'm losing by Tanktalus (Canon) on Nov 08, 2006 at 06:02 UTC
Your last question may be key: why doesn't it just work? Could be simple: it was fixed in 5.8.7? ;-) (For the record, no problems on 5.8.8 on my i386-linux box.)	[reply]
Re: Perl debugging vs. utf8: I'm losing by jbrugger (Parson) on Nov 08, 2006 at 06:42 UTC
Still i think it's an interesting topic, since i have various problems using utf8 with our system. For example, using mod_Perl, some utf-8 strings just loose their utf-8 flag. I can't do a thing about it, it's compiled and cached in apache's mod perl. Seems to have to deal with this or this I wonder if there are more issues (And if possible: Solutions!) reported by others who try to have there programme running mod_Perl and utf-8 data. Becouse of the first mentioned issue, we have to use `PerlModule ModPerl::PerlRun` in stead of the preferred `PerlModule ModPerl::Registry` , since sometimes good formed unicoded data comes out cripple "We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise." - Larry Wall.	[reply] [d/l] [select]
Re: Perl debugging vs. utf8: I'm losing by jethro (Monsignor) on Nov 08, 2006 at 12:55 UTC
I got a similar problem just this morning. I have a perl script that has problems getting parsed correctly (on i386 Suse-Linux in an utf8 environment). In one case 'perl -c script.pl' even results in a segmentation fault. The effect vanishes as soon as I add 'use utf8;' to the script which hints at the cause of the problem. Will update to perl 5.8.8 and see what that does. Update: My problem still exists with perl 5.8.8: `> perl -c EB/vb.pm * glibc detected * malloc(): memory corruption (fast): 0x000000000 +071f880 *** Abbruch` [download] If someone wants to try out, this produces a segmentation fault on my machine: `package EB::Parse; use strict; use 5.008; use warnings; use encoding 'utf8'; use Carp; my($parser); $parser= q{ { my $execute=0; } all : /[:\\s]/ zeitortobjekt(s? /[:\\s]/) befehl(s? /[:\\ ]/) /. +/ gug : '1' #------------ }; 1;` [download]	[reply] [d/l] [select]