Berislav has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone. I need to parse a rather large text file, and I've managed to write a regex to match exactly the thing i want. However, sometimes, with files that contain a lot of information, I get segmentation fault during the regex match. Here is the part of the code that does the matching, and prints the match out. Please note that i frequently need to match a string that spans across more than 32k lines of text, and when i do i get segmentation fault. Can anyone propose a solution to this problem? Thank you very much in advance.
while ($dna=~ m/(^\w{2}\s\(\d+\)\sLocus:(\s\d+)+)/mg) { print "\n", $1, "\n";}

2006-06-29 Retitled by broquaint, as per Monastery guidelines
Original title: 'Segmentation fault during rexeg match'

Replies are listed 'Best First'.
Re: Segmentation fault during regex match
by Corion (Patriarch) on Jun 29, 2006 at 09:32 UTC

    The problem Perl has is not with the length of the matching text, but with the size of the regular expression. Your regular expression obviously is shorter than 32k. Please tell us the output of perl -V and post some small example data on which the problem occurs.

      This is the output of perl -V
      Summary of my perl5 (revision 5 version 8 subversion 6) configuration: Platform: osname=linux, osvers=2.4.21-27.0.2.elsmp, archname=i386-linux-thre +ad-multi uname='linux decompose.build.redhat.com 2.4.21-27.0.2.elsmp #1 smp + wed jan 12 23:35:44 est 2005 i686 i686 i386 gnulinux ' config_arg +s='-des -Doptimize=-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions +-m32 -march=i386 -mtune=pentium4 -fasynchronous-unwind-tables -Dversi +on=5.8.6 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc - +Dcf_by=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i3 +86-linux -Dvendorprefix=/usr -Dsiteprefix=/usr -Duseshrplib -Dusethre +ads -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db +-Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio -Di +nstallusrbinperl=n -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/les +s -isr -Dd_gethostent_r_proto -Ud_endhostent_r_proto -Ud_endprotoent_ +r_proto -Ud_endservent_r_proto -Ud_sethostent_r_proto -Ud_setprotoent +_r_proto -Ud_setservent_r_proto -Dinc_version_list=5.8.5 5.8.4 5.8.3' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemulti +plicity=define useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING -fno-st +rict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_ +OFFSET_BITS=64 -I/usr/include/gdbm', optimize='-O2 -g -pipe -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -m32 - +march=i386 -mtune=pentium4 -fasynchronous-unwind-tables', cppflags='-D_REENTRANT -D_GNU_SOURCE -DDEBUGGING -fno-strict-alias +ing -pipe -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='4.0.0 20050516 (Red Hat 4.0.0-6)', gccos +andvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +2 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lresolv -lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread +-lc perllibs=-lresolv -lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.3.5.so, so=so, useshrplib=true, libperl=libperl.s +o gnulibc_version='2.3.5' Dynamic Linking: dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E - +Wl,-rpath,/usr/lib/perl5/5.8.6/i386-linux-thread-multi/CORE' cccdlflags='-fPIC', lddlflags='-shared -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: DEBUGGING MULTIPLICITY USE_ITHREADS USE_LARGE_ +FILES PERL_IMPLICIT_CONTEXT Built under linux Compiled at May 18 2005 18:21:23 @INC: /usr/lib/perl5/site_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/site_perl/5.8.6 /usr/lib/perl5/site_perl/5.8.5 /usr/lib/perl5/site_perl/5.8.4 /usr/lib/perl5/site_perl/5.8.3 /usr/lib/perl5/site_perl /usr/lib/perl5/vendor_perl/5.8.6/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.5/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.4/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.3/i386-linux-thread-multi /usr/lib/perl5/vendor_perl/5.8.6 /usr/lib/perl5/vendor_perl/5.8.5 /usr/lib/perl5/vendor_perl/5.8.4 /usr/lib/perl5/vendor_perl/5.8.3 /usr/lib/perl5/vendor_perl /usr/lib/perl5/5.8.6/i386-linux-thread-multi /usr/lib/perl5/5.8.6 .
      And this is the sample file
      AT (327658) Locus: 1 3 8 17 26 31 + 48 54 58 87 96 103 110 116 130 145 + 156 168 176 194 200 241 246 260 266 297 + 312 339 341 351 354 365 371 401 416 443 + 448 555 560 567 607 635 690 703 705 737 + 757 777 788 792 794 819 882 897 902 940 + 948 964 969 973 983 986 1005 1015 1029 1060 + 1063 1086 1088 1100 1106 1128 1140 1152 1159 1166 + 1188 1215 1229 1253 1257 1263 1276 1287 1291 1298 + 1313 1326 1328 1348 1350 1358 1376 1397 1401 1424 + 1429 1452 1466 1472 1491 1505 1545 1550 1557 1563 + 1566 1585 1597 1614 1616 1635 1641 1649 1654 1666 + 1686 1727 1729 1737 1744 1748 1751 1753 1757 1762 + 1782 1787 1798 1845 1867 1879 1906 1915 1917 1927 + 1935 1945 1962 1970 1975 2006 2021 2025 2068 2070 + 2078 2100 2104 2112 2122 2133 2149 2188 2195 2270 + 2280 2293 2305 2318 2328 2332 2338 2362 2368 2396 + 2399 2409 2427 2453 2472 2479 2482 2499 2513 2518 + 2526 2532 2539 2545 2557 2562 2585 2589 2593 2608 + 2611 2622 2638 2652 2659 2694 2704 2711 2723 2731 + 2737 2743 2764 2779 2787 2794 2796 2802 2807 2809 + 2815 2819 2829 2839 2848 2857 2875 2881 2884 2892 + 2898 2911 2932 2941 2947 2956 2964 2970 2981 2985 + 2994 2997 3003 3018 3034 3059 3073 3075 3082 3111 + 3127 3160 3162 3172 3177 3187 3196 3201 3204 3231 + 3237 3251 3263 3267 3277 3283 3305 3331 3352 3354 + 3357 3361 3364 3394 3402 3427 3440 3442 3445 3469 + 3471 3528 3539 3549 3592 3607 3620 3628 3636 3647 + 3664 3677 3680 3697 3700 3706 3712 3727 3773 3781 + 3787 3790 3794 3815 3822 3825 3830 3846 3852 3859 + 3868 3884 3887 3892 3895 3908 3913 3936 3944 3953 + 3972 3985 3988 4006 4012 4023 4030 4034 4037 4042 + 4048 4051 4063 4079

        I can't reproduce that error with the below program, but on 5.8.2 and 5.8.3 on Win32. Does the below program crash for you? Does it crash every time or only from time to time? If the below program works for you, please try to find some other complete example which crashes predictably. If the below program crashes for you, that's "good", as the bug is then reproducible.

        use strict; use warnings; undef $/; my $dna = <DATA>; print $dna; while ($dna=~ m/(^\w{2}\s\(\d+\)\sLocus:(\s\d+)+)/mg) { print "Found: $1\n"; }; __DATA__ AT (327658) Locus: 1 3 8 17 26 31 + 48 54 58 87 96 103 110 116 130 145 + 156 168 176 194 200 241 246 260 266 297 + 312 339 341 351 354 365 371 401 416 443 + 448 555 560 567 607 635 690 703 705 737 + 757 777 788 792 794 819 882 897 902 940 + 948 964 969 973 983 986 1005 1015 1029 1060 + 1063 1086 1088 1100 1106 1128 1140 1152 1159 1166 + 1188 1215 1229 1253 1257 1263 1276 1287 1291 1298 + 1313 1326 1328 1348 1350 1358 1376 1397 1401 1424 + 1429 1452 1466 1472 1491 1505 1545 1550 1557 1563 + 1566 1585 1597 1614 1616 1635 1641 1649 1654 1666 + 1686 1727 1729 1737 1744 1748 1751 1753 1757 1762 + 1782 1787 1798 1845 1867 1879 1906 1915 1917 1927 + 1935 1945 1962 1970 1975 2006 2021 2025 2068 2070 + 2078 2100 2104 2112 2122 2133 2149 2188 2195 2270 + 2280 2293 2305 2318 2328 2332 2338 2362 2368 2396 + 2399 2409 2427 2453 2472 2479 2482 2499 2513 2518 + 2526 2532 2539 2545 2557 2562 2585 2589 2593 2608 + 2611 2622 2638 2652 2659 2694 2704 2711 2723 2731 + 2737 2743 2764 2779 2787 2794 2796 2802 2807 2809 + 2815 2819 2829 2839 2848 2857 2875 2881 2884 2892 + 2898 2911 2932 2941 2947 2956 2964 2970 2981 2985 + 2994 2997 3003 3018 3034 3059 3073 3075 3082 3111 + 3127 3160 3162 3172 3177 3187 3196 3201 3204 3231 + 3237 3251 3263 3267 3277 3283 3305 3331 3352 3354 + 3357 3361 3364 3394 3402 3427 3440 3442 3445 3469 + 3471 3528 3539 3549 3592 3607 3620 3628 3636 3647 + 3664 3677 3680 3697 3700 3706 3712 3727 3773 3781 + 3787 3790 3794 3815 3822 3825 3830 3846 3852 3859 + 3868 3884 3887 3892 3895 3908 3913 3936 3944 3953 + 3972 3985 3988 4006 4012 4023 4030 4034 4037 4042 + 4048 4051 4063 4079

        Update: I realized that your regular expression is too strict to properly match the data given, at least on Win32 (but I guess on other OSes as well). When modifying the code to the below code, it matches but still doesn't crash for me:

        use strict; use warnings; undef $/; my $dna = <DATA>; while ($dna=~ m/(^\w{2}\s+\(\d+\)\s+Locus:(\s+\d+)+)/smg) { print "Found: $1\n"; }; __DATA__ ... dataset elided ...
Re: Segmentation fault during regex match
by Moron (Curate) on Jun 29, 2006 at 11:21 UTC
    Your example doesn't span any lines of text, so we don't appear to have an example of where it actually fails...

    -M

    Free your mind