tshabet has asked for the wisdom of the Perl Monks concerning the following question:

I have written the following code to turn brace delimited code like this
{bob is {a cool guy}}
into XML like this
<bob> is <a> cool guy </a></bob>
Here's the code I've written (edited for stuff that I know is fine/user outputs/etc)
#!/usr/bin/perl -w use diagnostics; use Parse::RecDescent; use Text::Balanced qw( extract_bracketed ); #input print "Name of file to be inputted? :"; $infile=<>; print "What should the output file be named? The file will be automati +cally created. :"; $outfile=<>; chomp $infile; $/=undef; open INFILE, "<$infile"; $text=<INFILE>; close INFILE; #processing my $counter = 0; while($next = (extract_bracketed($text, '{}', '[^{}]*' ))[0]) { $holder = $next; while($bext = (extract_bracketed($next, '{}', '(?s).*?(?=\{code)' +))[0]) { $bolder = $bext; while($cext = (extract_bracketed($bext, '{}', '(?s).*?(?=\{escape) +' ))[0]) { $colder = $cext; $cext =~ s/\{([^ \s|\}]*?)\}/<$1\/>/gix; $cext =~ s/\{([\w|-]*)(.*)\}/<$1>$2<\/$1>/osi; $bext =~ s/$colder/$cext/sgi; } $bext =~ s/\{(\w*?)\s(.*)\}/\<$1\>$2<\/$1>/gosix; $bext =~ s/\{metavar(.*?)\}/<metavar>$1<\/metavar>/gosix; $bext =~ s/\}/ebrac/g; $bext =~ s/\{/obrac/g; $next =~ s/$bolder/$bext/sgi; } $next =~ s/\{([^ \s|\}]*?)\}/<$1\/>/gix; $next =~ s/\{([\w|-]*)(.*)\}/<$1>$2<\/$1>/osi; $text =~ s/$holder/$next/sgi; print "Sync check \#$counter\n"; print "$next\n"; $counter++; } #output open FILEOUT, ">$outfile"; print FILEOUT $text; close FILEOUT; print "\nYour result is stored in file $outfile\nGoodbye.\n";

Ok, so this code does the job fine. The problem is, the application of this program needs to be very broad....including processing some truly huge files. The problem I run into is that when I am running this script on a truly humongous file (my test file is 19,000 lines) the script runs until a certain point and then stops (stalls is more like it) and goes no further. The first two times this happened it stopped on the exact same call/line number, which seemed pretty fishy. So as an experiment (and to make sure it wasn;t the input at fault) I cut out everything for about 50 lines surrounding the stalling line and ran everything again. This time it got about 10 more calls/60 more lines and stalled. I ran it again and it stalled in the same place. So, in a fit of annoyance (from start to stall takes about 30 minutes) I cut the original input file down the center and made it into 2 files......which ran through perfectly. So I'm thinking maybe a memory leak in my program or something.(?) I've read a bit about the subject, but in terms of its manifestations in Perl I'm pretty lost. Now, for the short term, I suppose I can just cut up especially big files, but in the long term I hope to not be the sole user of this program and I don't want to have to tell everyone that they need to butcher their input. So my question is.......SOLVE MY PROBLEM FOR ME! DO IT! DROP WHAT YOU'RE DOING AND DO IT NOW! hahhaha, just kidding, but if some knowledgable Monk should come along and see something in my code, I'd sure appreciate a helpful suggestion or two. I'd assume that since Text::Balanced is a well respected and widely used module that my problem does not originate there. Any ideas? Thanks, Monks!

Replies are listed 'Best First'.
Re (tilly) 1: Code stalls...possible memory leak?
by tilly (Archbishop) on Aug 21, 2001 at 21:38 UTC
    I wouldn't assume that about Damian's module. Damian writes really cool string processing, but he usually uses algorithms that do a lot of recopying. That can lead to excessive memory use, particularly on long strings.

    So if you need efficiency, you probably need to roll yor own or use someone else's. Everyone is saying to use while, but nobody gave you sample code. Let me rectify that with a simple script that should run fairly efficiently:

    #! /usr/bin/perl -w use strict; use HTML::Entities qw(encode_entities); my @stack; while (<>) { while (/\G([^{}]+)|\G([{}])/g) { if (length($1)) { print encode_entities($1); } elsif ("{" eq $2) { my $pos = pos($_); if (/\G(\w+)/g) { print "<$1>"; push @stack, $1; } else { die "Unnamed opening brace at character $pos, line $."; } } elsif (@stack) { my $tag = pop(@stack); print "</$tag>"; } else { my $pos = pos($_); die "Unmatched closing brace at character $pos, line $."; } } } if (@stack) { die "Unclosed tags at end of file: '@stack'"; }
Re: Code stalls...possible memory leak?
by John M. Dlugosz (Monsignor) on Aug 21, 2001 at 20:52 UTC
    The first thing I would do is bring up some performance monitoring tools. Does it chew up memory? When it stalls, is it pegging the CPU? Does it chew up handles or other OS resources?

    Also, post your OS details and Perl version number. —John

      Hi John, I am running 2000 NT on my personal machine, though the script performs with similar results on Unix variants. I'm running 5.005_02 Perl. As soon as I start to execute the script, my CPU pegs to 100%, memory hits about 50% and stays there. At any given point, Perl will be using 85-99% of my CPU and it stays pegged after the script stalls. I would expect (although I could be wrong) that a genuine memory leak would show steadily larger increments of memory usage, but in this case everything pretty much stays the same from start to abortive finish. What tools would you recomment for benching code?
        http://www.sysinternals.com.

        Get the (freshly updated) tools to view handles, and see if there is a handle leak. Also "process explorer" will give more details than just looking at the totals in Task Manager.

        Reduce the available memory and run again. Does the problem spot pull back as well? Is it the same spot on Unix as well? If always the same spot, it sounds like something in the logic itself, not memory.

        —John

Re: Code stalls...possible memory leak?
by filmo (Scribe) on Aug 21, 2001 at 21:18 UTC
    Instead of slurping the entire file into a scalar all at once, try line processing using 'while'. This will take far less memory.
    open INFILE, "<$infile"; while ($line_in_file = <INFILE>) { do processing here... } close INFILE;

    --
    Filmo the Klown
Re: Code stalls...possible memory leak?
by Cine (Friar) on Aug 21, 2001 at 22:22 UTC
    Assuming that your file is always balanced and there are no qouted { or }, it can be made much simpler. (I didnt bother looking into your regex's or the Text::Balanced more than breifly, so bare over with me if this is not what you want)
    #!/usr/bin/perl -w use strict; #input print "Name of file to be inputted? :"; $infile=<>; print "What should the output file be named? The file will be automati +cally created. :"; $outfile=<>; chomp $infile; chomp $outfile; ##<<<<<<<<<You forgot this one open INFILE, "<$infile"; #output open FILEOUT, ">$outfile"; my @stack; while(<INFILE>) { my @op = (); my $len = length($_); my $minsta = 0; my $minsto = 0; my $lastpos = 0; while(true) { $minsta = index $_,'{',$minsta; $minsto = index $_,'}',$minsto; last if ($len == $lastpos || ($minsta == -1 && $minsto == -1)); if ($minsta < $minsto) { my $nextsp = index $_,' ',$minsta; $nextsp = length($_) if ($nextsp == -1); my $stack = substr $_,$minsta+1,$nextsp-$minsta; push @op, (substr $_,$lastpos,$minsta-$lastpos), '<',$stack,'>'; push @stack, $stack; $lastpos = $minsta+1; } else { my $stack = pop @stack; die "Not balanced" unless (defined $stack); push @op, (substr $_,$lastpos,$minsto-$lastpos), '</',$stack,'> +'; $lastpos = $minsto+1; } } print OUTFILE join "",@op,(substr $_,$lastpos,$len-$lastpos); } close INFILE; close FILEOUT; die "Not balanced" if (@stack); print "\nYour result is stored in file $outfile\nGoodbye.\n";
    I may have some offby 1 errors, solve them if you find it usefull...

    This will use almost no memory and prob do the same jobs a LOT faster.

    Update:
    Well, seems tilly had the same idea as I, and dinner interrupted me so he got it first ;)

    T I M T O W T D I
Re: Code stalls...possible memory leak?
by jryan (Vicar) on Aug 21, 2001 at 21:11 UTC
    open INFILE, "<$infile"; $text=<INFILE>; close INFILE;

    Thats one hefty scalar to have to sort through, especially after it is expanded into xml. You are probably eating most of your system memory by doing this. Perhaps it would be faster to put your data into an array instead:

    open INFILE, "<$infile"; @text=<INFILE>; close INFILE;

    and then print to the output file in your while loop.

      Arrays use (slightly) more memory than scalars. If you've diagnosed his trouble correctly, the cure is (slightly) worse than the disease. Of course, the extra several bytes per array element (one SV per line, as opposed to one SV total) can start to add up if there are several lines.

      If it's possible to process on a line by line or chunk by chunk basis, with a limited state machine or a stack of some sort, a while read is the better approach. (I suspect that's what you meant.)

Re: Code stalls...possible memory leak?
by tshabet (Beadle) on Aug 21, 2001 at 21:58 UTC
    Thanks for the responses, and special thanks to tilly for the code! So while I play with tilly's code, another sort of followup question: It seems that everyone thinks that my script is biting off more than it can chew (or chomp) so to speak. What if instead of using
    $text=<INFILE>;
    I used a while loop of the Text::Balanced module or tilly's code to bite off balanced brace chunks and run them through the rest of the code one at a time? This would seem to me to have the memory advantages of the
    while ($line_in_file = <INFILE>)
    line while more suited to my needs. Does that thinking make sense? A lot of little processes instead of one big one? This is extremely helpful advice, thanks again guys!