apok has asked for the wisdom of the Perl Monks concerning the following question:

Right now I'm working on a text parser for conversion of a pdf of table metadata (because that is the only way to view it outside of the stupid proprietary software) to a tab separated flat file for documentation purposes. I've hit a stumbling block that I am having trouble getting around though.

@meta is a 3-dimensional array of pages, lines, and fields. The parsing and recombining of lines is working as desired, the problem happens when it reaches the end of the page. Once it reaches the end, new elements keep getting added to the end of @{$meta[$page]}, preventing the inner for loop from breaking as it should.

Here is the relevant section of problem code:

open OUT, ">${new_file}.meta" or die "DIAF OUT: $!"; for (my $page = 0; $page < scalar @meta; ++$page) { for (my $line = 0; $line < scalar @{$meta[$page]}; ++$line) { print "$page:$line/",scalar @{$meta[$page]},"\n"; while ($meta[$page]->[$line+1]->[0] and !$meta[$page]->[$line+ +1]->[1] and !$meta[$page]->[$line+1]->[2]) { print "while loop #1\n"; $meta[$page]->[$line]->[0] .= " $meta[$page]->[$line+1]->[ +0]"; del($meta[$page],$line+1); } while (!$meta[$page]->[$line+1]->[1] and $meta[$page]->[$line+ +1]->[2] and $line+1 < scalar @{$meta[$page]}) { print "while loop #2\n"; $meta[$page]->[$line]->[0] .= " $meta[$page]->[$line+1]->[ +0]" if ($meta[$page]->[$line+1]->[0]); $meta[$page]->[$line]->[2] .= " $meta[$page]->[$line+1]->[ +2]"; del($meta[$page],$line+1); } if (!$meta[$page]->[$line+1]) { print "last if check\n"; while (!$meta[$page+1]->[0]->[1] and $meta[$page+1]->[0]-> +[2]) { print "while loop #3\n"; $meta[$page]->[$line]->[0] .= $meta[$page+1]->[0]->[0] + if ($meta[$page+1]->[0]->[0]); $meta[$page]->[$line]->[2] .= $meta[$page+1]->[0]->[2] +; del($meta[$page+1],0); } } print OUT "$meta[$page]->[$line]->[0]\t$meta[$page]->[$line]-> +[1]\t$meta[$page]->[$line]->[2]\n"; } } close OUT; sub del { my $rArr = shift; my $ele = shift; my $last = scalar @$rArr - 1; if ($ele > $last) { warn "Invalid element removal attempted: $ele > $last\n"; return 0; } for my $num ($ele..$last-1) { $rArr->[$num] = $rArr->[$num+1]; } pop @$rArr; return 1; }

Which produces output (continues on infinitely, cut at $line = 21):

0:0/36 0:1/36 0:2/36 0:3/36 ... blah blah blah ... 0:13/17 0:14/17 while loop #1 while loop #1 0:15/16 0:16/17 0:17/18 0:18/19 0:19/20 0:20/21 0:21/22

I'm pretty stumped. The output shows that the while loops aren't being entered, and that is the only place where @meta is modified. Any ideas?

Replies are listed 'Best First'.
Re: Undesired array growth
by jdporter (Paladin) on Jan 09, 2009 at 02:20 UTC

    It sure is unclear what you're trying to do here, and I have to believe there's got to be a better way to do it.

    But I've identified a bug which may be responsible at least in part for whatever problem you're seeing. You're missing a backwhack in front of the array in the third call to del(), at line 25 of the posted code. That is,

    del(@{$meta[$page+1]},0);
    should be
    del(\@{$meta[$page+1]},0);
    Of course, it could really be simply
    shift @{$meta[$page+1]};
    Couldn't it? Otherwise, it looks like your sub del is attempting to re-implement the splice built-in function.

    Good luck!

    Between the mind which plans and the hands which build, there must be a mediator... and this mediator must be the heart.
      should be
      del(\@{$meta[$page+1]},0);

      $meta[$page+1] already contains an array reference so it should be:

      del($meta[$page+1],0);
        Good point, thanks. Updated those, but the problem is still occurring. =/
        Good catch. And the same is true for the other two calls to del() as well.

      Good catch on the backwhack, thanks! Fixed that typo, but the problem is still continuing. =/ I'd love to find a simpler way to do it, but this is the best way I could come up. Granted, I do have a tendency to over-complicate things.

      So at this point in the script, the data is stored in a 3d array. @meta holds the pages, each page holds an array of lines, and each line points to an array of 3 values: Field Name, Field Type, and Field Details. Since it is originally raw text pulled from a pdf, each line is not necessarily a field's full information. Some field names are long enough to be split into two lines, and some of the field details can take up to 15 or 20 lines, but the field type is always only one line.

      The script uses that to check if the line below the current one is the beginning of another field definition (field 0,1,2 defined) or a continuation of the current one (1 undefined, either/both 0/2 defined). If the latter it concats the lower line values onto the current line ($line_num) and then removes the lower line($line_num+1).

      I tried using splice and undef, but that just leaves $line_num+1 undefined. I was hoping to get around that with sub del, so it could just concat and loop until a new definition (0,1,2 defined) was next, and then move down and repeat. It works great... until the end when the array of lines on the page starts growing.

Re: Undesired array growth
by BrowserUk (Patriarch) on Jan 09, 2009 at 05:42 UTC

    Assuming I've made no mistakes, this should replicate what your current code is doing, including the errors.

    But (IMO) it is a little simpler (uses splice), and rather clearer, and may allow you to see your logic error?

    use constant { NAME=>0, TYPE=>1, DETAILS=>2 }; open OUT, ">${new_file}.meta" or die "DIAF OUT: $!"; for( my $page = 0; $page < @meta; ++$page ) { for( my $line = 0; $line < @{ $meta[ $page ] }; ++$line ) { print "$page:$line/",scalar @{ $meta[ $page ] },"\n"; ## while the NAME is defined ## and neither TYPE nor DETAILS are while( $meta[ $page ][ $line +1 ][ NAME ] and ! $meta[ $page ][ $line +1 ][ TYPE ] and ! $meta[ $page ][ $line +1 ][ DETAILS ] ) { print "while loop #1\n"; ## Concatenate the NAME from the next line ## to the current line $meta[ $page ][ $line ][ NAME ] .= ' ' . $meta[ $page ][ $line +1 ][ NAME ]; ## and delete the next line splice @{ $meta[ $page ] }, $line +1, 1; } ## while TYPE is undefined, but DETAILS are ## and there is another line in this page while( ! $meta[ $page ][ $line +1 ][ TYPE ] and $meta[ $page ][ $line +1 ][ DETAILS ] and $line +1 < @{ $meta[ $page ] } ) { print "while loop #2\n"; ## Concatenate the NAME from the next line ## to the current if it exists $meta[ $page ][ $line ][ NAME ] .= ' ' . $meta[ $page ][ $line +1 ][ NAME ] if $meta[ $page ][ $line +1 ][ NAME ]; ## concatenate the details from the next line ## to the current $meta[ $page ][ $line ][ DETAILS ] .= ' ' . $meta[ $page ][ $line +1 ][ DETAILS ]; ## and delete the next line splice @{ $meta[ $page ] }, $line +1, 1; } ## if the next line is undefined if( ! $meta[ $page ][ $line +1 ] ) { print "last if check\n"; ## while the first line of the next page ## has no TYPE but does have DETAILS while( ! $meta[ $page +1 ][ 0 ][ TYPE ] and $meta[ $page +1 ][ 0 ][ DETAILS ] ) { print "while loop #3\n"; ## concatenate the NAMEs if it exists ## in the first line of the next page $meta[ $page ][ $line ][ NAME ] .= $meta[ $page +1 ][ 0 ][ NAME ] if $meta[ $page +1 ][ 0 ][ NAME ]; ## And the DETAILS $meta[ $page ][ $line ][ DETAILS ] .= $meta[ $page +1 ][ 0 ][ DETAILS ]; ## And delete the first line of the next page splice @{ $meta[ $page +1 ] }, 0, 1; } } print OUT join( "\t", @{ $meta[ $page ][ $line ] }[ NAME, TYPE, DETAILS ] ), "\n"; } } close OUT;

    I can not, in the abstract, see the error in your logic, but maybe a different view of your code will make it stand out for you.

    Update: One thing that looks wrong is that you are checking whether another line exists, after you've already looked at the elements of that line:

    ## while TYPE is undefined, but DETAILS are ## and there is another line in this page while( ! $meta[ $page ][ $line +1 ][ TYPE ] and $meta[ $page ][ $line +1 ][ DETAILS ] and $line +1 < @{ $meta[ $page ] } ) {

    That would probably be better as:

    ## while there is another line in this page ## and its TYPE is undefined, but its DETAILS are while( $line +1 < @{ $meta[ $page ] } and ! $meta[ $page ][ $line +1 ][ TYPE ] and $meta[ $page ][ $line +1 ][ DETAILS ] ) {

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Your version exits the for loop correctly! It has a glitch or two in the concatenation, but I should be able to combine the logic of the two into one completely working script. Many, many thanks. =)

      EDIT: though really, all the concat logic was from my script in the first place so it was my bad it's not working completely correct. Silly me.