babelfish has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow monks,

i have a problem collecting data from a record-oriented stream.

In particular, i need to collect strings within each paragraph appearing on different lines and matching a common regex. But my current approach does not work properly, only the first occurrence of the regex is found and the other ones are skipped or ignored.

The data stream i want to process looks like this:

### HEADING OF RECORD 1 #### Logical device ID=08E1 LINE_THAT_DOES_NOT_BOTHER_ME ANOTHER_LINE_THAT_DOES_NOT_BOTHER_ME 29 8/0/2/1/0.18.152.0.0.6.1 c29t6d1 FA 5eA 30 8/0/3/1/0.17.152.0.0.6.1 c30t6d1 FA 12e 31 8/0/8/1/0.17.150.0.0.6.1 c31t6d1 FA 10eA 32 8/0/9/1/0.18.150.0.0.6.1 c32t6d1 FA 11eA ### HEADING OF RECORD 2 #### Logical device ID=08E2 LINE_THAT_DOES_NOT_BOTHER_ME ANOTHER_LINE_THAT_DOES_NOT_BOTHER_ME 29 8/0/2/1/0.18.152.0.0.4.1 c29t4d1 FA 5eA 30 8/0/3/1/0.17.152.0.0.4.1 c30t4d1 FA 12eA 31 8/0/8/1/0.17.150.0.0.4.1 c31t4d1 FA 10eA 32 8/0/9/1/0.18.150.0.0.4.1 c32t4d1 FA 11eA ### HEADING OF RECORD 3 #### (...)

The task is as follows:

Create a hash of arrays from that stream where the "Logical device ID" numbers are the keys and the cXtYdZ strings shall be collected in arrays, being the respective values:

%hash = ( '08E1' => ['c29t6d1','c30t6d1','c31t6d1','c32t6d1'], '08E2' => ['c29t4d1','c30t4d1','c31t41','c32t4d1'], (...) )

I am using this code for processing the stuff:

use strict; use warnings; use Data::Dumper; my %hash; open ( FH, "powermt display dev=all|");# data stream comes from here $/ = ''; while (<FH>) { my ($id) = ( $_ =~ /Logical device ID=(\w+)/ ); push (@{$hash{$id}}, $1) if /(c\d+t\d+d\d+)/; } print Dumper (\%hash);

But when using this code, i only get a HoA containing only the first occurence of the regex within each paragraph, like this:

$VAR1 = { '08E1' => [ 'c29t6d1' ], '08E2' => [ 'c29t4d1' ], (...)

So far, the record processing itself seems to work ok, but i am missing something in the while loop when trying to catch all cXtYdZ strings. I also must mention that the number of lines with that string may also vary, there might be just one line, but there could also be 2,3,4,5 ... another lines containing these strings.

The problem seems to be that i need to execute the push statement as often as the regex pattern appears within each loop.

Can somebody enlighten me for perhaps improving my loop-control skills?

TIA!

Replies are listed 'Best First'.
Re: How can i catch strings matching a regex across multiple lines?
by aaron_baugher (Curate) on Jun 30, 2012 at 23:38 UTC

    The problem is this line:

     push (@{$hash{$id}}, $1) if /(c\d+t\d+d\d+)/;

    That checks the input record for your pattern, captures it, and pushes it onto the array pointed to by that ID. But it only does that once, so it finds the first one, pushes it, and moves on. To find them all, you'll need to tell your regex to repeat the search:

    push @{hash{$id}}, /(c\d+t\d+d\d+)/g;

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Re: How can i catch strings matching a regex across multiple lines?
by Anonymous Monk on Jun 30, 2012 at 21:38 UTC
      you can see that $id gets reinitialized upon each iteration of the loop (with each new line read)
      But he is reading records, not lines and therefore it is perfectly OK --even recommended-- to re-initialize $id each time through the loop.

      The solution is to add the g modifier to the regex.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
      I dont think the problem has anything to do with $id.

      The much bigger potential perpetrator is the "$/" setting.
      Since you expect to be reading lines, you need a record separator - otherwise, the entire stream will be "slurp"ed, and you will see only one record, which matches your symptoms.

                   I hope life isn't a big joke, because I don't get it.
                         -SNL

        Not quite. See perlvar, "INPUT_RECORD_SEPARATOR": setting $/ to an empty string sets the input record separator to "two or more consecutive empty lines." That's what he wants here, since his records are separated by a blank line.

        Aaron B.
        Available for small or large Perl jobs; see my home node.

Re: How can i catch strings matching a regex across multiple lines?
by 2teez (Vicar) on Jul 01, 2012 at 11:25 UTC

    You can achieve what you want like so:

    use warnings; use strict; use Data::Dumper; my $device_id = {}; my $id = ""; while (<DATA>) { chomp; if (m/Logical.+=(.+?)$/) { $id = $1; } else { if (m/.+?\s+?(c.+?)\s+?.+?$/) { push @{ $device_id->{$id} }, $1; } } } print Dumper($device_id); __DATA__ ### HEADING OF RECORD 1 #### Logical device ID=08E1 LINE_THAT_DOES_NOT_BOTHER_ME ANOTHER_LINE_THAT_DOES_NOT_BOTHER_ME 29 8/0/2/1/0.18.152.0.0.6.1 c29t6d1 FA 5eA 30 8/0/3/1/0.17.152.0.0.6.1 c30t6d1 FA 12e 31 8/0/8/1/0.17.150.0.0.6.1 c31t6d1 FA 10eA 32 8/0/9/1/0.18.150.0.0.6.1 c32t6d1 FA 11eA ### HEADING OF RECORD 2 #### Logical device ID=08E2 LINE_THAT_DOES_NOT_BOTHER_ME ANOTHER_LINE_THAT_DOES_NOT_BOTHER_ME 29 8/0/2/1/0.18.152.0.0.4.1 c29t4d1 FA 5eA 30 8/0/3/1/0.17.152.0.0.4.1 c30t4d1 FA 12eA 31 8/0/8/1/0.17.150.0.0.4.1 c31t4d1 FA 10eA 32 8/0/9/1/0.18.150.0.0.4.1 c32t4d1 FA 11eA ### HEADING OF RECORD 3 #### (...)

    output: $VAR1 = { '08E2' => [ 'c29t4d1', 'c30t4d1', 'c31t4d1', 'c32t4d1' ], '08E1' => [ 'c29t6d1', 'c30t6d1', 'c31t6d1', 'c32t6d1' ] };

    Please, note test your match regexes.
    Also Check perldoc perldsc
    Hope this helps

      /me .oO ' ... at last -- a tested solution!'

              + +

      Thanks, that one works well for me!

      So, i need to give myself an hour's detention on proper data munging :)