multiple regexp matches in multi line string

wirelesscharlie has asked for the wisdom of the Perl Monks concerning the following question:

I have a file which needs to be parsed to get some parameters. It looks like

1,               ,    ,   Transport,         TCP,Total Packets with Er
+rors = 0
   1,               ,    ,   Transport,         TCP,Packets Received w
+ith Checksum Errors = 0
   1,               ,    ,   Transport,         TCP,Packets Received w
+ith Bad Offset = 0
   1,               ,    ,   Transport,         TCP,Packets Received t
+hat are Too Short = 0
   1,               ,[1024], Application,TRAFFIC-GEN Server,Client Add
+ress = 192.0.0.6
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+art Time (s) = 0.226811597
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session En
+d Time (s) = 19.909754409
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+atus = Closed
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total byte
+s Received = 8320
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total Data
+ Units Received = 260
   1,               ,[1024], Application,TRAFFIC-GEN Server,Throughput
+ (bits/s) = 3381
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average En
+d-to-End Delay (s) = 0.026102975
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average Ji
+tter (s) = 0.070900925
   1,               ,[1024], Application,TRAFFIC-GEN Server,Client Add
+ress = 192.0.0.4
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+art Time (s) = 0.107537260
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session En
+d Time (s) = 19.984968334
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+atus = Closed
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total byte
+s Received = 11232
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total Data
+ Units Received = 351
   1,               ,[1024], Application,TRAFFIC-GEN Server,Throughput
+ (bits/s) = 4520
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average En
+d-to-End Delay (s) = 0.020293675
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average Ji
+tter (s) = 0.056769093
   1,               ,[1024], Application,TRAFFIC-GEN Server,Client Add
+ress = 192.0.0.5
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+art Time (s) = 0.058634166
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session En
+d Time (s) = 19.798208565
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+atus = Closed
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total byte
+s Received = 6624
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total Data
+ Units Received = 207
   1,               ,[1024], Application,TRAFFIC-GEN Server,Throughput
+ (bits/s) = 2684
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average En
+d-to-End Delay (s) = 0.026288118
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average Ji
+tter (s) = 0.090685171
   1,               ,[1024], Application,TRAFFIC-GEN Server,Client Add
+ress = 192.0.0.2
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+art Time (s) = 0.028508654
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session En
+d Time (s) = 19.981800333
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+atus = Closed
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total byte
+s Received = 12832
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total Data
+ Units Received = 401
   1,               ,[1024], Application,TRAFFIC-GEN Server,Throughput
+ (bits/s) = 5144
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average En
+d-to-End Delay (s) = 0.009312223
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average Ji
+tter (s) = 0.046495978
   1,               ,[1024], Application,TRAFFIC-GEN Server,Client Add
+ress = 192.0.0.3
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+art Time (s) = 0.017999448
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session En
+d Time (s) = 19.943126887
   1,               ,[1024], Application,TRAFFIC-GEN Server,Session St
+atus = Closed
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total byte
+s Received = 10880
   1,               ,[1024], Application,TRAFFIC-GEN Server,Total Data
+ Units Received = 340
   1,               ,[1024], Application,TRAFFIC-GEN Server,Throughput
+ (bits/s) = 4368
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average En
+d-to-End Delay (s) = 0.019138684
   1,               ,[1024], Application,TRAFFIC-GEN Server,Average Ji
+tter (s) = 0.060386082
   2,               , [0],    Physical,    802_15_4,Signals transmitte
+d = 1571
   2,               , [0],    Physical,    802_15_4,Signals detected =
+ 1732
   2,               , [0],    Physical,    802_15_4,Signals locked on 
+by PHY = 1191
</readmore>
<code>
[download]

From this data file i need to extract the client address 192.168.0.(\d+) and the corresponding data units received. I tried the follwing:

open(STATFILE,$statfile) or die ("could not open the stat file");
my $lines = do { local $/; <STATFILE> };

my @ones;
my @twos;

    while($lines =~ m/traffic-gen.*client address = 192\.0\.0\.(\d+).*
+data units received = (\d+).*jitter/scgi)
    { 
        print $1, "\tyes\t",$2,"\n";
        push @ones, $1;
        push @twos, $2;


    }

    print @ones,@twos;
[download]

but it is printing only that last addresses and the corresponding data packets received: that is 3 and 340. I am new to regular expressions and perl in general. Kindly help ! Thanks in advance :)

Comment on multiple regexp matches in multi line string Select or Download Code

Replies are listed 'Best First'.
Re: multiple regexp matches in multi line string by moritz (Cardinal) on May 07, 2010 at 09:57 UTC
Please read Death to Dot Star! to learn why you get only one match. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^2: multiple regexp matches in multi line string by wirelesscharlie (Initiate) on May 10, 2010 at 06:39 UTC
Thanks a lot!	[reply]
Re: multiple regexp matches in multi line string by k_manimuthu (Monk) on May 07, 2010 at 10:12 UTC
For your code, you used the greedy match (.). You will use the non-greedy match (.?) in your code, you will get all the addresses. `while($lines =~ m/traffic-gen.?client address = 192\.0\.0\.(\d+).?da +ta units received = (\d+).*?jitter/scgi) { print $1, "\tyes\t",$2,"\n"; push @ones, $1; push @twos, $2; }` [download]	[reply] [d/l]
Re^2: multiple regexp matches in multi line string by wirelesscharlie (Initiate) on May 10, 2010 at 06:38 UTC
Yes that worked!! Thanks :)	[reply]
Re: multiple regexp matches in multi line string by Marshall (Canon) on May 07, 2010 at 18:43 UTC
Welcome to Perl! This regex stuff can get tricky and as you have seen your regex is only matching the LAST address / unit combination (3,340). In general when processing log files, "slurping" or reading the whole file into a single $var is a bad idea. Usually better is to process the log file one line at a time. One reason is so that your code doesn't depend upon the size of the log file. Another reason is for exactly the problem that you are experiencing, the regex stuff gets more complex. There are also some issues about dealing with corrupted lines or records and such things. But that is not relevant to your current problem. I don't know where you heard about: `my $lines = do { local $/; <STATFILE> };`, but that uses two relatively rare constructs in the same statement! It is highly likely that you can write Perl code for a number of years and never need either "do" or "local". Below, I wrote a simple parser for you. There is a predefined handle called DATA which I used instead of opening an external file. The code loops on each line of DATA and searches for a line that ends with "Client Address = some_ip_address". Then the code calls get_units_rcvd() to get that number and the results are printed. That's it. Well code does loop and do the same thing again! The subroutine, get_units_rcvd() could be written more compactly as could all of this code. But more compact doesn't mean "faster" and there is no need here. This is a demo of a common methodology. Look for something that "starts the record" and then call a subroutine to complete the job. `#!/usr/bin/perl -w use strict; while (<DATA>) { if (m/Client Address = ([0-9.]+)\s$/) { my $ip_adr = $1; my $units_rcvd = get_units_rcvd(); print "$ip_adr => $units_rcvd\n"; } } sub get_units_rcvd { while (<DATA>) { if (/Data Units Received = ([0-9]+)\s$/) { return ( $1); } } } =prints: 192.0.0.6 => 260 192.0.0.4 => 351 192.0.0.5 => 207 192.0.0.2 => 401 192.0.0.3 => 340 =cut` [download] Read more... (8 kB)	[reply] [d/l] [select]
Re^2: multiple regexp matches in multi line string by ambrus (Abbot) on May 08, 2010 at 15:09 UTC
I don't know where you heard about: `my $lines = do { local $/; <STATFILE> };` , but that uses two relatively rare constructs in the same statement! It is highly likely that you can write Perl code for a number of years and never need either "do" or "local". That's a common idiom mentioned in perlfaq5 under the question "How can I read in an entire file all at once?". A Super Search confirms that this idiom is quite popular and well-known.	[reply] [d/l]
Re^3: multiple regexp matches in multi line string by Marshall (Canon) on May 10, 2010 at 12:38 UTC
Yes this idiom albeit well known is often misused. I added some more comments at: Re^3: multiple regexp matches in multi line string.	[reply]
Re^2: multiple regexp matches in multi line string by wirelesscharlie (Initiate) on May 10, 2010 at 07:08 UTC
Thanks a lot!! I guessed that loading the complete file into a variable will be bad practice...thanks for pointing out the better method. I got this idiom `my $lines = do { local $/; <STATFILE> };` from perlfaq5 but I confess that I did not understand how it works. I just had a vague idea that for using multi-line regexp, there should be multiple lines in the variable. But I am curious to know how it works, especially about "do" and "local". And also thanks for the tipoff on `<DATA>` :)	[reply] [d/l] [select]
Re^3: multiple regexp matches in multi line string by Marshall (Canon) on May 10, 2010 at 11:46 UTC
$/ is called the "input record separator". Every text line that you would read from a text file will contain "\n" at the end and Perl would say "hey that's one 'record'" because $/ is '\n' by default. `my $line_count=1; while (<FILEHANDLE>) { print $line_count++, " :$_"; }` [download] Above prints a "line number" for each line. And yes there is a special Perl variable that represents that without having to use $line_count. But I figure that Perl is terse enough without needing it here. Above, each thing in $_ will be delimited by the input record separator. I used $line_count to show that there are indeed separate input lines being processed. If you "undefine" the input record separator from its default of "\n", then the whole file will be read in one single shot! That's because there is no "stopping place" to define what a input record (in this case a "line"). local is a weird critter...it says that in this scope define var $x to be "whatever". In functional terms this is like (when re-defined within a scope): push current value, use this new value, pop previous value when the scope changes. In previous code, `$/ = undef; my $lines = <STATFILE>;` [download] would have the same effect as: `my $lines = do { local $/; <STATFILE> };` [download] $/'s meaning is not FILEHANDLE specific and if you had multiple files open for read at once, the current value of $/ would apply to any subsequent read from any file, but you only had one file. This "do {code}" is like a subroutine. You will probably write a lot of code before you really need this. Anyway, after this statement, $/ is auto-magically set back to the default of "\n". Reading a whole file at a time into memory is good for some situations and is good for records where fields span multiple lines. But you don't have that. You have a record that although it contains multiple input lines, nothing spans a line (a "\n" boundary). This is typical of well designed log file output. Log files often get very large and the ability to process them while only requiring a small subset of lines at a time in memory is often well worth the trouble! I demonstrated one classic way to process a multi-line record, the technique works in C, JAVA or any language. Perl has some cool other ways. For more reading, check out Flipin good, or a total flop?. Perl has a "range" operator that decides whether you are inside of 2 lines bounded by a regular expression. This allows processing in a similar way to the "classic" method and does not require that the entire log file be read into memory at once...again..often a huge advantage! Hope this lengthy reply helped. There is a BIG difference between records whose fields span multiple lines and a record that contains a collection of more than one line. Well designed text log file formats do not have the former.	[reply] [d/l] [select]