Define string on current line, then match other lines with string below the line

Mashed Potato has asked for the wisdom of the Perl Monks concerning the following question:

Hello Sirs - please take it easy on me if this is an elementary question, I'm fairly new to Perl. I could not find any relevant topics.

I am trying to define a string from a subset of each line which matches my regex, and then read down in the file to find the other lines which match the string. If there is more than one match, print the number of matches and the lines.

My data:

src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=AGE OUT
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=Traffic Den
+ied
src=3.3.3.3 dst=4.4.4.4 src_port=50235 dst_port=123 reason=AGE OUT
src=3.3.3.3 dst=4.4.4.4 src_port=50235 dst_port=123 reason=AGE OUT
[download]

My attempt seems to only print the first match. Is that because the inside file loop passes EOF to the outside? I believe I have used this before with arrays and it worked, but this file is too large to take into memory.

#!/usr/bin/perl


use strict;
use warnings;

my $file = 'tmpfile';
my $match;
my $numelements;


open my $info, $file or die "Could not open $file: $!";

while( my $line = <$info>) {
        if ( $line =~ m/(src\=\d+\.\d+\.\d+\.\d+\sdst\=\d+\.\d+\.\d+\.
+\d+\ssrc\_port\=\d+\sdst\_port\=\d+).*AGE OUT/ ) {
                $match = "$1";
                push (my @dups, $line);
                while( my $twoline = <$info> ) {
                        if ( $twoline =~ /$match/ ) {
                                push @dups, $twoline;
                        }
                }               
        $numelements = @dups;
        print "$match has $numelements elements\n";
        if ( $numelements > 1 ) {
        print join("\n", @dups);
        }
        }

}

close $info;
[download]

Output I get:

$./tcprst.pl
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 has 2 elements
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=AGE OUT
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=Traffic Den
+ied
$
[download]

Output I expect:

 
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 has 2 elements
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=AGE OUT
src=2.2.2.2 dst=1.1.1.1 src_port=50232 dst_port=514 reason=Traffic Den
+ied

src=3.3.3.3 dst=4.4.4.4 src_port=50235 dst_port=123 has 2 elements
src=3.3.3.3 dst=4.4.4.4 src_port=50235 dst_port=123 reason=AGE OUT
src=3.3.3.3 dst=4.4.4.4 src_port=50235 dst_port=123 reason=AGE OUT
[download]

Thank you in advance

Comment on Define string on current line, then match other lines with string below the line Select or Download Code

Replies are listed 'Best First'.

Re: Define string on current line, then match other lines with string below the line
by GotToBTru (Prior) on Oct 02, 2014 at 18:27 UTC

Your interior while loop runs to the end of the file. You need to seek back to the point you found your last match (assuming they always come in groups) to find the next string to search for. It looks like you could use the number of elements in the first line to know how many lines to search.

1 Peter 4:10

[reply]

Re: Define string on current line, then match other lines with string below the line
by QM (Parson) on Oct 03, 2014 at 10:36 UTC

seek

tell

seek

while (my $line = <$info>) {
    my $next_line_start = tell($info); # save start position of next l
+ine
    if ($line =~ /regex/) {
        do_something_here;
    }
    # reposition at $next_line_start
    seek($info,$next_line_start,0);
}
[download]

Update: removed defined in the while test, as it is included (here, at least) in while magic.

-QM
--
Quantum Mechanics: The dreams stuff is made of

[reply]
[d/l]
[select]

Re: Define string on current line, then match other lines with string below the line
by mr_mischief (Monsignor) on Oct 03, 2014 at 17:14 UTC

Maybe I'm being dense, but why are you reading sections of the file more than once anyway? You only need to read from where you are and below as I read your node. This is usually done by setting some sort of flag value and testing that flag. Your I/O system will thank you.

The following code produces very close to your expected output from your sample input. There may be an extraneous newline at the end if you care about that. I've heavily commented this to make it easier to follow. I also threw in some quite simple debugging for the data structure and made the regex a bit easier (for me) to read.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper; # added for debugging the hash

my $DEBUG = 0;    # enable debugging if true
my $file = 'tmpfile';
my $numelements;
my %connection;   # This becomes the central data structure. Called it
+ 'connection' because it represents a typical TCP connection

open my $info, '<', $file or die "Could not open $file: $!"; # favor t
+hree-argument open when you're not using open's magic

# Read all the info in a single pass by use of a start flag, put it in
+ the data structure.
while ( my $line = <$info> ) {
    if ( $line =~ m/(src=(?:\d+\.){3}\d+ dst=(?:\d+\.){3}\d+ src_port=
+\d+ dst_port=\d+) reason=(.*)/ ) { # capture the reason
        my $match = $1;

        if ( $2 eq 'AGE OUT' ) { # test to see if the reason is what w
+e're looking for at the start
            $connection{ $match }{ 'aged_out' } = 1;
        }
        if ( exists $connection{ $match } and $connection{ $match }{ '
+aged_out' } ) {
            # If we've recieved an 'AGE OUT' reason, then $connection{
+ $match }{ 'aged_out' } has been autovivified and we can start counti
+ng and pushing.
            # Keep track of this and following lines for this connecti
+on in this sub-hash.
            $connection{ $match }{ 'count' }++;
            push @{ $connection{ $match }{ 'line' } }, $line; # The ac
+tual lines are in a HoHoA here.
        }
    }
}
close $info;

print Dumper %connection if $DEBUG;

# Now there's a data structure from the above loop we can loop over wi
+thout accessing the file any longer.
for my $con ( keys %connection ) {
    print "$con has " . $connection{ $con }{ 'count' } . " elements\n"
+;
    if ( $connection{ $con }{ 'count' } > 1 ) {
        print @{ $connection{ $con }{ 'line' } }; # Doesn't need to be
+ joined because the newlines were never stripped.
    }
    print "\n";
}
[download]

[reply]
[d/l]

Re^2: Define string on current line, then match other lines with string below the line

by Mashed Potato (Initiate) on Oct 05, 2014 at 14:14 UTC

First: thank you.

Second: HoHoA? That's one Ho short of a Santa-A (Canadian?). Seriously though, I have yet to venture into hash-land, nevermind hashes of arrays, and certainly not Santas who are not playing with a full deck of Ho's.

Since it's obviously time for me to get into hashes, do you have any examples like this one where data from a file is pushed to the hash, as opposed to the user defining it? Unfortunately nobody at my work cares that I can create a hash with different fruits and vegetables from my mind.

And third, thanks for saying TCP because it made me realize I don't need to look for UDP connections. I will work that into the regex.

Great learning experience, thanks again to everyone who replied.

[reply]

Re^3: Define string on current line, then match other lines with string below the line

by Athanasius (Archbishop) on Oct 05, 2014 at 16:04 UTC

Hello Mashed Potato,

HoHoA? ... I have yet to venture into hash-land ...

A good place to start would be chromatic’s book Modern Perl, which is available for free download from http://onyxneon.com/books/modern_perl/. Look at Chapter 3, “The Perl Language,” subsections “Hashes” and “Nested Data Structures.” Then work through these Perl documentation tutorials: perlreftut and perldsc.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]

Re^3: Define string on current line, then match other lines with string below the line

by mr_mischief (Monsignor) on Oct 06, 2014 at 18:14 UTC

Yeah, I was a bit concerned that a hash of hashes of arrays was a bit complex in this case. Sometimes I find it easier to think about the levels backward. There's an array of the lines kept in 'line', and a reference to each 'line' is kept in its own $match hash. A reference to each $match is kept in %connection to hold it all together. The 'count' is just another branch of that tree. Set $DEBUG to 1 and look at the data structure.

I've found some quotes about data structures I'd like to share before I start giving bibliography.

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships." -- Linus Torvalds
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowcharts; they'll be obvious. -- Fred Brooks.
Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming. -- Fred Brooks
"It is better to have 100 functions operate on one data structure than to have 10 functions operate on 10 data structures." �Alan J. Perlis

Besides the wonderful Modern Perl already mentioned in the thread, there are other resources, too.

Chapter 9 of Programming Perl is all about data structures as they apply to Perl.
perldsc is too.
The data structures portion of Tom Christiansen (tchrist)'s Far More Than You Ever Wanted To Know series.

[reply]