comment on

Thank you for all of your responses.
So... more detail is clearly needed.
My input is from tcpdump (reading a capture file) The standard set of pcap modules from CPAN for
reading the file, I won't say couldn't work, but would be very hard for me to implement as
we don't use standard frame types, all custom protocols, so would have to write all my own frame types for the perl modules.

I have been working on getting Parallel::Fork::BossWorker working, but it spawns never-ending defunct processes, despite setting the number of workers.
There was a bug posted with a patch addressing that, and with that patch applied the perl script does nothing (the child pid's
hang) - that I am working on, but Im getting side-tracked.

I am using a slightly modified form of the afterglow tcpdump perl parsing script. By itself, it was outputting data to the output file
at around 2 MB/s, so for a 1.5 GB file, it was taking around 12 minutes to finish.
I tried to implement threads, and it was outputting at around 200 KB/s, so it was much slower
(by the way, my box is an 8 CPU harpertown box with 8GB of ram, and the files are all on an EMC SAN)

I don't blame the threads, and being such a novice at this, im sure it's the way I did it, so I will post the code.
Please forgive my fumbling attempts at all of this, following examples and such, it's the best I could figure out.
There are some commented out parts that I normally would not leave in when posting code, but im
leaving them in as they show that I also (for the heck of it) tried using semaphore locking on the output file to
make sure there was no contention on the file by multiple threads (didn't improve anything) -
i have also tried file locking which didn't improve anything either.
Right now, the sub that prints the output is locked so only one thread can access the print at a time, but that hasn't
had any effect either, removing it wouldn't change the results.

I have tried select on STDIN, STDOUT, both, and neither, which didn't have any effect either.

I would say as well that this is CPU bound entirely, as
the CPU definately pegs at 100%, using the threads, I do see the kernel spreading the load out across the CPU's.

Also, I put the pipe in the perl script, running tcpdump (blah) | script vs. open (blah |) doesn't change any timings.
One other note: I have tried various numbers of workers, with no effect.

Anyway, this has been long enough. FWIW here's the script. Please be gentle :-)

--- NOTE credits to http://afterglow.sourceforge.net for their tcpdump2sql.pl script which is what is mostly what you see below

use strict;
use Thread::Pool;
#use Thread::Semaphore;

open( OUTFILE, ">/crnch_data/foo.csv" ) or die "Cannot open output";
print OUTFILE "\n";

#select(STDIN); $| = 1;
open( STDIN, "/usr/sbin/tcpdump -vttttnner /crnch_data/tcpdump_infile 
+|" );
select(OUTFILE); $| = 1;

my $res = "";
my $pool = Thread::Pool->new(
    {
       workers => 10,
       do      => \&do,
       stream  => \&monitor,
    }
);

$pool->job($_) while (<STDIN>);
$pool->shutdown;

sub do {
    chomp;
    my $input = $_;
    $input = shift;

    if ( $input =~ /(^\d\d\d\d-\d\d-\d\d .*)/ ) {
        if ( $input =~
/(.*) \(tos (\S+), ttl +(\d+), id (\d+), offset (\d+), flags ([\S\+]+)
+(?:, ([\S\+]+))?, proto +(\S+ \(\d+\)), length (\d+)(?:, .*?)?\) (.*)
+/
           )
        {
            $input = "$1 $2 $3 $4 $5 $6$7 $8 $9 $10";
        }
        else {
            if ( $input =~
/(.*) (((?:(\d{1,2}|[a-fA-F]{1,2}){2})(?::|-*)){6}) (\>) (((?:(\d{1,2}
+|[a-fA-F]{1,2}){2})(?::|-*)){6}), (.*?), length (\d+)\:(.*)/
               )
            {
                $input = "$1 $2 x $6 x x x x $10";
            }
            else {
                $res = "error";
            }
        }
    }
    else {
        $res = "error";
    }

    my @fields = split( " ", $input );
    my $timestamp   = $fields[0] . " " . $fields[1];
    my $microsecond = $fields[1];
    $timestamp   =~ s/(.*?)\.\d+$/\1/;
    $microsecond =~ s/(.*?)\.(\d+$)/\2/;
    my $sourcemac = $fields[2];
    my $destmac   = $fields[4];
    $destmac =~ s/,//g;
    $fields[4] =~ s/,$//;
    my $len = $fields[9];
    $len =~ s/:$//;
    my $tos = $fields[11];
    my $ttl     = $fields[12];
    my $id      = $fields[13];
    my $offset  = $fields[14];
    my $ipflags = $fields[15] . " " . $fields[16];

    $ipflags =~ s/\[(.*)\]/\1/g;

    my $sip = $fields[18];
    $sip =~ s/([^\.]+\.[^\.]+\.[^\.]+\.[^\.]+).*/\1/;
    $sip =~ s/:$//;

    my $sport = $fields[18];
    if ( $sport =~ /[^\.]+\.[^\.]+\.[^\.]+\.[^\.]+\.(.*)/ ) {
        $sport = $1;
        $sport =~ s/:$//;
    }
    else { $sport = "null"; }

    my $dip = $fields[20];

    $dip =~ s/([^\.]+\.[^\.]+\.[^\.]+\.[^\.]+).*/\1/;
    $dip =~ s/:$//;

    my $dport = $fields[20];
    if ( $dport =~ /[^\.]+\.[^\.]+\.[^\.]+\.[^\.]+\.(.*)/ ) {
        $dport = $1;
        $dport =~ s/:$//;
    }
    else { $dport = "null"; }

    my $proto = "null";
    my $flags = $fields[21] if ( $fields[21] =~ /[SRPU.]+/ );
    my $proto = "tcp" if ( $fields[21] =~ /[SRPU.]+/ );
    $_ = "//$timestamp//$microsecond//$sourcemac//$destmac//$sip//$dip
+//$sport//$dport//$proto//$flags//$len//$ttl//$id//$tos//$ipflags//$o
+ffset?";
}

sub monitor : locked {
    my $line = $_;

# my $semaphore = new Thread::Semaphore;
    $line = shift;
    unless ( $line eq "error" ) {
# $semaphore->down;
        print OUTFILE $line;
# $semaphore->up;
    }
}
close(OUTFILE);
[download]

In reply to Re: How do you parallelize STDIN for large file processing? by forsaken75
in thread How do you parallelize STDIN for large file processing? by forsaken75

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.