Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to split file for threading?

by diyaz (Beadle)
on Jun 23, 2015 at 14:07 UTC ( [id://1131634]=perlquestion: print w/replies, xml ) Need Help??

diyaz has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I have to filter >20GB files, just pulling any line that has a user specified keyword. It processes tab delimited file and searches through each field for keyword. I have the single threaded version here (please excuse my novice coding/commenting):
use strict; use warnings; use Data::Dumper; #declarations my @rawfile; my $filter_count=0; #input open(INFILE, "<$ARGV[0]") || die "cannot open file:$!"; #### #filter file #------------------ # parse entire file my $header = <INFILE>; #grabbing header chomp $header; my @headerArray = split("\t", $header); my $sizeheader=@headerArray; for my $line (<INFILE>) { chomp $line; my @splitline = split("\t", $line); #my $poskey = $splitline[0] . ":" . $splitline[1]; for (@splitline){ if (/^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } } } close INFILE; print "Completed filtering $ARGV[0]\n"; print "Found $filter_count elements\n"; my $outfilename = substr($ARGV[0],0,length($ARGV[0])-4)."_filter.txt"; print "Filtering to output file: $outfilename\n"; #### #output file #------------------ open(OUTFILE, ">$outfilename") || die "cannot open file to write: $!"; print OUTFILE "$header\n"; for (@rawfile) { print OUTFILE "$_\n"; } close OUTFILE;
I hope this will be a good example to learn threading from. Thanks!

Replies are listed 'Best First'.
Re: How to split file for threading?
by BrowserUk (Patriarch) on Jun 23, 2015 at 14:16 UTC

    If you don't care which field the keyword is found in, why are you bothering to split the line?

    It would be much faster to just test if the keyword appears in the whole line.

    With respect to threading the application; where are the files located? Ie. local disk; local ssd; remotely?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
      Ah good point. I could regex the whole line right? File is remote, but I think everything I do is remote since I have to log in to the server
        File is remote, but I think everything I do is remote since I have to log in to the server

        The question is whether the file is local or remote to the processor running the program; not the human who initiates it.

        You could try this, but it is doubtful if it will be any faster when using multiple threads than just one unless the file is on a local, fast, SSD:

        #! perl -slw use strict; use threads; sub worker { my( $filename, $target, $start, $end ) = @_; open my $fh, '<', $filename or die $!; seek $fh, $start, 0; <$fh> if $start > 0; ## discard first partial line my $count = 0; 1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; return $count; } our $T //= 4; my( $filename, $target ) = @ARGV; my $fsize = -s $filename; my $chunksize = int( $fsize / $T ); my @chunks = map{ $_ * $chunksize } 0 .. $T-1; push @chunks, $fsize; my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1; my $total = 0; $total += $_->join for @threads; print "Found $total '$target' lines";

        Usage:

        thisScript.pl -T=n theFile.txt "the string"

        Note:The count is printed to stdout. Redirect it if you need it in a file.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

        If you can search the whole line without breaking it up, this:

        for my $line (<INFILE>) { chomp $line; my @splitline = split("\t", $line); #my $poskey = $splitline[0] . ":" . $splitline[1]; for (@splitline){ if (/^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } } }

        can become this:

        while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }

        -stevieb

Re: How to split file for threading?
by Anonymous Monk on Jun 23, 2015 at 14:19 UTC
    for my $line (<INFILE>) {

    Have you tried actually running this code on a 20GB file?

    Do you remember the responses to your previous question?

      It doesn't take hours only maybe 10-15 minutes. I was thinking this would be an easy example of learning threading. I did see the previous posts. Are you implying I should use Bio::Seq to thread?

        No. Anonymonk is suggesting not using for() to read a file.

        Use while (my $line = <INFILE>){ ... instead.

Re: How to split file for threading?
by wollmers (Scribe) on Jun 23, 2015 at 15:02 UTC

    IMHO you can just use grep and wc.

Re: How to split file for threading?
by marioroy (Prior) on Jun 26, 2015 at 20:04 UTC

    Update: OP mentions searching for a keyword. Changed the first if statement to do index not regex.

    Parallelism is beautiful. IO reads against the input file are sequential in MCE no matter the number of workers. Workers communicate the next offset position for the next worker waiting to read. The chunk_id value makes it possible to have ordered output.

    Well, here is a parallel implementation for the OP script. Chomping isn't necessary after all. But, please add if necessary. But why chomp and later append the line feed when time to output. Perhaps the OP script is in the form of art and chomp was met to be there. If that's the case, I apologize for taking chomp out. :-)

    The initial pattern matching against the slurped chunk is likely beneficial. But, please feel free to comment out the outer-most if statement and closing brace and give it a try. Scripting is fun.

    Kind regards, Mario

    use strict; use warnings; use MCE::Loop; use MCE::Candy; # ensure two arguments are provided to the script my ($fileName,$keyword) = @ARGV; die "usage: $0 file keyword\n" if @ARGV != 2; # grab header line open INFILE, "<$fileName" || die "cannot open file to read: $!"; my $header = <INFILE>; close INFILE; # utilize many core engine to filter file # out_iter_array returns a closure for gathering orderly my @rawData; MCE::Loop::init { max_workers => 'auto', # note: 'auto' is never higher than 8 gather => MCE::Candy::out_iter_array(\@rawData), use_slurpio => 1, }; mce_loop_f { my ($mce,$slurped_ref,$chunk_id) = @_; # quickly determine if the keyword is found; this is fast # think of this as short-circuiting unnecessary work my ($count, $foundData) = (0, ''); if ( 1 + index($$slurped_ref, $keyword) ) { open my $MEM_FH, '<', $slurped_ref; binmode $MEM_FH, ':raw'; # skip header line for the first chunk only if ($chunk_id == 1) { while (<$MEM_FH>) { if (/$keyword/) { next if $. == 1; # skip header line $foundData .= $_; # append line $count++; # increment count }} } # otherwise, the line number check is not necessary else { while (<$MEM_FH>) { if (/$keyword/) { $foundData .= $_; # append line $count++; # increment count }} } close $MEM_FH; } # gathers two elements; count and rawData in anonymous array # gather must be called irregardless if found or not found # the manager process needs to know if this chunk_id has completed # when gathering results orderly MCE->gather($chunk_id, [ $count, $foundData ]); } $fileName; MCE::Loop::finish; # shutdown MCE workers # each element in rawData is an array ref [ $count, $foundData ] # output count my $filterCount=0; $filterCount += $_->[0] for @rawData; # $count print "Completed filtering $keyword\n"; print "Found $filterCount elements\n"; # output found data my $outFileName = substr($fileName,0,length($fileName)-4)."_filter.txt +"; print "Filtering to output file: $outFileName\n"; open OUTFILE, ">$outFileName" || die "cannot open file to write: $!"; print OUTFILE $header; print OUTFILE $_->[1] for @rawData; # $foundData close OUTFILE;
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1131634]
Approved by ww
Front-paged by GotToBTru
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (1)
As of 2024-04-25 00:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found