How to split file for threading?

diyaz has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 14:16 UTC
If you don't care which field the keyword is found in, why are you bothering to split the line? It would be much faster to just test if the keyword appears in the whole line. With respect to threading the application; where are the files located? Ie. local disk; local ssd; remotely? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply]
Re^2: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 14:46 UTC
Ah good point. I could regex the whole line right? File is remote, but I think everything I do is remote since I have to log in to the server	[reply]
Re^3: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 15:15 UTC
File is remote, but I think everything I do is remote since I have to log in to the server The question is whether the file is local or remote to the processor running the program; not the human who initiates it. You could try this, but it is doubtful if it will be any faster when using multiple threads than just one unless the file is on a local, fast, SSD: #! perl -slw use strict; use threads; sub worker { my( $filename, $target, $start, $end ) = @_; open my $fh, '<', $filename or die $!; seek $fh, $start, 0; <$fh> if $start > 0; ## discard first partial line my $count = 0; 1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; return $count; } our $T //= 4; my( $filename, $target ) = @ARGV; my $fsize = -s $filename; my $chunksize = int( $fsize / $T ); my @chunks = map{ $_ * $chunksize } 0 .. $T-1; push @chunks, $fsize; my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1; my $total = 0; $total += $_->join for @threads; print "Found $total '$target' lines"; [download] Usage: `thisScript.pl -T=n theFile.txt "the string"` [download] Note:The count is printed to stdout. Redirect it if you need it in a file. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l] [select]
Re^4: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 16:48 UTC
Re^5: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 19:14 UTC
Some notes below your chosen depth have not been shown here
Re^4: How to split file for threading? by marioroy (Prior) on Jun 26, 2015 at 17:22 UTC
Re^3: How to split file for threading? by stevieb (Canon) on Jun 23, 2015 at 14:53 UTC
If you can search the whole line without breaking it up, this: `for my $line (<INFILE>) { chomp $line; my @splitline = split("\t", $line); #my $poskey = $splitline[0] . ":" . $splitline[1]; for (@splitline){ if (/^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } } }` [download] can become this: `while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }` [download] -stevieb	[reply] [d/l] [select]
Re^4: How to split file for threading? by afoken (Chancellor) on Jun 23, 2015 at 16:40 UTC
Re^4: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 15:52 UTC
Re: How to split file for threading? by Anonymous Monk on Jun 23, 2015 at 14:19 UTC
`for my $line (<INFILE>) {` Have you tried actually running this code on a 20GB file? Do you remember the responses to your previous question?	[reply] [d/l]
Re^2: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 14:47 UTC
It doesn't take hours only maybe 10-15 minutes. I was thinking this would be an easy example of learning threading. I did see the previous posts. Are you implying I should use Bio::Seq to thread?	[reply]
Re^3: How to split file for threading? by stevieb (Canon) on Jun 23, 2015 at 15:01 UTC
No. Anonymonk is suggesting not using `for()` to read a file. Use `while (my $line = <INFILE>){ ...` instead.	[reply] [d/l] [select]
Re^4: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 15:47 UTC
Re^5: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 16:01 UTC
Some notes below your chosen depth have not been shown here
Re: How to split file for threading? by wollmers (Scribe) on Jun 23, 2015 at 15:02 UTC
IMHO you can just use grep and wc.	[reply]
Re: How to split file for threading? by marioroy (Prior) on Jun 26, 2015 at 20:04 UTC
Update: OP mentions searching for a keyword. Changed the first if statement to do index not regex. Parallelism is beautiful. IO reads against the input file are sequential in MCE no matter the number of workers. Workers communicate the next offset position for the next worker waiting to read. The chunk_id value makes it possible to have ordered output. Well, here is a parallel implementation for the OP script. Chomping isn't necessary after all. But, please add if necessary. But why chomp and later append the line feed when time to output. Perhaps the OP script is in the form of art and chomp was met to be there. If that's the case, I apologize for taking chomp out. :-) The initial pattern matching against the slurped chunk is likely beneficial. But, please feel free to comment out the outer-most if statement and closing brace and give it a try. Scripting is fun. Kind regards, Mario use strict; use warnings; use MCE::Loop; use MCE::Candy; # ensure two arguments are provided to the script my ($fileName,$keyword) = @ARGV; die "usage: $0 file keyword\n" if @ARGV != 2; # grab header line open INFILE, "<$fileName" \|\| die "cannot open file to read: $!"; my $header = <INFILE>; close INFILE; # utilize many core engine to filter file # out_iter_array returns a closure for gathering orderly my @rawData; MCE::Loop::init { max_workers => 'auto', # note: 'auto' is never higher than 8 gather => MCE::Candy::out_iter_array(\@rawData), use_slurpio => 1, }; mce_loop_f { my ($mce,$slurped_ref,$chunk_id) = @_; # quickly determine if the keyword is found; this is fast # think of this as short-circuiting unnecessary work my ($count, $foundData) = (0, ''); if ( 1 + index($$slurped_ref, $keyword) ) { open my $MEM_FH, '<', $slurped_ref; binmode $MEM_FH, ':raw'; # skip header line for the first chunk only if ($chunk_id == 1) { while (<$MEM_FH>) { if (/$keyword/) { next if $. == 1; # skip header line $foundData .= $_; # append line $count++; # increment count }} } # otherwise, the line number check is not necessary else { while (<$MEM_FH>) { if (/$keyword/) { $foundData .= $_; # append line $count++; # increment count }} } close $MEM_FH; } # gathers two elements; count and rawData in anonymous array # gather must be called irregardless if found or not found # the manager process needs to know if this chunk_id has completed # when gathering results orderly MCE->gather($chunk_id, [ $count, $foundData ]); } $fileName; MCE::Loop::finish; # shutdown MCE workers # each element in rawData is an array ref [ $count, $foundData ] # output count my $filterCount=0; $filterCount += $_->[0] for @rawData; # $count print "Completed filtering $keyword\n"; print "Found $filterCount elements\n"; # output found data my $outFileName = substr($fileName,0,length($fileName)-4)."_filter.txt +"; print "Filtering to output file: $outFileName\n"; open OUTFILE, ">$outFileName" \|\| die "cannot open file to write: $!"; print OUTFILE $header; print OUTFILE $_->[1] for @rawData; # $foundData close OUTFILE; [download]	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.


Pathologically Eclectic Rubbish Lister
	PerlMonks