Re: How to split file for threading?

Replies are listed 'Best First'.
Re^2: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 14:46 UTC
Ah good point. I could regex the whole line right? File is remote, but I think everything I do is remote since I have to log in to the server	[reply]
Re^3: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 15:15 UTC
File is remote, but I think everything I do is remote since I have to log in to the server The question is whether the file is local or remote to the processor running the program; not the human who initiates it. You could try this, but it is doubtful if it will be any faster when using multiple threads than just one unless the file is on a local, fast, SSD: #! perl -slw use strict; use threads; sub worker { my( $filename, $target, $start, $end ) = @_; open my $fh, '<', $filename or die $!; seek $fh, $start, 0; <$fh> if $start > 0; ## discard first partial line my $count = 0; 1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; return $count; } our $T //= 4; my( $filename, $target ) = @ARGV; my $fsize = -s $filename; my $chunksize = int( $fsize / $T ); my @chunks = map{ $_ * $chunksize } 0 .. $T-1; push @chunks, $fsize; my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1; my $total = 0; $total += $_->join for @threads; print "Found $total '$target' lines"; [download] Usage: `thisScript.pl -T=n theFile.txt "the string"` [download] Note:The count is printed to stdout. Redirect it if you need it in a file. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l] [select]
Re^4: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 16:48 UTC
Thank you very much. I like to learn by example and this is great. I'll be honest I'm trying to understand how the code works as I am still very novice to Perl and programming in general. what do these lines mean? `1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end;` Is this reading the file line by line, returning line number where $target is found, and increasing $count for every time $target is found? Tell function returns the index position just to check end of file? Why couldn't you do while (<$fh>)? `our $T //= 4;` [download] I couldn't really google two forward slashes so I don't know what this means. `my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1;` [download] Ok map will transform arrays, but here it seems like it is assigning 3 threads to @threads? threads->new() creates new thread while passing the worker sub and 4 variables? or is it creating a worker sub and passing those variables to worker? `my $total = 0; $total += $_->join for @threads;` [download] this concatenates the thread results? but i don't see where or when the lines with $target were passed back and stored? Thanks again	[reply] [d/l] [select]
Re^5: How to split file for threading? by BrowserUk (Patriarch) on Jun 23, 2015 at 19:14 UTC
Re^6: How to split file for threading? by Anonymous Monk on Jun 23, 2015 at 19:16 UTC
Some notes below your chosen depth have not been shown here
Re^4: How to split file for threading? by marioroy (Prior) on Jun 26, 2015 at 17:22 UTC
Update: Added chunk_size option to the example. Update: Added note on one worker reading at any given time. The example by BrowserUk is awesome. Perhaps you're having to work with Perl not compiled with threads; e.g. running Perl on Solaris. A pattern counter can be done with MCE using the following code. The usage is the same between the two scripts. This is safe from random I/O thrashing. The read pattern is sequential, not random between workers due to one worker reading at any given time. `#! perl -slw use strict; use warnings; use MCE::Loop; our $T //= 'auto'; my ( $file_name, $target ) = @ARGV; MCE::Loop::init { use_slurpio => 1, max_workers => $T, chunk_size => 1024 * 1024 * 16, }; my @result = mce_loop_f { my ( $mce, $slurped_chunk, $chunk_id ) = @_; my $count = 0; $count++ while ( $$slurped_chunk =~ /$target/g ); MCE->gather($count); } $file_name; my $total = 0; $total += shift @result while @result; print "Found $total '$target' lines";` [download] Usage: `mce_script.pl -T=n theFile.txt "the string"` [download]	[reply] [d/l] [select]
Re^3: How to split file for threading? by stevieb (Canon) on Jun 23, 2015 at 14:53 UTC
If you can search the whole line without breaking it up, this: `for my $line (<INFILE>) { chomp $line; my @splitline = split("\t", $line); #my $poskey = $splitline[0] . ":" . $splitline[1]; for (@splitline){ if (/^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } } }` [download] can become this: `while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }` [download] -stevieb	[reply] [d/l] [select]
Re^4: How to split file for threading? by afoken (Chancellor) on Jun 23, 2015 at 16:40 UTC
`while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }` [download] A little bit more fine-tuning: If the line does not match, there is no point in chomping it. So move `chomp` into the `if` block. (This assumes the search pattern has no problems with the trailing line-end character(s) in `$line`.) Using a variable inside the regexp forces perl to recompile it several times (I think). `my $pattern=qr/^$ARGV[1]/;` in front of the `while` loop and using `$line=~$pattern` in `if` would avoid that. If the search is for a keyword at the first character of the line, and not a pattern, using a regexp may be slower than a simple string operation. Try using `if (index($line,$ARGV[1])==0)` or `if (substr($line,0,length($ARGV[1])) eq $ARGV[1])` instead. Hint: Use Devel::NYTProf to locate hot spots in your code. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^4: How to split file for threading? by diyaz (Beadle) on Jun 23, 2015 at 15:52 UTC
thank you!	[reply]