in reply to How to split file for threading?

If you don't care which field the keyword is found in, why are you bothering to split the line?

It would be much faster to just test if the keyword appears in the whole line.

With respect to threading the application; where are the files located? Ie. local disk; local ssd; remotely?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

Replies are listed 'Best First'.
Re^2: How to split file for threading?
by diyaz (Beadle) on Jun 23, 2015 at 14:46 UTC
    Ah good point. I could regex the whole line right? File is remote, but I think everything I do is remote since I have to log in to the server
      File is remote, but I think everything I do is remote since I have to log in to the server

      The question is whether the file is local or remote to the processor running the program; not the human who initiates it.

      You could try this, but it is doubtful if it will be any faster when using multiple threads than just one unless the file is on a local, fast, SSD:

      #! perl -slw use strict; use threads; sub worker { my( $filename, $target, $start, $end ) = @_; open my $fh, '<', $filename or die $!; seek $fh, $start, 0; <$fh> if $start > 0; ## discard first partial line my $count = 0; 1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; return $count; } our $T //= 4; my( $filename, $target ) = @ARGV; my $fsize = -s $filename; my $chunksize = int( $fsize / $T ); my @chunks = map{ $_ * $chunksize } 0 .. $T-1; push @chunks, $fsize; my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1; my $total = 0; $total += $_->join for @threads; print "Found $total '$target' lines";

      Usage:

      thisScript.pl -T=n theFile.txt "the string"

      Note:The count is printed to stdout. Redirect it if you need it in a file.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
        Thank you very much. I like to learn by example and this is great. I'll be honest I'm trying to understand how the code works as I am still very novice to Perl and programming in general. what do these lines mean?   1+index( <$fh>, $target ) and ++$count while tell( $fh ) < $end; Is this reading the file line by line, returning line number where $target is found, and increasing $count for every time $target is found? Tell function returns the index position just to check end of file? Why couldn't you do while (<$fh>)?
        our $T //= 4;
        I couldn't really google two forward slashes so I don't know what this means.
        my @threads = map{ threads->new( \&worker, $filename, $target, $chunks[ $_ ], $chunks +[ $_+1 ] ) } 0 .. $T-1;
        Ok map will transform arrays, but here it seems like it is assigning 3 threads to @threads? threads->new() creates new thread while passing the worker sub and 4 variables? or is it creating a worker sub and passing those variables to worker?
        my $total = 0; $total += $_->join for @threads;
        this concatenates the thread results? but i don't see where or when the lines with $target were passed back and stored? Thanks again

        Update: Added chunk_size option to the example.

        Update: Added note on one worker reading at any given time.

        The example by BrowserUk is awesome. Perhaps you're having to work with Perl not compiled with threads; e.g. running Perl on Solaris. A pattern counter can be done with MCE using the following code. The usage is the same between the two scripts.

        This is safe from random I/O thrashing. The read pattern is sequential, not random between workers due to one worker reading at any given time.

        #! perl -slw use strict; use warnings; use MCE::Loop; our $T //= 'auto'; my ( $file_name, $target ) = @ARGV; MCE::Loop::init { use_slurpio => 1, max_workers => $T, chunk_size => 1024 * 1024 * 16, }; my @result = mce_loop_f { my ( $mce, $slurped_chunk, $chunk_id ) = @_; my $count = 0; $count++ while ( $$slurped_chunk =~ /$target/g ); MCE->gather($count); } $file_name; my $total = 0; $total += shift @result while @result; print "Found $total '$target' lines";

        Usage:

        mce_script.pl -T=n theFile.txt "the string"

      If you can search the whole line without breaking it up, this:

      for my $line (<INFILE>) { chomp $line; my @splitline = split("\t", $line); #my $poskey = $splitline[0] . ":" . $splitline[1]; for (@splitline){ if (/^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } } }

      can become this:

      while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }

      -stevieb

        while (my $line = <INFILE>) { chomp $line; if ($line =~ /^$ARGV[1]/){ $filter_count++; push(@rawfile,$line); } }

        A little bit more fine-tuning:

        • If the line does not match, there is no point in chomping it. So move chomp into the if block. (This assumes the search pattern has no problems with the trailing line-end character(s) in $line.)
        • Using a variable inside the regexp forces perl to recompile it several times (I think). my $pattern=qr/^$ARGV[1]/; in front of the while loop and using $line=~$pattern in if would avoid that.
        • If the search is for a keyword at the first character of the line, and not a pattern, using a regexp may be slower than a simple string operation. Try using if (index($line,$ARGV[1])==0) or if (substr($line,0,length($ARGV[1])) eq $ARGV[1]) instead.

        Hint: Use Devel::NYTProf to locate hot spots in your code.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
        thank you!