Re^2: Process large text data in array

Try this out:

my (@dat) = ();

my @filters;
push @filters, sub { /active/ ? 1 : undef };
push @filters, sub { /anotherfilter/ ? 1 : undef };

open my $DATF, '<', $file_name;
while( chomp(my $line = <$DATF>) ) {

    foreach my $filter (@filters) {
        my $newline = $filter->($line) or next;
        push (@dat, $line);
        last;
    }

}

close($DATF);
[download]

An alternative is this:

use threads;
use Thread::Queue;

use constant MAXTHREADS => 2;

my $workQueue = Thread::Queue->new();
my $outQueue = Thread::Queue->new();

my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS;

open my $DATF, '<', $file_name;
while ( <$DATF> ) {
    $workQueue->enqueue($_);
}
close $DATF;

$workQueue->end();

$_->join for @threads;

$outQueue->end();

my @dat;
while (my $line = $outQueue->dequeue()) {
    push @dat, $line;
}


sub worker {
    my @filters;
    push @filters, sub { /active/ ? 1 : undef };
    push @filters, sub { /anotherfilter/ ? 1 : undef };

    while ( chomp(my $line = $workQueue->dequeue()) ) {
        foreach my $filter (@filters) {
            my $newline = $filter->($line) or next;
            $outQueue->enqueue($line);
            last;
        }
    }
}
[download]

The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. This currently requires you to read the entire file into memory first, however pushing the read process into a separate thread resolves that issue and pushing the outqueue processing into a separate thread also assists in reducing memory footprint (assuming you're doing something like writing the data into a filtered output file)

Comment on Re^2: Process large text data in array Select or Download Code

Replies are listed 'Best First'.
Re^3: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 15:17 UTC
The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. Sorry, but have you actually run and times that code? Because it will, unfortunately, run anything from 5 to 50 times slower than the single threaded version on any build of Perl, or OS, I am familiar with. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^4: Process large text data in array by SimonPratt (Friar) on Mar 11, 2015 at 16:01 UTC
Exactly as it is written, yes it is slower and I expect it to be slower. The model is sound, as I use it in other applications that are actually CPU bound and it does provide a scalable performance improvement Testing it just now, I get 0.9s run time on a 24MB input file for the single threaded code. I get 6.5s run time on the same input using the multi-threaded code, so massively worse performance, as expected. When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code. An admittedly more realistic scenario would be either randomly matching across the full range of filters, or to long tail the matching, with the majority skewed to the front, depending on how much effort has been made by OP to enhance performance, but I'm not sure how much effort I want to put into this right now :-) Given the performance figures OP was quoting, I made the assumption that the process is CPU bound in their case, in which case (as my figures above indicate), a multi-threaded approach can definitely assist and is certainly worth exploring.	[reply]
Re^5: Process large text data in array by BrowserUk (Patriarch) on Mar 11, 2015 at 17:24 UTC
When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code. Could you post those versions of the two programs (save me trying to reproduce them from your descriptions), as I'd like to do a little more analysis on them. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked	[reply]
Re^6: Process large text data in array by SimonPratt (Friar) on Mar 11, 2015 at 18:12 UTC