in reply to Re^3: Process large text data in array
in thread Process large text data in array

Exactly as it is written, yes it is slower and I expect it to be slower. The model is sound, as I use it in other applications that are actually CPU bound and it does provide a scalable performance improvement

Testing it just now, I get 0.9s run time on a 24MB input file for the single threaded code. I get 6.5s run time on the same input using the multi-threaded code, so massively worse performance, as expected.

When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code. An admittedly more realistic scenario would be either randomly matching across the full range of filters, or to long tail the matching, with the majority skewed to the front, depending on how much effort has been made by OP to enhance performance, but I'm not sure how much effort I want to put into this right now :-)

Given the performance figures OP was quoting, I made the assumption that the process is CPU bound in their case, in which case (as my figures above indicate), a multi-threaded approach can definitely assist and is certainly worth exploring.

Replies are listed 'Best First'.
Re^5: Process large text data in array
by BrowserUk (Patriarch) on Mar 11, 2015 at 17:24 UTC
    When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code.

    Could you post those versions of the two programs (save me trying to reproduce them from your descriptions), as I'd like to do a little more analysis on them.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Sure, this is the multi-threaded code:

      use strict; use threads; use Thread::Queue; use Time::HiRes 'time'; use constant MAXTHREADS => 2; my $workQueue = Thread::Queue->new(); my $outQueue = Thread::Queue->new(); my @filters; push @filters, sub { /blahdeblah/ ? 1 : undef } for 1..83; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS; my $file_name = 'test.txt'; open my $DATF, '<', $file_name; while ( <$DATF> ) { $workQueue->enqueue($_); } close $DATF; $workQueue->end(); $_->join for @threads; $outQueue->end(); my @dat; while (my $line = $outQueue->dequeue()) { push @dat, $line; } print( time - $^T, "\n" ); sub worker { while ( my $line = $workQueue->dequeue() ) { chomp $line; foreach my $filter (@filters) { my $newline = $filter->($line) or next; $outQueue->enqueue($line); last; } } }

      This is the single-threaded code:

      use strict; use Time::HiRes 'time'; my (@dat) = (); my @filters; push @filters, sub { /blahdeblah/ ? 1 : undef } for 1..83; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; my $file_name = 'test.txt'; open my $DATF, '<', $file_name; while( chomp(my $line = <$DATF>) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; push (@dat, $line); last; } } close($DATF); print( time - $^T, "\n" );

      And the data file I used was made up of the following:

      active=sync|foo=bar=bam|sync=53|foo=bar=bam|sync=53|foo=bar=bam|sync=5 +3| foo=bar=bam|sync=53||foo=bar=bam|sync=53 anotherfilter=forest|foo=bar=bam|sync=53|foo=bar=bam|sync=53|foo=bar=b +am|sync=53|foo=bar=bam|sync=53|foo=bar=bam|sync=53
      repeated ~100,000 times, to generate a file of ~300,000 lines at roughly 23MB

      I initially came up with this threading model to parse I/B/E/S monthly financial data files, which are pretty hefty (roughly 30GB all up) and require a lot of processing for each line (anywhere between 40 lines of code - 200 lines of code), ultimately winding up with a massively CPU bound operation. Given the ultimate dataset size though (and the fact it is already broken up into multiple files), the final model I went with, which provides the best speed enhancement for this scenario, is a multi-threaded model that divides the work based on files, rather than lines. Splitting on lines was good for individual file processing, but not good enough for overall processing and ultimately not as scalable, due to natural limits on performance enhancement when splitting processing into such tiny work units.

      Ultimately, you need to know the underlying environment and task very well to be able to make a good decision about what can / needs to be multi-threaded, how it should be split up and where the sweet spot is for performance enhancement.

      edit: Removed use warnings, as I didn't actually run this code with use warnings enabled