in reply to Process large text data in array

** Fastest approach tested so far **

As suggested by BrowserUk, I have done a test using the file reading method as suggested. The results absolutely encouraging. From previous reading + processing = 21sec+-. It reduced to just 15sec or less with added up more data from 300k to 400k lines of data.

my (@dat) = (); open (DATF, "<$file_name"); while( <DATF> ) { my ($line) = $_; chomp($line); my (%trec) = &line2rec($line); # just do some filtering here if ($trec{'active'}) { } # just testing to move every data line into array push (@dat, $line); } close(DATF);

Replies are listed 'Best First'.
Re^2: Process large text data in array
by SimonPratt (Friar) on Mar 11, 2015 at 14:49 UTC

    Try this out:

    my (@dat) = (); my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; open my $DATF, '<', $file_name; while( chomp(my $line = <$DATF>) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; push (@dat, $line); last; } } close($DATF);

    An alternative is this:

    use threads; use Thread::Queue; use constant MAXTHREADS => 2; my $workQueue = Thread::Queue->new(); my $outQueue = Thread::Queue->new(); my @threads = map { threads->new( \&worker ) } 1..MAXTHREADS; open my $DATF, '<', $file_name; while ( <$DATF> ) { $workQueue->enqueue($_); } close $DATF; $workQueue->end(); $_->join for @threads; $outQueue->end(); my @dat; while (my $line = $outQueue->dequeue()) { push @dat, $line; } sub worker { my @filters; push @filters, sub { /active/ ? 1 : undef }; push @filters, sub { /anotherfilter/ ? 1 : undef }; while ( chomp(my $line = $workQueue->dequeue()) ) { foreach my $filter (@filters) { my $newline = $filter->($line) or next; $outQueue->enqueue($line); last; } } }

    The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you. This currently requires you to read the entire file into memory first, however pushing the read process into a separate thread resolves that issue and pushing the outqueue processing into a separate thread also assists in reducing memory footprint (assuming you're doing something like writing the data into a filtered output file)

      The benefit to multithreading is you can dial your performance up and down depending on how many resources are available to you.

      Sorry, but have you actually run and times that code?

      Because it will, unfortunately, run anything from 5 to 50 times slower than the single threaded version on any build of Perl, or OS, I am familiar with.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Exactly as it is written, yes it is slower and I expect it to be slower. The model is sound, as I use it in other applications that are actually CPU bound and it does provide a scalable performance improvement

        Testing it just now, I get 0.9s run time on a 24MB input file for the single threaded code. I get 6.5s run time on the same input using the multi-threaded code, so massively worse performance, as expected.

        When I increase the number of filters to 83 and skew the ordering to match towards the end (to simulate a CPU bound process), I get 5.8s run time for the same input for the single threaded code. I get 6.1s run time for the multi-threaded code. An admittedly more realistic scenario would be either randomly matching across the full range of filters, or to long tail the matching, with the majority skewed to the front, depending on how much effort has been made by OP to enhance performance, but I'm not sure how much effort I want to put into this right now :-)

        Given the performance figures OP was quoting, I made the assumption that the process is CPU bound in their case, in which case (as my figures above indicate), a multi-threaded approach can definitely assist and is certainly worth exploring.