in reply to buffering from a large file

As BrowserUk and moritz said, it make no sense to read the whole file into a single variable. Much better to read the file in line by line, discarding the lines you don't need, and storing each line you are interested in in a separate variable. You can keep this list of scalar strings in an array or another data structure.

If each line requires lots of work to process, then your overall problem sounds ideal for Parallel::ForkManager I would try something like:

use strict; use English; use Parallel::ForkManager; # Experiment with this value. I suggest you initially try setting # it to twice the number of CPU threads you have. my $MAX_PROCESSES = 8; $pm = new Parallel::ForkManager($MAX_PROCESSES); open my $src_FH, '<', 'Huge_source_file.txt' or die "Error opening sou +rce file $!"; my $line_num = 0; LINE: while( my $line = <$src_FH> ) { next LINE unless 1 == ($INPUT_LINE_NUMBER % 4); my $worker_pid = $pm->start; if( $worker_pid ) { # In parent next LINE; } else { # Call a subroutine to process the line process_line($worker_pid); $pm->finish; # Terminates the child process } }

Parallel::ForkManager will maintain a pool of worker threads, and pass processing jobs to each, so you don't need to worry about creating a fork bomb by mistake. If you need to get results back from the worker threads, then there are docs on CPAN explaining how.

Replies are listed 'Best First'.
Re^2: buffering from a large file
by cedance (Novice) on Mar 17, 2011 at 11:45 UTC
    Hi all,

    Thank you very much. I'll change the code to read line by line. And I'll check the forkmanager module as well. What I failed to mention is that, while the processing is done only on every 2nd line, the other lines are not discarded. Let me explain:

    Assume the file as n quadruples. Each 4 lines (1:4, 5:8 etc..) belong as one entity. From each of these 4 lines I check/edit the 2nd line and then if it satisfies the criteria, then I have to write these 4 lines (3 of them untouched, yes) back to another output file, otherwise I don't, meaning those 4 lines are discarded for the output file.

    Thanks again.

      Based on your further description, I would say that you should probably forget about ForkManager and multiple threads.

      If multiple threads are all trying to write to the same output file, then you will need to worry about locking the file so they don't corrupt it. The locking overhead will kill performance, and even if you solve it somehow random differences in how long each thread takes to run will mean that the order of the lines in the output file will get partly randomised, which you probably don't want.

      Instead I suggest you go for a single threaded solution that reads the input line by line and only keeps one group of four lines in memory at any one time. That way everything should be simple, and reasonably quick.

      I suggest that you add some regular expressions or other tests when you read lines, so that if an extra new line creeps in somehow the script can re-sync with the line groups and not break.

      If you are really desperate for maximum performance then you could investigate chopping your raw file up into chunks and then having separate scripts process each. If you do that then the code you have written to find where a group of four lines starts will come in handy.

        Chrestomanci, You're right. Probably not so wise to do it here. I recoded by reading the file line by line and its already real fast. However, I would like to learn about threads and forks in Perl. Could you point me to a good source? Thanks,