cedance has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have quite a large file, appx. 4 GB. I am running on a cluster, so memory is not an issue. So, I went ahead and buffered the whole file on to a variable. I hope this is much better than reading line by line?

My task is this: From the second line of the file $i=1, I read from then on every 4th line, $i=1,5,9,13 etc... and then check for some patterns and do some operations depending on if the pattern was present on not, replacing strings etc...

Now, since mostly these operations are independent, that is; the pattern check on line 2 ($i = 1) could be done independent of line 6 ($i=5) and so on.. is it possible to create something like threads or do multiple checks at the same time? If its possible, then reading data in small chunks and assigning them to each thread would be a good idea; the number of threads would depend of course on the total memory available.

I hope my questions is clear, if not, please point it out and I'll try to clarify. Its just that I have the resources and I wonder if it couldn't be run other than in a total sequential manner.

Thank you!

Replies are listed 'Best First'.
Re: buffering from a large file
by BrowserUk (Patriarch) on Mar 17, 2011 at 11:00 UTC
    I have quite a large file, appx. 4 GB. I am running on a cluster, so memory is not an issue. So, I went ahead and buffered the whole file on to a variable. I hope this is much better than reading line by line?

    Probably not. Discarding 3 lines will likely be far less costly than allocating memory to hold them when you are not going to use them.

    Now, since mostly these operations are independent, that is; the pattern check on line 2 ($i = 1) could be done independent of line 6 ($i=5) and so on.. is it possible to create something like threads or do multiple checks at the same time?

    It would certainly be possible to use a separate thread to process each selected line, but starting a new thread to process a single line--unless the processing of that line is very cpu-intensive--is unlikely to save any time.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: buffering from a large file
by moritz (Cardinal) on Mar 17, 2011 at 11:01 UTC
    So, I went ahead and buffered the whole file on to a variable. I hope this is much better than reading line by line?

    I don't think so.

    If you read line by line, you can do some processing, while the operating system pre-fetches the next blocks from the file in background.

    If you read the whole chunk into memory, the whole file must be read first.

    is it possible to create something like threads or do multiple checks at the same time?

    Yes. Or separate processes. Your operation sytem should keep the read blocks in its buffer, so that only the first process actually reads it from disc, and subsequent ones get it from the buffer cache.

Re: buffering from a large file
by chrestomanci (Priest) on Mar 17, 2011 at 11:34 UTC

    As BrowserUk and moritz said, it make no sense to read the whole file into a single variable. Much better to read the file in line by line, discarding the lines you don't need, and storing each line you are interested in in a separate variable. You can keep this list of scalar strings in an array or another data structure.

    If each line requires lots of work to process, then your overall problem sounds ideal for Parallel::ForkManager I would try something like:

    use strict; use English; use Parallel::ForkManager; # Experiment with this value. I suggest you initially try setting # it to twice the number of CPU threads you have. my $MAX_PROCESSES = 8; $pm = new Parallel::ForkManager($MAX_PROCESSES); open my $src_FH, '<', 'Huge_source_file.txt' or die "Error opening sou +rce file $!"; my $line_num = 0; LINE: while( my $line = <$src_FH> ) { next LINE unless 1 == ($INPUT_LINE_NUMBER % 4); my $worker_pid = $pm->start; if( $worker_pid ) { # In parent next LINE; } else { # Call a subroutine to process the line process_line($worker_pid); $pm->finish; # Terminates the child process } }

    Parallel::ForkManager will maintain a pool of worker threads, and pass processing jobs to each, so you don't need to worry about creating a fork bomb by mistake. If you need to get results back from the worker threads, then there are docs on CPAN explaining how.

      Hi all,

      Thank you very much. I'll change the code to read line by line. And I'll check the forkmanager module as well. What I failed to mention is that, while the processing is done only on every 2nd line, the other lines are not discarded. Let me explain:

      Assume the file as n quadruples. Each 4 lines (1:4, 5:8 etc..) belong as one entity. From each of these 4 lines I check/edit the 2nd line and then if it satisfies the criteria, then I have to write these 4 lines (3 of them untouched, yes) back to another output file, otherwise I don't, meaning those 4 lines are discarded for the output file.

      Thanks again.

        Based on your further description, I would say that you should probably forget about ForkManager and multiple threads.

        If multiple threads are all trying to write to the same output file, then you will need to worry about locking the file so they don't corrupt it. The locking overhead will kill performance, and even if you solve it somehow random differences in how long each thread takes to run will mean that the order of the lines in the output file will get partly randomised, which you probably don't want.

        Instead I suggest you go for a single threaded solution that reads the input line by line and only keeps one group of four lines in memory at any one time. That way everything should be simple, and reasonably quick.

        I suggest that you add some regular expressions or other tests when you read lines, so that if an extra new line creeps in somehow the script can re-sync with the line groups and not break.

        If you are really desperate for maximum performance then you could investigate chopping your raw file up into chunks and then having separate scripts process each. If you do that then the code you have written to find where a group of four lines starts will come in handy.

Re: buffering from a large file
by JavaFan (Canon) on Mar 17, 2011 at 11:38 UTC
    My initial though as to use different processes, and have each process read in a different part of the file.

    But it seems you're only interested in every 4th line, and while it's fairly easy to find lines when starting to read from the middle, you cannot know which line will be a 4th one without having counted from the beginning.

    So, perhaps you should first look where the bottleneck is. CPU? IO? If it's CPU bound, using threads or forks may improve things. If it's IO bound, then using multiple threads/processes using different controllers may help. But only if they read different parts of the file, and then you're back to the sync problem. And then there are still many factors that play a role (one disk, multiple disks, mirrors, striping, disk/controller caches, other IO, etc) determining how much there's to gain.

Re: buffering from a large file
by ikegami (Patriarch) on Mar 17, 2011 at 13:48 UTC

    is it possible to create something like threads or do multiple checks at the same time?

    The sharing mechanism used by threads would make a copy of the string when you access it. We're talking about repeatedly copying 4GB into each thread.

      I was thinking about buffering 20 or 100 or 1000 lines every time, should I use threads, and run the code with a certain amount of threads and repeating the process. That should be feasible I suppose. I ran the code reading the file line by line and the code seems pretty fast already. I just would like to code utilizing some other novel concept, threads/forks etc... to get some experience and see the performance.