buffering from a large file

cedance has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: buffering from a large file by BrowserUk (Patriarch) on Mar 17, 2011 at 11:00 UTC
I have quite a large file, appx. 4 GB. I am running on a cluster, so memory is not an issue. So, I went ahead and buffered the whole file on to a variable. I hope this is much better than reading line by line? Probably not. Discarding 3 lines will likely be far less costly than allocating memory to hold them when you are not going to use them. Now, since mostly these operations are independent, that is; the pattern check on line 2 ($i = 1) could be done independent of line 6 ($i=5) and so on.. is it possible to create something like threads or do multiple checks at the same time? It would certainly be possible to use a separate thread to process each selected line, but starting a new thread to process a single line--unless the processing of that line is very cpu-intensive--is unlikely to save any time. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: buffering from a large file by moritz (Cardinal) on Mar 17, 2011 at 11:01 UTC
So, I went ahead and buffered the whole file on to a variable. I hope this is much better than reading line by line? I don't think so. If you read line by line, you can do some processing, while the operating system pre-fetches the next blocks from the file in background. If you read the whole chunk into memory, the whole file must be read first. is it possible to create something like threads or do multiple checks at the same time? Yes. Or separate processes. Your operation sytem should keep the read blocks in its buffer, so that only the first process actually reads it from disc, and subsequent ones get it from the buffer cache. Perl 6 - second systems done right	[reply]
Re: buffering from a large file by chrestomanci (Priest) on Mar 17, 2011 at 11:34 UTC
As BrowserUk and moritz said, it make no sense to read the whole file into a single variable. Much better to read the file in line by line, discarding the lines you don't need, and storing each line you are interested in in a separate variable. You can keep this list of scalar strings in an array or another data structure. If each line requires lots of work to process, then your overall problem sounds ideal for Parallel::ForkManager I would try something like: use strict; use English; use Parallel::ForkManager; # Experiment with this value. I suggest you initially try setting # it to twice the number of CPU threads you have. my $MAX_PROCESSES = 8; $pm = new Parallel::ForkManager($MAX_PROCESSES); open my $src_FH, '<', 'Huge_source_file.txt' or die "Error opening sou +rce file $!"; my $line_num = 0; LINE: while( my $line = <$src_FH> ) { next LINE unless 1 == ($INPUT_LINE_NUMBER % 4); my $worker_pid = $pm->start; if( $worker_pid ) { # In parent next LINE; } else { # Call a subroutine to process the line process_line($worker_pid); $pm->finish; # Terminates the child process } } [download] Parallel::ForkManager will maintain a pool of worker threads, and pass processing jobs to each, so you don't need to worry about creating a fork bomb by mistake. If you need to get results back from the worker threads, then there are docs on CPAN explaining how.	[reply] [d/l]
Re^2: buffering from a large file by cedance (Novice) on Mar 17, 2011 at 11:45 UTC
Hi all, Thank you very much. I'll change the code to read line by line. And I'll check the forkmanager module as well. What I failed to mention is that, while the processing is done only on every 2nd line, the other lines are not discarded. Let me explain: Assume the file as n quadruples. Each 4 lines (1:4, 5:8 etc..) belong as one entity. From each of these 4 lines I check/edit the 2nd line and then if it satisfies the criteria, then I have to write these 4 lines (3 of them untouched, yes) back to another output file, otherwise I don't, meaning those 4 lines are discarded for the output file. Thanks again.	[reply]
Re^3: buffering from a large file by chrestomanci (Priest) on Mar 17, 2011 at 16:31 UTC
Based on your further description, I would say that you should probably forget about ForkManager and multiple threads. If multiple threads are all trying to write to the same output file, then you will need to worry about locking the file so they don't corrupt it. The locking overhead will kill performance, and even if you solve it somehow random differences in how long each thread takes to run will mean that the order of the lines in the output file will get partly randomised, which you probably don't want. Instead I suggest you go for a single threaded solution that reads the input line by line and only keeps one group of four lines in memory at any one time. That way everything should be simple, and reasonably quick. I suggest that you add some regular expressions or other tests when you read lines, so that if an extra new line creeps in somehow the script can re-sync with the line groups and not break. If you are really desperate for maximum performance then you could investigate chopping your raw file up into chunks and then having separate scripts process each. If you do that then the code you have written to find where a group of four lines starts will come in handy.	[reply]
Re^4: buffering from a large file by cedance (Novice) on Mar 17, 2011 at 23:18 UTC
Re^5: buffering from a large file by chrestomanci (Priest) on Mar 18, 2011 at 09:55 UTC
Re: buffering from a large file by JavaFan (Canon) on Mar 17, 2011 at 11:38 UTC
My initial though as to use different processes, and have each process read in a different part of the file. But it seems you're only interested in every 4th line, and while it's fairly easy to find lines when starting to read from the middle, you cannot know which line will be a 4th one without having counted from the beginning. So, perhaps you should first look where the bottleneck is. CPU? IO? If it's CPU bound, using threads or forks may improve things. If it's IO bound, then using multiple threads/processes using different controllers may help. But only if they read different parts of the file, and then you're back to the sync problem. And then there are still many factors that play a role (one disk, multiple disks, mirrors, striping, disk/controller caches, other IO, etc) determining how much there's to gain.	[reply]
Re: buffering from a large file by ikegami (Patriarch) on Mar 17, 2011 at 13:48 UTC
is it possible to create something like threads or do multiple checks at the same time? The sharing mechanism used by threads would make a copy of the string when you access it. We're talking about repeatedly copying 4GB into each thread.	[reply]
Re^2: buffering from a large file by cedance (Novice) on Mar 17, 2011 at 23:16 UTC
I was thinking about buffering 20 or 100 or 1000 lines every time, should I use threads, and run the code with a certain amount of threads and repeating the process. That should be feasible I suppose. I ran the code reading the file line by line and the code seems pretty fast already. I just would like to code utilizing some other novel concept, threads/forks etc... to get some experience and see the performance.	[reply]