avanta has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,

Question 1: I have a huge text file (50MB approx) and I need to parse it once in a process and pass the duplicate content to another one. I have a sample code to parse a file but how can I duplicate the output is where I am stuck.
sub processInputFiles { <----doing something--->..... ...... LogInfoMsg("Processing $file Started"); if ($f_type eq ".gz") { $fp= new IO::Zlib($working_file,"rb"); $tfp= new IO::Zlib($working_file,"rb"); } elsif ($f_type eq ".log") { $fp= new IO::File($working_file,"r"); $tfp= new IO::File($working_file,"r"); } else { LogCriticalMsg("Unsupported file_type($f_type), Skip parsing $ +file."); return; } if (! $fp) { # if it fails for some reason, rename to its original name. rename ($working_file, $file); LogCriticalMsg("Open input file $file failed. $!"); } else { if(grep m/[\r]/,$line) { IO::File->input_record_separator("\r\n\r\n"); } else { IO::File->input_record_separator("\n\n"); } close($tfp); } while ($record= $fp->getline()) { <---parsing file per record----> }# end of while } } <----doing something--->
the code may have errors but what I want is the concept to duplicate data in two or more processes (forking) so that I parse the input file only once. The input file can be huge so cant store in some variable or array or hash.

Question 2: which will be faster parsing the input huge file in two different processes or parsing it once and send the data to different processes record by record?

Replies are listed 'Best First'.
Re: passing a file content from one process to another process.
by jethro (Monsignor) on May 03, 2010 at 08:56 UTC

    Normally you will spend more time waiting for the disk delivering your data than parsing it. So the answer to question 2 would be that you shouldn't care. Better optimize for readability and maintainability

    Parsing is a wide field. It might mean searching for a few strings or parsing a programming language like perl. I assume here that your parsing needs are rather simple

    In that case you wouldn't need two processes. You would use two Finite-state_machines working in parallel:

    my $state1=0; my $state2=0; while ($record= $fp->getline()) { $state1= statemachine1($state1,$record); $state2= statemachine2($state2,$record); }# end of while

    Here is a short description and more links about finite state machines: Re^3: How to parse a text file.

    UPDATE: Read your question again and if that file is just a long list of "records", i.e. lines without any structure that connects two or more lines, you don't even need a finite state machine. Just use the two subroutines. You do know about the split() function? Very helpful to split a line into its components

Re: passing a file content from one process to another process.
by CountZero (Bishop) on May 03, 2010 at 08:53 UTC
    50 MB is not "huge", but even then it is nice to be parsimonious on system reqources.

    Rather than going the "forking" way, cannot you just package each of your processes into a subroutine and call both subroutines one after the other and give them the contents of the record to work with as a parameter?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James