dideod.yang has asked for the wisdom of the Perl Monks concerning the following question:
###### test.txt######## sample AA sample BB Not sample CC good boy good yyy bad aaa
open(FILE,"test.txt"); while(<FILE>){ if(/^sample\s+(\S+)/){push @sample,$1} if(/^good\s+(\S+)/){push @good,$1} } close(FILE);
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: About text file parsing -- MCE
by Discipulus (Canon) on Aug 29, 2018 at 07:27 UTC | |
if your file is huge a line by line processsing will result slow with any variation of the algorithm. But you can throw more CPUs at this with, hopefully, better results. While parallel programming is not so easy to implement correctly in Perl, a gentle monk, marioroy, spent a lot of time and energy to help us, producing MCE and it seems that the second example of the documentation can be easely modified to suit your needs. The example uses MCE::Loop to work on a file in chunks: pay attention to OS dependant implementation inside the mce_loop_f call below and choose the appropriate one for your OS
L* UPDATE you can also be interested in some other tecniques you can find in my library
There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS. | [reply] [d/l] [select] |
Re: About text file parsing
by Corion (Patriarch) on Aug 29, 2018 at 08:45 UTC | |
Have you timed how fast you can read the file at all? Maybe reading the file is what limits your speed? Do you have enough RAM to keep all the data you are extracting in arrays? Maybe writing the output into separate files immediately makes things faster. At least it makes certain that your program uses far less RAM. | [reply] |
Re: About text file parsing
by davido (Cardinal) on Aug 29, 2018 at 18:45 UTC | |
I did the following:
On my laptop with an SSD that took about fifteen seconds to run. Then I did this:
And that took about eight seconds to run. In the case of your code, within the while() {...} loop you're invoking the regex engine, doing a capture, and pushing onto two arrays. If you have "hits" in the case of, say, 50% of the lines from your file, you'll be pushing 25 million captures into the arrays. Depending on the size of your captures, you could have one to several gigabytes stored in the arrays. If your run-times for the code segment you demonstrated are under 30-45 seconds, you're probably doing about as best as can be expected for a single process working with a file. If the time is over a couple minutes, you're probably swamping memory and doing a lot of paging out behind the scenes. If that's the case, consider instead of pushing into @good and @sample arrays, writing entries to a couple of output files. This will add IO overhead to the process, but will remove the memory impact which is probably generating even more IO overhead behind the scenes at a much lower layer. Once the 'sample' and 'good' files are written, you can process them line by line to do with them what you would have done with the arrays. Another alternative would be instead of pushing onto @sample and @good, do the processing that will later happen on @sample and @good just in time for each line of the input file. IE:
As long as # do something with $capture does not include storing the entire capture into an array, this should pretty much wipe out the large memory footprint. Dave | [reply] [d/l] [select] |
Re: About text file parsing
by bliako (Abbot) on Aug 29, 2018 at 10:52 UTC | |
If you insert the file in a RAM disk The additional benefit if you go the RAM disk way is that you can keep your input files in the disk for multiple perl runs, until the next reboot or until you remove them from RAM. So the second time you run a similar script to find different patterns you will see a better time benefit because the input is already in RAM. If you go the parallel way (as Discipulus mentioned) then you are bound by the total IO bandwidth of your hard disk. And so the benefits also may be different than just multiplying by the If your Not sample CC lines are a lot then you can filter them out before running all the different regexes on each line of input or even before running that perl script: for example, via grep -v 'Not sample CC' input.txt | perl ... or with a perl one-liner filter, but I am not sure perl beats grep. Of course you need the filter-out lines to have a common regex to filter them out. And finally, if you do manage to remove all the Not sample CC lines, it is worth trying the following and see if it is faster (caveat: results in %inp will be in random order and not the order of insertion as with arrays) :
Edit: If you want to pass the output of your command above to another command for further processing then the problem of waiting for a process to finish in order to get all its output out and run it through another command and so on has been solved a long time ago, it is called a pipeline and essentially is what you see in unix style cmd1 | cmd2 | cmd3 ... . cmd1 starts outputing results as soon as it reads its input (if it is a simple program as yours above), its output is immediately read by cmd2 which then spits its output as soon as the first line is read and on to cmd3 which finally gives you an output as soon as the first line of input is read by cmd1 plus the propagation time. So you save a lot of time and you have results coming out almost immediately. The provision is that processing one line or chunk of input must be independent of the following lines of input. | [reply] [d/l] [select] |
by TheloniusMonk (Sexton) on Aug 29, 2018 at 12:48 UTC | |
| [reply] |
Re: About text file parsing
by SuicideJunkie (Vicar) on Aug 29, 2018 at 17:42 UTC | |
XY problem question here... Perhaps consider something more like this: That will keep little more than one line in memory at a time, and you can then deal with the pieces separately. Uncomment the comments to give a progress display. | [reply] [d/l] |
Re: About text file parsing
by TheloniusMonk (Sexton) on Aug 29, 2018 at 08:31 UTC | |
| [reply] |
Re: About text file parsing
by stevieb (Canon) on Aug 29, 2018 at 14:36 UTC | |
Although I really like Discipulus's approach to parallel processing the file, I thought I'd throw out Tie::File as an option. I've used it a couple of times successfully years ago. It doesn't load the entire file at once; instead, it reads it in chunks and presents the file as an array. Instead of doing:
You'd do something like the following after opening the file with the distribution:
| [reply] [d/l] [select] |
by haukex (Archbishop) on Aug 30, 2018 at 21:53 UTC | |
Unfortunately, Tie::File also adds significant overhead, so it's pretty safe to assume that it would slow things down and burn more memory. | [reply] |
Re: About text file parsing
by marioroy (Prior) on Aug 30, 2018 at 12:48 UTC | |
Greetings, dideod.yang, The regular expressions in your code presents an opportunity for running parallel. With parallel cores among us (our friends), let us take Perl for a spin. Please find below the serial and parallel demonstrations. Serial
Parallel
50 million test The tests were timed on a system with a NVMe SSD. Notice the user times. MCE has low overhead.
Regards, Mario | [reply] [d/l] [select] |
by marioroy (Prior) on Aug 30, 2018 at 13:41 UTC | |
Hi again, One may want to have the manager-process receive and loop through @sample and @good. That will incur an additional CPU core for the manager-process itself.
The extra time comes from workers appending to local arrays. Likewise, the manager-process receiving and looping through the arrays. There are 4 workers and the manager process running simultaneously on a machine with 4 real cores.
Update: Interestingly, Perl v5.20 and higher take 2x longer to run. I'm not sure why. Yikes, possibly from regular expression? This is on my TODO list to check why. The above was captured from Perl v5.18.2 on the same machine.
Regards, Mario | [reply] [d/l] [select] |
by marioroy (Prior) on Aug 30, 2018 at 14:17 UTC | |
Once again, hi :) Using a simplified demonstration, regular expression appears to be 3x slower in Perl v5.20 and higher. I'm not sure why.
Results
Regards, Mario | [reply] [d/l] [select] |
Re: About text file parsing
by tybalt89 (Monsignor) on Aug 30, 2018 at 18:58 UTC | |
See if it is faster reading big chunks at a time, like this simple test case (of course, modify it for your file).
Outputs:
| [reply] [d/l] [select] |
by marioroy (Prior) on Aug 30, 2018 at 20:52 UTC | |
That's cool, tybalt89. Each day, learn something new about Perl. I ran serially and parallel with "text.txt" containing 50 million lines. There is no slowness using Perl v5.20 and higher. Serial
Parallel
Demo
Regards, Mario | [reply] [d/l] [select] |
Re: About text file parsing
by Marshall (Canon) on Aug 29, 2018 at 19:01 UTC | |
Do not push to an array if you can process the line right now!
| [reply] [d/l] |
Re: About text file parsing
by Anonymous Monk on Aug 29, 2018 at 06:57 UTC | |
Hi, How many gigabytes of RAM do you have? How many gigabytes is your file (4.7GB)? | [reply] |