MelaOS has asked for the wisdom of the Perl Monks concerning the following question:

Hi Fellow monks, i have a huge txt file in the 1XXMB size, and i need to go loop through the whole file extracting line by line data. I've tried a few times and currently it takes about one minutes plus or so to extract the whole thing.

my algorithm is a simple one as i just open up the file, loop through it, regex match it and extract it out by directly print OUT to the output file as the memory cannnot hold the whole file at all once.

i have a limit file which tells my script what to extract so that i don't have to extract everything.the limit file contains of test names which i can use to regex match in the txt file.

what i want to do here is to find a way to speed up the extraction time, i'm thinking of using multiple threads so that each thread will only extract a few test names specified in the limit file into a few output files. Is this a good idea? is there any better way to do this? thanks in advance~
  • Comment on How to speed up/multi thread extract from txt files?

Replies are listed 'Best First'.
Re: How to speed up/multi thread extract from txt files?
by perrin (Chancellor) on Jan 09, 2008 at 03:13 UTC

    The mistake you're making here is trying to optimize without any idea what is actually slowing it down. Give it a run with Devel::DProf and then you'll know.

    When I worked on a similar problem, I found that most of the time was spent reading lines from disk, not doing the regexes. I sped it up by using the sliding window technique to read in larger chunks.

Re: How to speed up/multi thread extract from txt files?
by BrowserUk (Patriarch) on Jan 09, 2008 at 04:19 UTC

    How many cpus have you? Neither threads nor processes are likely to help unless you have more than one, and even then it's unlikely to get much quicker using the method you are considering as they would all be accessing the same disc and file. You would likely just slow things down.

    One thing that might help is to load the file as a single string. I know you said you do not have enough memory, but I am guessing that you have been trying to load the file as an array of lines which takes a lot more ram than if you load it as a single string. It's a rare machine these days that does not have 200MB available.

    The following shows my fairly average machine loading a 200MB file consisting of 100-character lines of random digits and then searching the resultant string and finding 2000+ occurances all in just under 3/4 of a second.

    #! perl -slw use strict; use Benchmark::Timer; my $T = new Benchmark::Timer; $T->start('read'); open IN, '<', $ARGV[ 0 ] or die $!; my $text; sysread IN, $text, -s( $ARGV[ 0 ] ) or die $!; $T->stop('read'); printf "file contains %d bytes\n", length $text; my $count = 0; $T->start('regex'); $count++ while $text =~ m[(12345)]g; $T->stop('regex'); print "$count matches found"; $T->report; __END__ C:\test>junk junk.dat file contains 213909504 bytes 2016 matches found 1 trial of read (452.624ms total) 1 trial of regex (260.872ms total)

    If you are searching for multiple texts and need to capture/output whole lines then things will be slower, but it's not possible to be more realistic without better information.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I first load the hash with only the test name that i'm interested to extract then use the code below to do the actual extraction into a text file.
      i've ran just a while loop going through the file, and it takes about 25secs, and when i ran it through my code below, it takes about 52secs.
      any other helpful advice? thanks~
      ## extract all the param values from the stdf sub get_paramValue { my ($stdf,$lot,$oper,$sum,%param_flag) = @_; my ($output); print "Running with stdf:$stdf.\n"; &log("get_paramValue","Running with stdf:$stdf."); if(-e $stdf){ ## create the output file name, similar to the stdf name but w +ith .log ext $output = $outputdir.$lot."_".$oper."_".$sum.".log"; open(OUT, ">$output") or &log("get_paramValue","Can't write to + output: $output"); print OUT "tname,idx,param_val\n"; open(STDF, $stdf) or &log("get_paramValue","Die can't read fro +m stdf:$stdf."); my (@tmp,$testname,$testFound,$paramVal,$unit_count); while(<STDF>){ if(/3_prtnm_/){ @tmp = split(/3_prtnm_/); $unit_count = &trim($tmp[1]); } elsif(/2_tname_/){ @tmp = split(/2_tname_/); $testname = &trim($tmp[1]); if(exists $param_flag{$testname}){ $testFound = 1; } } elsif($testFound){ if(/2_mrslt_/){ @tmp = split(/2_mrslt_/); $paramVal = &trim($tmp[1]); print OUT "$testname,$unit_count,$paramVal\n"; $testFound = 0; } } } ## END while close(STDF); close(OUT); } ## END IF return $output; } ## end sub
        I have not really understood your code, but sometimes it helps a lot to work with call by references instead of call by value.
        Call by value will mean one additional copy of your data will be done when a function is called.
        Given your amount of data this can speed things up big time.

        Ignore (most) of this!

        That's the trouble with running code in your head. You don't always notice scoping issues. And answer to the question at the end would still help though.

        Did you tidy your code up for posting? I ask because there is a logic error in what you've posted that (I think) means that it cannot do what you are wanting it to do.

        if(/3_prtnm_/){ @tmp = split(/3_prtnm_/); $unit_count = &trim($tmp[1]); } elsif(/2_tname_/){ @tmp = split(/2_tname_/); $testname = &trim($tmp[1]); if(exists $param_flag{$testname}){ $testFound = 1; } } elsif($testFound){ if(/2_mrslt_/){ @tmp = split(/2_mrslt_/); $paramVal = &trim($tmp[1]); print OUT "$testname,$unit_count,$paramVal\n"; $testFound = 0; } }
        1. You will only ever print anything to the output file if $testFound is true.
        2. But $testFound is only ever set true inside another branch of the same if/else cascade?
        3. The same is also true for the values of both $testname and $unit_count

        Also, how many keys are there in %param_flag and what do they look like?

        If you can clarify those, I'll try and adapt the logic of your subroutine to use the big string technique I mentioned above.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        In order to be able to help you further I would need to see a short example of the contents of the input file. And also the number of keys in %param_flag and what they look like.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to speed up/multi thread extract from txt files?
by ysth (Canon) on Jan 09, 2008 at 03:03 UTC
Re: How to speed up/multi thread extract from txt files?
by KurtSchwind (Chaplain) on Jan 09, 2008 at 03:05 UTC

    You don't mention which OS you are on. If you are on a *nix box you can use split.

    Split the file into equal parts and just run multiple instances of your application.

    At least that's one way. However, have you considered that the extraction process isn't actually what needs optimizing? If you are running through a list of regex on each line, I'd hazard a guess that that is consuming more of your time than reading each line in. It might help if you'd post some of your code so we can take a gander and see where your bottleneck is.

    --
    I used to drive a Heisenbergmobile, but every time I looked at the speedometer, I got lost.
Re: How to speed up/multi thread extract from txt files?
by Cody Pendant (Prior) on Jan 09, 2008 at 04:02 UTC
    It's almost too obvious to ask, but have you considered that you need a database now that your data is this size?


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
Re: How to speed up/multi thread extract from txt files?
by holli (Abbot) on Jan 09, 2008 at 02:56 UTC
    Buy a faster harddisk ;)


    holli, /regexed monk/
Re: How to speed up/multi thread extract from txt files?
by Anonymous Monk on Jan 09, 2008 at 19:14 UTC

    You may want to try to compile your regexps before the loop with something like :

    my $re1 = qr/3_prtnm_/; my $re2 = qr/2_tname_/; # ... while( <STDF> ) { # ... if( m/$re1/ ) { @tmp = split($re1); #... } elsif( m/$re2/ ) { @tmp = split($re2); #... } }

    Another trick, if you have a few million lines to parse could be to rewrite the inside of the loop to something more like :

    my $re = qr/(3_prtnm_|2_prtnm_|2_mrslt_)/; # ... while( my $line = <STDF> ) { if( m/$re/ ) { my $sep = $1; my @tmp = split( /$sep/, $line ); # ... } }

    I'm not sure if the spedup would be noticable because of the varying split that should be benchmarked/optimized.

    Have a look to http://www.stonehenge.com/merlyn/UnixReview/col28.html for regexp compilation.

    One sure thing is that threading is hard to get right and in big loops there are generally many other possible optimizations to try before.

Re: How to speed up/multi thread extract from txt files?
by guaguanco (Acolyte) on Jan 09, 2008 at 18:53 UTC
    Have you considered modifying your tests to use index() ?

    if(/3_prtnm_/)

    becomes:

    if (index($_, '3_prtnm_') != -1)

    If I read your code right, you are using regex matches to look for fixed strings, which is not the fastest approach.

Re: How to speed up/multi thread extract from txt files?
by guaguanco (Acolyte) on Jan 09, 2008 at 21:23 UTC
    Another thing to think about: since you know that you're spending 25 seconds on file IO alone, you might consider reading the file in larger chunks than one line (perhaps 64K chunks). Might shave off a few seconds.
Re: How to speed up/multi thread extract from txt files?
by BrowserUk (Patriarch) on Jan 10, 2008 at 20:02 UTC

    Since you haven't provided the request further information, here's an indication of what is possible. The following adaption of your posted sub run against 200MB of simulated data shows it doing single searches in ~1/2 second; 10 searches in ~4 seconds. Several runs of 100 searches all came in under 36 seconds.

    This includes locating and extracting 3 values per your original. It could probably be optimised further but there's no point with guessed at test data.

    The simulated data I used looks like this:

    C:\test>u:head 661250.dat 1_3_prtnm_1441794 1_2_tname_1441794 1_2_mrslt_1 2_3_prtnm_0133611 2_2_tname_0133611 2_2_mrslt_2 3_3_prtnm_1469079 3_2_tname_1469079 3_2_mrslt_3 4_3_prtnm_0852340 ...

    Probably not very realistic data, but also possibly more testing than the real stuff. Enjoy!


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: How to speed up/multi thread extract from txt files?
by locked_user sundialsvc4 (Abbot) on Jan 10, 2008 at 18:13 UTC

    The speed of this operation is going to rely on the speed of the hard-drive and the efficiency of the operating-system's buffering operations.

    If it takes “one minute plus” to process a 1XX megabyte file, I think I'd be pretty happy with that ...

    You are probably not going to improve upon that. By fiddling with the program you could easily slow it down.