Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks,

I'm retrieving large amount of data from a server and after some preprocessing I'm saving it locally in a text file (~3GB). Each time I'm processing a small chunk and I'm wondering if it's better to print the chunk straight after processing, print only let's say after 10 000 chunks (out of ~70,000) or perhaps store them all and print all at once? Speed is the most important part, thought I'm not sure if creating a 3GB array would be wise.

In regards to the data the chunk is split into hundreds of lines. the first line is an identifier and the rest should all by concatenated. I've used split by newline and then .= for that but perhaps join would be better? Or split with a limit of 2 and then use s/\n//g/ on the second part? Again speed is crucial.

Thanks for all the advice.

  • Comment on Print in loop or store and print once done

Replies are listed 'Best First'.
Re: Print in loop or store and print once done
by kcott (Archbishop) on Feb 08, 2014 at 00:59 UTC

    From the description of your data, it sounds like you may be dealing with FASTA format. Even if you're not, the following technique (with a little modification) may do what you want.

    #!/usr/bin/env perl use strict; use warnings; while (<DATA>) { if (/^>/) { print "\n" unless $. == 1; print; } else { chomp; print; } } print "\n"; __DATA__ >chunk1 c1_line1 c1_line2 c1_line3 >chunk2 c2_line1 c2_line2 c2_line3

    Output:

    >chunk1 c1_line1c1_line2c1_line3 >chunk2 c2_line1c2_line2c2_line3

    Use Benchmark, to compare this with any other solutions, to determine which runs the fastest.

    For the split and join (or .=) operations you mention, I'd have to guess you're using some form of 'local $/ = ...' (see perlvar: Variables related to filehandles) — I'd need more information to comment further on that. Of course, there's no reason why you can't compare those options with any others.

    Two comments regarding stripping embedded newlines from a string using "s/\n//g/":

    1. That's incorrect syntax. There's no slash after the modifier (i.e. it's just s/\n//g). See perlop: Regexp Quote-Like Operators.
    2. Transliteration (e.g. y/\n//d) is probably faster. See perlop: Quote-Like Operators. [Note: y/// and tr/// are synonymous.]

    If my guess at your requirements is wrong, please supply more information to help us help you. The guidelines in "How do I post a question effectively?" should point you in the right direction with respect to this.

    -- Ken

      Hi Ken,

      Thanks for your comments, really helpful. I've compared different method but the difference is negligible in my case. Seems the bottleneck was in a different place.

      Great to know about transliteration, seems like a faster though simpler substitution.

Re: Print in loop or store and print once done
by Laurent_R (Canon) on Feb 08, 2014 at 00:05 UTC
    Perl's IOs are buffered anyway (unless you specify otherwise), so, even though you might be able to find some gains, don't expect too much, you are only likely to find marginal gains.
      Thanks Laurent, exactly what I found when I compared different approaches.
Re: Print in loop or store and print once done (bench)
by Anonymous Monk on Feb 07, 2014 at 21:49 UTC
Re: Print in loop or store and print once done
by Kenosis (Priest) on Feb 07, 2014 at 21:20 UTC

    Without knowing your data and your operations on it, it's difficult to provide suggestions. Can you provide a data sample and your wanted outcome? Doing so may open the door to proposing efficient algorithms--or at least well-(in)formed suggestions.

      Hi Kenosis, Sorry for the vague description - I perform this sort of operation on very diverse datasets so it was hard to describe. As people mentioned below it turns out the difference from method to method are very small.