khaunda has asked for the wisdom of the Perl Monks concerning the following question:

I have around 30-35 csv files . some sample is like First File
NAME CLIENT VAL1 VAL2 VAL3 aa bb 4 4 5 aa cc 4 4 5 Second File aa bb 4 4 5.34 aa dd 4 4 4.05 THird File aa bb 4 4 5.34 aa dd 4.1 4.3 4.05 aa ff 4.3 0 3.4
My task is to merge them to a single file in format like
NAME CLIENT VAL1 VAL2 VAL3 aa bb <sum from allfiles> <sum from all files> <do> aa cc -------- do ----------- aa dd ------- do ----------
In practice each file is very big containing around 8-9 k lines. I have written some code which gives the result but it is taking time around 25-30 minutes. I am looking for a solution which can take less time may b by using parallel processing or watever

Replies are listed 'Best First'.
Re: Fast Processing
by BrowserUk (Patriarch) on Nov 23, 2010 at 11:37 UTC
    30-35 csv files ... each file is very big containing around 8-9 k lines.... taking time around 25-30 minutes

    35 * 9000 = 315000 lines. That is not big. This shows perl processing 40 million lines in just over 10 seconds:

    C:\test>wc -l bigfile 40000000 bigfile [11:29:51.26] C:\test>perl -nlE"}{say $." bigfile 40000000 [11:30:02.41] C:\test>

    So, the cause of your slow processing is not Perl, nor the size of those files, but whatever you are doing in your script. Throwing parallel processes or threads at it should be your last resort, not your first.

    Your first resort should be correct whatever is wrong with your code that is causing it to be so slow. The quickest way to do that would be to post the code and let us help you with it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast Processing
by moritz (Cardinal) on Nov 23, 2010 at 10:46 UTC

    Please read Markup in the Monastery and update the formatting of your node. Right now it's not readable.

    As for parallelization, if the processing of files is independent of each other, just start a new process for each file. See for example Parallel::ForkManager.

    Or you can use a completely external solution. For example on linux, xargs can parallelize for you:

    ls *.csv | xargs -P 8 -n 1 perl yourscript.pl {} \;

    I'd also recommend to profile your code to identify bottlenecks.

Re: Fast Processing
by marto (Cardinal) on Nov 23, 2010 at 10:43 UTC
Re: Fast Processing
by anonymized user 468275 (Curate) on Nov 23, 2010 at 15:01 UTC
    I also can't imagine what you are doing to make it run that slow. The term csv file covers a multitude of delimiters, but this should still be practically a one-line program that should perform well; for the comma delimiter case, perhaps:
    cat *.csv | perl -lane '/^(\S+\,\s*\S+)/; $_{ $1 }++; END{ for $v ( so +rt keys %_ ) { print "$v $_{$v}\n"}};' > result_csv.txt

    One world, one people

Re: Fast Processing
by fisher (Priest) on Nov 23, 2010 at 10:42 UTC