Fast Processing

khaunda has asked for the wisdom of the Perl Monks concerning the following question:

I have around 30-35 csv files . some sample is like First File

NAME CLIENT VAL1 VAL2 VAL3
aa   bb       4   4    5
aa   cc       4   4    5
Second File
aa   bb       4   4    5.34
aa   dd       4   4    4.05
THird File
aa   bb       4   4    5.34
aa   dd       4.1  4.3    4.05
aa   ff       4.3   0    3.4
[download]

My task is to merge them to a single file in format like

NAME CLIENT VAL1 VAL2 VAL3
aa  bb      <sum from allfiles> <sum from all files> <do>
aa cc        -------- do -----------
aa dd        -------  do ----------
[download]

In practice each file is very big containing around 8-9 k lines. I have written some code which gives the result but it is taking time around 25-30 minutes. I am looking for a solution which can take less time may b by using parallel processing or watever

Comment on Fast Processing Select or Download Code

Replies are listed 'Best First'.
Re: Fast Processing by BrowserUk (Patriarch) on Nov 23, 2010 at 11:37 UTC
30-35 csv files ... each file is very big containing around 8-9 k lines.... taking time around 25-30 minutes 35 * 9000 = 315000 lines. That is not big. This shows perl processing 40 million lines in just over 10 seconds: `C:\test>wc -l bigfile 40000000 bigfile [11:29:51.26] C:\test>perl -nlE"}{say $." bigfile 40000000 [11:30:02.41] C:\test>` [download] So, the cause of your slow processing is not Perl, nor the size of those files, but whatever you are doing in your script. Throwing parallel processes or threads at it should be your last resort, not your first. Your first resort should be correct whatever is wrong with your code that is causing it to be so slow. The quickest way to do that would be to post the code and let us help you with it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re: Fast Processing by moritz (Cardinal) on Nov 23, 2010 at 10:46 UTC
Please read Markup in the Monastery and update the formatting of your node. Right now it's not readable. As for parallelization, if the processing of files is independent of each other, just start a new process for each file. See for example Parallel::ForkManager. Or you can use a completely external solution. For example on linux, xargs can parallelize for you: `ls *.csv \| xargs -P 8 -n 1 perl yourscript.pl {} \;` [download] I'd also recommend to profile your code to identify bottlenecks. Perl 6 - second systems done right	[reply] [d/l]
Re: Fast Processing by marto (Cardinal) on Nov 23, 2010 at 10:43 UTC
Welcome to the Monastery. Please don't ignore the advice displayed regarding formatting your posts. See Writeup Formatting Tips and How do I post a question effectively?. You're going to have to show us your code (as mentioned above, read the formatting advice), otherwise how can we possibly advise on what to change? Have you profiled or benchmarked your code? See Debugging and Optimization from the tutorials section, along with the Devel::NYTProf module.	[reply]
Re: Fast Processing by anonymized user 468275 (Curate) on Nov 23, 2010 at 15:01 UTC
I also can't imagine what you are doing to make it run that slow. The term csv file covers a multitude of delimiters, but this should still be practically a one-line program that should perform well; for the comma delimiter case, perhaps: `cat .csv \| perl -lane '/^(\S+\,\s\S+)/; $_{ $1 }++; END{ for $v ( so +rt keys %_ ) { print "$v $_{$v}\n"}};' > result_csv.txt` [download] One world, one people	[reply] [d/l]
Re: Fast Processing by fisher (Priest) on Nov 23, 2010 at 10:42 UTC
please use Markup in the Monastery	[reply]