Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

dear All, pls help me on below code this code is dealing with files having more than 1 million records 1st 6 line and last 2 lines are useless for me to process. major issue facing while writng files is data lost.i am working on multiple files. code:
use POSIX; $num_args = $#ARGV + 1; if ($num_args != 1) { print "\nUsage: Spool.pl Require two argumrnts \n"; exit; } @SMSFileList = `ls | grep CSV`; chop(@SMSFileList); my $FileCount = 0; my %myhash; my %myhash1; my @lines; my ($lines,$CDR,$uniquekey,$uniquekey1,$file_name,$file_name1,$CircleG +roupHandle,$CircleGroupHandle1,$uniq1,$uniq2); my $targetdir=$ARGV[0]; foreach my $SMSFileName (@SMSFileList) { $FileCount++; print "Reading File ($FileCount)\n"; print "$SMSFileName\n"; sysopen (SOURCE_SMS_FILE,"$SMSFileName",O_RDONLY) or die "Err +or opening $SMSFileName"; while(<SOURCE_SMS_FILE>) { if ($_ =~ m/^,/) { @lines= split(",",$_); chop($_); if ($lines[14] =~ /^I$/ ) { $uniquekey= $lines[17].$lines[14].$lines[21].$lines +[44].substr($lines[27],0,8); $CDR = substr($lines[27],0,8).','.$lines[28].','.$l +ines[36].','.$lines[23].','.$lines[24].','.$lines[91].','.$lines[92]. +','.$lines[101].','.$lines[15].','.$lines[18].','.$lines[75].','.$lin +es[21].','.$lines[44].','.$lines[14].','.substr($lines[69],0,8).','.$ +lines[1].','.$lines[13].','.$lines[50]; $file_name = $lines[17].'_'.$lines[14] +.'_'.$lines[21].'_'.$lines[44].'_'.substr($lines[27],0,8); $myHash{$uniquekey}++; } if($lines[14] =~ /^O$/ ) { $uniquekey1= $lines[17].$lines[14].$lines[22].$line +s[45].substr($lines[27],0,8); $CDR1 = substr($lines[27],0,8).','.$lines[28].','.$ +lines[37].','.$lines[23].','.$lines[24].','.$lines[91].','.$lines[92] +.','.$lines[101].','.$lines[16].','.$lines[19].','.$lines[75].','.$li +nes[22].','.$lines[45].','.$lines[14].','.substr($lines[69],0,8).','. +$lines[1].','.$lines[13].','.$lines[50]; $file_name1 = $lines[17].'_'.$lines[14].'_'.$lines[22].'_'.$li +nes[45].'_'.substr($lines[27],0,8); $myHash1{$uniquekey1}++; } } foreach $uniquekey (keys %myHash) { $uniq1 = $file_name.'.csv'; sysopen($CircleGroupHandle,"$ARGV[0]/$uniq1",O_WRON +LY|O_APPEND|O_CREAT)or die "Error writing to $!"; print $CircleGroupHandle "$CDR\n"; # print "$uniquekey\n" #print "$key\n"; } %myHash = (); %$uniquekey = (); $uniquekey = {}; foreach $uniquekey1 (keys %myHash1) { $uniq2 = $file_name1.'.csv'; sysopen($CircleGroupHandle1,"$ARGV[0]/$uniq2",O_WRONLY| +O_APPEND|O_CREAT)or die "Error writing to $!"; print $CircleGroupHandle1 "$CDR1\n"; # print "$uniquekey1\n" } %myHash1 = (); %$uniquekey1 = (); $uniquekey1 = {}; } close(SOURCE_SMS_FILE); print"$SMSFileName closed\n"; close($CircleGroupHandle) or die "Error closing to $!"; close($CircleGroupHandle1) or die "Error closing to $!"; print " All file handles closed\n" }
  • Comment on <speed a major issue with below code also loosing data while writing in files>
  • Download Code

Replies are listed 'Best First'.
Re: <speed a major issue with below code also loosing data while writing in files>
by bluescreen (Friar) on Jul 25, 2011 at 15:03 UTC

    You seem to be opening the output files multiple times in the foreach loop at the end, that will not only affect the running time but also can be the source of your data lost issue. So open the file before looping:

    open my $fh, '>>', $filename; foreach my $uniquekey ( keys %myHash ) { ... print $fh ...; }; close($fh);

    If this is a one time script then don't worry *much* about the following but if you have to maintain it over time consider the following:

    • use strict and use warnings
    • Naming Conventions, variables like myHash, uniquekey, ... complicate understanding the script, instead use name that reflects their contents
    • Use proper scope and define variables only when you need them
    • Instead of `ls | grep CVS` use opendir, readdir functions
    • Use functions to divide the code into meaningful pieces
    • Use autodie
    • Use three arg open calls instead of sysopen

      dear monk, thnx for your reply,but my major concern is that i am using file name from the file data itself which is a uniq key for me to lookup however cdr is the content of that file which also contain file name hidden in it. help will be highly app.as i am dealing with a source file having 1+ million records in that.

        This code:

        foreach $uniquekey1 (keys %myHash1) { $uniq2 = $file_name1.'.csv'; sysopen($CircleGroupHandle1,"$ARGV[0]/$uniq2",O_WRONLY| +O_APPEND|O_CREAT)or die "Error writing to $!";

        If you look closely the file name is composed of: "$ARGV[0]/$file_name1.csv", all of the variables involved are defined outside the foreach loop, if that doesn't change that means that within the loop you're opening the same file over and over.

        The first optimization would be to extra the constants from the loop.< That means that $file_name and the file_handle can be outside the foreach loop. This will prevent your program to open same file multiple times ( one per keys in %myHash ). Let's say %myHash has 10k keys, that means you opening ( without closing ) same file 10k times.