Reduce the time taken for Huge Log files

pr19939 has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Can you please help me in optimizing the code given below. The code below iterates through a single log file of size 3 GB which has got many lines containing any one of the businesses specified in the Array. When it iterates with the first business against the Log file, it picks up the line contaning that particular business and writes it into a new file named after the name of the Business. Basically i will be using the consolidated business files to calculate the number of hits on each site.

Code :

#!/usr/bin/perl
#!/usr/bin/perl

@businesses = (  ["\"corp.home.ge.com\"","new_corp_home_ge_com.log"],
                 ["\"scotland.gcf.home.ge.com\"","new_scotland_gcf_hom
+e_ge_com.log"],
                 ["\"marketing.ge.com\"","new_marketing_ge_com.log"]  
+                      
                );
$rows = scalar(@businesses);

#---------------Code to get todays date ------------------------------
+---------
$today = time();
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($tod
+ay);  
$year += 1900;
$mon++;
  if ($mday <10)
       {
        $mday = "0$mday";
       }

  if ($mon<10) 
       {
        $mon = "0$mon";
       }

$today = $year . $mon . $mday;

#---------------Code to get yesterdays date --------------------------
+-------------
$yesterday = time() - ( 24 * 60 * 60 );
($sec1,$min1,$hour1,$mday1,$mon1,$year1,$wday1,$yday1,$isdst1) = local
+time($yesterday);
$year1 += 1900;
$mon1++;
    if ($mday1 <10) 
        {
         $mday1 = "0$mday1";
        }

    if ($mon1<10) 
        {
         $mon1 = "0$mon1";
        }

$yesterday = $year1 . $mon1 . $mday1;
#---------------------------------------------------------------------
+----------

$outfile="consolidatedlog.txt";
open (OUT,">$outfile") or die ("Cud not open the $outfile");
opendir(DIR, "/inside29/urchin/test/logfiles") or die ("couldn't open 
+igelogs");
  while ( defined ($filename = readdir(DIR)) ) 
   {
    $index = index($filename,$yesterday);    
     if ($index > -1) 
     {       
          $date1 = localtime();
      print "The log $filename started at $date1.\n";
  
  open(OLD,"/inside29/urchin/test/logfiles/$filename") || die ("Cud no
+t open the $filename");
  
   while (<OLD>) {  
      print OUT $_;          
  } 
   close OLD;
  $date = localtime();
 print "The log $filename ended at $date.\n";

}
}
closedir (DIR);
close OUT;
#-------------------------------------------------------------
$ct = 0;
while ($ct < $rows) {
$outfile1 = "/inside29/urchin/test/newfeed/$today-$businesses[$ct][1]"
+;
$newigebusiness = "$businesses[$ct][0]";
$date2 = localtime();
print  "$ct log started for $newigebusiness at $date2\n";
open(OUT,">>$outfile1") || die("Could not open out file!$outfile1");
open(OLD,"consolidatedlog.txt") || die ("Could not open the consolidat
+ed file");
   while ( <OLD>) { 
     
     if ((index($_,$newigebusiness))> -1)
      { 
      print OUT $_; 
      }
       
   } 
  close OLD;
   close OUT;
   $date3 = localtime();
   print "$ct log created for $newigebusiness at $date3\n";
   $ct += 1;
 }
[download]

End of Code.
Sample Log file :

Line 1:
16/Jan/2005:00:00:40 -0500 "GET /ge/ige/1/1/4/common/cms_portletview2.html HTTP/1.1" 200 1702 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com"

Line 2:
16/Jan/2005:00:00:40 -0500 "GET /portal/site/transportation/menuitem.8c65c5d7b286411eb198ed10b424aa30/ HTTP/1.1" 200 7596 0 "http://geae.home.ge.com/portal/site/transportation/" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)" "-" "geae.home.ge.com"

Line 3:
16/Jan/2005:00:00:41 -0500 "GET /ge/ige/26/83/409/common/cms_portletview.html HTTP/1.1" 200 7240 0 "http://erc.home.ge.com/portal/site/insurance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "erc.home.ge.com"

Around a millon line will be there in the Log file

Please help.

Thanks

Comment on Reduce the time taken for Huge Log files Download Code

Replies are listed 'Best First'.
Re: Reduce the time taken for Huge Log files by BrowserUk (Patriarch) on Mar 18, 2005 at 10:05 UTC
Can you please help me in optimizing the code given below. Sure, but clean up your code first then re-post it. As is, it is almost unreadable. Add `use strict;` and eliminate these warnings: Read more... (6 kB) That will save everybody here from having to track down that the reason for this: `Global symbol "$sourcefile" requires explicit package name at - line 1 +58.` [download] Is because of lines like this: `open(OLD,"/inside29/urchin/test/logfiles/$filename") \|\| die ("Cud not +open the $sourcefile");` [download] and this `open(OLD,"consolidatedlog.txt") \|\| die ("C not open the $sourcefile");` [download] It is quite handy if your error messages refer to the same files as those that cause the errors! And sorry for mentioning this, but what is all this? my $yesterday = time() - ( 24 * 60 * 60 ); ## Why use 'my' here ###### +########### ## And not here -v-v-v-v-v-v-v-v- ##################### ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime($yes +terday); $year += 1900; $mon++; $d = "$mday"; ## <<< Why copy $mday to $d? And why quote "$mday"? ### +###################### if ($d <10) { $d = "0$d"; ## <<< What does this do? ################# } $m ="$mon"; if ($m<10) { $m = "0$m"; ## <<< And this? #################### } $yesterday = "$year" . "$m" . "$d"; ## Why all those "s? ####### [download] Please don't take this to heart, but asking us to optimise your code, when it is so ... um ... messy, is rather pushing your luck a little. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l] [select]
Re: Reduce the time taken for Huge Log files by Random_Walk (Prior) on Mar 18, 2005 at 10:52 UTC
As mentioned it is hard to read your code, it is long, it does not use strict and warnings, it does not adhere to the smallest possible test case ideal. That said a couple of hints to tidy things up. It does look like you are taking each business and then comparing all the file entries against it before reading the next business off the array. This means you are reading the input files $number_of_business times. Reading the file is slow, itterating through an array held in memory is fast. For getting a formatted date/time consider `use POSIX;` and the `strftime` function that provides. `perl -MPOSIX -le'print strftime "%X %x", localtime(time)' 12:28:59 18/03/05` [download] If you need to match each line against all those businesses consider building an array of pre-compiled regex using the qr command and then read each line in turn and run it against each compiled regex 'till you get a match (assuming only one business can match each line) If I missed the boat completely sorry. Hopefuly this code can give you a couple of ideas ... #!/usr/bin/perl use warnings; use strict; my @businesses=( ['foo', 'bar', 'baz'], ['een', 'twee', 'drie'], ['ichi', 'ni', 'san'], ['hydrogen', 'helium', 'lithium'], ); my %regexen; foreach my $group (@businesses) { print "making regex from group @{$group} ... "; my $regex=join "\|", @$group; print "\\$regex\\\n"; my $compiled_re=qr/$regex/; $regexen{$regex}=$compiled_re; } while (my $line = <DATA>) { for my $group (keys %regexen) { next unless $line =~ /$regexen{$group}/; print "The line $line matched the bussiness group $group\n"; } } __DATA__ nosuch foo this that helium ballon ichi foot een [download] Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]
Re: Reduce the time taken for Huge Log files by holli (Abbot) on Mar 18, 2005 at 11:24 UTC
Your script is a good candidate for dominus' red flag articles :-) I guess you're better off writing that from scratch. This might give you an idea of an efficient way. #!/usr/bin/perl use strict; use warnings; use FileHandle; # create a hash of businesses and their target files my %businesses = ( "corp.home.ge.com" => "new_corp_home_ge_com.log", "scotland.gcf.home.ge.com" => "new_scotland_gcf_home_ge_com.log", "marketing.ge.com" => "new_marketing_ge_com.log", ); #loop the data while (<DATA>) { # check if we have a line that contains a http:// address # and store that in $1 #\" is just to prevent editor from screwing syntax higlight if ( m-http://([^/\"]+)- ) { # do we have a entry in the hash for that business if ( $businesses{$1} ) { #yes. so create a filehandle in the hash for writing to th +at unless ( ref $businesses{$1} ) { my $fh = new FileHandle; $fh->open (">$businesses{$1}"); $businesses{$1} = $fh; } #print to the business' filehandle $businesses{$1}->print ($_); } else { #no so emmit a warning warn "unknown business in logfile"; } } } __DATA__ 16/Jan/2005:00:00:40 -0500 "GET /ge/ige/1/1/4/common/cms_portletview2. +html HTTP/1.1" 200 1702 0 "http://marketing.ge.com/portal/site/insura +nce/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "marke +ting.ge.com" 16/Jan/2005:00:00:40 -0500 "GET /portal/site/transportation/menuitem.8 +c65c5d7b286411eb198ed10b424aa30/ HTTP/1.1" 200 7596 0 "http://geae.ho +me.ge.com/portal/site/transportation/" "Mozilla/4.0 (compatible; MSIE + 5.5; Windows NT 5.0)" "-" "geae.home.ge.com" 16/Jan/2005:00:00:41 -0500 "GET /ge/ige/26/83/409/common/cms_portletvi +ew.html HTTP/1.1" 200 7240 0 "http://marketing.ge.com/portal/site/ins +urance/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-" "ma +rketing.ge.com" [download] Question: Why can't I write `print $businesses{$1} $_; #yields: #Scalar found where operator expected at C:\t.pl line 36, near "} +$_" #(Missing operator before $_?)` [download] ?? That's odd. Update: Changed the print to the print-method of FileHandle as per castaway's suggestion. holli, /regexed monk/	[reply] [d/l] [select]
Re^2: Reduce the time taken for Huge Log files by pr19939 (Initiate) on Mar 18, 2005 at 15:04 UTC
Hi Holli, I tried your code with little modification and it was really helpful.Thanks a lot. But i am stuck. I have some lines having https:// and some lines having no http:// at all. I tried out some combimations of reg ex,but no luck. Please advise. Thanks	[reply]
Re^3: Reduce the time taken for Huge Log files by holli (Abbot) on Mar 18, 2005 at 15:23 UTC
I just noticed the address is also at the very end of every string, so: `m-"([^\"]+)"$-` [download] will do. holli, /regexed monk/	[reply] [d/l]
Re^3: Reduce the time taken for Huge Log files by Random_Walk (Prior) on Mar 18, 2005 at 15:24 UTC
If you do not need to know if it was https you can reduce them all to http:// when you read each line in with something like `s!https://!http://!;` just after the `while(<DATA>) {`. Note the use of ! as a regex delimiter stops you having a quoting nightmare with the // Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l] [select]
Re^2: Reduce the time taken for Huge Log files by nobull (Friar) on Mar 18, 2005 at 18:00 UTC
Just an asside for the OP. If you are just starting out with Perl don't get into the habit of saying "use FileHandle". The FileHandle module was replaced by IO::File some considerable time ago.	[reply]
Re^3: Reduce the time taken for Huge Log files by holli (Abbot) on Mar 18, 2005 at 18:04 UTC
NOTE: This class is now a front-end to the IO:: classes.* (from the POD of FileHandle.pm) holli, /regexed monk/	[reply]
Re: Reduce the time taken for Huge Log files by deibyz (Hermit) on Mar 18, 2005 at 09:19 UTC
It's too early here to read all your code ( :P ), but I see you're using a AoA. If you have to perform any kind of match against it, you can gain a lot of performance by changing it to a HoA (Hash of arrays). With that, you have "direct" acces to the key and you don't have to search the entire array. See perldoc perdsc and perldoc perldata for more info. Also, you should loop only once and write each line to the appropiate file.	[reply]
Re: Reduce the time taken for Huge Log files by Anonymous Monk on Mar 18, 2005 at 23:12 UTC
Unless most of the lines match what you're looking for, you'll also save a lot of time by first filtering out the lines that don't contain what you're looking for. e.g. If you're using Unix: `my $log = '/var/log/httpd-access.log'; open my $fh, "/usr/bin/egrep 'corp.home.ge.com\|scotland.gcf.home.ge.co +m' $log \|") or die "egrep failed on $log: $!"; while ( <$fh> ) { ... } close ($fh);` [download]	[reply] [d/l]
Re: Reduce the time taken for Huge Log files by TedPride (Priest) on Mar 18, 2005 at 18:59 UTC
I'm not going to try modifying your code, but here's an example of what you could do with the URL portion of each line from the log file. This allows for both http and https, non-subdomain URLs which might or might not contain www, capitalization, etc. I've compiled the counts here and not bothered writing the matched lines to a secondary log file, but it would be easy do modify the code to do so. use strict; use warnings; my @b = ( "corp.home.ge.com", "scotland.gcf.home.ge.com", "marketing.ge.com", "home-school.com" ); my %b; $b{$_} = 0 for @b; for (<DATA>) { $_ = lc $_; m/^https?:\/\/(?:www.)?(.*?)[\/\n]/; $b{$1}++ if exists $b{$1}; } print "$_ => $b{$_}\n" for @b; __DATA__ http://corp.home.ge.com/page/whatever.php3 https://scotland.gcf.home.ge.com http://sub.marketing.ge.com/ HTTP://marketing.ge.com/ http://www.home-school.com/mypage.html http://home-school.com/mypage.html https://MARKETING.ge.com/testpage.html [download]	[reply] [d/l]
Re: Reduce the time taken for Huge Log files by Anonymous Monk on Mar 21, 2005 at 01:09 UTC
I think that this consolidating and splitting logs method is very inefficient but i think that i could help just by cleaning the code to some degree.... #!/usr/bin/perl use strict; use warnings; #------------------------------------------------------- sub generate_date_str ($) { my ($time) = @_; my ($mday,$mon,$year) = (localtime($time))[3,4,5]; $year += 1900; $mon++; return sprintf("%04d%02d%02d", $year, $mon, $mday); } #------------------------------------------------------- sub get_matching_filenames ($$) { my ($dir, $match_str) = @_; opendir(DIR, $dir) or die "couldn't open directory \"$dir\""; my @names = grep {/$match_str/} readdir(DIR); closedir(DIR); return @names; } #------------------------------------------------------- sub consolidate_logs ($$$) { my ($destination_file, $dir, $filename_str) = @_; my @files = get_matching_filenames($dir, $filename_str); open(OUT,"> $destination_file") or die "Could not open file \"$des +tination_file\" for writing"; foreach my $source_file (@files) { print "Processing of log \"$source_file\" started at " . local +time() . "\n"; open(OLD,"< $dir/$source_file") or die("Could not open file \" +$dir/$source_file\" for reading"); while (<OLD>) { print OUT $_; } close(OLD); print "Processing of log \"$source_file\" ended at " . localti +me() . ".\n"; } close(OUT); } #------------------------------------------------------- sub split_logs ($$$) { my ($source_file, $business_list, $filename_prefix) = @_; foreach my $business (@$business_list) { my ($domain, $file) = @$business; my $outfile = "/inside29/urchin/test/newfeed/$filename_prefix- +$file"; my $newigebusiness = $domain; print "Creating of log for $newigebusiness started at " . loc +altime() . "\n"; open(OUT,">> $outfile") \|\| die("Could not open out file \"$out +file\" for appending"); open(OLD,"< $source_file") \|\| die ("Could not open the consoli +dated file \"$source_file\" for reading"); while (<OLD>) { if ((index($_,$newigebusiness))> -1) { print OUT $_; } } close(OLD); close(OUT); print "Log for $newigebusiness created at " . localtime() . "\ +n"; } } #------------------------------------------------------- my @businesses = ( [ "\"corp.home.ge.com\"", "new_corp_home_ge_com.log" ], [ "\"scotland.gcf.home.ge.com\", "new_scotland_gcf_home_ge_com.log" ] +, [ "\"marketing.ge.com\"", "new_marketing_ge_com.log" ] ); my $consolidated_log = "consolidatedlog.txt"; my $logfiles_dir = '/inside29/urchin/test/logfiles'; my $today = generate_date_str( time() ); my $yesterday = generate_date_str( time() - (24 * 60 * 60) ); consolidate_logs($consolidated_log, $logfiles_dir, $yesterday); split_logs($consolidated_log, \@businesses, $today); [download] hope that helps. bartek	[reply] [d/l]
Re^2: Reduce the time taken for Huge Log files by Anonymous Monk on Mar 21, 2005 at 01:39 UTC
consolidate_logs function could be then optimized to this `sub consolidate_logs ($$$) { my ($destination_file, $dir, $filename_str) = @_; my @files = get_matching_filenames($dir, $filename_str); open(OUT,"> $destination_file") or die "Could not open file \"$des +tination_file\" for writing"; foreach my $source_file (@files) { print "Processing of log \"$source_file\" started at " . local +time() . "\n"; system("cat $dir/$source_file >> $destination_file"); print "Processing of log \"$source_file\" ended at " . localti +me() . ".\n"; } close(OUT); }` [download] (using "cat" program instead of perl code for simply transferring large quantities of data) or even to this `sub consolidate_logs ($$$) { my ($destination_file, $dir, $filename_str) = @_; system("ls $dir \| grep $filename_str \| xargs -iX cat $dir/X >> $de +stination_file"); }` [download] split_logs function could be simplified to this `sub split_logs ($$$) { my ($source_file, $business_list, $filename_prefix) = @_; foreach my $business (@$business_list) { my ($name, $file) = @$business; my $outfile = "/inside29/urchin/test/newfeed/$filename_prefix- +$file"; print "Creating of log for $name started at " . localtime() . + "\n"; system("grep \"$name\" $source_file >> $outfile"); print "Log for $name created at " . localtime() . "\n"; } }` [download] again - using external program ("grep" this time) for simple string matching but in big quantities of data. bartek	[reply] [d/l] [select]