Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

#!/usr/bin/perl use Date::Manip; $date_2_days_ago = ParseDate("3 days ago");; $date_converted = UnixDate($date_2_days_ago,"%e/%h/%Y"); open DATA,">$ARGV[1]"; open FH,"$ARGV[0]"; while(<FH>){ @tab_delimited_array = split(/\t/,$_); $tab_delimited_array[3] =~ s/^\[//; $tab_delimited_array[3] =~ s/^\-//; chomp($tab_delimited_array[3]); if(length($tab_delimited_array[3]) > 1) { $date_format= UnixDate($tab_delimited_array[3],"%Y%m%d%H:%M:%S"); $converted_date = Date_ConvTZ("$date_format",'GMT','PST'); $pst_converted_date = UnixDate($converted_date,"%e/%h/%Y:%H:%M:%S +"); $pst_converted_date =~ s/^\s//g; $extracted_YMD=UnixDate($converted_date,"%e/%h/%Y"); $_ =~ s/$arr[3]/$pst_converted_date/g; if($extracted_YMD =~ m/$date_converted/){ print DATA $_; } } } close DATA; close FH;
Please tell me how can i optimize this code. The file is very large and has around 11989364 lines. So the processing is very slow.

Replies are listed 'Best First'.
Re: optimize the code
by BrowserUk (Patriarch) on Jun 24, 2010 at 12:24 UTC

    Untested, but I think this should run substantially more quickly.

    The basic idea is instead of converting all 11 million GMT dates to match your PST target date, you convert the target date to GMT and use a simple regex to do the matching:

    my $target = UnixDate( Date_ConvTZ( ParseDate("3 days ago"), 'PST', 'GMT' ) ,"%e/%h/%Y" ); open DATA,">$ARGV[1]"; open FH,"$ARGV[0]"; m/\[$target:/ and print DATA $_ while <FH>; close DATA; close FH;

    If there might be other dates embedded in the log that would be matched by the regex [...:, then you might need to elaborate the regex to isolate the required date.

    Alternatively, if as your sample suggests the required date is at a set offset from the start of the line, you might use:

    substr( $_, 34, 11 ) eq $target and print DATA $_ while <FH>

    which as a straight string compare would be even quicker.

    This assumes that your "3 days earlier" runs midnight to midnight GMT on that day. If you need to cater for the timezone shift of the start and end of day, then things get more complicated. But your code doesn't appear to be doing that.

    In that case I probably calculate the unixtime (seconds since epoch) of the start and end times, convert the log date/times to the same and use a numeric compare:

    print if $logSecs > $startSecs && $logSecs < $endSecs;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      This is the input file, and it has combination of GMT dates.
      74.13.151.1 - - [22/Jun/2010:06:00:00 +0000] GET 67.195.112.248 - - [21/Jun/2010:20:09:42 +0000] GET 99.138.106.5 - - [21/Jun/2010:23:10:18 +0000] GET 99.138.106.5 - - [21/Jun/2010:09:10:18 +0000] GET
      When I run the script using the code below.
      #!/usr/bin/perl use strict; use warnings; use Date::Manip; my $date_converted = UnixDate(ParseDate("3 days ago"),"%e/%h/%Y"); open DATA,">$ARGV[1]"; open FH,"$ARGV[0]"; while(<FH>){ my @tab_delimited_array = split(/\t/,$_); $tab_delimited_array[3] =~ s/^\[//; $tab_delimited_array[3] =~ s/^\-//; chomp($tab_delimited_array[3]); if(length($tab_delimited_array[3]) > 1) { my $date_format = UnixDate($tab_delimited_array[3],"%Y%m%d%H:%M:%S +"); my $converted_date = Date_ConvTZ("$date_format",'GMT','PST'); my $pst_converted_date = UnixDate($converted_date,"%e/%h/%Y:%H:%M: +%S"); $pst_converted_date =~ s/^\s//g; my $extracted_YMD=UnixDate($converted_date,"%e/%h/%Y"); $_ =~ s/$tab_delimited_array[3]/$pst_converted_date/g; if($extracted_YMD =~ m/$date_converted/){ print DATA $_; } } } close DATA; close FH;
      output is
      74.13.151.1 - - [21/Jun/2010:22:00:00 +0000] GET 67.195.112.248 - - [21/Jun/2010:12:09:42 +0000] GET 99.138.106.5 - - [21/Jun/2010:15:10:18 +0000] GET
      When I use the code,it is matching the input file for just 3 days ago.
      my $target = UnixDate(Date_ConvTZ( ParseDate("3 days ago"), 'GMT', 'PS +T' ),"%e/%h/%Y"); print $target; open DATA,">$ARGV[1]"; open FH,"$ARGV[0]"; m/\[$target:/ and print DATA $_ while <FH>; close DATA; close FH;
      Output is:
      67.195.112.248 - - [21/Jun/2010:20:09:42 +0000] GET 99.138.106.5 - - [21/Jun/2010:23:10:18 +0000] GET
      please tell me how to optimize the code to read the date/time from input file and convert to PST time.
      while(<FH>){ my @tab_delimited_array = split(/\t/,$_); $tab_delimited_array[3] =~ s/^\[//; $tab_delimited_array[3] =~ s/^\-//; chomp($tab_delimited_array[3]); if(length($tab_delimited_array[3]) > 1) { my $date_format = UnixDate($tab_delimited_array[3],"%Y%m%d%H:%M:%S +"); my $converted_date = Date_ConvTZ("$date_format",'GMT','PST'); my $pst_converted_date = UnixDate($converted_date,"%e/%h/%Y:%H:%M: +%S"); $pst_converted_date =~ s/^\s//g; my $extracted_YMD=UnixDate($converted_date,"%e/%h/%Y"); $_ =~ s/$tab_delimited_array[3]/$pst_converted_date/g; if($extracted_YMD =~ m/$date_converted/){ print DATA $_; } } }
        Please help me on this. It takes long time to read the input file,convert into pst format and match for 3 days ago date.
Re: optimize the code
by almut (Canon) on Jun 24, 2010 at 11:41 UTC

    If performance is relevant, maybe you shouldn't be using Date::Manip.  In SHOULD I USE DATE::MANIP, it's explicitly stated that the module isn't one of the fastest.

    You could try Date::Calc(::XS), for example, which is more focused on speed  (just one suggestion, there are quite a number of other date modules on CPAN...)

Re: optimize the code
by BrowserUk (Patriarch) on Jun 24, 2010 at 10:58 UTC

    You'd make helping you a lot easier if you would also post

    1. half a dozen lines of the file;
    2. a brief description of the purpose of the program.

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      66.249.65.27 - - [21/Jun/2010:08:00:00 +0000] GET - + - - 66.250.65.27 - - [21/Jun/2010:08:00:00 +0000] GET - + - - 66.244.65.27 - - [21/Jun/2010:08:00:00 +0000] GET - + - -
      In input file,4th column was GMT time. So this was converted to PST time, and if the date matches, 3 day ago date format in the 4th coulmn, only then it was written to the output file
Re: optimize the code
by Khen1950fx (Canon) on Jun 24, 2010 at 11:41 UTC
    First, always use strictures like this:
    #!/usr/bin/perl use strict; use warnings; use Date::Manip;
    Second, fix your lexical problem. For example,
    $date_two_days_ago = ParseDate('3 days ago');
    should be
    my $date_two_days_ago = ParseDate('3 days ago');
    Third, you used $arr[3] but forgot to give the array.
    Fourth, run perltidy to cleanup. Here's what I got:
    #!/usr/bin/perl use strict; use warnings; use Date::Manip; my $date_2_days_ago = ParseDate('3 days ago'); my $date_converted = UnixDate( $date_2_days_ago, '%e/%h/%Y' ); open DATA, '>', $ARGV[1]; open FH, $ARGV[0]; while (<FH>) { my(@tab_delimited_array) = split( /\t/, $_ ); $tab_delimited_array[3] =~ s/^\[//; $tab_delimited_array[3] =~ s/^\-//; chomp $tab_delimited_array[3]; if ( length $tab_delimited_array[3] > 1 ) { my $date_format = UnixDate( $tab_delimited_array[3], '%Y%m%d%H +:%M:%S' ); my $converted_date = Date_ConvTZ( $date_format, 'GMT', 'PST' ) +; my $pst_converted_date = UnixDate( $converted_date, '%e/%h/%Y: +%H:%M:%S' ); $pst_converted_date =~ s/^\s//g; my $extracted_YMD = UnixDate( $converted_date, '%e/%h/%Y' ); my @arr; $_ =~ s/$arr[3]/$pst_converted_date/g; if ( $extracted_YMD =~ /$date_converted/ ) { print DATA $_; } } } close DATA; close FH; __DATA__
Re: optimize the code
by BioLion (Curate) on Jun 24, 2010 at 12:02 UTC

    Code optimisation is always easier if you can properly compare the speed of different approaches, sometimes there are suprising results - see Benchmark and Devel::NYTProf. The second is a very good code profiler and might help, but in your case optimising the IO seems the important thing, comparing the different approaches and checking that 'improvements' actually *improve the running speed* with benchmarking is the best approach. Hope this helps.

    Just a something something...
Re: optimize the code
by wfsp (Abbot) on Jun 24, 2010 at 12:22 UTC
    Reducing the amount of work done inside the loop can help.

    Are all the timezones the same? If so you could calculate/build a string for three days ago outside the loop.

    If there are different timezones you could use a hash to cache each one so that you only carry out the costly calculation once for each timezone encountered.

    Are they fixed width fields/records? If so you could use substr to just test that part of the record i.e. dd/mmm/yyyy and possibly +0000.