Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

67.162.10.216 - - [21/Aug/2010:00:00:00 +0000] GET /2 +010-08-18/news/ct-met-barrington-student-death-20100818_1_mental-illn +ess-suicide-prevention-teen-suicides HTTP/1.1 200 6826 67.162.10.216 - - [21/Aug/2010:00:01:00 +0000] GET /2 +010-08-18/news/ct-met-barrington-student-death-20100818_1_mental-illn +ess-suicide-prevention-teen-suicides HTTP/1.1 200 6826 67.162.10.216 - - [22/Aug/2010:01:00:00 +0000] GET /t +racker.js.php?45aa01ed37b58d2a537b1ba12bb97fe2e5695a8c HTTP/1.1 200 + 2915 67.162.10.216 - - [22/Aug/2010:02:00:00 +0000] GET /t +racker.js.php?45aa01ed37b58d2a537b1ba12bb97fe2e5695a8c HTTP/1.1 200 + 2882 66.249.71.98 - - [22/Aug/2010:03:04:00 +0000] GET /a +d-openx.php?out=js&d=mod-top-hdr-defer&z-i=24809&z-n=top-leaderboard& +i-w=728&i-h=90&i-e=pi%3D45%26amp%3Btv%3Dkp-CT1-G%26amp%3Bpm_mode%3Dp& +i-s=pgtp%3Dkeyword%26pi%3D45%26pe_id%3Dcarrot-cake%26tn%3Dnone%26tv%3 +Dkp-CT1-G HTTP/1.1 200 1020
I have a sorted file based on timestamp. ie. column 4,
Please tell me how to extract the lines which is
greater than or equal to 21\/Aug\/2010:06:00:00 and less than or equal 22\/Aug\/2010:09:00:00

Replies are listed 'Best First'.
Re: Extract the lines
by Ratazong (Monsignor) on Aug 26, 2010 at 10:16 UTC

    Where do you have problems? What did you try?

    The general approach is simple:

    • read the file line-by-line
      • for each line, extract the date (e.g. by a regex)
      • compare the date with your target-dates (Date::Calc has some handy functions for this, e.g. Delta_YMDHMS )
      • process the fitting lines
    Of course you can optimize due to the knowledge that your lines are sorted, e.g. abort processing once you found a line with a date later than your second one ....

    HTH, Rata
Re: Extract the lines
by Utilitarian (Vicar) on Aug 26, 2010 at 10:18 UTC
    • open the file
    • while the file hasn't matched your final time
      • if you have matched your start time
        • push the line onto an array of interesting records
    • Carry out any processing you need to do with the array of interesting records
    Code that up and come back with any issues you have with your code.

    print "Good ",qw(night morning afternoon evening)[(localtime)[2]/6]," fellow monks."
Re: Extract the lines
by jethro (Monsignor) on Aug 26, 2010 at 10:28 UTC

    In the documentation of Date::Calc, there is a recipe for just this question:

    How do I check whether a given date lies within a certain range of dates?

    use Date::Calc qw( Date_to_Days ); $lower = Date_to_Days($year1,$month1,$day1); $upper = Date_to_Days($year2,$month2,$day2); $date = Date_to_Days($year,$month,$day); if (($date >= $lower) && ($date <= $upper)) { # ok } else { # not ok }

    There are other modules like Date::Manip or Class::Date if you want alternatives, but it is sensible to use a module instead of "reinventing the wheel"

Re: Extract the lines
by roboticus (Chancellor) on Aug 26, 2010 at 10:18 UTC
Re: Extract the lines
by ww (Archbishop) on Aug 26, 2010 at 10:27 UTC
    1. Read the file, line by line (if it's large) or into an array, if small enough to handle in RAM.
    2. Test the data for your first condition.
    3. When found, test the same data for your second condition.
    4. If both are satisfied, write to new file, STDOUT or whatever
    5. Repeat until the date-time is greater than your second condition.

    This is not a code writing service.

    So, based on the sequence of hints above, read the docs, write some code, and come back with specific questions about specific functions or operations if you get stuck.

Re: Extract the lines
by RMGir (Prior) on Aug 26, 2010 at 11:46 UTC
    Unlikely in a homework exercise, but if your file was REALLY large, this is a perfect place to use the brilliant File::SortedSeek - it would let you find your matching lines in O(log n) time.

    Mike
Re: Extract the lines
by locked_user sundialsvc4 (Abbot) on Aug 26, 2010 at 13:22 UTC

    How about using something like Apache::Parselog?

    I searched for “Apache log” on CPAN and found a nice number of hits to interesting-looking packages.   I always strongly encourage folks to look first at CPAN before venturing any distance at all down any “primrose path.”

    “Two paths diverged in a wood, and I ...
    ... found that CPAN had already been hundreds of times before down both of them.”
Re: Extract the lines
by Marshall (Canon) on Aug 26, 2010 at 22:37 UTC
    I suspect that your sort of the "date/time" field didn't work as well as you think! That is because the format in the file won't sort in ascending date order when using a plain alpha-numeric sort! I mean "Aug" will sort less than "Jan" although we know that's not right!

    When comparing date/times you need to convert to something that can be compared. There are two basic options:
    1. convert to epoch time (a huge binary number) and you use numeric compares on that number
    2. convert to a text representation that allows you to use string compares.

    Below I show method(2) because if you have any influence upon the format of this log file, this is a HUGE hint on what would be better!

    A string like "2010-08-22 01:00:00" can be compared with string le,gt,cmp functions to "2010-08-21 06:00:00" without calling any Perl module or function. And it is "human readable" as opposed to an integer epoch time. Please note that leading zeroes are important in this type of format!

    Below I just showed one way to do a format conversion like this. I didn't spend a billion hours making this as efficient as possible. Just trying to demonstrate the idea.

    I think your "sort" was just wasted CPU MIP's. Write code that processes the file, use the reformat_date_time() subroutine to get the reformatted date/time for that line and look for lines that are gt or eq "2010-08-21 06:00:00" and lt or eq "2010-08-22 09:00:00" using string compare functions.

    If was doing some huge sort, I would be tempted and probably would convert times to epoch values to speed up the compares in the sort. But here you are going to "touch" each input line exactly once to convert the date/time info into a better string, and then either save that line or not.

    #!/usr/bin/perl -w use strict; my %month2numstring = (Jan => '01', Feb => '02', Mar => '03', Apr => '04', May => '05', Jun => '06', Jul => '06', Aug => '08', Sep => '09', Oct => '10', Nov => '11', Dec => '12', ); while (<DATA>) { my $datefield = (split)[3]; my ($datestring) = $datefield=~ m|\[([\w/:]+)|; my $new_date_field = reformat_date_time($datestring); print "$new_date_field\n"; } sub reformat_date_time { my $date_time = shift; my ($date,$time) = m|([\w/]+):([\d:]+)|; my ($day,$month_text,$year) = split(m|/|,$date); $day = "0$1" if $day =~ m|^(\d)$|; #force leading zero my $month = $month2numstring{$month_text}; return ($year.'-'.$month.'-'.$day." $time"); } =prints 2010-08-21 00:00:00 2010-08-21 00:01:00 2010-08-22 01:00:00 2010-08-22 02:00:00 2010-08-22 03:04:00 =cut __DATA__ 67.162.10.216 - - [21/Aug/2010:00:00:00 +0000] GET /2 +010-08-18/news/ct-met-barrington-student-death-20100818_1_mental-illn +ess-suicide-prevention-teen-suicides HTTP/1.1 200 6826 67.162.10.216 - - [21/Aug/2010:00:01:00 +0000] GET /2 +010-08-18/news/ct-met-barrington-student-death-20100818_1_mental-illn +ess-suicide-prevention-teen-suicides HTTP/1.1 200 6826 67.162.10.216 - - [22/Aug/2010:01:00:00 +0000] GET /t +racker.js.php?45aa01ed37b58d2a537b1ba12bb97fe2e5695a8c HTTP/1.1 200 + 2915 67.162.10.216 - - [22/Aug/2010:02:00:00 +0000] GET /t +racker.js.php?45aa01ed37b58d2a537b1ba12bb97fe2e5695a8c HTTP/1.1 200 + 2882 66.249.71.98 - - [22/Aug/2010:03:04:00 +0000] GET /a +d-openx.php?out=js&d=mod-top-hdr-defer&z-i=24809&z-n=top-...blah....