Re: Log parsing by timestamp dilema

To sort all the results chronologically means that I would have to read them all in first (they could get quite large), perform the sort, and print out the output. I thought of the following alternative options:

Assuming each log file is in chronological order wouldn't it be easier to go through them in parallel, writing out the the entries in order as you go? Something like this:

# MAKE A FILEHANDLE FOR EACH FILE WE WERE GIVEN
my @files = map {new IO::File $_ or die "could not open $_\n"} @ARGV;

# READ IN A LINE FOR EACH FILE
my @lines = map {scalar(<$_>)} @files;

# GET THE DATES FOR EACH LINE;
my @dates = map {get_time($_)} @lines;

my $MAX = <some date bigger than anything in the logs>;

my $found;
do {
    # FIND THE LINE WITH THE EARLIEST DATE
    my $min = $MAX;
    $found = undef;
    for (my $i=0; $i<$num_logs; $i++) {
        my $num = $dates[$i];
        if ($num < $min) {
            $found = $i; $min = $num;
        };
    };
    if (defined($found)) {
        # IF WE FOUND A LINE, SHOW IT AND READ THE NEXT
        # LINE IN FOR THAT LOG FILE
        print $lines[$found];
        my $io = $files[$found];
        $lines[$found] = <$io>;
        $dates[$found] = get_time($lines[$found]);
    };
    
} while (defined($found));
[download]

I did something similar to this to merge multiple apache access logs together a few years back.

Comment on Re: Log parsing by timestamp dilema Download Code

Replies are listed 'Best First'.
Re: Re: Log parsing by timestamp dilema by Limbic~Region (Chancellor) on Feb 01, 2003 at 12:02 UTC
I am not sure I understand how your code processes the log files in parallel. The following logic came to mind as I read your post, which may be what your code snippet is - correct me if I am wrong. Open up all files and read the first line Determine which line/file had the earliest stamp Print that line, while keeping the other lines in the array Read the next line from the matched file back into the array Repeat step 2 This appears to be good logic, even though I can't discern how this works from your code. Of course, in order for this to work for me, I would have to add a lot more code since I need to include the file name/path in each array - because we know that you can't get a filename/path from a filehandle. For this reason, I would probably use a hash. If this is not what you meant and I am completely off base - please let me know. I was thinking that I was going to have to resort (pun intended) to telling them to \| sort. Cheers - L~R UPDATE: - edited to make logic clear	[reply]
Re: Re: Re: Log parsing by timestamp dilema by DaveH (Monk) on Feb 01, 2003 at 18:29 UTC
Hi. Sorry, I couldn't resist rewriting your code. :-) The problem "got at me". It uses adrianh's solution, but translating it into your script, you would end up with something like the rewrite below. First, I removed the whole while loop, around lines 86-90, and all the code inbetween was cut out and saved for later. Alot of the repeated code was moved into subroutines. I have tested it as best as I can, and it works for me. I took advantage of the fact that you had already done the work of finding the files, which were stored in @Logs. This was used instead of @ARGV. I tried not to impose my coding style on the script, but it has been run through PerlTidy. This may have moved stuff around a bit. The other main change was the way of handling specified date ranges. Whilst the 'if' logic remains, I generalised it into a subroutine, and made use of a new %Range hash to store the 'begin' and 'end' dates (which are updated if the '-t' option is specified). By defaulting appropriately, this allows the code to check for dates being in the range the user want in just one line of code. Also, this means that the complicated regexes to parse command line args are only performed once, ranther for every line of every file. Read more... (9 kB)	[reply] [d/l]
Re: Re: Re: Re: Log parsing by timestamp dilema by Limbic~Region (Chancellor) on Feb 01, 2003 at 20:51 UTC
DaveH, Thanks! My logical interpretation of adrianh's solution was pretty much correct - I just couldn't see it in the code. This works as is, but I am going to test its speed against tall_man's suggestion as it runs considerably slower. I know that it is doing a lot more work, so this is expected and with the $\|++ - the humans viewing it shouldn't really notice a difference. None the less, I am going to code my own version of the logic to see if I can't speed it up in addition to benching it against a version using File::MergeSort. If I can't do any better than your integration of adrianh's solution, the only change I will make is having it being an option and not the default. This way it will not effect the overall speed if someone chooses to do a -c and only look at one connector log. Cheers - L~R	[reply]
Re: Re: Re: Re: Re: Log parsing by timestamp dilema by DaveH (Monk) on Feb 02, 2003 at 00:04 UTC
Re^3: Log parsing by timestamp dilema by adrianh (Chancellor) on Feb 01, 2003 at 21:19 UTC
Summary of logic spot on. Sorry my code wasn't clear enough :-) However, as tall_man pointed out, File::MergeSort (which I have somehow managed to miss) does exactly the same thing - and is nicely encapsulated. So I'd use that instead if it were me :-)	[reply]