Suspect FH issues

kdmurphy001 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Great and Divine Monks,

I once again must beseech you for your aid and ask that you bestow upon me a small fraction of your boundless knowledge. I have partaken on a humble quest but have stumbled near the end.

I am seeking to calculate the length of sessions by scanning a logFile (logFile.txt). It's an apache log file and I have used other versions of this crude script successfully in the past. The one big change here being that I must now look through the file a second time for the last occurrence of the session ID.

To this end I have posted my writing below. Currently the script works only on compressed files (with slight modification) or if I remove/comment out the FILELEN file handle. If I run it as shown below it will simply spin (running from a Jenkins env) with no errors and no output. Please forgive me for showing such ugliness to your divine eyes. It is only out of great necessity that I do so.

#!/usr/bin/perl
use strict;
use warnings;
use DBD::mysql;
use File::Find;
use File::Find::Rule;
use POSIX qw(strftime);
use PerlIO::gzip;
use DateTime ();
use Date::Parse qw(str2time);

#Get yesterdays date
my $epoc = time() - 86400;   # one day before of current date.
my $yesterday = strftime "%F", localtime($epoc);

#REGEX Patterns
my $pattern = '(\d+-\d+-\d+ \d+:\d+:\d+).*login:(\S+).*UID:(\S+).*sess
+ion:(\S{32}).*type:(\d).*InstID:(\S+)';
my $pattern_session = '(\d+-\d+-\d+).*';

my $output_file= 'FileList_with_Length.txt';
open OUTPUT, ">$output_file";

#Locate Files
my $dir = $ARGV[0];
my $ffr_obj = File::Find::Rule->file()
                              ->name( "logFile.txt" )
                              ->maxdepth( 5 )
                              ->start ( $ARGV[0] );

while (my $file = $ffr_obj->match() ) {
    if($file =~ /\/net\/bard2\/home\/bca\/logs\/ilrn-(east|west|sjc|cv
+g)(\d+)\/logFile.txt/) { 
        my $node = $2;
        my $pool = $1;
        print "In File:  $file\n";
                
        open FILE, "< $file" or die $!;
        while(<FILE>) {            
            if ($_ =~ /$pattern/) {
                my $currentline = $. ; 
                my $currentsession = $4 ; 
                my $starttime = $1 ;   #Sets date and time when first 
+session ID found
                my $endtime = 0;       #Will hold  date and time when 
+last session ID found
                my $duration = 0;
                
                #Get Sesion Length
                my $lastseen = 0;    #If > 100000 will assume session 
+has ended
                
                open FILELEN, "< $file" or die $!;                
                #open FILELEN, "<:gzip", "$file" or die $!;
                while (<FILELEN>) { 
                    #Skip all lines up to current line (where session 
+ID was first seen)
                    next if (1..$currentline);
                    if ($_ =~ /(\d+-\d+-\d+ \d+:\d+:\d+).*$currentsess
+ion/) {
                        $lastseen = 0;
                        $endtime = $1;    #Sets endtime to last date a
+nd time seen
                    }
                    else { 
                        if($lastseen < 100000) {
                            $lastseen++;
                        }    
                        if($lastseen >= 100000) {    #Exit if 100,000 
+lines were searched with out seeing any further 
                            next;                    #of the session i
+d
                        }                    
                    }
                }
                close FILELEN;
                $duration = str2time($endtime) - str2time($starttime);
+    #convert to epoch time and subtract to get length in seconds     
                if($duration > 5) {    print OUTPUT "$1\t$2\t$3\t$4\t$
+5\t$6\t$duration\n" ; print "$1\t$2\t$3\t$4\t$5\t$6\t$duration\n" }  
+              
            }
        }
        close FILE;    
    }
}
close OUTPUT;
[download]

I thank you again for your blessings.
UPDATE: Sample of the log files. I do apologize, I meant to include this originally.

2014-09-28 16:49:55" INFO com.xxx.session.Session.login login:someone@email.com UID:1263498014 session:647D67F6F741B57C8ACDC05F13D1C3C0 browser:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36 type:1 remote-ip:00.000.00.000 InstID:003675 region:US|| Skin:UNIVERSITY OF SomeSchool SessionObjectID:396640419

2014-09-28 16:51:08" INFO AssignmentSessionLifecycleLogger Start UserActionStart instID:003675 userID:1263498014 userLogin:some@email.com assignmentID:1277035616 assignmentResultsID:1883245099 sessionID:647D67F6F741B57C8ACDC05F13D1C3C0 requestURI:/xxx/takeAssignment/takeCovalentActivity.do

2014-09-28 20:10:21" INFO AssignmentSessionLifecycleLogger Stop SessionCloseCleanup instID:003675 userID:1263498014 userLogin:some@email.com assignmentID:1277035616 assignmentResultsID:1883245099 sessionID:647D67F6F741B57C8ACDC05F13D1C3C0 requestURI -Kevin Murphy

Comment on Suspect FH issues Download Code

Replies are listed 'Best First'.
Re: Suspect FH issues by Corion (Patriarch) on Oct 23, 2014 at 07:59 UTC
From what I read from your code, your approach is the following: `read through the file "$file" when a starting session is found open the file "$file" again, as file2 read through file2 until the end of the session is found` [download] I would approach the problem differently. I would try to read through the file only once, remembering the start and ending time for each session: my %session_start; my %session_end; while( <FILE> ) { if( /$pattern/ ) { # Session starts here my $started_session= $1; # actually, you do the counting of pa +rentheses my $session_start_time= $2; # actually, you do the counting of + parentheses if( ! $session_start{ $started_session }) { # That session really is new $session_start{ $started_session }= $session_start_time; } else { # That session should be new but we already know it warn "Line $.: Session '$started_session' already started +at $session_start{ $started_session }, but also (re)starts at $sessio +n_start_time. See [$_]"; }; } elsif( /(\d+-\d+-\d+ \d+:\d+:\d+).*session:(\S{32})/) my $session_end= $1; # actually, you do the counting of parent +heses my $ended_session= $2; # actually, you do the counting of pare +ntheses # Session was seen here, or might end here if( ! $session_start{ $ended_session }) { # That session is unknown?!really is new warn "Line $.: Session '$ended_session' never started, but + ends at $session_end. See [$_]"; } else { # That session was seen here $session_end{ $ended_session }= $session_end; }; } }; [download] Once you have collected all the start and end times for the sessions (and refined your regular expressions so they only match the start/end lines), you just need to output the start/end hashes: `for my $session (sort keys %session_start) { print "Session: $session - $session_start{ $session } - $session_e +nd{ $session }\n"; };` [download] Sanity checking that every session that was started also ends, and that every session that ended also started is left to the user.	[reply] [d/l] [select]
Re: Suspect FH issues by McA (Priest) on Oct 23, 2014 at 07:55 UTC
Hi, in this case it would be good to have an example file of your logs to be able to see what you want to do and where the problem is. But, if I understand the intent of the program right you have a certain pattern to identify session starts and you have a certain pattern to identify session continuation entries. Why don't you do it in one rush? It seems that you can always identify a line by a session id. As soon as you hit a line you decide whether it's a session start or session cont line. If it's a session start you do somthing like: `my %sessions; $sessions{$id}->{'start'}->{'line'} = $lineno; $sessions{$id}->{'start'}->{'timestamp'} = $timestamp;` [download] For every cont entry you do the following: `$sessions{$id}->{'last'}->{'line'} = $lineno; $sessions{$id}->{'last'}->{'timestamp'} = $timestamp;` [download] assuming that the timestamps in the log file a monotonous rising. In the end you have a hash `%sessions` where every key is a session-id and with the value hash you can do all calculations you want. Am I wrong in my assumptions? Regards McA	[reply] [d/l] [select]
Re: Suspect FH issues by Discipulus (Canon) on Oct 23, 2014 at 07:55 UTC
even if i'm an humble monk, i can try some answer: you are reopening your file inside the while loop. this is not correct or safe. You have to seek the filehandle in such a case, i think, not reopen the FH. If you really need to know the lenght of a session you can build up a hash of arrays with the session as key, start time as first index of the array and update the second element while iterating. Something like `$sessions{$cur_id}[1] = $1 if $1 > $sessions{$cur_id}[1];` [download] HtH L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]