Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have two sets of files. The first set is a series of 6 files that contain a user_id,session_id,dept_code. All of the entries in these files will be unique because of the session_id. So across all six files I end up with around 10K unique entries per day.

The other set of files are two apache access logs. They are about 250 megs each per day. What I need to do is break down the apache access logs into department specific files based on the session_id. The session_id appears in the url portion of the access log entry.

For instance, I may have session entry like (user,session,dept): user_x, sess12345,tools I then need to match up every url in the access logs that the user hit for this session_id and print them out to a tools file.

I originally tried putting all of the session information into a single hash and then reading the access logs, but then I end up reading the 250 megs files 10,000 times (once for each unique session id). This seems plain silly to me. Looking around it appears another way to go would be to create a series of hash and then reference then when reading the access logs, but haven't been able to get this to work.
sub make_session { my ($u,$a,$s)=@_; %session_hash={ user=>$u, dept=>$a }; $s=\%session_hash; }
during the reading of the session logs I
&make_session($user,$dept,$sid);

I then move to the access log
open (HTTP,$access_log)||die ("unable to open $access_log $!\n"); while (my $line2=<HTTP>) { chomp; my @access_fields = split /\s/, $line2; my $session=@access_fields[6]; y $session=substr($session,(index($session,"?")+12),(index +($session,"|")-1)-(index($session,"?")+12) ); if (exists($$session_id{'user_id'})) { print "match found\n"; } }

I seem to be having issues with referencing the hash's that have been created. I am open to new ideas, I really just want to limit the number of times I have to read the large file to decrease processing time as the goal is to do this daily.

Replies are listed 'Best First'.
Re: hash referencing...best approach?
by Roy Johnson (Monsignor) on Nov 24, 2003 at 20:36 UTC
    sub make_session { my ($u,$a,$s)=@_; %session_hash={ user=>$u, dept=>$a }; $s=\%session_hash; }
    You've muddled several things, here. The hash should be assigned with a list, not a block. There's no reason to create $s, since you're just throwing it away. You can return the reference by just saying
    \%session_hash;
    as the last line. You don't say where the result of make_session goes, and you don't declare %session_hash as a lexical in the sub, which means you're going to be returning a reference to the same thing every time.

    while (my $line2=<HTTP>) { chomp;
    I'm guessing that here you want to chomp $line2.

    I think I know what you want to do, but I'm not sure: Create a hash entry for each unique session, and associate a filename with it. Then, when you're reading the log files, you'll parse out the session name, and write (append) the line onto the associated file. Is that right?


    The PerlMonks advocate for tr///
      "I think I know what you want to do, but I'm not sure: Create a hash entry for each unique session, and associate a filename with it. Then, when you're reading the log files, you'll parse out the session name, and write (append) the line onto the associated file. Is that right? "

      This assesement is pretty accurate. What I want to have at the end of processing is a series of dept. specific access files:

      tools_access.log
      help_access.log
      ...

      Even better would be the full line from the access log file with the user_code and dept_code added to the end of it so we could not only track dept ussage but user usage within each dept.

      I don't know if that was any clearer or not...
        Ok, let's see if this gets you started. I'm going to write this as mostly pseudocode comments. You get to fill in the code.
        my %session_hash; my %departments; # for each of the 6 session files, # open and read line-by-line # parse out user_id, session_id, dept_code $session_hash{$session_id} = $dept_code; $departments{$dept_code}++; # # for each apache log # open and read line-by-line # parse out session_id # append the line to the file associated with $session_hash{$s +ession_id}
        If you have a relatively small number of departments (keys %departments), you can keep all the output files open for writing. Otherwise, you'll need to open for append each time you want to write a line of output. (You could also hold some number of lines in memory and write them out every so often, for a little less open and closing action.)

        HTH.


        The PerlMonk tr/// Advocate
Re: hash referencing...best approach?
by tcf22 (Priest) on Nov 24, 2003 at 20:04 UTC
    You are referencing the hash to $s, which is local.
    You could pass in a ref to $sid, or return the hash ref. I prefer the later.

    Also you are assigning a hash ref to a hash, which is probably what you don't want. {} creates a hash reference, () are used to define a hash.

    One of these 2 options should work:
    #Passing scalar ref sub make_session { my ($u,$a,$s)=@_; my $session_hash={ user=>$u, dept=>$a }; $$s=$session_hash; } &make_session($user,$dept,\$sid);
    or
    #Returning hash ref sub make_session { my ($u,$a)=@_; $session_hash={ user=>$u, dept=>$a }; return $session_hash; } $sid = &make_session($user,$dept);
    Update: Changed %session_hash to $session_hash after realizing that you are creating a hash ref by using {}.

    - Tom