comment on

I have two sets of files. The first set is a series of 6 files that contain a user_id,session_id,dept_code. All of the entries in these files will be unique because of the session_id. So across all six files I end up with around 10K unique entries per day.

The other set of files are two apache access logs. They are about 250 megs each per day. What I need to do is break down the apache access logs into department specific files based on the session_id. The session_id appears in the url portion of the access log entry.

For instance, I may have session entry like (user,session,dept): user_x, sess12345,tools I then need to match up every url in the access logs that the user hit for this session_id and print them out to a tools file.

I originally tried putting all of the session information into a single hash and then reading the access logs, but then I end up reading the 250 megs files 10,000 times (once for each unique session id). This seems plain silly to me. Looking around it appears another way to go would be to create a series of hash and then reference then when reading the access logs, but haven't been able to get this to work.

sub make_session {

    my ($u,$a,$s)=@_;

    %session_hash={
        user=>$u,
        dept=>$a
    };
        $s=\%session_hash;

}
[download]

during the reading of the session logs I

&make_session($user,$dept,$sid);
[download]

I then move to the access log

open (HTTP,$access_log)||die ("unable to open $access_log $!\n");
    while (my $line2=<HTTP>) {
        chomp;
        my @access_fields = split /\s/, $line2;
            my $session=@access_fields[6];
            y $session=substr($session,(index($session,"?")+12),(index
+($session,"|")-1)-(index($session,"?")+12) );
            if (exists($$session_id{'user_id'})) {
            print "match found\n";            
        } 
    }
[download]

I seem to be having issues with referencing the hash's that have been created. I am open to new ideas, I really just want to limit the number of times I have to read the large file to decrease processing time as the goal is to do this daily.

In reply to hash referencing...best approach? by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.