Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I have a log file which records system usage on a server. I have a session start time and an end time. I would like to know how many concurrent users are accessing the system. Here is what I have done so far. I have created a new field for the duration of the session. Any transaction that does not last more than one second (i.e. 00:00:00) is ignored. I converted my dates from ISO format (yyyy-mm-dd hh:mm:ss to epoch time and am hoping to create a data structure to iterate over each enumeration in order to establish an intersection/concurrence. Logically, it seems like such a daunting task. So I wanted to discuss this case with the pros to find out how you all would address this problem. I am thinking of what would be the best way to implement this. For instance, most users would be connected for less than a couple of hours. So there is no reason to analyze data from the stack that is irrelevant! Any ideas or advise to guide this poor, lost monk? Thanks a lot.
  • Comment on How to Iterate to Identify Concurrent Users

Replies are listed 'Best First'.
Re: How to Iterate to Identify Concurrent Users (trees, merge)
by tye (Sage) on Oct 26, 2007 at 01:36 UTC

    This is exactly the type of problem that I've wanted to solve with Perl that lead me to propose that Perl add tree-based, very efficient, sorted hashes in a future version (such that entries in the hash are sorted by key).

    Since Perl doesn't have that, you could certainly find a module that implements hashes with sorted keys and use that (if the max number of concurrent users is not huge, then the performance should be acceptable). Then you read the "start" records in chronological order and insert them into a sorted hash with the "end" date as the key. Then you basically do a merge on the "start" records that you are reading and on the ever-changing hash of still-current users. Each time you insert a new "start" record into the hash, you first remove all of the items having an "end" date prior to the new "start" date. Track the size of the hash at each step to see how many concurrent users you have as you progress through (past) time.

    The other route I'd go is producing a second stream of records sorted by "end" time. You could do this by putting the records into a database with an index on the "start" time and an index on the "end" time (or just do an "order by" on the computed "end" time, if your problem size isn't too large for that to be reasonable). Or you could just produce a second log file with the "end" time up front and use an efficient external 'sort' command to put that second log in the desired order.

    Then you merge the two streams. Reading a record from the "start" stream adds an item to your hash of "concurrent users" (keyed by user ID this time). Reading a record from the "end" stream deletes an item from your hash.

    If you have the added complexity of perhaps multiple transactions per user, then you'd add a second hash recording the number of concurrent transactions for each user and delete each user when their count goes to zero.

    - tye        

      tye,
      Since Perl doesn't have that, you could certainly find a module that implements hashes with sorted keys and use that...

      I wrote Tie::Hash::Sorted as a result of someone at this site wanting a sorted hash. It was written with every effort at optimizations a monk with a little over a year of perl under his belt could muster.

      Cheers - L~R

Re: How to Iterate to Identify Concurrent Users (SQL)
by fenLisesi (Priest) on Oct 26, 2007 at 08:50 UTC
    You could use SQL, as tye hinted. Suppose you load your data into a table named sesn (session) with fields sesn_id, usr_id, sesn_bgn, sesn_end and whatever else you may need, properly indexed etc. Then, a query like the following should do what you want (untested):
    SELECT foo.usr_id AS user, foo.sesn_id AS session1, bar.sesn_id AS session2, foo.sesn_bgn AS start1, bar.sesn_bgn AS start2, foo.sesn_end AS end1, bar.sesn_end AS end2 FROM sesn foo, sesn bar WHERE foo.usr_id = bar.usr_id AND bar.sesn_bgn >= foo.sesn_bgn AND bar.sesn_bgn <= foo.sesn_end AND foo.sesn_id != bar.sesn_id -- AND (conditions on min duration etc here)

    Here I have assumed that you are trying to track multiple simultaneous sessions by the same user. If that's not what you meant, adjust accordingly.

    Hope this helps. Cheers.

      Thanks tye and fenLisesi. I initially started with a database model but found it to be more cumbersome. After your good advice, I created a view and ran fenLisesi query with some modifications. It did the trick. Its slow though but it gets me what I need. I'll benchmark it with Perl code to see which is more efficient. You monks rock!!!
      create view foo as select session_id, start_session, end_session, user +_id from bar;
      </code> Then I ran
      SELECT DISTINCT foo.user_id AS user, foo.session_id AS session1, bar.session_id AS session2, foo.start_session AS start1, bar_.start_session AS start2, foo.end_session AS end1, bar_.end_session AS end2 FROM foo,bar WHERE foo.user_id != bar_.user_id AND bar_.start_session >= foo.start_session AND bar_.start_session <= foo.end_session AND foo.session_id != bar_.session_id order by start1 ASC;
Re: How to Iterate to Identify Concurrent Users
by GrandFather (Saint) on Oct 25, 2007 at 23:58 UTC

    What do you want to output? A single number indicating the greatest number of concurrent users? A list indicating the number of concurrent users whenever that number changes? A list of hourly (or some other period) reports indicating the maximum number of concurrent users in that period? Something else?

    If you have a list of connect/disconnect times then you can sort the list by event, then iterate through the list incrementing or decrementing a connected count as appropriate for each event.


    Perl is environmentally friendly - it saves trees
      Quite simply, I would like to output the total number of concurrent users. The list is sorted since it is chronologically based. Thanks Grandfather.
        "The total number of concurrent users" doesn't define much. If you mean "the number of users ever", then say that. If you mean "the number of users that had at least one other user logged in at the same time", that might make sense. For example, if one user logs in at 5, and leaves at 8, and another user logs in at 6, and leaves at 9, and a third user logs in from 10 to 11, how many is that? Two, or Three, or none? :)

        Or maybe you mean average users over time, which from 5 to 11 pm on that example is (calculating... 3 + 3 + 1 over 6) 2.333 or so.

        Do you mean "the maximum number of concurrent users over the span of the log file"? The total of one number is rather uninteresting, especially when you're not clear on what that number is.