cosmicperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi All,
  I've written a piece of tracking software that records clicks to a database, checking to see which are unique. Basic logic:-
Click comes in
Check IP against database
Record click to database

This works fine for low traffic sites, but when you get higher traffic with potentially several hits a second the database quickly get's up to 100MB and checking for that unique IP before recording the click really hits the CPU.
I've come up with this solution that uses a temporary flatfile before bulk checking and inserting to the database.
my $maxsize = 1024; my $maxtime = 30; my ($fsize, $ftime) = (stat("click.save"))[7,9]; open(OUTF, ">>click.save"); print OUTF "$ENV{REMOTE_ADDR}|$date|$timestamp\n"; close(OUTF); if ($fsize > $maxsize || ($ftime + $maxtime) < time) { #### Checks all IP's against database using an sql 'WHERE ip I +N ('111.111.111.111, '...)' to find if unique #### Does a bulk insert in one call to the db }#if
I skipped a load of code so it would be easier to read. I've use Benchmark; 'ed this to find it's about 10 times faster.

What do you guys think? Comments and suggestions welcome.


Lyle

Replies are listed 'Best First'.
Re: Hit tracking optimization...
by kyle (Abbot) on Jan 24, 2008 at 01:26 UTC

    The problem you may have is when two or more of these all try to run at once. The usual solution to that is based on flock.

    Another way to deal with concurrency that I've used in the past is to write each item to a unique file in a particular directory. Then a single cron/batch job comes along later and processes each file, unlinking them as they're finished. The only race condition is if the batch processor tries to work on a file the writer hasn't finished writing yet. I'd avoid that by not working on any file that's less than (say) a minute old.

      You can also address that using the method that qmail uses: create tmp file in another directory on the same device, finish writing, hard link into your work directory with the inode number as the file name on completion, and unlink from tmp directory.

      --MidLifeXis

        I like the sound of that, sounds simple and effective, I've done something similar in the past for an SE script. I'll use the flock for now, but if I find problems in the future I'll try this.
      I do flock in the version I'm using. Should have mentioned it. Bu thanks for the heads up. Probably should also mention that I do a seek after the flock just encase the file changed while waiting for a lock:-
      flock(OUTF,2); seek(OUTF,0,2);
Re: Hit tracking optimization...
by chromatic (Archbishop) on Jan 24, 2008 at 02:51 UTC

    Call me naive, but do you have an index on the IP address column in your database?

      Call ME lacking in MySQL knowledge ;) Thanks for the pointer
Re: Hit tracking optimization...
by moritz (Cardinal) on Jan 24, 2008 at 08:03 UTC
    If you want to cache something, look at memcached - that's much faster than a file, because it keeps the data in RAM.
Re: Hit tracking optimization...
by roboticus (Chancellor) on Jan 24, 2008 at 13:50 UTC
    Another way to help your process is to use a hash to accumulate all IP addresses, and each <unit of time> swap the hash with a blank one. Then, while you're accumulating the new hash of IP addresses, you can bulk load the list of IP addresses into a temporary table and integrate the table into your production table.

    ...roboticus

Re: Hit tracking optimization...
by BrowserUk (Patriarch) on Jan 24, 2008 at 19:25 UTC

    If you want really fast, are only dealing with IPv4 addresses and have half a gig of diskspace to spare, then you can have your lookups run at 22,000/second if the ip has already been seen and 120/second for those that haven't.

    This by using a single binary file 512MB in size that uses 1 bit to represent each of 4GB possible IPs. Write contention only occurs if two or more new IPs within the same 8 value range, connect at exactly the same time. This is probably a very rare occurance and can be dealt with by a re-read, back-off and retry scheme (think Nagal).

    However, the contention risk can be reduce to zero if you are using an OS that supports byte-range file locking. This does slow things down a little, but not hugely.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      It's worth pointing out that if your system supports sparse files, the disk space requiement drops consireably. I added 1e6 random hits and the on-disk usage was only 64 MB.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.