Bod has asked for the wisdom of the Perl Monks concerning the following question:

We are wanting to supplement Google Analytics or a few reasons. Not least because we want to have site traffic information held in our own database so we can interrogate it automagically. We've created a database table to hold this data.

Within the common header method, we've added some code that sets a cookie with a max age of 2 hours or refreshes the cookie if it is already set. If the cookie isn't already there, we write a row to the database table with the entry time, entry page, etc. If the cookie exists we update the row with exit page, exit time and bump the page count.

This approach is working and it's been running for a week.

But, it is reading about 11 times higher for site traffic than Google Analytics. I'd expect some discrepancy but not that much. Looking at the visits, we are getting a quite a few with the same or very close timestamp so my best guess is that it's a client that isn't accepting the cookie - perhaps a web crawler. To check this out, I've added IP and User Agent to the database table and sure enough these have a user agent of a crawler/bot.

To solve this, I've added a condition to the line that writes the new line to the database:

$dbh->do("INSERT INTO Site_Visit SET firstVisit = NOW(), lastPage = ?, + firstPage = ?, IP = ?, userAgent = ?, orsa = ?, orta = ?, Person_idP +erson = ?", undef, $ENV{'REQUEST_URI'}, $ENV{'REQUEST_URI'}, $ENV{'REMOTE_ADDR' +}, $ENV{'HTTP_USER_AGENT'}, $cookie{'orsa'}, $data{'orta'}, $user) unless $ENV{'HTTP_USER_AGENT'} =~ /bot/i or $ENV{'HTTP_USER_AGEN +T'} =~ /facebook/i or $ENV{'HTTP_USER_AGENT'} =~ /dataprovider/i;
This seems to be working...but...the list of 'blocked' user agent strings could get quite large.

Is there a more Perlish way to write this condition?

I did think of putting them all in a database table for querying the user string against this table:

SELECT ? IN ( SELECT userAgent FROM Blocked_Users )
untested

But, that would mean having the full and exact user agent strings instead of using a regexp.

Note that I don't want to block crawlers, I just don't want them written to the site visit logs. This makes it quite difficult to Google because most articles are about blocking crawlers and bots from a website.

Replies are listed 'Best First'.
Re: Bot vs human User Agent strings
by hippo (Archbishop) on Feb 09, 2024 at 20:14 UTC

    It's pretty easy if you don't use GA but instead use your own logging. You can configure the webserver only to log requests from user agents you are interested in (or conversely omit those you are not). You can do this based on a regexp, see eg. BrowserMatch for Apache.

    You can use any number of these to mark the user agents as you so wish and then log only the unmarked ones. This also has the bonus effect that you will still see requests even from users (such as I) who block GA.


    🦛

      You can configure the webserver only to log requests from user agents you are interested in

      Oh yes - I'd overlooked doing this in Apache. Thanks for the suggestion.

      My first thought is that I'd still rather record the information in a DB table, as we want to be able to query it. A DB query will be much easier than parsing the Apache logs.

        Sure, but once you have the data in whatever form it shouldn't be too hard to import it into the DB of your choice. Remember that you can also set Apache to log to a pipe so you could load it into a DB without going to the local filesystem in the first place. Lots of options. :-)


        🦛

Re: Bot vs human User Agent strings
by InfiniteSilence (Curate) on Feb 10, 2024 at 02:17 UTC

    Alternation?

    #!/usr/bin/perl -w use strict; my @bad = (); my ($f, $fh, $baddies, $sql, $badregex) = ('./badactors.txt',undef, un +def, '', undef); open($fh, '<', $f) or die $!; while(<$fh>){ chomp; next if m/^$/; push @bad, $_; } close($f); $baddies = join qq{\|},@bad; $badregex = qr~$baddies~; # easy test my $junk = 'doodle bot'; if ($junk=~m/$badregex/) { print qq~\nSee....?\n~; } 1; __DATA__ $dbh->do("INSERT INTO Site_Visit SET firstVisit = NOW(), lastPage = ?, + firstPage = ?, IP = ?, userAgent = ?, orsa = ?, orta = ?, Person_idP +erson = ?", undef, $ENV{'REQUEST_URI'}, $ENV{'REQUEST_URI'}, $ENV{'REMOTE_ADDR'}, $E +NV{'HTTP_USER_AGENT'}, $cookie{'orsa'}, $data{'orta'}, $user) unless $ENV{'HTTP_USER_AGENT'} =~ /bot/i or $ENV{'HTTP_USER_AGENT' +} =~ /facebook/i or $ENV{'HTTP_USER_AGENT'} =~ /dataprovider/i; <code>

    Celebrate Intellectual Diversity

      Thanks...based on this solution and sleeping on it, I've implemented this solution:

      open my $fh, '<', "....data/UserAgents/block.dat"; my @agent = <$fh>; close $fh; chomp @agent; my $invalid = grep { $ENV{'HTTP_USER_AGENT'} =~ /$_/i } @agent;

      The solution from hippo has made me think, recall and investigate what Apache can do in this situation. However, I decided to do it this way instead of using Apache because it keeps all the logic in the method that processes the page headers. This is where the session cookie is set, so it makes sense (to me) to keep the code there as well. I feel this should be easy to maintain and easy to find.

      I am a bit confused by a line in the above code:

      if ($junk=~m/$badregex/) { print qq~\nSee....?\n~; }
      Isn't the m operator redundant here or is it doing something subtle that I have overlooked?

        Isn't the m operator redundant here

        Yes, you are correct. However, it does no harm. It's not really an operator - rather it can serve to disambiguate the regex (for the compiler) in circumstances where it might not be clear. This is not one of those circumstances AFAICT.


        🦛