Fortificus has asked for the wisdom of the Perl Monks concerning the following question:

Good morning Monks, I have created a small subroutine to obtain the country from a GeoIP database (below). If I run the query alone (with explain) the response is that it is using its index (Start_int) and the response is 0.00 sec (second paste). The issue that I am getting is that on large scale (over one million records), the response is so slow that it will fail to finish in one day. I put a timer on the executing of 100 entries of the logs (Cisco ASA logs) and when the subroutine is off, the execution time for 100 entries is around 0.03x seconds but as soon as I execute the subroutine, it goes to 1.x to 3.x seconds, which explains why on larger loads it cannot finish. I have added indexes, tweeked the query (using ranges or "between") with no luck. I also tried MyISAM and InnoDB (to try to cache the full table in memory) with no luck either. Please notice that a hash was added to minimize the number of queries, but it does not seem to make a major difference. RedHat 6.6 with latest patches installed. Any advice would be welcome!
sub Obtain_GeoIP { $GeoIP_resolved++; $Scountry=""; $Dcountry=""; my ($IP) = @_; if($IP=~/^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ &&(($1< +=255 && $2<=255 && $3<=255 &&$4<=255 ))) { my $sip = 0; my @sip = split(/\./,$IP); if (($sip[0] == "10") || ($sip[0] == "192" && $sip[1] == "168" +) || (($sip[0] == "172" && $sip[1] >= "16") && ($sip[0] == "172" && $ +sip[1] <= "31"))) { return "Internal Network"; } else { if ( exists $geoip_hash{IP} ) { my $country_hash = $ge +oip_hash{IP}; return $country_hash; } my $lip = ( ($sip[0]*($n * $n * $n))+($sip[1]*($n * $n +))+($sip[2] * $n) + ($sip[3]) ); # my $obtain_country = $dba->prepare("select Country f +rom GeoIP where ('$lip' > Start_int) and ('$lip' < End_int) limit 1") +; my $obtain_country = $dba->prepare("select Country fro +m GeoIP where $lip between Start_int and End_int limit 1"); $obtain_country->execute(); my @country = $obtain_country->fetchrow(); $obtain_country->finish; chomp @country; $geoip_hash{$IP} = $country[0]; return $country[0]; } } else { print EFILE "IP address sent to Obtain_GeoIP routine w +as incorrect: $IP - $counter\n"; } } mysql> explain select Country from GeoIP where ('2345565466' > Start_i +nt) and ('2345565466' < End_int) limit 1; +----+-------------+-------+-------+-------------------+-----------+-- +-------+------+-------+-------------+ | id | select_type | table | type | possible_keys | key | k +ey_len | ref | rows | Extra | +----+-------------+-------+-------+-------------------+-----------+-- +-------+------+-------+-------------+ | 1 | SIMPLE | GeoIP | range | Start_int,End_int | Start_int | 5 + | NULL | 40541 | Using where | +----+-------------+-------+-------+-------------------+-----------+-- +-------+------+-------+-------------+ 1 row in set (0.00 sec) mysql> desc GeoIP; +------------------+------------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+------------------+------+-----+---------+-------+ | Start_int | int(10) unsigned | YES | MUL | NULL | | | End_int | int(10) unsigned | YES | MUL | NULL | | | Country_Initials | varchar(3) | YES | | NULL | | | Country | varchar(30) | YES | | NULL | | +------------------+------------------+------+-----+---------+-------+ 4 rows in set (0.01 sec)

Replies are listed 'Best First'.
Re: Slow response on DBI MySQL query
by Corion (Patriarch) on Mar 24, 2015 at 13:31 UTC

    There are many things you can cache to make executing your subroutine faster.

    For example, you can cache the statement handle $obtain_country to fetch a single country:

    use vars '$obtain_country'; sub Obtain_GeoIP { ... $obtain_country ||= $obtain_country = $dba->prepare("select Countr +y from GeoIP where $lip between Start_int and End_int limit 1"); };

    You could also restructure the query and instead of issuing a query for each IP address, insert all IP addresses into a table and then query the countries for them through SQL.

    I'm not sure if a table is the best data structure for a range query and personally, I would just use the Geo::IP module.

      Corion, Thank you for the response! I am unable to do a single query as there is a lot more processing being done on the rest of the script. Also, I looked at Geo::IP, but the site where the database needs to be downloaded seems down. Are you using it? Any pointers into that module?

        Have you tried to install the module via cpan?

        I just now installed the module and that installation process also included the database without me needing to do anything else.

        unable to do a single query as there is a lot more processing being done

        One does not preclude the other. Retrieving multiple responses in one batch allows the script to work on all of them as well. If the DB needs to do even more work, perhaps the RDBMS should share more of the load, for example, an SP or more intelligent query be written.

Re: Slow response on DBI MySQL query
by chacham (Prior) on Mar 24, 2015 at 13:51 UTC

    "select Country from GeoIP where ('$lip' > Start_int) and ('$lip' < End_int) limit 1");

    Aside from adding insecurity with dynamic SQL, there's little benefit from possible RDBMS caching. That is, preparing a query that uses palceholders allows the RDBMS to create a PLAN that (it thinks) is best for varying arguments. Also, rerunning the query does not require a new prepare.

    And, large loads can be done with a bulk collect.

    '$lip' > Start_int) and ('$lip' < End_int)

    BETWEEN is inclusive. So a BETWEEN b and c means a >= b AND a <= c. So, there's a difference between (no pun intended) the two statements.

      Great point in the between query! It should be a >= b AND a <= c and not a>b and a<c Sorry for the lack of knowledge, but when you say that it does not require a new prepare, how would that work? How would I need to change the query so that $lip is updated? Again, sorry for my lack of knowledge.

        sorry for my lack of knowledge.

        No reason to be sorry about not knowing. Ignorance is not an excuse (to receive pardon) though it is an excellent reason to ask a question!

        Change $lip to a question mark: WHERE ? BETWEEN ... Then, when execute()ing the query, pass the parameter. If in a loop to run multiple times, for example, the prepare would exist before the loop, the execute (and fetch) within it.

Re: Slow response on DBI MySQL query
by fishmonger (Chaplain) on Mar 24, 2015 at 14:37 UTC

    Besides the Gep::IP module, there is also the IP::Country module for "fast lookup of country codes from IP addresses".

    I have not used either of them, but they both will do what you need.

    I would also use the is_ipv4() and is_private_ipv4() functions from the Data::Validate::IP module to handle the validation of the IP.

    Also, instead of manually converting the IP to an int, I'd use the built-in mysql function inet_aton() in the query.

      Fishmonger, lots of really good recommendations! It will take me a little bit to change the script with all of them, and then I will come back to post the speed improvement of the recommendations. Thank you very much!
        Monks, after doing all recommended changes I was able to process 3,000 records per second (for a total of 259,200,000 records a day). All changes had some positive effect, but the one that had the most dramatic effect was moving to Geo::IP. As it was mentioned on the thread by a few of you guys, the relational database was certainly not the correct place to do this operation. The finalized code looks like this now:
        sub Obtain_GeoIP { $GeoIP_resolved++; my ($IP) = @_; if($IP=~/^(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ &&(($1< +=255 && $2<=255 && $3<=255 &&$4<=255 ))) { my @sip = split(/\./,$IP); if (($sip[0] == "10") || ($sip[0] == "192" && $sip[1] == "168" +) || (($sip[0] == "172" && $sip[1] >= "16") && ($sip[0] == "172" && $ +sip[1] <= "31"))) { return "Internal Network"; } else { if ( exists $geoip_hash{IP} ) { my $country_hash = $ge +oip_hash{IP}; return $country_hash; } my $country = $gi->country_name_by_addr($IP); $geoip_hash{$IP} = $country; return $country; } } else { print EFILE "IP address sent to Obtain_GeoIP routine w +as incorrect: $IP - $counter\n"; } }
        Thank you very much for everybody's help!