Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to pipe a simple Apache log of IP addresses to a Perl program for further processing. In practice it should record each IP once in a hash tied to an SDBM_File. If the IP is new it forks with Proc::Fork for a DNS lookup using `host` and a possible whois lookup with Net::Whois::IANA. If the host or whois info matches my $problem a mask is generated with Net::Netmask so `iptables` can drop packets from that entire subnet.

This code runs and produces no visible errors, but the SDBM_File never gets bigger than 0 bytes after being created, even in /tmp with plenty of free disk. Permission to write the files exists, but this code runs as root anyway, via Apache. I also noticed it takes quite a bit of CPU from time to time, and eventually Apache stops responding. What am I doing wrong?

In httpd.conf:

LogFormat "%a" problem CustomLog "|/path/to/this/script.pl" problem
The CustomLog target:
#!/usr/bin/perl -w $|=1; use strict; use Fcntl qw(:DEFAULT); use Net::Netmask; use Net::Whois::IANA; use Proc::Fork; use SDBM_File; my $iana = Net::Whois::IANA->new; my $bye = join '|', ( 'strings known to occur', 'in custname and orgname', ); tie (my %seen, 'SDBM_File', '/some/sdbm', O_RDWR|O_CREAT, 0666) or die "$!"; open my $fh, "< /dev/stdin" or die "$!"; while (<$fh>) { chomp; next if exists $seen{$_}; $seen{$_} = 1; my $ip = $_; child { my $problem = 0; my $host = `host $ip`; $problem = 1 if $host =~ /some host name regex/; if ($problem or $host =~ /not found/) { $iana->whois_query(-ip=>$ip); my $inetnum = $iana->inetnum || ''; my $custname = $iana->{QUERY}->{custname} || ''; my $orgname = $iana->{QUERY}->{orgname} || ''; next unless $inetnum and ($custname or $orgname); if ($custname =~ /$bye/i or $orgname =~ /$bye/i){ my $mask = Net::Netmask->new($inetnum); `iptables -A INPUT -s $mask -j DROP`; } } } } close $fh or die "$!"; untie %seen;

I have a feeling, even if this worked, there's a better way to do it. Maybe something in mod_perl or an existing Apache module? Meanwhile I can't figure out why this SDMB_File doesn't grow past 0 bytes. Thanks for your consideration.

Replies are listed 'Best First'.
Re: Apache log piped to Perl
by gellyfish (Monsignor) on Aug 07, 2006 at 21:19 UTC

    I'm not quite sure why you are opening /dev/stdin when you can just read from the STDIN filehandle.

    /J\

      Thank you gellyfish. I was reading /dev/stdin instead of STDIN because that part is pure cargo from my inspiration at ApacheLogsWithoutIPs. After fixing that and learning from a horde of zombies to call "exit" instead of "next" from inside the child to continue looping in the parent, it seems to function as intended and very efficiently.
      #!/usr/bin/perl -w $SIG{CHLD} = 'IGNORE'; use strict; use Fcntl qw(:DEFAULT); use Net::Netmask; use Net::Whois::IANA; use Proc::Fork; use SDBM_File; my $iana = Net::Whois::IANA->new; my $bye = join '|', ( 'strings known to occur', 'in custname and orgname', ); my $badhosts = 'some\.bad|other\.worse\d+'; tie (my %seen, 'SDBM_File', '/some/sdbm', O_RDWR|O_CREAT, 0666) or die "$!"; while (<STDIN>) { chomp; next if exists $seen{$_}; $seen{$_} = 1; my $ip = $_; child { my $problem = 0; my $host = `host $ip` or die "$!"; $problem = 1 if $host =~ /$badhosts/i; if ($problem or $host =~ /not found/) { $iana->whois_query(-ip=>$ip); my $inetnum = $iana->inetnum || ''; my $custname = $iana->{QUERY}->{custname} || ''; my $orgname = $iana->{QUERY}->{orgname} || ''; exit unless $inetnum and ($problem or $custname or $orgname); if ($problem or $custname =~ /$bye/i or $orgname =~ /$bye/i){ my $mask = Net::Netmask->new($inetnum); `iptables -A INPUT -s $mask -j DROP`; } } exit }; } untie %seen;
      ...or better yet, use the handy, automatic <>


      --isotope
Re: Apache log piped to Perl
by pileofrogs (Priest) on Aug 08, 2006 at 16:44 UTC

    This is a totally wild guess, so please feel free to throw rocks at my head if I'm wrong.

    Is it OK to open a file and then write to it from forked children and then close it from the parent? It seems to me that you'd get all kinds of horrible problems with multiple children trying to write at the same time... Maybe SDBM has a locking system and the parent process holds the lock? That would cause what you're seeing.

    I think you need some kind of IPC to organize everyone so they don't step on eachother while writing to the data base file.

    If I were doing this, first I'd seriously reconsider the value of forking off those children. If this were only one process without the forking it would be way easier. If I decided that I really needed that higher performance, I'd use threads with queues (Thread::Queue). I don't know what the equivalent to thread queues in a forking system is, but maybe that'd be easier.

    Threads aren't really any harder than forking, in my experience.

    See: perlthrtut, perlipc, Thread::Queue

      Is it OK to open a file and then write to it from forked children and then close it from the parent?

      That's a good question. My instinct is not to touch the hash tied by the parent from inside children in any way. I don't fork until the parent is done accessing %seen, which is ultimately untied from the parent as well. It seems logical to divide the labor by having the parent alone handle a very fast and reliable tied hash while forking off slower DNS queries to children who may add firewall rules based on the results.

      I'd seriously reconsider the value of forking off those children.

      Something like it has to be done or else a slow function in that while loop prevents Apache from serving requests. So far I'm seeing very few children in "top" on 1 second delay on a busy webserver, one or two at most, with no noticable slowdown.

Re: Apache log piped to Perl
by gcalexander (Novice) on Aug 09, 2006 at 02:47 UTC
    I don't think its a good idea to have apache forking and doing crazy things.

    I'd consider de-coupling this from the apache log handler. Set up a CustomLog to a file and write/borrow:) a log tail-er(perldoc -f seek), which runs from inittab (or a daemon), that processes log entries without intefering with Apache. This has the added benefit, of serialising the process (no fork required), and you won't hit process/resource limits when your webserver gets slammed.

    That aside, I don't see why the above script shouldn't work. Are you defining the CustomLog in the server config, or a VirtualHost section? If you set the CustomLog to a file, does anything get written to it?

      I'd consider de-coupling this from the apache log handler.

      That makes so much sense I rewrote it immediately using File::Tail and Proc::Daemon. By the way this piped version works, but probably won't scale so well as you generously pointed out. Thank you.

      #!/usr/bin/perl -w use strict; use Fcntl qw(:DEFAULT); use File::Tail; use Net::Netmask; use Net::Whois::IANA; use Proc::Daemon; use SDBM_File; Proc::Daemon::Init; my $bye = join '|', ( 'strings known to occur', 'in custname and orgname', ); my $bad = 'some\.bad\.host|other\.worse\d+'; my $file = File::Tail->new("/path/to/ip_log"); my $iana = Net::Whois::IANA->new; tie (my %seen, 'SDBM_File', '/some/sdbm', O_RDWR|O_CREAT, 0666) or die "$!"; while (defined (my $ip = $file->read)) { chomp $ip; next if exists $seen{$ip}; $seen{$ip} = 1; my $problem = 0; my $host = `host $ip` or next; $problem = 1 if $host =~ /$badhosts/i; if ($problem or $host =~ /(not found|timed out)/) { $iana->whois_query(-ip=>$ip); my $inetnum = $iana->inetnum || ''; my $custname = $iana->{QUERY}->{custname} || ''; my $orgname = $iana->{QUERY}->{orgname} || ''; next unless $inetnum and ($problem or $custname or $orgname); if ($custname =~ /($bye)/i or $orgname =~ /($bye)/i){ my $mask = Net::Netmask->new($inetnum); `iptables -A INPUT -s $mask -j DROP`; } } } untie %seen;