TiffanyButterfly has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

This is a general perl question regarding execution speed.

I wrote a bash script a short while ago (more as a proof-of-concept deal) that I use as part of a custom IDS for my webserver. The script works so well that it is now my main IDS. But, parts of it are slow. Noticably, a loop that performs a single-pass scan on a file containing a list of IPv4 addresses.

This list presently has 1,473 entries and the related bash script function takes about 18 seconds to process on my webserver. My fastest workstation can process it in about 5 seconds. But I suspect perl could do this even quicker.

I'm hoping someone can tell me if this would be the case. The function does very simple integer comparisons between each line looking for matches in the first 3 octets.

I've never programmed in perl but would be prepared to put in the effort to learn it if I knew that 'the end justifies the means'... :)

All opinions are welcome. Thank you for your wisdom.

Replies are listed 'Best First'.
Re: perl quicker than bash?
by johngg (Canon) on Jan 06, 2015 at 00:00 UTC

    Just taking a guess at what you are doing, I created a file of 1500 random IP-like addresses

    $ perl -E ' say join q{.}, map { int rand 256 } 1 .. 4 for 1 .. 1500;' > spw1112250.dat $

    I then put a script together to read the file and construct a hash keyed on the first three octets then look up another octet to see if it occurred in the data file. I timed the process in the script which took milliseconds rather than seconds on a fairly old laptop.

    use strict; use warnings; use 5.014; use Time::HiRes qw{ gettimeofday tv_interval }; my $startTV = [ gettimeofday() ]; my $inFile = q{spw1112250.dat}; open my $inFH, q{<}, $inFile or die qq{open: < $inFile: $!\n}; my @lines = <$inFH>; close $inFH or die qq{close: < $inFile: $!\n}; s{\.\d+$}{} for @lines; my %lookupIPs; @lookupIPs{ @lines } = ( 1 ) x @lines; my $lookFor = q{17.23.213}; say qq{$lookFor }, $lookupIPs{ $lookFor } ? q{} : q{not }, qq{found in $inFile}; say qq{Process took @{ [ tv_interval( $startTV, [ gettimeofday() ] ) ] + } seconds};

    The output.

    17.23.213 not found in spw1112250.dat Process took 0.002678 seconds

    Of course, your integer comparison may be more complex than this simple lookup. Perhaps you could show us exactly what your comparison does.

    Cheers,

    JohnGG

Re: perl quicker than bash?
by TiffanyButterfly (Novice) on Jan 06, 2015 at 04:23 UTC

    Hi and thank you everyone for all your great suggestions! :)

    It sounds like perl is definitely quicker than what I'm doing now.

    As a few have asked to see the current code, I'll include it here (be nice...)

    It's part of my system for blocking IP addresses when rather dodgy requests come from them (looking for login and admin pages for example). It's been stripped right back to only the slow function and some supporting bits. I think I've managed to remove everything else around it that isn't directly related. It's not pretty but it's easy to read.

    A short explanation (which I'm sure you lot won't really need) is as follows:

    A list of IP addresses is compiled into a file (for this example it's called input.list). The list is sorted with a key on each octet and duplicate entries removed.

    The script shown below then reads input.list line-by-line.

    The output goes into output.list

    Basic concept here is to condense the input list down by using a CIDR mask (0/24) where possible. This is done if there are 2 or more 4th octets where the first 3 octets are the same. So, instead of single IP addresses being blocked, the whole 256 will be blocked. If the first 3 do match then the 4th octet is replaced with the mask.

    e.g. If we have these 7 lines in the input:

    1.2.3.0/24 1.2.3.4 1.2.3.6 1.4.3.5 2.3.1.2 2.3.2.1 2.3.2.10
    the output will be these 4 lines:
    1.2.3.0/24 1.4.3.5 2.3.1.2 2.3.2.0/24

    Just to make it more difficult, each line also has a tab char then a comment after the IP address (this comment contains the reason that my IDS chose to block this IP). The comment is enclosed within the C comment structure. This comment also has to be carried over to the output file.

    So each input line actually looks like this:

    1.161.169.75 /* wp-login.php */ 1.168.230.73 /* wp-login.php */ 1.174.218.109 /* wp-login.php */ 1.192.128.23 /* /manager/ */ 1.214.212.74 /* .cgi */ 1.234.20.151 /* ZmEu */ 1.249.203.135 /* .cgi */ 2.61.137.117 /* wp-login.php */ 2.77.94.236 /* wp-login.php */ 2.90.252.253 /* wp-login.php */ 2.139.237.110 /* /manager/ */ 2.176.166.94 /* wp-login.php */ 2.180.21.24 /* wp-login.php */ 2.182.209.107 /* wp-login.php */ 2.187.171.182 /* wp-login.php */ 2.229.27.202 /* /manager/ */ 2.237.24.187 /* wp-login.php */ 5.9.136.55 /* SlowLoris */ 5.20.156.72 /* SlowLoris */ 5.34.57.96 /* GET /?author= */

    A complete list can be seen here

    And this is the slow script:

    #!/bin/bash debug=false #debug=true function Init { # constants cidr_input_file="./input.list" cidr_output_file="./output.list" default_maskvalue=24 # variables prev_first3="" prev_octet4="" prev_comment="" match_found=false prev_match_found=false prev_has_been_written=false current_maskvalue=0 active_maskvalue=0 exitcode=0 } function CondenseOnThreeOctets { $verbose && echo -n "["$(date)"] -- applying CIDR mask where possi +ble ..." # delete and create a new output file rm -f "${cidr_output_file}" && touch "${cidr_output_file}" while read line ; do if [ ! -z "$line" ] ; then # ignore empty lines if [[ $line != \#* ]] ; then # ignore lines that be +gin with a # character # only take first word on each line as IP address current_ip=$( cut -f1 <<< "${line}" ) # take everything after /* as a comment but only if it + exists [[ ${line} == */\** ]] && comment=$( sed 's|^.*/\*|/\* +|' <<< "${line}" ) || comment="" while IFS=. read octet1 octet2 octet3 octet4 ; do $debug && echo first3="${octet1}.${octet2}.${octet3}" $debug && echo "-- now checking - IP entry: ${firs +t3}.${octet4}" if [ -z "$prev_first3" ] ; then # first time through the loop - no previous va +lues have been saved yet. SaveThisIpAsPrev else if [ "$first3" = "$prev_first3" ] ; then # if here then first 3 octets matched so w +e can combine this IP with the previous IP match_found=true SaveThisIpAsPrev else # if here then first3 octets are different + so it's OK to save the previous IP match_found=false WriteIpAsCIDR fi fi done <<< "${current_ip}" fi fi done < "${cidr_input_file}" # write out last IP WriteIpAsCIDR $verbose && echo " done!" } function CalcMasks { [[ $active_maskvalue -eq 0 ]] && active_maskvalue=${default_maskva +lue} if [[ $octet4 == *"0/"* ]] ; then current_maskvalue=${octet4:2} [[ $current_maskvalue -lt $active_maskvalue ]] && active_maskv +alue=${current_maskvalue} else current_maskvalue=0 fi } function SaveThisIpAsPrev { $debug && echo "-- saving current IP as previous IP" CalcMasks prev_first3=$first3 prev_octet4=$octet4 prev_comment="$comment" prev_match_found=$match_found prev_has_been_written=false } function WriteIpAsCIDR { if [ ! -z "$prev_first3" ] ; then if ! $prev_has_been_written ; then if $prev_match_found ; then buildline="$prev_first3.0/$active_maskvalue\t$prev_com +ment" else buildline="$prev_first3.$prev_octet4\t$prev_comment" fi echo -e "${buildline}" >> "${cidr_output_file}" prev_has_been_written=true active_maskvalue=0 fi fi SaveThisIpAsPrev } echo "["$(date)"] >> started [$0@$HOSTNAME]" Init CondenseOnThreeOctets echo "["$(date)"] << finished [$0@$HOSTNAME]" exit $exitcode
    Hopefully I've included all relevant information. Please let me know if there's something I missed.

    Thanks everyone. :)

    update: just replaced that previous code with a newer one in response to an idea I saw in Anonymous Monk's solution below (namely: perform a single search for first 3 octets). And it will now correctly parse more restrictive masks such as 0/21. So, it's gone from 18 seconds to 17 seconds. lol...

      Yes you can totally do that in Perl and with much better performance. Heres the gist of how it can be done:
      use strict; use warnings; my ( %masked, @results ); while ( my $line = <STDIN> ) { my ($ip) = $line =~ / (\d+ \. \d+ \. \d+ \. \d+) /x or next; my $mask = pack 'C3', split /\./, $ip; if ( $masked{$mask} ) { if ( not $masked{$mask}{repeat} ) { $masked{$mask}{ip} =~ s{ \d+ \z }{0/24}x; $masked{$mask}{repeat} = 1; } } else { $masked{$mask} = { ip => $ip, }; push @results, $masked{$mask}; } } print $_->{ip}, "\n" for @results;
      output:
      1.2.3.0/24 1.4.3.5 2.3.1.2 2.3.2.0/24
      This way you also don't need to depend on the ordering of the source file (works with any order). As for dealing with comments and different cidrs I'll "leave it as an exercise for the reader" :)
        Okay, that was a different anonymonk explaining pack and recommending Modern Perl, but now it's me again :) I had nothing better to do and decided to actually write this thing for you, seeing your enthusiasm (to give you a taste of Perl). Here it goes. The program accepts command line options '-i' and '-o', meaning input and output. Otherwise it operates on stdin and stdout. Usage:
        $ perl squeeze_ips.pl -i input.list -o output.list
        It preserves biggest found cidr and the first encountered comment (changing it to the last found comment is easy enough). Processing a file of one million ips takes about 15 seconds on my laptop.

        Wow! That is so much shorter than what I wrote... lol

        I've been reading it through and re-reading... parts of the code I can understand but others are completely new to me (pack, push, my). The bits that I can understand also show how I really could have written that bash script better. :)

        I'll keep researching the bits that I don't understand and let the learning begin!

        Thank you for your great solution!

Re: perl quicker than bash?
by Anonymous Monk on Jan 05, 2015 at 23:29 UTC

    If you know shell scripting and do a significant amount of it, then it's definitely worth it to learn Perl :-)

    Speaking about performance in general is always very difficult, but yes, Perl can often be faster than shell scripts, one of the reasons being that Perl has a lot of functionality built in that shell scripts only support by executing external processes.

    Could you show us the bash script? Having a real test case to benchmark would really help.

    Without knowing more, processing 1500 records in 18 seconds sounds slow. Is the script doing any kind of network activity that could be the bottleneck?

Re: perl quicker than bash?
by davido (Cardinal) on Jan 05, 2015 at 23:35 UTC

    Maybe you could post the relevant portion of the bash script, an explanation of what it does, and a sample of typical inputs and sample output. Then we might help come up with a Perl solution, and at that point let the benchmarking begin. ;)


    Dave

Re: perl quicker than bash?
by Anonymous Monk on Jan 05, 2015 at 23:26 UTC

    But I suspect perl could do this even quicker.

    Yup, under one second on a really really old laptop to both generate a file of ips and then read and compare then

    $ perl -le " for( 1 .. 10 ){ printf qq{%d.%d.%d.%d\n}, $_ , $_, $_, $_ + for 1..255; } warn time-$^T; " > 2 0.265625 at -e line 1. $ perl -lne " $tri = substr $_, 0, 3; if( $tri > 255 ){ die } END{wa +rn time-$^T; } " 2 0.359375 at -e line 1, <> line 2550.
A reply falls below the community's threshold of quality. You may see it by logging in.