Re: perl quicker than bash?

Hi and thank you everyone for all your great suggestions! :)

It sounds like perl is definitely quicker than what I'm doing now.

As a few have asked to see the current code, I'll include it here (be nice...)

It's part of my system for blocking IP addresses when rather dodgy requests come from them (looking for login and admin pages for example). It's been stripped right back to only the slow function and some supporting bits. I think I've managed to remove everything else around it that isn't directly related. It's not pretty but it's easy to read.

A short explanation (which I'm sure you lot won't really need) is as follows:

A list of IP addresses is compiled into a file (for this example it's called input.list). The list is sorted with a key on each octet and duplicate entries removed.

The script shown below then reads input.list line-by-line.

The output goes into output.list

Basic concept here is to condense the input list down by using a CIDR mask (0/24) where possible. This is done if there are 2 or more 4th octets where the first 3 octets are the same. So, instead of single IP addresses being blocked, the whole 256 will be blocked. If the first 3 do match then the 4th octet is replaced with the mask.

e.g. If we have these 7 lines in the input:

1.2.3.0/24
1.2.3.4
1.2.3.6
1.4.3.5
2.3.1.2
2.3.2.1
2.3.2.10
[download]

the output will be these 4 lines:

1.2.3.0/24
1.4.3.5
2.3.1.2
2.3.2.0/24
[download]

Just to make it more difficult, each line also has a tab char then a comment after the IP address (this comment contains the reason that my IDS chose to block this IP). The comment is enclosed within the C comment structure. This comment also has to be carried over to the output file.

So each input line actually looks like this:

1.161.169.75    /* wp-login.php */
1.168.230.73    /* wp-login.php */
1.174.218.109    /* wp-login.php */
1.192.128.23    /* /manager/ */
1.214.212.74    /* .cgi */
1.234.20.151    /* ZmEu */
1.249.203.135    /* .cgi */
2.61.137.117    /* wp-login.php */
2.77.94.236    /* wp-login.php */
2.90.252.253    /* wp-login.php */
2.139.237.110    /* /manager/ */
2.176.166.94    /* wp-login.php */
2.180.21.24    /* wp-login.php */
2.182.209.107    /* wp-login.php */
2.187.171.182    /* wp-login.php */
2.229.27.202    /* /manager/ */
2.237.24.187    /* wp-login.php */
5.9.136.55    /* SlowLoris */
5.20.156.72    /* SlowLoris */
5.34.57.96    /* GET /?author= */
[download]

A complete list can be seen here

And this is the slow script:

#!/bin/bash

debug=false
#debug=true

function Init
    {

    # constants
    cidr_input_file="./input.list"
    cidr_output_file="./output.list"
    default_maskvalue=24

    # variables
    prev_first3=""
    prev_octet4=""
    prev_comment=""
    match_found=false
    prev_match_found=false
    prev_has_been_written=false
    current_maskvalue=0
    active_maskvalue=0
    exitcode=0

    }

function CondenseOnThreeOctets
    {

    $verbose && echo -n "["$(date)"] -- applying CIDR mask where possi
+ble ..."

    # delete and create a new output file
    rm -f "${cidr_output_file}" && touch "${cidr_output_file}"

    while read line ; do
        if [ ! -z "$line" ] ; then                # ignore empty lines
            if [[ $line != \#* ]] ; then        # ignore lines that be
+gin with a # character
                # only take first word on each line as IP address
                current_ip=$( cut -f1 <<< "${line}" )

                # take everything after /* as a comment but only if it
+ exists
                [[ ${line} == */\** ]] && comment=$( sed 's|^.*/\*|/\*
+|' <<< "${line}" ) || comment=""

                while IFS=. read octet1 octet2 octet3 octet4 ; do
                    $debug && echo

                    first3="${octet1}.${octet2}.${octet3}"
                    $debug && echo "-- now checking - IP entry: ${firs
+t3}.${octet4}"

                    if [ -z "$prev_first3" ] ; then
                        # first time through the loop - no previous va
+lues have been saved yet.
                        SaveThisIpAsPrev
                    else
                        if [ "$first3" = "$prev_first3" ] ; then
                            # if here then first 3 octets matched so w
+e can combine this IP with the previous IP
                            match_found=true
                            SaveThisIpAsPrev
                        else
                            # if here then first3 octets are different
+ so it's OK to save the previous IP
                            match_found=false
                            WriteIpAsCIDR
                        fi
                    fi
                done <<< "${current_ip}"
            fi
        fi
    done < "${cidr_input_file}"

    # write out last IP
    WriteIpAsCIDR

    $verbose && echo " done!"

    }

function CalcMasks
    {

    [[ $active_maskvalue -eq 0 ]] && active_maskvalue=${default_maskva
+lue}

    if [[ $octet4 == *"0/"* ]] ; then
        current_maskvalue=${octet4:2}
        [[ $current_maskvalue -lt $active_maskvalue ]] && active_maskv
+alue=${current_maskvalue}
    else
        current_maskvalue=0
    fi

    }

function SaveThisIpAsPrev
    {

    $debug && echo "-- saving current IP as previous IP"

    CalcMasks

    prev_first3=$first3
    prev_octet4=$octet4
    prev_comment="$comment"
    prev_match_found=$match_found
    prev_has_been_written=false

    }

function WriteIpAsCIDR
    {

    if [ ! -z "$prev_first3" ] ; then
        if ! $prev_has_been_written ; then
            if $prev_match_found ; then
                buildline="$prev_first3.0/$active_maskvalue\t$prev_com
+ment"
            else
                buildline="$prev_first3.$prev_octet4\t$prev_comment"
            fi

            echo -e "${buildline}" >> "${cidr_output_file}"

            prev_has_been_written=true
            active_maskvalue=0
        fi
    fi

    SaveThisIpAsPrev

    }

echo "["$(date)"] >> started [$0@$HOSTNAME]"

Init
CondenseOnThreeOctets

echo "["$(date)"] << finished [$0@$HOSTNAME]"

exit $exitcode
[download]

Hopefully I've included all relevant information. Please let me know if there's something I missed.

Thanks everyone. :)

update: just replaced that previous code with a newer one in response to an idea I saw in Anonymous Monk's solution below (namely: perform a single search for first 3 octets). And it will now correctly parse more restrictive masks such as 0/21. So, it's gone from 18 seconds to 17 seconds. lol...

Comment on Re: perl quicker than bash? Select or Download Code

Replies are listed 'Best First'.
Re^2: perl quicker than bash? by Anonymous Monk on Jan 06, 2015 at 10:27 UTC
Yes you can totally do that in Perl and with much better performance. Heres the gist of how it can be done: `use strict; use warnings; my ( %masked, @results ); while ( my $line = <STDIN> ) { my ($ip) = $line =~ / (\d+ \. \d+ \. \d+ \. \d+) /x or next; my $mask = pack 'C3', split /\./, $ip; if ( $masked{$mask} ) { if ( not $masked{$mask}{repeat} ) { $masked{$mask}{ip} =~ s{ \d+ \z }{0/24}x; $masked{$mask}{repeat} = 1; } } else { $masked{$mask} = { ip => $ip, }; push @results, $masked{$mask}; } } print $_->{ip}, "\n" for @results;` [download] output: `1.2.3.0/24 1.4.3.5 2.3.1.2 2.3.2.0/24` [download] This way you also don't need to depend on the ordering of the source file (works with any order). As for dealing with comments and different cidrs I'll "leave it as an exercise for the reader" :)	[reply] [d/l] [select]
Re^3: perl quicker than bash? by Anonymous Monk on Jan 08, 2015 at 02:35 UTC
Okay, that was a different anonymonk explaining pack and recommending Modern Perl, but now it's me again :) I had nothing better to do and decided to actually write this thing for you, seeing your enthusiasm (to give you a taste of Perl). Here it goes. The program accepts command line options '-i' and '-o', meaning input and output. Otherwise it operates on stdin and stdout. Usage: `$ perl squeeze_ips.pl -i input.list -o output.list` [download] It preserves biggest found cidr and the first encountered comment (changing it to the last found comment is easy enough). Processing a file of one million ips takes about 15 seconds on my laptop. Read more... (3 kB)	[reply] [d/l] [select]
Re^4: perl quicker than bash? by Anonymous Monk on Jan 08, 2015 at 22:23 UTC
`elsif ( $old_cidr > $new_cidr ) {` [download] should be `elsif ( $new_cidr and $old_cidr > $new_cidr ) {` [download] come to think of it.	[reply] [d/l] [select]
Re^4: perl quicker than bash? by Anonymous Monk on Jan 08, 2015 at 02:42 UTC
It preserves biggest found cidr s/biggest/smallest/ :)	[reply]
Re^5: perl quicker than bash? by TiffanyButterfly (Novice) on Jan 08, 2015 at 04:35 UTC
Re^6: perl quicker than bash? by TiffanyButterfly (Novice) on Jan 08, 2015 at 04:50 UTC
Re^3: perl quicker than bash? by TiffanyButterfly (Novice) on Jan 07, 2015 at 19:56 UTC
Wow! That is so much shorter than what I wrote... lol I've been reading it through and re-reading... parts of the code I can understand but others are completely new to me (pack, push, my). The bits that I can understand also show how I really could have written that bash script better. :) I'll keep researching the bits that I don't understand and let the learning begin! Thank you for your great solution!	[reply]
Re^4: perl quicker than bash? by Anonymous Monk on Jan 07, 2015 at 21:19 UTC
my just defines a lexical variable in the current block. push is hopefully easy to understand, pack can get a little complicated - in this case it's turning the first three octets of the IP address into a byte string; so e.g. `pack "C3", split /\./, "80.114.108.33"` returns the bytes/chars "`Prl`" (the last octet is ignored because the template is "C3" and not "C4"). In addition to the general documentation such as perlsyn and perldata (Syntax and Data Types), to understand that code perldsc (Data Structures) will probably be useful; an introduction to the references used to create those data structures is in perlreftut. And a regular expression tutorial is at perlretut. See also Modern Perl 2014 (free online edition), Learning Perl, and some of the links on this site: Getting Started with Perl	[reply] [d/l] [select]
Re^5: perl quicker than bash? by TiffanyButterfly (Novice) on Jan 07, 2015 at 23:13 UTC
Re^6: perl quicker than bash? by Anonymous Monk on Jan 08, 2015 at 02:29 UTC