Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Seeker of Regex Wisdom (strings which don't form specific patterns)

by ilcylic (Scribe)
on Aug 12, 2015 at 19:36 UTC ( #1138337=perlquestion: print w/replies, xml ) Need Help??

ilcylic has asked for the wisdom of the Perl Monks concerning the following question:

Edited again to add more clarity. Hopefully. ;)

I ran the command # cat * |grep -P '\d+\.\d+\.\d+\.\d+' on a directory containing a bunch of iptables configuration files, in order to pull out all of the lines which had something that looked like an IP address on them. Given that most of the lines were of the form

ADMIN__SANDRO_DESK="123.456.78.90"   #  ilcylic Desktop

my first thought was to then |awk {print $1;} to eliminate the trailing comments, but then shortly realised that eliminated a bunch of stuff I actually wanted.

In then attempting to see what I would be eliminating, I tried to create a regex for grep -P that would show me lines which had, roughly, ^[one or more non-whitespace chars, which also do not form an IP address][one or more whitespace chars][zero or more chars][an IP address]

The first part proved to be the most difficult. How would one say "a string of one or more non-whitespace chars, which are also not an IP address" since, of course, an IP address is a string of one or more non-whitespace chars?

Thanks in advance, Monks. :)

Edited to add some examples (examples also edited):

So, if I were to cat a file and run it past grep -P 'My_Regex', it would not match lines of the form

ADMIN__SANDRO_DESK="123.156.78.90"   #  ilcylic Desktop

but would match lines like

# ADMIN__SANDRO_DESK="123.156.78.91"   #  ilcylic Desktop (old)

or

# 06/06/15    SR    Chgd SANDRO 121.123.154.99

or

ADMIN__SANDRO_DESK = "123.156.78.90"

Replies are listed 'Best First'.
Re: Seeker of Regex Wisdom (strings which don't form specific patterns)
by kcott (Archbishop) on Aug 12, 2015 at 20:38 UTC

    G'day ilcylic,

    While it may be possible to use 'grep -P ...', I suspect using 'perl ...' (on the command line) would be a lot simpler.

    Regexp::Common provides often-used regexes; Regexp::Common::net has regexes for IP addresses.

    Using this module, I'd eliminate lines matching /^$RE{net}{IPv4}/ first; then attempt to match /^\S+\s+.*?$RE{net}{IPv4}/.

    For use on the command line:

    $ perl -MRegexp::Common=net -wnE '/^$RE{net}{IPv4}/ and next; /^\S+\s+ +.*?$RE{net}{IPv4}/ and say' 127.0.0.1aaa 127.0.0.1 256.0.0.1aaa 127.0.0.1 256.0.0.1aaa 127.0.0.1 aaa 127.0.0.1 aaa 127.0.0.1

    Having tested that solution, I noticed your update with examples. For IPv4, all of the octets should be in the range 0-255. I suspect you generated quick-and-dirty examples (by running along the number keys) which has produced invalid IP addresses. Changing 456 to 156 and 654 to 154, the same one-liner matches the way you want:

    $ perl -MRegexp::Common=net -wnE '/^$RE{net}{IPv4}/ and next; /^\S+\s+ +.*?$RE{net}{IPv4}/ and say' ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # 06/06/15 SR Chgd SANDRO 121.123.154.99 # 06/06/15 SR Chgd SANDRO 121.123.154.99

    — Ken

Re: Seeker of Regex Wisdom (strings which don't form specific patterns)
by BrowserUk (Patriarch) on Aug 12, 2015 at 20:01 UTC
      Yeah, sorry, my examples were apparently not as good as I was hoping for. Post updated.
Re: Seeker of Regex Wisdom (strings which don't form specific patterns)
by Laurent_R (Canon) on Aug 12, 2015 at 20:42 UTC
    The question is far from being clear, and I don't understand if you want to keep or not a line starting with #.

    Assuming you just want to keep all the lines which have an IP address, or something very much looking an IP address, you might try something like this:

    perl -ne 'print if /(\d{1,3}\.){3}\d{1,3}/;'
    Of course, if you want to be more selective, you could check that the captures are smaller than 256:
    perl -ne 'print if /(\d{1,3}\.){3}(\d{1,3})/ and $1 < 256 and ... and +$4 < 256;'
    It really depends on your data. In many cases, the first simple regex is just sufficient, in others, you really need to be sure that you don't keep something like "345.765.5.34", which is obviously not an IP address, whatever it is.

    Update: my code line above is wrong, as explained and shown below by AnomalousMonk: capture groups don't change their numbering under a counted quantifier.

    It would have to be something like this:

    perl -ne 'print if /(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/ and $1 + < 256 and ... and $4 < 256;'
    but that's probably getting a bit too convoluted for a one-liner.
      or more selective and verbosely too, as found in 'Mastering Regualr Expressions':
      ^([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])\.([01]?\ +d\d?|2[0-4]\d|25[0-5])\.([01]?\d\d?|2[0-4]\d|25[0-5])$


      L*
      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        As selective, less verbose, non-repetitive, more readable, better IMHO:

            my $octet_dec = qr{ [01]?\d\d? | 2[0-4]\d | 25[0-5] }xms;

            my $ipv4_dec = qr{ $octet_dec (?: [.] $octet_dec){3} }xms;

        Update: Or better yet, as already mentioned, Regexp::Common::net.


        Give a man a fish:  <%-(-(-(-<

        Hi Discipulus,

        your suggestion is a pure regex, which makes perfect sense in J. Friedl's book, but I do not think it is more selective than my proposal, mixing a regex and some arithmetics, which looks for four dot-separated integer numbers smaller than 256. (Except that I used \d instead of [0-9] for brevity, so that my regex might match (non-Arabic) Unicode digits, but that's easily fixed.)

Re: Seeker of Regex Wisdom (strings which don't form specific patterns)
by KurtSchwind (Chaplain) on Aug 12, 2015 at 20:16 UTC

    Do you want to match lines with IP addresses that start with a comment?

    grep '^#.*(?:[0-9]{1,3}\.){3}[0-9]{1,3}'

    If you want you can 'trap' the IP address and just print that as well. Or did I miss the question?

    --
    “For the Present is the point at which time touches eternity.” - CS Lewis

      No, although I can see why my examples would lead one to think that. Doh. :(

      My question is really more about testing for the effects of awk. Since awk by default splits lines on whitespace, if I were to cat a file that contained a line that looked like MY_IP = 123.134.234.90, and then do  |awk '{print $1;}', I'd only get the "MY_IP" part.

      So, in order to see what I would be losing by doing that, I decided to try and write a regex that would show me lines that contained an IP address, but only after the first whitespace on the line, which itself needed to follow some non-whitespace stuff, that did not itself contain an IP address.

      This turns out to be difficult to do by hand, but easy if you use regex modules, as shown below by AnomalousMonk :D

Re: Seeker of Regex Wisdom (strings which don't form specific patterns)
by AnomalousMonk (Bishop) on Aug 13, 2015 at 11:58 UTC
    ... a regex for grep -P that would show me lines which had, roughly, ^[one or more non-whitespace chars, which also do not form an IP address][one or more whitespace chars][zero or more chars][an IP address]

    My 0.02USD. It's possible to rigorously express all the stated requirements as regexes. It's highly convenient to do so by building upon existing Perl modules such as Regexp::Common.

    The requirement
        [one or more non-whitespace chars, which also do not form an IP address]
    can be exactly expressed as
        my $S_not_IPv4 = qr{ (?! $RE{net}{IPv4}) \S }xms
    for a single such character, and
        $S_not_IPv4+
    (within a regex) as "one or more" such characters.

    The requirements  [one or more whitespace chars] and  [zero or more chars] are exactly met by  \s+ and  .* respectively.

    For
        [an IP address]
    it's convenient to use  $RE{net}{IPv4} defined in Regexp::Common::net, but this regex has a subtlety: it intentionally does not include boundary conditions and so may match what might be considered a non-IPv4 string in certain circumstances:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(net); ;; print 'match' if '54321.2.3.45678' =~ m{ $RE{net}{IPv4} }xms; " match
    The programmer must determine the proper match criteria for each circumstance. I have used a "loose" criterion in the  $S_not_IPv4 definition, and also have a tighter  $ip definition that would exclude the  '54321.2.3.45678' match above.

    So finally:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(net); ;; my $ip = qr{ (?<! \d) $RE{net}{IPv4} (?! \d) }xms; my $S_not_IPv4 = qr{ (?! $RE{net}{IPv4}) \S }xms; ;; for my $s ( 'ADMIN__SANDRO_DESK=\"123.45.78.90\" # ilcylic Desktop', '# ADMIN__SANDRO_DESK=\"123.45.78.91\" # ilcylic Desktop (old)', '# 06/06/15 SR Chgd SANDRO 121.123.65.92', @ARGV, ) { my $match = $s =~ m{ \A $S_not_IPv4+ \s+ .* $ip }xms; printf qq{%8s '%s' \n}, $match ? 'MATCH' : 'no match', $s; } " no match 'ADMIN__SANDRO_DESK="123.45.78.90" # ilcylic Desktop' MATCH '# ADMIN__SANDRO_DESK="123.45.78.91" # ilcylic Desktop (ol +d)' MATCH '# 06/06/15 SR Chgd SANDRO 121.123.65.92'
    It's a bit wordy, but still possible to express as a CLI one-liner. I think it can be said to exactly meet the stated requirements. It does so in Perl and not for grep, but that's life.


    Give a man a fish:  <%-(-(-(-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1138337]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2022-08-10 14:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?