Seeker of Regex Wisdom (strings which don't form specific patterns)

ilcylic has asked for the wisdom of the Perl Monks concerning the following question:

Edited again to add more clarity. Hopefully. ;)

I ran the command # cat * |grep -P '\d+\.\d+\.\d+\.\d+' on a directory containing a bunch of iptables configuration files, in order to pull out all of the lines which had something that looked like an IP address on them. Given that most of the lines were of the form

ADMIN__SANDRO_DESK="123.456.78.90" # ilcylic Desktop

my first thought was to then |awk {print $1;} to eliminate the trailing comments, but then shortly realised that eliminated a bunch of stuff I actually wanted.

In then attempting to see what I would be eliminating, I tried to create a regex for grep -P that would show me lines which had, roughly, ^[one or more non-whitespace chars, which also do not form an IP address][one or more whitespace chars][zero or more chars][an IP address]

The first part proved to be the most difficult. How would one say "a string of one or more non-whitespace chars, which are also not an IP address" since, of course, an IP address is a string of one or more non-whitespace chars?

Thanks in advance, Monks. :)

Edited to add some examples (examples also edited):

So, if I were to cat a file and run it past grep -P 'My_Regex', it would not match lines of the form

ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop

but would match lines like

# ADMIN__SANDRO_DESK="123.156.78.91" # ilcylic Desktop (old)

# 06/06/15 SR Chgd SANDRO 121.123.154.99

ADMIN__SANDRO_DESK = "123.156.78.90"

Comment on Seeker of Regex Wisdom (strings which don't form specific patterns) Select or Download Code

Replies are listed 'Best First'.
Re: Seeker of Regex Wisdom (strings which don't form specific patterns) by kcott (Archbishop) on Aug 12, 2015 at 20:38 UTC
G'day ilcylic, While it may be possible to use '`grep -P ...`', I suspect using '`perl ...`' (on the command line) would be a lot simpler. Regexp::Common provides often-used regexes; Regexp::Common::net has regexes for IP addresses. Using this module, I'd eliminate lines matching `/^$RE{net}{IPv4}/` first; then attempt to match `/^\S+\s+.?$RE{net}{IPv4}/`. For use on the command line: `$ perl -MRegexp::Common=net -wnE '/^$RE{net}{IPv4}/ and next; /^\S+\s+ +.?$RE{net}{IPv4}/ and say' 127.0.0.1aaa 127.0.0.1 256.0.0.1aaa 127.0.0.1 256.0.0.1aaa 127.0.0.1 aaa 127.0.0.1 aaa 127.0.0.1` [download] Having tested that solution, I noticed your update with examples. For IPv4, all of the octets should be in the range 0-255. I suspect you generated quick-and-dirty examples (by running along the number keys) which has produced invalid IP addresses. Changing `456` to `156` and `654` to `154`, the same one-liner matches the way you want: `$ perl -MRegexp::Common=net -wnE '/^$RE{net}{IPv4}/ and next; /^\S+\s+ +.*?$RE{net}{IPv4}/ and say' ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # ADMIN__SANDRO_DESK="123.156.78.90" # ilcylic Desktop # 06/06/15 SR Chgd SANDRO 121.123.154.99 # 06/06/15 SR Chgd SANDRO 121.123.154.99` [download] — Ken	[reply] [d/l] [select]
Re: Seeker of Regex Wisdom (strings which don't form specific patterns) by BrowserUk (Patriarch) on Aug 12, 2015 at 20:01 UTC
On the basis of your limited sample, it looks like all you need is to exclude lines that start with a #. Ie: `perl -nle"/^#/ or print" infile > outfile` [download] If there are cases that doesn't work for, post some more samples. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice. I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!	[reply] [d/l]
Re^2: Seeker of Regex Wisdom (strings which don't form specific patterns) by ilcylic (Scribe) on Aug 13, 2015 at 15:14 UTC
Yeah, sorry, my examples were apparently not as good as I was hoping for. Post updated.	[reply]
Re: Seeker of Regex Wisdom (strings which don't form specific patterns) by Laurent_R (Canon) on Aug 12, 2015 at 20:42 UTC
The question is far from being clear, and I don't understand if you want to keep or not a line starting with #. Assuming you just want to keep all the lines which have an IP address, or something very much looking an IP address, you might try something like this: `perl -ne 'print if /(\d{1,3}\.){3}\d{1,3}/;'` [download] Of course, if you want to be more selective, you could check that the captures are smaller than 256: `perl -ne 'print if /(\d{1,3}\.){3}(\d{1,3})/ and $1 < 256 and ... and +$4 < 256;'` [download] It really depends on your data. In many cases, the first simple regex is just sufficient, in others, you really need to be sure that you don't keep something like "345.765.5.34", which is obviously not an IP address, whatever it is. Update: my code line above is wrong, as explained and shown below by AnomalousMonk: capture groups don't change their numbering under a counted quantifier. It would have to be something like this: `perl -ne 'print if /(\d{1,3})\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})/ and $1 + < 256 and ... and $4 < 256;'` [download] but that's probably getting a bit too convoluted for a one-liner.	[reply] [d/l] [select]
Re^2: Seeker of Regex Wisdom (strings which don't form specific patterns) by Discipulus (Canon) on Aug 13, 2015 at 08:01 UTC
or more selective and verbosely too, as found in 'Mastering Regualr Expressions': `^([01]?\d\d?\|2[0-4]\d\|25[0-5])\.([01]?\d\d?\|2[0-4]\d\|25[0-5])\.([01]?\ +d\d?\|2[0-4]\d\|25[0-5])\.([01]?\d\d?\|2[0-4]\d\|25[0-5])$` [download] L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l]
Re^3: Seeker of Regex Wisdom (strings which don't form specific patterns) by AnomalousMonk (Archbishop) on Aug 13, 2015 at 09:01 UTC
As selective, less verbose, non-repetitive, more readable, better IMHO: `my $octet_dec = qr{ [01]?\d\d? \| 2[0-4]\d \| 25[0-5] }xms;` `my $ipv4_dec = qr{ $octet_dec (?: [.] $octet_dec){3} }xms;` Update: Or better yet, as already mentioned, Regexp::Common::net. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]
Re^3: Seeker of Regex Wisdom (strings which don't form specific patterns) by Laurent_R (Canon) on Aug 13, 2015 at 09:26 UTC
Hi Discipulus, your suggestion is a pure regex, which makes perfect sense in J. Friedl's book, but I do not think it is more selective than my proposal, mixing a regex and some arithmetics, which looks for four dot-separated integer numbers smaller than 256. (Except that I used `\d` instead of `[0-9]` for brevity, so that my regex might match (non-Arabic) Unicode digits, but that's easily fixed.)	[reply] [d/l] [select]
Re^4: Seeker of Regex Wisdom (strings which don't form specific patterns) by AnomalousMonk (Archbishop) on Aug 13, 2015 at 10:04 UTC
Re^5: Seeker of Regex Wisdom (strings which don't form specific patterns) by shmem (Chancellor) on Aug 13, 2015 at 12:29 UTC
Re^5: Seeker of Regex Wisdom (strings which don't form specific patterns) by Laurent_R (Canon) on Aug 13, 2015 at 10:20 UTC
Re: Seeker of Regex Wisdom (strings which don't form specific patterns) by KurtSchwind (Chaplain) on Aug 12, 2015 at 20:16 UTC
Do you want to match lines with IP addresses that start with a comment? `grep '^#.(?:[0-9]{1,3}\.){3}[0-9]{1,3}'` If you want you can 'trap' the IP address and just print that as well. Or did I miss the question? -- �For the Present is the point at which time touches eternity.� - CS Lewis*	[reply] [d/l]
Re^2: Seeker of Regex Wisdom (strings which don't form specific patterns) by ilcylic (Scribe) on Aug 13, 2015 at 15:43 UTC
No, although I can see why my examples would lead one to think that. Doh. :( My question is really more about testing for the effects of awk. Since awk by default splits lines on whitespace, if I were to cat a file that contained a line that looked like `MY_IP = 123.134.234.90`, and then do `\|awk '{print $1;}'`, I'd only get the "MY_IP" part. So, in order to see what I would be losing by doing that, I decided to try and write a regex that would show me lines that contained an IP address, but only after the first whitespace on the line, which itself needed to follow some non-whitespace stuff, that did not itself contain an IP address. This turns out to be difficult to do by hand, but easy if you use regex modules, as shown below by AnomalousMonk :D	[reply] [d/l] [select]
Re: Seeker of Regex Wisdom (strings which don't form specific patterns) by AnomalousMonk (Archbishop) on Aug 13, 2015 at 11:58 UTC
... a regex for grep -P that would show me lines which had, roughly, `^[one or more non-whitespace chars, which also do not form an IP address][one or more whitespace chars][zero or more chars][an IP address]` My 0.02USD. It's possible to rigorously express all the stated requirements as regexes. It's highly convenient to do so by building upon existing Perl modules such as Regexp::Common. The requirement `[one or more non-whitespace chars, which also do not form an IP address]` can be exactly expressed as `my $S_not_IPv4 = qr{ (?! $RE{net}{IPv4}) \S }xms` for a single such character, and `$S_not_IPv4+` (within a regex) as "one or more" such characters. The requirements `[one or more whitespace chars]` and `[zero or more chars]` are exactly met by `\s+` and `.` respectively. For `[an IP address]` it's convenient to use `$RE{net}{IPv4}` defined in `Regexp::Common::net`, but this regex has a subtlety: it intentionally does not* include boundary conditions and so may match what might be considered a non-IPv4 string in certain circumstances: `c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(net); ;; print 'match' if '54321.2.3.45678' =~ m{ $RE{net}{IPv4} }xms; " match` [download] The programmer must determine the proper match criteria for each circumstance. I have used a "loose" criterion in the `$S_not_IPv4` definition, and also have a tighter `$ip` definition that would exclude the `'54321.2.3.45678'` match above. So finally: c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw(net); ;; my $ip = qr{ (?<! \d) $RE{net}{IPv4} (?! \d) }xms; my $S_not_IPv4 = qr{ (?! $RE{net}{IPv4}) \S }xms; ;; for my $s ( 'ADMIN__SANDRO_DESK=\"123.45.78.90\" # ilcylic Desktop', '# ADMIN__SANDRO_DESK=\"123.45.78.91\" # ilcylic Desktop (old)', '# 06/06/15 SR Chgd SANDRO 121.123.65.92', @ARGV, ) { my $match = $s =~ m{ \A $S_not_IPv4+ \s+ .* $ip }xms; printf qq{%8s '%s' \n}, $match ? 'MATCH' : 'no match', $s; } " no match 'ADMIN__SANDRO_DESK="123.45.78.90" # ilcylic Desktop' MATCH '# ADMIN__SANDRO_DESK="123.45.78.91" # ilcylic Desktop (ol +d)' MATCH '# 06/06/15 SR Chgd SANDRO 121.123.65.92' [download] It's a bit wordy, but still possible to express as a CLI one-liner. I think it can be said to exactly meet the stated requirements. It does so in Perl and not for grep, but that's life. Give a man a fish: `<%-(-(-(-<`	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom