Guildenstern has asked for the wisdom of the Perl Monks concerning the following question:

I've got some quick and dirty code that I'm using to parse some logs. A typical line that I'm feeding to my regex looks like this:
ASCII data in TCP pkt from 111.11.111.111/1162 to local port 7070: !
The regex I'm using looks like this:
/[^\d]+([^ ]+)[^\d](\d+)(.*?)$/; print "$1\n$2\n$3\n";
This should yield:
111.11.111.111/1162 7070 !

It doesn't.
What I get instead is this:
111.11.111.111 1162 to local port 7070: !

For some reason the ([^ ]+) section stops at the forward slash, even though it's clearly not a space.
If I change that part of the RE to ([^ ]+?), making it less greedy, it stops matching at the first period in the IP address, even though a period is still not a space!
I'm at a loss to explain why this seemingly simple RE insists on frappeing my brain. Anybody run into something similar? Am I missing some key knowledge that I should be smacking my forehead and saying "D'oh!" for? FWIW, I'm using ActiveState's Perl 5.6 in case this may be one of those bugs I keep hearing about.
TIA


Guildenstern
Negaterd character class uber alles!

Replies are listed 'Best First'.
(Ovid) Re: Matching (non)spaces in regex?
by Ovid (Cardinal) on Sep 28, 2000 at 00:06 UTC
    For some reason the ([^ ]+) section stops at the forward slash, even though it's clearly not a space.
    Actually, it's not stopping at the forward slash. The regex tries to match as much as possible, so it does match up to the space. However, that causes the rest of the regex to fail, so the regex engine keeps backtracking and retrying the regex. In this case, backtracking to the / allows the regex to match.

    What happened was, you forgot a quantifier. Try the following regex (cleaned up):

    $data =~ /[^\d]+([^ ]+)[^\d]+(\d+)(.*?)$/;
    Here's the cleaned up version of the regex:
    $data =~ /\D+(\S+)\D+(\d+)(.*)$/;
    You'll notice that the [^\d] between the first and second set of parens now has a + after it.

    Breaking out the regex for those who prefer it:

    $data =~ / \D+ # One or more non-digits ( # Capture to $1 \S+ # non-spaces ) \D+ # One or more non-digits ( # Capture to $2 \d+ # one or more digits ) ( # Capture to $3 .* # rest of string )$ # Anchor above to end of string /x;

    Cheers,
    Ovid

    Update: Darn it! Guildenstern beat me to it!

    Join the Perlmonks Setiathome Group or just go the the link and check out our stats.

(Guildenstern) Re: Matching (non)spaces in regex?
by Guildenstern (Deacon) on Sep 27, 2000 at 23:58 UTC
    *sigh*
    I think there's a rule somewhere that the most frustrating problems have answers that appear 5 minutes after you've asked the question to everyone in the place.
    In the above RE, I forgot to add one simple stupid + sign. The new RE looks like this:
    /[^\d]+([^ ]+)[^\d]+(\d+)(.*?)$/; ## added plus here ^ (duh)

    Works perfectly now. "D'oh!"

    Guildenstern
    Negaterd character class uber alles!
Re: Matching (non)spaces in regex?
by Fastolfe (Vicar) on Sep 28, 2000 at 00:05 UTC
    The behavior you're describing certainly sounds weird to me. I suspect it's your [^\d] and [^ ] groups, which are goofy. I changed them as described below and your regular expression worked great for me.

    Note that items like \d and [ ] (perhaps illegal, but I know what you're doing and this is equivalent to \s) have opposites: \D (non-digits) and \S (non-spaces).

    In your case here, your regular expression could be better written like:

    /(\d+\S+)\D+(\d+): (.*)$/
    Perhaps someone else has a better optimization, but this will essentially start your search on the first number, continue for the rest of that "field", and then match non-numbers until it gets your 2nd numeric, following up with everything else in $3. Modify to taste.

    You may also want to consider using split to break your line up into fields separated by spaces. The only processing you'd have to do at that point is to get rid of the colon following your last numeric.

    Note: Code samples are for conceptual use only and generally are not meant to be cut/pasted into a production application. Always 'use strict', have a thorough understanding of the code you use, and check the return values of functions and handle errors according to your needs. - Fastolfe