Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I am a new Perl user and am having a hard time with a regex. I am trying to extrapolate certain elements of a file, in order to pass them as arguments to another program.
Here is an example of the config file:
newjersey-ab1.net agg nj1-dpta1 net 10 newyork-cd1.net agg nyc-dpt1 net NA
I need a regex which will return "nj" and "nyc" from the above.
....snip open (CONFIG, $config_file) || die "Cannot open $config_file:$!\n"; while (my $line = <CONFIG>) { chomp $line; next if ($line =~ /^\#/); my ($router, $cache, $tmp, $as, $sample) = split (' ', $line); if ($tmp =~ /(\w+)(.*)\-(.*)dta/) { my $host = $1; print "$host\n"; } }
doesn't quite do it. Can anyone help out?

Replies are listed 'Best First'.
Re: Simple Regex Question
by Fastolfe (Vicar) on Jan 30, 2001 at 00:36 UTC
    Update: I didn't realize a single space behaved like /\s+/, so ignore that bit below. That seems counter-intuitive to me, but whatever. That, and I'm used to specifying "real" patterns to split, not this awk-compatibility bit.

    You are splitting on a single space, which messes up if you have multiple spaces between your "fields" (u: this is the incorrect bit). You might use /\s+/ as your split delimiter instead.

    A regex to get the first set of non-numerics out of your 3rd field could be /([a-z]+)/ or /(\D+)/. Avoid the use of .* as you're doing, since it involves a bit of back-tracking and is generally less efficient than explicitely mapping out what you do want.

Re: Simple Regex Question
by arturo (Vicar) on Jan 30, 2001 at 00:39 UTC

    "Doesn't quite do it" isn't a lot of information to go on, I'm afraid. Next time you post (and I do encourage you to post again), please explain *what* the problem is (what results did you get, what else have you tried, etc.).

    Are you sure the fields in each line are separated by a space? Maybe it's tabs?

    You know that what you want is in the third field, however the whole line should be split, and you can further narrow that down to what's on the left of the minus sign. So you can use split all the way through. Here's a snippet which makes a few assumptions, which I've tried to document.

    # assuming it's tabs; change "\t" to "\s" or "\s+" as appropriate my ($router, $cache, $tmp, $as, $sample) = split ("\t", $line); my $host = (split "-", $tmp)[0]; # grab LHS of $tmp $host =~ tr/0-9//d; # strip any digits -- whether this is right # REALLY depends on your data

    HTH

    Philosophy can be made out of anything. Or less -- Jerry A. Fodor

Re: Simple Regex Question
by KM (Priest) on Jan 30, 2001 at 00:51 UTC
    Please, read the perlre man page, as well as pick up a copy of Mastering Regular Expressions.

    This should do what you want..

    while (my $line = <CONFIG>) { chomp $line; if ((split /\s+/,$line)[2] =~ /(\w+)-/i) { print $1; } }

    Cheers,
    KM

Re: Simple Regex Question
by lemming (Priest) on Jan 30, 2001 at 00:39 UTC
    I'm hoping it's just a spelling problem, but shouldn't you be looking for "dpt" instead of "dta"?

    By the way, the ' ' in split is the same as saying /\s+/, except that /\s+/ spaces would produce a null field if there is leading white space.

    Follow the rest of their advice though.
Re: Simple Regex Question
by sierrathedog04 (Hermit) on Jan 30, 2001 at 01:26 UTC
    Inside your if statement I would say:
    my $area = $tmp; # anchor the pattern at the start of the line using ^ # then look for the third clump of characters and pick everything +through up to the hyphen. # the ? turns off greedy matching, so you do not get messed up by +duplicate occurrences of -dpt on the same line. $area =~ s/^.*\s+.*\s+(.*?)-dpt/\1/; return $area;
      $area =~ s/^.*\s+.*\s+(.*?)-dpt/\1/;

      Not very efficient. The RE engine will have to work more than you think to match that pattern. From my test of it, it will turn a line like this:

      asdasd	egg	nyc-dpt	net	10
      
      into
      nyc	net	10
      

      Did you test this before posting, or look at the other answers? :)

      Cheers,
      KM

        Your point is well taken. From now on I will test my answers first.

        My proposed solution is:

        use strict; my $row1 = "newjersey-ab1.net agg nj1-dpta1 net 10"; $row1 =~ s/^.+\s+.+\s+(.+)-dpt.*$/\1/; print $row1;

        As far as looking at the other answers, yes I look at them. If my approach differs from the other answers then I like to throw it out there and see what people say.

        I agree with you that my answer may be inefficient; I really only starting doing Perl seriously last year. My question to anyone who cares to answer is, why is it inefficient? And is this inefficiency lost in the noise of overall execution times, or would it be a problem in real-life?