perlavi has asked for the wisdom of the Perl Monks concerning the following question:

I have an issue while parsing strings which are in variable. I have a text file containing machine names and I'm reading the text file line by line. When I am trying to parse these machine names, somehow it's not working properly. e.g. In my text file I have machine names as machine1, machine11, machine12, machine2, machine20, machine23, machine30 (ef0a-14f09a-dfe230d-ea5f8d), etc. So when I parse for machine11, the code still matches machine1 and not machine11. Also I have used \Q and \E to escape special character sequences. I tried to use \b in combination with it to resolve the issue but it did not work.

my $machine = shift @_; foreach my $line (sort @::Lines){ if($line =~ m/\Q$machine\E\b/){ print "MACHINE = $machine \n"; } }

$machine is taken from the array which contains machine names. @::Lines is the array which looks like this.
\\hostname\Cpu(123:machine1)\%Ready
\\hostname\Cpu(3545:machine11)\%Used
\\hostname\Memory(3244:machine30 (ef0a-14f09a-dfe230d-ea5f8d))\Swapped MBytes
I have used \Q and \E in parsing because sometimes my machine names contain special characters like -,( , ) and : Can we use \Q, \E and \b in combination with each other? Because if I simply use only \Q and \E, it doesn't work for machine names like machine1, machine11,etc and when I use \b with \Q and \E, it doesn't work for machine names like machine30 (ef0a-14f09a-dfe230d-ea5f8d). Any idea on how can we parse this kind of patterns where we are using variable names ?
Thanks

Replies are listed 'Best First'.
Re: Parsing Issue while using Variable name in Pattern
by NetWallah (Canon) on Feb 16, 2012 at 04:37 UTC
    "\b" Expects a transition between a "\w" and a \W", so it will not match the transition between the ")" and ")" in your "machine30" case.

    If you know there is something that is not a "Word" character, immediately following the machine name, you can use:

    m/\Q$machine\E\W/
    to match all your cases.

                “PHP is a minor evil perpetrated and created by incompetent amateurs, whereas Perl is a great and insidious evil perpetrated by skilled but perverted professionals.”
            ― Jon Ribbens

      Hi NetWallah,
      Using \W worked perfectly fine. But I could not get the difference between \b and \W even though I was able to partially resolve the issue using \b. Using \b only worked for common strings in a machine name but not for machine names with special characters.

        From http://perldoc.perl.org/perlre.html#Regular-Expressions :
        \b Match a word boundary

        A word boundary (\b ) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W .

        So, the "\b" will match "xxyy" if the following character is NOT [a-zA-Z0-9].

        However, if the terminating character in the match is NOT a "word" character, as in your case - the terminating character is ")", then the "\b" expects the NEXT character to be a "word".

        Hope this makes sense. If not, please experiment.

                    “PHP is a minor evil perpetrated and created by incompetent amateurs, whereas Perl is a great and insidious evil perpetrated by skilled but perverted professionals.”
                ― Jon Ribbens

Re: Parsing Issue while using Variable name in Pattern
by kino (Initiate) on Feb 16, 2012 at 05:13 UTC

    I'm probably misunderstanding what you want, I'm rather new at this, but why not this?

     if ($line =~ /\((.*)\)\// ) {print "MACHINE=$1 \n";}

    That way it gets anything between '(' and ')\'

    He picked the perl from the dying flesh and held it in his palm, and he turned it over and saw that its curve was perfect in the hand he had smashed against the gate; the torn flesh of the knuckles was turned grayish white by the sea water. ~Steinbeck "The Perl"
Re: Parsing Issue while using Variable name in Pattern
by Eliya (Vicar) on Feb 16, 2012 at 04:23 UTC

    Hm, I cannot reproduce the issue:

    #!/usr/bin/perl -w use strict; while (<DATA>) { chomp; push @::Lines, $_; } for my $machine (qw(machine1 machine11 machine12 machine2 machine20 machine23 machine30)) { for my $line (sort @::Lines) { if ($line =~ m/\Q$machine\E\b/){ print "$machine : $line\n"; } } } __DATA__ \\hostname\Cpu(123:machine1)\%Ready \\hostname\Cpu(3545:machine11)\%Used \\hostname\Memory(3244:machine30 (ef0a-14f09a-dfe230d-ea5f8d))\Swapped + MBytes

    Output (as expected):

    machine1 : \\hostname\Cpu(123:machine1)\%Ready machine11 : \\hostname\Cpu(3545:machine11)\%Used machine30 : \\hostname\Memory(3244:machine30 (ef0a-14f09a-dfe230d-ea5f +8d))\Swapped MBytes

    Update: sorry, misread your post... (quotes with "machine30 (ef0a-14f09a-dfe230d-ea5f8d)" would've made it clearer — which only goes to show that a piece of runnable code is often better than prose).

Re: Parsing Issue while using Variable name in Pattern
by tchrist (Pilgrim) on Feb 16, 2012 at 19:18 UTC
    The metasymbol pair \b and \B used in Perl patterns, zero-width assertions that respectively match a word boundary and a non-wordboundary, cause no end of confusion. The precise definition of \b is:
    (?(?<=\w)(?!\w)|(?=\w)) # \b equivalent

    The corresponding definition for \B is:

    (?(?<=\w)(?=\w)|(?!\w)) # \B equivalent

    If, like most compassionate human beings, you prefer your regexes written for legibility and maintainability, you would write those in /x mode:

    # \b equivalent: (?(?<= \w) # if there is a word character left (?! \w) # then there must be no word character right | (?= \w) # else there must be a word character right ) # \B equivalent: (?(?<= \w) # if there is a word character left (?= \w) # then there must be a word character right | (?! \w) # else there must be no word character right )
    Which should now presumably be more scrutable.

    Please note how both \b and \B alike are defined solely in terms of \w characters. There is absolutely no mention of \W in either of those definitions, let alone of ^ or $. This catches many people by surprise.

    Now that you know exactly how word boundaries and nonboundaries work, you can craft your own boundaries by swapping in your own condition for wherever you see \w in the patterns above. You just need to be careful to specify a fixed-width condition so that it can be used in a lookbehind. That means you can’t use things like \X or \R, which are variable-width. The easiest way to do that is to use a property or other character class. For example, you could use \p{Greek} for characters in the Greek script—but best add Inherited so you don’t miss the combining characters, so use [\p{Greek}\p{Inherited}] instead.

    Perhaps the most common custom boundary that people want to craft is the one that they thought that \b was doing all all along — but which as has just been demonstrated, is not.

    That is, they want a custom boundary that asserts that they are touching either whitespace or the edge of the string, in whichever direction makes sense there.

    # space boundary (?(?<= \S) # if there is a nonspace character left (?! \S) # then there must be no space character right | (?= \S) # else there must be a space character right )

    To show how that version operates, consider this:

    Whether that’s quite what you’re looking for, I cannot say. But you should have enough in your armament now to craft whatever sort of boundary you might desire.

    Most of the preceding text is excerpted from the section on “Building Custom Boundaries” beginning on page 308 of the just-released 4th Edition to Programming Perl.

    Enjoy.