http://qs1969.pair.com?node_id=699261


in reply to Re^3: Regex problems using '|'
in thread Regex problems using '|'

I'm not sure why the match is failing with the alternation when it matches correctly without it. I understand that if the first pattern does not match, it will go on to the second half . That's the behavior I want with regards to the rest of the records. Why is the match failing? That's what I do not understand, since it works correctly as long as the alternation is removed.

Replies are listed 'Best First'.
Re^5: Regex problems using '|'
by moritz (Cardinal) on Jul 22, 2008 at 11:04 UTC
    If I understood your earlier reply correctly, the regex does match (with the alternation), but it doesn't match the way you want. That's a big difference, and what I tried to explain to you.
    That's the behavior I want with regards to the rest of the records.

    After looking at the updated data I think that you need two regexes for that:

    use strict; use warnings; my $str = do { local $/; <DATA> }; if ($str =~ m/Remediation Report\n\n(.+?)\n/g){ print $1, $/; while ($str =~ m/\n\n(.*)\n/g){ print $1, $/; } } __DATA__ thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 # rest of data goes here

    The output is:

    Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

    The trick is to use the /g-modifier on the first regex although it matches only once. That way pos $str will not be reset, and the next regex match starts where the previous left off.

    Also note that ^ will anchor to the start of the string (not to the start of a line) unless the /m modifier is present.

      I think I see where I'm not being clear in my question. Please bear with me on this, as I don't understand why it isn't working, and I'm trying more for understanding than function. I can always beat at it until it functions; I'd rather understand why it doesn't work the way I expect.. i.e. where are my expectations wrong?

      The record in question is one large string with newlines inside it, right? I'm not sure why the first part of the pattern with alternation would not match the "Remediation Report" and instead use the second half (because it *does* match if there isn't an alternative), unless.. does the regex engine still treat this one large string as multiple strings, separated by the newlines? In other words, why would it skip over a match that works? Does it evaluate each "internal string" in turn?

      I updated the example data above to explain how each record is broken up better. I think the Data::Dumper output was a bit confusing. There is only one vulnerability name in each record, so /g shouldn't apply (I believe).

        Your expectation is that if you have /pat1|pat2/, the regexp engine will first try to find 'pat1' anywhere in the string, and only then pat2.

        That's not how the regexp engine works. Instead, it will find pat1 or pat2, whichever comes first (leftmost) in the string. Only if the first occurrences of pat1 and pat2 start at the same character, the order of | becomes important: the engine will pick the leftmost.

        Now, in your example, pat2 is a pattern that matches at the beginning of the string, while pat1 doesn't. So pat2 matches, not pat1.

        If you want pat1 to match if it occurs in the string, and only if pat1 doesn't match, you want pat2, use two different patterns:

        $str =~ /pat1/ or $str =~ /pat2/;
        Alternatively, make it that pat1 starts matching at the beginning of the string as well:
        $vulnerabilityText =~ m/^.+?Remediation Report\n\n(.+?)\n|^(.+?)\n/;
        i.e. where are my expectations wrong?

        You seem to assume that, for an alternation $a|$b, the regex engine does the following:

        1. It searches for alternative $a in the string</lii>
        2. If it doesn't find a match, it tries alternative $b
        However, that's not the case. It does this:
        1. anchor pattern at start of string
        2. try to match alternative $a
        3. if it fails, try to match alternative $b
        4. if there's still no match, anchor pattern at the second character in the string, and start again from No. 2

        Perhaps you want something along this line:

        m/(?s:.*)Remediation Report\n\n(.+?)\n|^(.+?)\n/;

        That searches for the Remediation Report\n\n(.+?)\n part of the regex anywhere in your string, and only if that fails it tries the second regex.

        The record in question is one large string with newlines inside it

        In the example script I posted, yes.

        I updated the example data above to explain how each record is broken up better. I think the Data::Dumper output was a bit confusing. There is only one vulnerability name in each record, so /g shouldn't apply (I believe).

        In scalar context the /g modifier doesn't mean "match as often as you can", but rather "start your match at pos $str, and set pos $str after the match". That means you can say stuff like this:

        while ($str =~ m/($regex)/){ print $1, "\n"; }

        But it's not the only application. You can use it to preserve the pos $str value, and then apply a different regex against it.