in reply to Re: String contents
in thread String contents

Sorry about my earlier posts. Here is the regex in a nice format:
/^\s*(\w+),\s*(\w+ \w+)(.+?\s*LLP)/m;
my question now is: why did I extract "New" instead of "New York"? Any suggestions?

Replies are listed 'Best First'.
Re^3: String contents
by davido (Cardinal) on Jun 29, 2012 at 09:26 UTC

    My suggestion is to post in one node the sample text, and the regular expression that is failing to match where you expect it to. Post them well formatted, using the tips found in Writeup Formatting Tips, and be sure that you're posting actual full copy and pastes of the text and code that fail.

    When I test the text you provided, and the regexp you provided in the preceding node, I got the following results:

    Capture variables:

    • Digit Captures
      • $1 => Melville
      • $2 => New York
      • $3 =>                             /s/KPMG LLP
    • ${^PREMATCH}  => ended June 30, 2001 in conformity with accounting principles generally accepted
      in the United States of America. Also in our opinion, the related financial
      statement schedule, when considered in relation to the basic consolidated
      financial statements taken as a whole, presents fairly, in all material
      respects, the information set forth therein.
      
    • ${^MATCH}     => 
      Melville, New York                            /s/KPMG LLP
    • ${^POSTMATCH} => 
      September 26, 2001
      STR
    • $^N           =>                             /s/KPMG LLP
    • @- => (352,354,364,372)
    • @+ => (411,362,372,411)

    The text I used was exactly this:

    ended June 30, 2001 in conformity with accounting principles generally + accepted in the United States of America. Also in our opinion, the related fina +ncial statement schedule, when considered in relation to the basic consolida +ted financial statements taken as a whole, presents fairly, in all materia +l respects, the information set forth therein. Melville, New York /s/KPMG LLP September 26, 2001 STR

    And the regexp I used was exactly this:

    /^\s*(\w+),\s*(\w+ \w+)(.+?\s*LLP)/m

    Try it yourself with my regexp tester, here: Perl Regex Tester


    Dave

      Sorry, first time here. Still trying to find my way around. Yes, that code works now, but when I was trying to generalize it, I failed. Here is the new code:
      /^\s*(\w+|\w+ \w+|\w+ \w+ \w+),\s*(\w+|\w+ \w+|\w+ \w+ \w+)\s*(.+?\s*L +LP)/m

        So in generalizing it you regressed. That happens. You could revert to your previous regexp, and then start again at trying to generalize it, but this time keeping closer track of the spaces, word characters, etc.

        It doesn't seem to me that the more complicated solution (this most recent one) is actually superior. It's just more confusing. When regexps start getting too confusing, it's time to try again, or to break the problem up into smaller chunks. ...and of course it's also time for the "/x" modifier. :)


        Dave

      I have to modify it, because sometimes, the state could be "District of Columbia" or "Virginia"