inblosam has asked for the wisdom of the Perl Monks concerning the following question:

This should be a relatively simple question for the regex-minded. I have a lot of learning to do still. I have a document that I am parsing and I will have the people who write the documents put in a pseudo-XML tag if they want a blank page, like so:

<blank title="Title of My Page">

However, right now it only works if there are no spaces in the attribute title's value. So, TitleofMyPage would work with the following code:
#text is put in an array called page_text foreach my $new_line(@page_text) { #put each line into another array my @blankPages = ($new_line =~ m/<blank[^>]+?title\s*=\s*["']? +([^'" >]+?)[ '"]?>/sig); foreach my $blankTitle(@blankPages) { print "$blankTitle is the title\n"; #do some logic to add a blank page with that title } }

Of course I want it so I can put spaces in there. I was also wondering if it would be difficult to allow " and ' marks in there, because those throw it off too.
The last thing I couldn't figure out too was how to make an if statement that would fail if it did find that, because of course I don't want my text to include the xml tag. The following seems to not work, even when I try to take out the parentheses that grab the text I want (refer to the regex in code above):
if ($new_line =! m/sameregexasabove?/sig) { #don't write a new line with the xml tag, but any other text is + okay }
Your help would be much appreciated!


Michael Jensen
michael at inblosam.com
http://www.inblosam.com
ipod user
powerbook user

Replies are listed 'Best First'.
Re: Regex and a pseudo-XML tag
by Abigail-II (Bishop) on Nov 20, 2003 at 15:17 UTC
    Well, your regex explicitely forbids spaces:
    [^'" >]+
    so I wonder why you are surprised it only 'works' if there are no spaces in the value. If you want spaces, don't forbid them.

    Having said that, there are many more problems with your regexp. Why don't you use one of the many XML parsing modules from CPAN?

    Abigail

      That works great! I didn't know that the "space greater than" excluded whitespace, so I learned something new! THANKS!
        "space greater than" does not exclude whitespace.

        If you use character classes (which is the thing inside [ ... ]) you exlude the characters mentioned in the character class if the first character after the [ is a ^. It sort of negates the list within the square brackets or includes all characters not in the list.

        CountZero

        "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

Re: Regex and a pseudo-XML tag
by Roy Johnson (Monsignor) on Nov 20, 2003 at 15:22 UTC
    For robustness, you should probably look into Text::Balanced. For quick and dirty, which means putting more limits on what you will accept (no nesting or line breaks, for example), this might get you there:
    m/<blank\s+title\s*=\s*(["'])?(.*?)$1\s*>/
    Update: As noted in the reply below, the $1 should be a \1.
      In order use a captured string as part of a regex match, you need to use a backslash in front of the digit, not a dollar sign:
      m/<blank\s+title\s*=\s*(["'])?(.*?)\1\s*>/ --
      To show a simpler example, consider:
      $_ = "here are ::framed words:: to capture"; if ( /(::)(.*?)\1/ ) { print "text framed by $1 was <$2>\n"; }
      If you try it with $ instead of \, it doesn't work.