Regex and a pseudo-XML tag

inblosam has asked for the wisdom of the Perl Monks concerning the following question:

This should be a relatively simple question for the regex-minded. I have a lot of learning to do still. I have a document that I am parsing and I will have the people who write the documents put in a pseudo-XML tag if they want a blank page, like so:

<blank title="Title of My Page">
[download]

However, right now it only works if there are no spaces in the attribute title's value. So, TitleofMyPage would work with the following code:

#text is put in an array called page_text
foreach my $new_line(@page_text) {
        #put each line into another array
        my @blankPages = ($new_line =~ m/<blank[^>]+?title\s*=\s*["']?
+([^'" >]+?)[ '"]?>/sig);

        foreach my $blankTitle(@blankPages) {
                print "$blankTitle is the title\n";
                #do some logic to add a blank page with that title
        }
}
[download]

Of course I want it so I can put spaces in there. I was also wondering if it would be difficult to allow " and ' marks in there, because those throw it off too.
The last thing I couldn't figure out too was how to make an if statement that would fail if it did find that, because of course I don't want my text to include the xml tag. The following seems to not work, even when I try to take out the parentheses that grab the text I want (refer to the regex in code above):

if ($new_line =! m/sameregexasabove?/sig) {
       #don't write a new line with the xml tag, but any other text is
+ okay
}
[download]

Your help would be much appreciated!

Michael Jensen
michael at inblosam.com
http://www.inblosam.com
ipod user
powerbook user

Comment on Regex and a pseudo-XML tag Select or Download Code

Replies are listed 'Best First'.
Re: Regex and a pseudo-XML tag by Abigail-II (Bishop) on Nov 20, 2003 at 15:17 UTC
Well, your regex explicitely forbids spaces: `[^'" >]+` [download] so I wonder why you are surprised it only 'works' if there are no spaces in the value. If you want spaces, don't forbid them. Having said that, there are many more problems with your regexp. Why don't you use one of the many XML parsing modules from CPAN? Abigail	[reply] [d/l]
Re: Re: Regex and a pseudo-XML tag by inblosam (Monk) on Nov 20, 2003 at 16:27 UTC
That works great! I didn't know that the "space greater than" excluded whitespace, so I learned something new! THANKS!	[reply]
Re: Re: Re: Regex and a pseudo-XML tag by CountZero (Bishop) on Nov 20, 2003 at 20:31 UTC
"space greater than" does not exclude whitespace. If you use character classes (which is the thing inside `[ ... ]`) you exlude the characters mentioned in the character class if the first character after the `[` is a `^`. It sort of negates the list within the square brackets or includes all characters not in the list. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply] [d/l] [select]
Re: Re: Re: Re: Regex and a pseudo-XML tag by inblosam (Monk) on Nov 21, 2003 at 03:11 UTC
Re: Regex and a pseudo-XML tag by Roy Johnson (Monsignor) on Nov 20, 2003 at 15:22 UTC
For robustness, you should probably look into Text::Balanced. For quick and dirty, which means putting more limits on what you will accept (no nesting or line breaks, for example), this might get you there: `m/<blank\s+title\s=\s(["'])?(.?)$1\s>/` [download] Update: As noted in the reply below, the `$1` should be a `\1`.	[reply] [d/l]
Re: Re: Regex and a pseudo-XML tag by graff (Chancellor) on Nov 21, 2003 at 03:28 UTC
In order use a captured string as part of a regex match, you need to use a backslash in front of the digit, not a dollar sign: `m/<blank\s+title\s=\s(["'])?(.?)\1\s>/ --` [download] To show a simpler example, consider: `$_ = "here are ::framed words:: to capture"; if ( /(::)(.*?)\1/ ) { print "text framed by $1 was <$2>\n"; }` [download] If you try it with $ instead of \, it doesn't work.	[reply] [d/l] [select]