regex in html

voyager has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(jeffa) Re: regex in html by jeffa (Bishop) on Apr 01, 2001 at 20:27 UTC
You are close, first thing - you have to undefine the input record separator if you wish to slurp up entire blocks of lines, otherwise you will only get data up to the first new line encountered: `undef $/; $current = <DATA>; $start = '<!---CURCON-->'; $end = '<!---END CURCON-->'; my ($match) = $current =~ m/$start(.)$end/s; print $match; __DATA__ stuff i don't want <!---CURCON--> stuff i do want <!---END CURCON--> more stuff i don't want` [download] You are correcly using the 's' modifier for your regex, but instead of using s///, use m// and capture $1 in another variable. The trick is, you have to catch $1 in array context: `my ($match) = $current =~ m/$start(.)$end/s; # note the parens around + $match` [download] else $match will be equal to the number of matches found. Now $match will contain a newline at the beginning as well as one at the end: `$match =~ tr/\n//d; # or $match =~ s/\n//g;` [download] Jeff R-R-R--R-R-R--R-R-R--R-R-R--R-R-R-- L-L--L-L--L-L--L-L--L-L--L-L--L-L--	[reply] [d/l] [select]
Re: regex in html by cLive ;-) (Prior) on Apr 02, 2001 at 04:33 UTC
timtowtdi... But also, I think Jeff misunderstands how your data is coming in, I'm assuming you're opening a file b4 the code you listed, and not using the __DATA__ token in your script. I don't like redefining $/, especially shown by Jeff, because it's not local and may cause issues later in your program. If you insist, use: `# assuming DATA pipe opened for reading... # declare my $current; # begin local code block { # locally define $/ local $/ = undef; # slurp $current = <DATA>; # end local code block }` [download] For more on $/, see '6.7. Reading Records with a Pattern Seperator' in The Perl Cookbook. But I'd do it this way, anyway... `# open open (DATA,"/path/to/webpage.htm") \|\| die "Can't open page - $!"; # slurp $current = join '', (<DATA>); # close close(DATA); # match $current =~ /<!---CURCON-->\n(.*?)\n<!---CURCON-->s; # store my $match = $1;` [download] Jeff's match also grabs an extra \n at beginning and end which you may not need (small point :) hope this makes sense. cLive ;-)	[reply] [d/l] [select]
(jeffa) Re: Re: regex in html by jeffa (Bishop) on Apr 02, 2001 at 17:29 UTC
I think Jeff misunderstands how your data is coming in Nope. You said it right the first time: TIMTOWDTI ;) I mentioned the extra new-lines, I did not address them because I did not know EXACTLY how the data will look EVERY time - what if there are multiple blank lines? `my ($match) = $current =~ m/$start\s(.)\s*$end/s;` [download] But thanks for sharing comments and critisicms, don't get me wrong, ++cLive ;-) :) Jeff R-R-R--R-R-R--R-R-R--R-R-R--R-R-R-- L-L--L-L--L-L--L-L--L-L--L-L--L-L--	[reply] [d/l]
Re: regex in html by Trimbach (Curate) on Apr 01, 2001 at 20:30 UTC
Don't use substitution when you really just want to match: `($current) = $everything =~ m/<!---CURCON-->(.*)<!---END CURCON-->/s;` [download] ...should work just fine. $current will now contain everything between the comments. If you want to insert the contents of $current somewhere else, there's no need to use another regex: `$new = $start . $current . $end;` [download] ...which will sandwich $current between $start and $end, which is what it looks like you want. Gary Blackburn Trained Killer	[reply] [d/l] [select]
Re: regex in html + from ... to by bjelli (Pilgrim) on Apr 02, 2001 at 13:39 UTC
If you are processing big files you might want to avoid slurping the whole thing at once. Here the range operator <kbd>...</kbd> comes in handy. When used in a scalar context it returns a boolean and does just what you need here: `while (<DATA>) { if (/$start/.../$end/) { print; } }` [download] I'll try to explain what happens in detail: The magic is in the three dots: When the first line is processed, the three dots are in the "false" state. They take the expression on the left (<kbd>/$start/</kbd>) and evaluate it. If the expression returns false everything stays the same, the three dots return false. If the expression returns true, the three dots return true and go into the "true" state. The next time we come to the three dots, the expression on the right is evaluated. If it returns false, everything says the same: the three dots continue to return true. If the expression returns true, the three dots go back into the true state. But once you've grokked all that, you just think of the whole while + if + ... construct as "from /$start/ to /$end/" -- Brigitte 'I never met a chocolate I didnt like' Jellinek http://www.horus.com/~bjelli/ http://perlwelt.horus.at	[reply] [d/l]
Re: Re: regex in html + from ... to by AidanLee (Chaplain) on May 04, 2001 at 17:18 UTC
The Camel 2nd Ed. States that the range operator is two dots '..' not three. ?	[reply]
Re: Re: Re: regex in html + from ... to by davorg (Chancellor) on May 04, 2001 at 17:28 UTC
It can be both `..` and `...`. They have subtly different effects. With two dots, it's possible for both the start and end checks to be true on the same line. This means that the operator goes from false to true and back to false again on one evaluation. With three dots, if the start check is true, then the end check isn't checked until the next evaluation - thus forcing at least one iteration with the operator returning true. -- <http://www.dave.org.uk> "Perl makes the fun jobs fun and the boring jobs bearable" - me	[reply]