Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: help match this
by GrandFather (Saint) on Jul 04, 2006 at 11:01 UTC | |
Generally using regexen to extract data from HTML is hard enough that it is worth leaving to tools designed for the purpose. In this case HTML::TreeBuilder is one way to do the job:
Prints:
DWIM is Perl's answer to Gödel | [reply] [d/l] [select] |
|
Re: help match this
by ysth (Canon) on Jul 04, 2006 at 07:54 UTC | |
| [reply] [d/l] |
|
Re: help match this
by bart (Canon) on Jul 04, 2006 at 08:34 UTC | |
:) Honestly, you don't show where it should stop matching, so I can just grab the rest of the string. I assume that actually, you want to match up to the next "<a", but that's just a guess. As there may not be a next anchor, the latter should be optional, thus, match up to the end of the string. So, try one of:
If your "html" is much more complex than this, you should look into using a HTML parser. I've used HTML::TokeParser::Simple with success in similar tasks in the past, so it's my first recommendation. | [reply] [d/l] [select] |
by Silent-monk (Novice) on Jul 04, 2006 at 12:41 UTC | |
Your first anwser is very close, but the question is how to extract contents between "<a" to "NGC:003288", not to the next "<a", so I would change your first anwser to this: and it should turn up the desired result. Now, back to my meditation | [reply] [d/l] |
|
Re: help match this
by lima1 (Curate) on Jul 04, 2006 at 08:04 UTC | |
| [reply] [d/l] |
|
Re: help match this
by Moron (Curate) on Jul 04, 2006 at 13:40 UTC | |
-M Free your mind | [reply] [d/l] |
by bobf (Monsignor) on Jul 04, 2006 at 20:49 UTC | |
In cases like this the flip-flop operator ( '..' in scalar context, see perlop) can really help to clean things up by eliminating the need to track state. For example, the entire body of your while loop can be replaced with this: Now that is easier and more maintainable! The snippet above will append $_ to $content anytime $_ matches $start and until it matches $finish. Note that $content will include the patterns within $start and $finish, however, so a simple regex at the end can be used to eliminate them: This does, of course, simply reduce to the examples above which use the s modifier to allow '.' to match newlines, but it illustrates how the flip-flop operator could be used in this situation. No more need for $phase, no more funky loop control in nested 'if' statements, and no manual resetting of $_. See also the very nice discussion in Flipin good, or a total flop?. | [reply] [d/l] [select] |
by Moron (Curate) on Jul 05, 2006 at 09:15 UTC | |
-M Free your mind | [reply] |