help match this

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: help match this by GrandFather (Saint) on Jul 04, 2006 at 11:01 UTC
Generally using regexen to extract data from HTML is hard enough that it is worth leaving to tools designed for the purpose. In this case HTML::TreeBuilder is one way to do the job: use strict; use warnings; use HTML::TreeBuilder; my $html = <<HTML; <p><a id="GuidelineDataList__ctl1_doctitlelink" href="/summary/summary.aspx?doc_id=4363">(1) Pertussis vaccination: us +e of acellular pertussis vaccines among infants and young children. <br />( +2) Use of diphtheria toxoid-tetanus toxoid-acellular pertussis vaccine as a five +-dose series. (Addendum)</a> Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep 26) . 25 pages. NGC:003288</p> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); # Get a list of anchors that have an id attribute starting 'GuidelineD +ataList' my @anchors = $tree->look_down (id => qr/^GuidelineDataList/); for my $anchor (@anchors) { next if ! defined $anchor \|\| ! defined $anchor->parent (); print 'HRef: ', $anchor->attr ('href'), "\n" if defined $anchor->att +r ('href'); print 'Text: ', $anchor->parent ()->as_text (), "\n" if defined $anc +hor->parent (); } [download] Prints: `HRef: /summary/summary.aspx?doc_id=4363 Text: (1) Pertussis vaccination: use of acellular pertussis vaccines a +mong infants and young children. (2) Use of diphtheria toxoid-tetanus + toxoid-acellular pertussis vaccine as a five-dose series. (Addendum) + Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep +26) . 25 pages. NGC:003288` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: help match this by ysth (Canon) on Jul 04, 2006 at 07:54 UTC
Since you don't show what you tried or say how it didn't work, it's hard to help, but perhaps adding a s flag after the regex might be what you want (like `/match.*this/s`). Without that, . will only match non-newline characters.	[reply] [d/l]
Re: help match this by bart (Canon) on Jul 04, 2006 at 08:34 UTC
`/<a\s(.)/s` [download] :) Honestly, you don't show where it should stop* matching, so I can just grab the rest of the string. I assume that actually, you want to match up to the next "`<a`", but that's just a guess. As there may not be a next anchor, the latter should be optional, thus, match up to the end of the string. So, try one of: `/<a\s((?:(?!<a\s).))/s /<a\s(.?)(?:<a\s\|$)/s` [download] If your "html" is much more complex than this, you should look into using a HTML parser. I've used HTML::TokeParser::Simple with success in similar tasks in the past, so it's my first recommendation.	[reply] [d/l] [select]
Re^2: help match this by Silent-monk (Novice) on Jul 04, 2006 at 12:41 UTC
Hey there bart, Your first anwser is very close, but the question is how to extract contents between "<a" to "NGC:003288", not to the next "<a", so I would change your first anwser to this: `/<a\s(.*)NGC:003288/s;` [download] and it should turn up the desired result. Now, back to my meditation	[reply] [d/l]
Re: help match this by lima1 (Curate) on Jul 04, 2006 at 08:04 UTC
see perldoc perlre. the s modifier is what you want... #!/usr/bin/perl use strict; use warnings; my $a = join '', <DATA>; my ( $b ) = $a =~ m{<a( .*? ) NGC:003288 }xms; print $b; __DATA__ <a id="GuidelineDataList__ctl1_doctitlelink" href="/summary/summary.as +px?doc_id=4363">(1) Pertussis vaccination: use of acellular pertussis + vaccines among infants and young children. <br />(2) Use of diphtheria toxoid-tetanus toxoid-acellular pertussis +vaccine as a five-dose series. (Addendum)</a> Centers for Disease Control and Prevention - Feder +al Government Agency [U.S.]. 1997 Mar 28 (revised 2000 Nov; + addendum released 2003 Sep 26) . 25 pages. NGC:003288 [download]	[reply] [d/l]
Re: help match this by Moron (Curate) on Jul 04, 2006 at 13:40 UTC
Given two delimiters over potentially more than one line, it seems easier and more maintainable to me to separate the matching operations, for example: `my $phase = 0; my $start = '<a'; my $finish = 'NGC:003288'; my $content = ''; while( <> ) { if ( $phase == 0 ) { if ( /$start(.)$/ ) { $phase++; $_ = $1; } else { next; } } if ( $phase == 1 ) { if ( /(.)$finish/ ) { $content .= $1; last; } $content .= $_; } }` [download] -M Free your mind	[reply] [d/l]
Re^2: help match this by bobf (Monsignor) on Jul 04, 2006 at 20:49 UTC
In cases like this the flip-flop operator ( '..' in scalar context, see perlop) can really help to clean things up by eliminating the need to track state. For example, the entire body of your while loop can be replaced with this: `if( /$start/ .. /$finish/ ) { $content .= $_; }` [download] Now that is easier and more maintainable! The snippet above will append `$_` to `$content` anytime `$_` matches `$start` and until it matches `$finish`. Note that `$content` will include the patterns within `$start` and `$finish`, however, so a simple regex at the end can be used to eliminate them: `$content =~ m/$start(.*)$finish/s;` [download] This does, of course, simply reduce to the examples above which use the `s` modifier to allow '.' to match newlines, but it illustrates how the flip-flop operator could be used in this situation. No more need for `$phase`, no more funky loop control in nested 'if' statements, and no manual resetting of `$_`. See also the very nice discussion in Flipin good, or a total flop?.	[reply] [d/l] [select]
Re^3: help match this by Moron (Curate) on Jul 05, 2006 at 09:15 UTC
The purpose of my post was to demonstrate that it was easy enough to split the matching into two. I am normally a fan of terse code, especially if the usage of an operator is unambiguous, but because of the danger of the unwary reader confusing it with the .. range operator, this requires significant explanation which would not be germane to the simple point I was making. I don't rule it out though and will bear it in mind for such time as I can think of a way to present it easily enough to potentially beginner OPs. -M Free your mind	[reply]