in reply to help match this

Generally using regexen to extract data from HTML is hard enough that it is worth leaving to tools designed for the purpose. In this case HTML::TreeBuilder is one way to do the job:

use strict; use warnings; use HTML::TreeBuilder; my $html = <<HTML; <p><a id="GuidelineDataList__ctl1_doctitlelink" href="/summary/summary.aspx?doc_id=4363">(1) Pertussis vaccination: us +e of acellular pertussis vaccines among infants and young children. <br />( +2) Use of diphtheria toxoid-tetanus toxoid-acellular pertussis vaccine as a five +-dose series. (Addendum)</a> Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep 26) . 25 pages. NGC:003288</p> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); # Get a list of anchors that have an id attribute starting 'GuidelineD +ataList' my @anchors = $tree->look_down (id => qr/^GuidelineDataList/); for my $anchor (@anchors) { next if ! defined $anchor || ! defined $anchor->parent (); print 'HRef: ', $anchor->attr ('href'), "\n" if defined $anchor->att +r ('href'); print 'Text: ', $anchor->parent ()->as_text (), "\n" if defined $anc +hor->parent (); }

Prints:

HRef: /summary/summary.aspx?doc_id=4363 Text: (1) Pertussis vaccination: use of acellular pertussis vaccines a +mong infants and young children. (2) Use of diphtheria toxoid-tetanus + toxoid-acellular pertussis vaccine as a five-dose series. (Addendum) + Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep +26) . 25 pages. NGC:003288

DWIM is Perl's answer to Gödel