Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

This node falls below the community's threshold of quality. You may see it by logging in.

Replies are listed 'Best First'.
Re: help match this
by GrandFather (Saint) on Jul 04, 2006 at 11:01 UTC

    Generally using regexen to extract data from HTML is hard enough that it is worth leaving to tools designed for the purpose. In this case HTML::TreeBuilder is one way to do the job:

    use strict; use warnings; use HTML::TreeBuilder; my $html = <<HTML; <p><a id="GuidelineDataList__ctl1_doctitlelink" href="/summary/summary.aspx?doc_id=4363">(1) Pertussis vaccination: us +e of acellular pertussis vaccines among infants and young children. <br />( +2) Use of diphtheria toxoid-tetanus toxoid-acellular pertussis vaccine as a five +-dose series. (Addendum)</a> Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep 26) . 25 pages. NGC:003288</p> HTML my $tree = HTML::TreeBuilder->new_from_content ($html); # Get a list of anchors that have an id attribute starting 'GuidelineD +ataList' my @anchors = $tree->look_down (id => qr/^GuidelineDataList/); for my $anchor (@anchors) { next if ! defined $anchor || ! defined $anchor->parent (); print 'HRef: ', $anchor->attr ('href'), "\n" if defined $anchor->att +r ('href'); print 'Text: ', $anchor->parent ()->as_text (), "\n" if defined $anc +hor->parent (); }

    Prints:

    HRef: /summary/summary.aspx?doc_id=4363 Text: (1) Pertussis vaccination: use of acellular pertussis vaccines a +mong infants and young children. (2) Use of diphtheria toxoid-tetanus + toxoid-acellular pertussis vaccine as a five-dose series. (Addendum) + Centers for Disease Control and Prevention - Federal Government Agen +cy [U.S.]. 1997 Mar 28 (revised 2000 Nov; addendum released 2003 Sep +26) . 25 pages. NGC:003288

    DWIM is Perl's answer to Gödel
Re: help match this
by ysth (Canon) on Jul 04, 2006 at 07:54 UTC
    Since you don't show what you tried or say how it didn't work, it's hard to help, but perhaps adding a s flag after the regex might be what you want (like /match.*this/s). Without that, . will only match non-newline characters.
Re: help match this
by bart (Canon) on Jul 04, 2006 at 08:34 UTC
    /<a\s(.*)/s
    :)

    Honestly, you don't show where it should stop matching, so I can just grab the rest of the string.

    I assume that actually, you want to match up to the next "<a", but that's just a guess. As there may not be a next anchor, the latter should be optional, thus, match up to the end of the string. So, try one of:

    /<a\s((?:(?!<a\s).)*)/s /<a\s(.*?)(?:<a\s|$)/s

    If your "html" is much more complex than this, you should look into using a HTML parser. I've used HTML::TokeParser::Simple with success in similar tasks in the past, so it's my first recommendation.

      Hey there bart,

      Your first anwser is very close, but the question is how to extract contents between "<a" to "NGC:003288", not to the next "<a", so I would change your first anwser to this:

      /<a\s(.*)NGC:003288/s;
      and it should turn up the desired result.

      Now, back to my meditation

Re: help match this
by lima1 (Curate) on Jul 04, 2006 at 08:04 UTC
    see perldoc perlre. the s modifier is what you want...
    #!/usr/bin/perl use strict; use warnings; my $a = join '', <DATA>; my ( $b ) = $a =~ m{<a( .*? ) NGC:003288 }xms; print $b; __DATA__ <a id="GuidelineDataList__ctl1_doctitlelink" href="/summary/summary.as +px?doc_id=4363">(1) Pertussis vaccination: use of acellular pertussis + vaccines among infants and young children. <br />(2) Use of diphtheria toxoid-tetanus toxoid-acellular pertussis +vaccine as a five-dose series. (Addendum)</a> Centers for Disease Control and Prevention - Feder +al Government Agency [U.S.]. 1997 Mar 28 (revised 2000 Nov; + addendum released 2003 Sep 26) . 25 pages. NGC:003288
Re: help match this
by Moron (Curate) on Jul 04, 2006 at 13:40 UTC
    Given two delimiters over potentially more than one line, it seems easier and more maintainable to me to separate the matching operations, for example:
    my $phase = 0; my $start = '<a'; my $finish = 'NGC:003288'; my $content = ''; while( <> ) { if ( $phase == 0 ) { if ( /$start(.*)$/ ) { $phase++; $_ = $1; } else { next; } } if ( $phase == 1 ) { if ( /(.*)$finish/ ) { $content .= $1; last; } $content .= $_; } }

    -M

    Free your mind

      In cases like this the flip-flop operator ( '..' in scalar context, see perlop) can really help to clean things up by eliminating the need to track state. For example, the entire body of your while loop can be replaced with this:

      if( /$start/ .. /$finish/ ) { $content .= $_; }
      Now that is easier and more maintainable!

      The snippet above will append $_ to $content anytime $_ matches $start and until it matches $finish. Note that $content will include the patterns within $start and $finish, however, so a simple regex at the end can be used to eliminate them:

      $content =~ m/$start(.*)$finish/s;
      This does, of course, simply reduce to the examples above which use the s modifier to allow '.' to match newlines, but it illustrates how the flip-flop operator could be used in this situation. No more need for $phase, no more funky loop control in nested 'if' statements, and no manual resetting of $_.

      See also the very nice discussion in Flipin good, or a total flop?.

        The purpose of my post was to demonstrate that it was easy enough to split the matching into two. I am normally a fan of terse code, especially if the usage of an operator is unambiguous, but because of the danger of the unwary reader confusing it with the .. range operator, this requires significant explanation which would not be germane to the simple point I was making. I don't rule it out though and will bear it in mind for such time as I can think of a way to present it easily enough to potentially beginner OPs.

        -M

        Free your mind