http://qs1969.pair.com?node_id=222730

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl Gods, I need some help -- suppose I want to search a variable containing the contents of a HTML page, what would be the best way? For example, the HTML page has "NOTE: Something somethinng", but NOTE: is surrounded by a bunch of HTML gook. All I want is the stuff after "NOTE:" .. how would I do that?
<font face='Arial,Helvetica,Sans Serif' color='#000055' size='2'/><B>N +ote:</B> &nbsp;</font><font face='Arial,Helvetica,Sans Serif' color= +'#000055' size='2'/>By arrangement.</font>
And I want to extract "By arrangement." from the above .... how would I do that? Thanks, Helpless (and Stupid)

Replies are listed 'Best First'.
Re: Extract text from HTML
by jdporter (Paladin) on Dec 28, 2002 at 15:29 UTC
    Here's a nice little function that does it.

    (FYI - the faq noted by Juerd is obsolete. Not only does it not give an actual solution, but it recommends HTML::Parse, which is now deprecated.)
    use HTML::Parser; sub extract_html_text { my $html = shift; my $text = ''; HTML::Parser->new( api_version => 3, text_h => [ sub { $text .= "@ +_"; }, "dtext" ] )->parse( $html )->eof; $text }

    UPDATE: Here's another (imho, nicer) little function that does it:
    use HTML::TreeBuilder; sub extract_html_text { HTML::TreeBuilder->new_from_content($_[0])->as_text }

    jdporter
    ...porque es dificil estar guapo y blanco.

      jdporter, Thanks a ton for your help ..... that did it. I had a question though ..... with regards to my question and your answer, I am using -
      my $we = extract_html_text($browser->{res}->content); my @note = $we =~ m/Note:\s*([^<]+)/gi;
      to first strip the HTML gook and then search the remainder for "Note". The thing is that "@note" prints out everything after Note: i.e. all the other coding in the remainder of the (formerly) HTML file, etc. Is there any way I can get it to search for Note:, collect all the information after it and stop when it reaches "Pre" or "Attrib" or "Link"? Would greatly appreciate your feedback. Thanks.
        How about:
        my ($note) = $we =~ m/Note:\s*(.+?)(?:Pre|Attrib|Link)/sgi
Re: Extract text from HTML
by Juerd (Abbot) on Dec 28, 2002 at 14:47 UTC
Re: Extract text from HTML
by vek (Prior) on Dec 29, 2002 at 14:09 UTC
Re: Extract text from HTML
by osama (Scribe) on Dec 29, 2002 at 20:56 UTC
    I used to like reinventing the wheel every time... I used to do something like this:
    # THIS IS BAD s/(\s|\&nbsp;)+/ /g; s/<(BR|P)>/\n/ig; s/<.+?>//g;
    Now I just :
    use HTML::TokeParser;