Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Extract text from HTML

by Anonymous Monk
on Dec 28, 2002 at 14:40 UTC ( [id://222730]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl Gods, I need some help -- suppose I want to search a variable containing the contents of a HTML page, what would be the best way? For example, the HTML page has "NOTE: Something somethinng", but NOTE: is surrounded by a bunch of HTML gook. All I want is the stuff after "NOTE:" .. how would I do that?
<font face='Arial,Helvetica,Sans Serif' color='#000055' size='2'/><B>N +ote:</B> &nbsp;</font><font face='Arial,Helvetica,Sans Serif' color= +'#000055' size='2'/>By arrangement.</font>
And I want to extract "By arrangement." from the above .... how would I do that? Thanks, Helpless (and Stupid)

Replies are listed 'Best First'.
Re: Extract text from HTML
by jdporter (Paladin) on Dec 28, 2002 at 15:29 UTC
    Here's a nice little function that does it.

    (FYI - the faq noted by Juerd is obsolete. Not only does it not give an actual solution, but it recommends HTML::Parse, which is now deprecated.)
    use HTML::Parser; sub extract_html_text { my $html = shift; my $text = ''; HTML::Parser->new( api_version => 3, text_h => [ sub { $text .= "@ +_"; }, "dtext" ] )->parse( $html )->eof; $text }

    UPDATE: Here's another (imho, nicer) little function that does it:
    use HTML::TreeBuilder; sub extract_html_text { HTML::TreeBuilder->new_from_content($_[0])->as_text }

    jdporter
    ...porque es dificil estar guapo y blanco.

      jdporter, Thanks a ton for your help ..... that did it. I had a question though ..... with regards to my question and your answer, I am using -
      my $we = extract_html_text($browser->{res}->content); my @note = $we =~ m/Note:\s*([^<]+)/gi;
      to first strip the HTML gook and then search the remainder for "Note". The thing is that "@note" prints out everything after Note: i.e. all the other coding in the remainder of the (formerly) HTML file, etc. Is there any way I can get it to search for Note:, collect all the information after it and stop when it reaches "Pre" or "Attrib" or "Link"? Would greatly appreciate your feedback. Thanks.
        How about:
        my ($note) = $we =~ m/Note:\s*(.+?)(?:Pre|Attrib|Link)/sgi
Re: Extract text from HTML
by Juerd (Abbot) on Dec 28, 2002 at 14:47 UTC
Re: Extract text from HTML
by vek (Prior) on Dec 29, 2002 at 14:09 UTC
Re: Extract text from HTML
by osama (Scribe) on Dec 29, 2002 at 20:56 UTC
    I used to like reinventing the wheel every time... I used to do something like this:
    # THIS IS BAD s/(\s|\&nbsp;)+/ /g; s/<(BR|P)>/\n/ig; s/<.+?>//g;
    Now I just :
    use HTML::TokeParser;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://222730]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (2)
As of 2024-04-20 03:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found