Extract text from HTML

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Perl Gods, I need some help -- suppose I want to search a variable containing the contents of a HTML page, what would be the best way? For example, the HTML page has "NOTE: Something somethinng", but NOTE: is surrounded by a bunch of HTML gook. All I want is the stuff after "NOTE:" .. how would I do that?

<font face='Arial,Helvetica,Sans Serif' color='#000055' size='2'/><B>N
+ote:</B>  &nbsp;</font><font face='Arial,Helvetica,Sans Serif' color=
+'#000055' size='2'/>By arrangement.</font>
[download]

And I want to extract "By arrangement." from the above .... how would I do that? Thanks, Helpless (and Stupid)

Comment on Extract text from HTML Download Code

Replies are listed 'Best First'.
Re: Extract text from HTML by jdporter (Paladin) on Dec 28, 2002 at 15:29 UTC
Here's a nice little function that does it. (FYI - the faq noted by Juerd is obsolete. Not only does it not give an actual solution, but it recommends HTML::Parse, which is now deprecated.) `use HTML::Parser; sub extract_html_text { my $html = shift; my $text = ''; HTML::Parser->new( api_version => 3, text_h => [ sub { $text .= "@ +_"; }, "dtext" ] )->parse( $html )->eof; $text }` [download] UPDATE: Here's another (imho, nicer) little function that does it: `use HTML::TreeBuilder; sub extract_html_text { HTML::TreeBuilder->new_from_content($_[0])->as_text }` [download] jdporter ...porque es dificil estar guapo y blanco.	[reply] [d/l] [select]
Re: Re: Extract text from HTML by Anonymous Monk on Dec 28, 2002 at 18:07 UTC
jdporter, Thanks a ton for your help ..... that did it. I had a question though ..... with regards to my question and your answer, I am using - `my $we = extract_html_text($browser->{res}->content); my @note = $we =~ m/Note:\s*([^<]+)/gi;` [download] to first strip the HTML gook and then search the remainder for "Note". The thing is that "@note" prints out everything after Note: i.e. all the other coding in the remainder of the (formerly) HTML file, etc. Is there any way I can get it to search for Note:, collect all the information after it and stop when it reaches "Pre" or "Attrib" or "Link"? Would greatly appreciate your feedback. Thanks.	[reply] [d/l]
Re: Re: Re: Extract text from HTML by Anonymous Monk on Dec 28, 2002 at 19:07 UTC
How about: `my ($note) = $we =~ m/Note:\s*(.+?)(?:Pre\|Attrib\|Link)/sgi` [download]	[reply] [d/l]
Re: Extract text from HTML by Juerd (Abbot) on Dec 28, 2002 at 14:47 UTC
How do I remove HTML from a string? - Yes, I reinvent wheels. - Spam: Visit eurotraQ.	[reply]
Re: Extract text from HTML by vek (Prior) on Dec 29, 2002 at 14:09 UTC
For this sort of thing I usually recommend HTML::TokeParser or Ovid's slightly more intuitive HTML::TokeParser::Simple. -- vek --	[reply]
Re: Extract text from HTML by osama (Scribe) on Dec 29, 2002 at 20:56 UTC
I used to like reinventing the wheel every time... I used to do something like this: `# THIS IS BAD s/(\s\|\ )+/ /g; s/<(BR\|P)>/\n/ig; s/<.+?>//g;` [download] Now I just : `use HTML::TokeParser;` [download]	[reply] [d/l] [select]


Perl Monk, Perl Meditation
	PerlMonks