HTML Parser print text

Vanquish has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: HTML Parser print text by Joost (Canon) on Jul 07, 2004 at 21:01 UTC
Ok, so based on your replies above, given: `<HTML> <title>My Page</title> </head> <body> <center> <h1>Brand.com Production Instances</h1> <br> <table border=1> <tr><td></td><td><b> Service   </b></td><td><b>Instance + </b></td> <tr><td align="right">1</td><td> app2<br></td><td> prd-1</td +><td> </td> </tr> <tr><td align="right">2</td><td> app2  <br></td><td> pr +d-2</td><td> </td></tr> <tr><td align="right">3</td><td> app3<br></td><td> prd-1</td +><td>` [download] etc etc you want to print out the text in the `<td>` tags that have align="right" as an attribute. This code will do that: #!/usr/bin/perl use strict; use warnings; use LWP::UserAgent; use HTML::Parser; # Create instance my $p = HTML::Parser->new(api_version => 3, marked_sections => 1, unbroken_text => 1, start_h => [\&start, "tagname, attr"], text_h => [\&text, 'text'], ); # Start parsing the following HTML file $p->parse_file("testpage.html"); my $get_next_text = 0; sub start{ # Execute when start tag is encountered my ($tagname,$attr) = @_; if ($tagname eq 'td' && exists $attr->{align} && $attr->{align} eq + 'right'){ $get_next_text = 1; } else { $get_next_text = 0; } } sub text { my $text = shift; print "$text\n" if $get_next_text; } [download] What it does is this: Set up HTML::Parser so that for each start tag &start gets called with as arguments the tag name ("td" or something else) followed by the attributes as a hash-ref) and that for all text parts &text gets called with the text as the argument. Note that a start tag is ANY tag that doesn't begin with `</` - so `<p>` is a start tag and `<td>` is a start tag, but `</p>` is not. A "text" part is anything that is not a tag. Test in &start if the current tag is a `<td>` with an align="right" attribute. If yes: set $get_next_text to true. if no: set $get_next_text to false. Test in &text if the previous tag was a `<td align="right">` (via the $get_next_text variable). If yes, print, otherwise do nothing. Hope this clears it up :-) Joost. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^2: HTML Parser print text by Vanquish (Novice) on Jul 07, 2004 at 21:13 UTC
Great Works Thanks Alot Have a nice Day MQ	[reply]
Re^3: HTML Parser print text by Joost (Canon) on Jul 07, 2004 at 21:16 UTC
Great to be of help. Note that depending on the input, this code might not be 100% fail-safe. You should be able to figure out how to fix it, though. Hint: at least put a handler for end-tags in. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re: HTML Parser print text by Joost (Canon) on Jul 07, 2004 at 18:14 UTC
sigh Well which is it? Do you want plain text or tags w/ attributes? And what's wrong with what your code is doing now, and what's in testpage.html? J. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^2: HTML Parser print text by Anonymous Monk on Jul 07, 2004 at 19:20 UTC
Thanks for reply I need plain text embeded in tags with attriutes. My code wont print any output. testpage.html is nothing but simple hello world page. Thanks again	[reply]
Re^3: HTML Parser print text by Joost (Canon) on Jul 07, 2004 at 19:29 UTC
testpage.html CAN'T be a simple "Hello World" page, because, for my definition of simple, that would look like: `<html> <head> <title>Some title</title> </head> <body> Hello, World </body> </html>` [download] And would not contain any `<td>` tags. Besides, I still don't know WHAT the tags are that you want around it (or do you want the plain text in an attribute? I've no clue) So please: Show the input file. Completely. Show the code. Completely. Show your intended output. Completely. And tell us a bit about why it should be exactly that output. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply] [d/l] [select]
Re^4: HTML Parser print text by Anonymous Monk on Jul 07, 2004 at 20:13 UTC