Please help with a fetching issue

mahira has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I have an issue and after several hours of "googling" and "cpanning" I was not able to solve it :(

My aim is to fetch a specific, single page html and copy the "exact/as-is" content of a specific table division (td) to a scalar variable.

The document has several tables. Nor my table, neither the others has any id. But the table division I am interested in has a specific class defined.

Thanks for your help...

ps: I am aware of LWP::Simple and currently using it for fetching the page to a file. But the rest :((

Comment on Please help with a fetching issue

Replies are listed 'Best First'.
Re: Please help with a fetching issue by marto (Cardinal) on Mar 24, 2010 at 15:22 UTC
You could use one of the HTML parsing/manipulating modules such as HTML::TableExtract or HTML::TokeParser. Update : oops, fixed copy/paste error.	[reply]
Re: Please help with a fetching issue by Your Mother (Archbishop) on Mar 24, 2010 at 19:01 UTC
Either of these (tuned to your need) should do the trick. HTML::TokeParser::Simple or XML::LibXML. use LWP::Simple qw( get ); use HTML::TokeParser::Simple; my $page = get(+shift \|\| die "Gimme URI!\n"); my $p = HTML::TokeParser::Simple->new(\$page); while ( my $token = $p->get_tag("td") ) { next unless $token->get_attr("class") =~ /\bsomeClass\b/; my $first_child = $p->get_token(); print $first_child->as_is, $/; last; } use XML::LibXML; my $p = XML::LibXML->new; $p->recover_silently(1); my $doc = $p->parse_html_string($page); my ( $td ) = $doc->findnodes('//td[@class="someClass"]'); print $td->textContent, $/; [download] (update: rolled into single `<code/>` and fixed tag name.	[reply] [d/l] [select]
Re^2: Please help with a fetching issue by mahira (Acolyte) on Mar 25, 2010 at 06:50 UTC
Thank you very much. I was not able to utilize the solutions above. I don't know why but I think it is something related with the page... At the end, I was able to fix the issue with some regex. But this time I used a tag right before the table division. After fetching the page: `$page =~ s/(\n\|\r)/<!--xxx-->/g; $page =~ s/.<!--start\stag-->(.)<!--end\stag-->.*/$1/; $page =~ s/<!--xxx-->/\n/g;` [download] Thanks again for your help.	[reply] [d/l]