Cutting big HTML file

HTTP-404 has asked for the wisdom of the Perl Monks concerning the following question:

Hello again I have a page where i want to ripout all content between few lines

JUNK HTML
..............
<td class="ContenidoTitulo" colspan="3">Trucos de  A 10 CUBA para PC <
+/td>
................
MORE NEEDED HTML
................
<td class="ContenidoTexto" colspan="3"> CODIGOS<br>Some Text </td>
.............
JUNK HTML
[download]

I have followng code for cutting it this needed part

#!/usr/bin/perl
  # Create a user agent object
  use LWP::UserAgent;
  use LWP::Simple;
  
  &surf;
  
sub surf{
 $ua = new LWP::UserAgent;
 $ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 5.
+12  [en]' . $ua->agent);
 $ua->agent('Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 5.
+12  [en]');
 $req = HTTP::Request->new(GET => 'http://www.chollonet.com/trucoteca/
+truco.php?idjuego=598&plataforma=4');

 #get rid if html 
 @html=$ua->request($req)->as_string;
 print "@html";

}
[download]

I think that following regexo should be fine

 /^<td class="ContenidoTitulo" colspan="3">Trucos de  A 10 CUBA para P
+C </td>(*.)<td class="ContenidoTexto" colspan="3"> CODIGOS<br>Some Te
+xt </td>
[download]

but how would i use it my script thnx a lot

Comment on Cutting big HTML file Select or Download Code

Replies are listed 'Best First'.
Re: Cutting big HTML file by !me (Acolyte) on Jul 29, 2001 at 05:03 UTC
Here is my non-regex solution. It's long, ugly and not very efficient but it works! `my ($page) = "12345abc67890"; print &get_page_chunk('12345','67890',$page); sub get_page_chunk { my ($start_marker,$end_marker,$page) = @_; my ($x1,$x2) = -1; $x1 = index($page,$start_marker); if ($x1 != -1) { $x1 += length($start_marker); $x2 = index($page,$end_marker,$x1); if ($x2 != -1 && $x2 > $x1) { return(substr($page,$x1,$x2-$x1)); }; }; return (''); };` [download]	[reply] [d/l]
Re: Cutting big HTML file by andye (Curate) on Jul 29, 2001 at 01:59 UTC
Hi again, I'd do something like this... `use LWP::Simple; my $html = get('http://www.chollonet.com/trucoteca/truco.php?idjue +go=598&plataforma=4'); $html =~ /<td class="ContenidoTitulo" colspan="3">Trucos de A 10 +CUBA para PC </td>(*.)<td class="ContenidoTexto" colspan="3"> CODIGOS +<br>Some Text </td>/s ; print $1;` [download] Notice that I've changed your regexp a little, at the beginning and at the end. I hope I've helped. andy.	[reply] [d/l]
Re: Cutting big HTML file by mitd (Curate) on Jul 29, 2001 at 11:46 UTC
Since your target data appears to all be contained within tables you may find HTML::TableExtract to be very useful. mitd-Made in the Dark 'My favourite colour appears to be grey.'	[reply]
Re: Cutting big HTML file by earthboundmisfit (Chaplain) on Jul 29, 2001 at 02:54 UTC
Not much to add except, have you taken a look at HTML::Parser?	[reply]