in reply to [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
in thread Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
my (@text, $found_start); while (my $t = $p->get_token){ my $txt; if ($t->is_text){ $txt = $t->as_is; for ($txt){ s/^\s+//; s/\s+$//; } next unless $txt; $found_start++ if $txt =~ /^Hit/; } elsif ( $found_start and $t->is_start_tag(q{a}) and $t->get_attr(q{href}) ) { my $href = $t->get_attr(q{href}); if ($href =~ /mailto:/i){ $txt = $href; } else { next; } } else{ next; } next unless $found_start; push @text, $txt; last if $txt =~ /Listed since/; }
Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: mailto: webmaster@demosite.fr master Listed since: 20.08.2002
All the output should be written in only one new text file.Well, open a new text file for writing. :-) See open for how to do that.
Bart has given some excellent tips on how to get a list of HTML files so that you can loop over them.
Good luck!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1
by Perlbeginner1 (Scribe) on Sep 26, 2010 at 18:45 UTC |