Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1

This will extract the text and uses HTML::TokeParser::Simple which is a wrapper around HTML::Parser. I've add white space to the HTML for clarity.

#! /usr/bin/perl

use strict;
use warnings;

use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new(*DATA)
  or die qq{cant parse html: $!\n};

my @text;
while (my $t = $p->get_token){
  next unless $t->is_text;
  my $txt = $t->as_is;
  if ($txt =~ /Hit/ .. $txt =~ /Listed since/){
    for ($txt){
      s/^\s+//;
      s/\s+$//;
    }
    next unless $txt;
    push @text, $txt;
  }
}

print qq{$_\n} for @text;

  
__DATA__
<br><br>

<h2>Hit 7 out of 120517</h2>

<img 
  src="http://myweb.org/images/wappen/ni.gif" 
  class="wappen_pos" 
  width="45" 
  height="53" 
  alt="country" 
  title="countryname" 
/>

<br>

<div style="width: 40em;"><br>
  
  <div style="display: inline;">
    
    <div class="logo_homepage">
      <a 
        class="img_inl" 
        href="http://myWeb.org/222237520031111"
      >
      </a>
    </div>
    
    <br>
    
    <div class="fm_linkeSpalte">
      <h2>name 1</h2>
      <br>
      <span class="schulart_text">type: one (for example)</span>
      <p class="einzel_text">
        Adress: Paris, 3ne Boulevard Saint Lo<br /><br>
        Telefon:048 + 334555664  , Fax: 048 + 334555667<br />
        MyWeb-Nummer:  222237520031111   <br />
        Webmaster: 
        <a 
          href="mailto: webmaster@demosite.fr" 
          class="p1"
        >
          master
        </a>
        <br />
      </p>
    </div>
    
  <div>
    
  <p class="ta_left einzel_text"></p>
  
</div>

<br />
  
<div>
  <p class="ta_left einzel_text">Listed since: 20.08.2002</p>
</div>
   
<br><br><br><br>
[download]

Hit 7 out of 120517
name 1
type: one (for example)
Adress: Paris, 3ne Boulevard Saint Lo
Telefon:048 + 334555664  , Fax: 048 + 334555667
MyWeb-Nummer:  222237520031111
Webmaster:
master
Listed since: 20.08.2002
[download]

One of the anchor tags has an email address as the href attribute. Do you need to collect that as well?

Comment on Re: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 Select or Download Code

Replies are listed 'Best First'.
[Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1 (Scribe) on Sep 26, 2010 at 12:03 UTC
Hello WFSP! hello roboticus! hello dear Community! Many thanks for the quick reply! And many many thanks to all other poster. Also many thanks to roboticus. I am very very happy to be here. I am glad! This is a great place to be! sure thing! Many thanks for the quick reply! All sounds great. Well - i am a beginner on Linux (i run OpenSuse 11.4 milestone1) or on a second machine OpenSuse 11.3 WFSP - and roboticus your approaches look very very impressive!One question comes up to mind: Perhaps i have not seen that you allready have answered this in the code you have written down. I am a bloody newbie. WFSP and roboticus i want to try out both approaches. They look impressive and i am convinced. Here the question: Where to put the large number of HTML-files, that need to be parsed: Do i have to call them in the script? How to do that!? At the moment they are in one folder - (Note; more that 10 000) I have a large number of HTML-files in a folder. I want to read and extract the content of each HTML-file and create a new single txt file with all the results. I'm only interested in the content having the above mentioned words. WFSP (& roboticus) - all you have written sounds very good and i am convinced. Ah - yes - the anchor-tag with the e-mail-adress is important too. I want to collect this e-mail-adress too. All the output should be written in only one new text file. It is important to have some clean output: That means i need to have the text with linebreaks WFSP - your approach seems to be great - and the output is right that what i want. THIS (above mentioned Format is great! It is preferred! I like this output Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: master Listed since: 20.08.2002 Superb! I need to have the results of the parsing written in this above mentioned format. All the results shoul be written down in only one text-file. That is important. Again - the question-(you probably see i am new to linux too): where to store the HTML-Files that need to be parsed!?... (and) where do the results are going to be written to!? Do i have to write these locations into the code. As well as the place where we store the results? BTW; on a windows-machine it has to look something like the following. doesn´t it!? `my $HTML_dir="C:\htmlperl";<br> my $output="C:\htmlperl\output.txt";<br> my $file = $ARGV[0];<br>` [download] or in general: `# folder where the HTML-files (that need to be parsed are stored my $html_dir = '/path/to/dir/with/html.files'; # fetch all.html-files from the directory my @html_files = File::Find::Rule->file->name( '*.html')->in( $html_di +r); for my $file ( @html_files ) { # parse the files # store all results that you got from the HTML-files in only one +txt-file. }` [download] Sorry for the stupid newbie-question!? ;-) But i am very very glad to have found a great (a superb place to be - and to ask all the questions that i have in mind! This is a great place to learn! Many thanks to all you! looking forward to hear from you... best regards perlbeginner1	[reply] [d/l] [select]
Re^8: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by bart (Canon) on Sep 26, 2010 at 13:25 UTC
Where to put the large number of HTML-files, that need to be parsed: Do i have to call them in the script? How to do that!? I assume you mean "Do I have to name them all in a script?" No, you don't. You can put them anytwhere you like (but preferably not mixed up with the unrelated rest of your files) and use glob in your script to get a complete list of all those files in your script, in one directory -- or possibly even in adjacent directories": `# all html files in one directory my @files = glob 'path/to/dir/.html';` [download] or `# all html files in all (direct, slibling) subdirectoris in a director +y my @files = glob 'path/to/dir//*.html';` [download] If you need an even more elaborate directory structure, then you can use File::Find or one of its derivcatives to find the names of all html files, recursively. You then continue to parse each file, one at a time. You can use a regexp substitution to `s/\.html$/.txt/` to produce the name for the text file, if you want to put it right beside the original file. You can do a path substitution using `abs2rel`/`rel2abs` from File::Spec/File::Spec::Functions to put the new file in a different directory if you want to preserve the directory structure: `use File::Spec::Functions qw(rel2abs abs2rel); my $txt = rel2abs(abs2rel($file, $htmlroot), $txtroot); # relocate $txt =~ s/\.html$/.txt/; # extension` [download] If your directory tree is deep, you may have to create the target directory first, for example with mkpath before attempting to open the text file. If you want all text files to be in one and the same directory, you can just use File::Basename's `basename` to strip the directory from the path.	[reply] [d/l] [select]
Re: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by wfsp (Abbot) on Sep 26, 2010 at 13:51 UTC
To get the email address as well replace my while loop with `my (@text, $found_start); while (my $t = $p->get_token){ my $txt; if ($t->is_text){ $txt = $t->as_is; for ($txt){ s/^\s+//; s/\s+$//; } next unless $txt; $found_start++ if $txt =~ /^Hit/; } elsif ( $found_start and $t->is_start_tag(q{a}) and $t->get_attr(q{href}) ) { my $href = $t->get_attr(q{href}); if ($href =~ /mailto:/i){ $txt = $href; } else { next; } } else{ next; } next unless $found_start; push @text, $txt; last if $txt =~ /Listed since/; }` [download] `Hit 7 out of 120517 name 1 type: one (for example) Adress: Paris, 3ne Boulevard Saint Lo Telefon:048 + 334555664 , Fax: 048 + 334555667 MyWeb-Nummer: 222237520031111 Webmaster: mailto: webmaster@demosite.fr master Listed since: 20.08.2002` [download] All the output should be written in only one new text file. Well, open a new text file for writing. :-) See open for how to do that. Bart has given some excellent tips on how to get a list of HTML files so that you can loop over them. Good luck!	[reply] [d/l] [select]
Re^2: [Re7]: Parsing with HTML::TreeBuilder::LibXML on OpenSuse Linux 11.4 Milestone 1 by Perlbeginner1 (Scribe) on Sep 26, 2010 at 18:45 UTC
Hello Bart hello wfsp! many many thanks for your help! I try out these hints and your code! i come back and report all results. untill soon. best regards perlbeginner1	[reply]