Re^7: path-names [a very easy question of a true beginner]

hello all

many thanks to you! I did as you adviced me! And now i was successful! i changed from

#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;
my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '/home/usr/perl/htmlfiles' );
foreach my $file(@files) {
        print $file, "\n";

}
[download]

to this


#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
use File::Find::Rule;
my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '.' );
foreach my $file(@files) {
        print $file, "\n";

}
[download]

and then i got the following output:

htmlfiles/einzelergebnis80b5.html<br>
htmlfiles/einzelergebnisa0ef.html<br>
htmlfiles/einzelergebnis1b42.html<br>
htmlfiles/einzelergebnis5960.html<br>
htmlfiles/einzelergebnise523.html<br>
htmlfiles/einzelergebnis2c7e.html<br>
htmlfiles/einzelergebnisdf57.html<br>
htmlfiles/einzelergebnis2b53-2.html<br>
htmlfiles/einzelergebnisb1c0-2.html<br>
htmlfiles/einzelergebnis8e8b.html<br>
htmlfiles/einzelergebnisdcc1.html<br>
htmlfiles/einzelergebnis1dae-2.html<br>
htmlfiles/einzelergebnisa70d.html<br>
htmlfiles/einzelergebnis3cec.html<br>
htmlfiles/einzelergebnis3f1f.html<br>
htmlfiles/einzelergebnis1d2b.html<br>
htmlfiles/einzelergebnis396c.html<br>
htmlfiles/einzelergebnis2592.html<br>
htmlfiles/einzelergebnisdee0.html<br>
htmlfiles/einzelergebnis987b-2.html<br>
htmlfiles/einzelergebnise20b.html<br>
[download]

...and 22 thousand lines further... ;-)

This seems to be the starting point! now i can continue figuring out how i have to configure the script of Keath - see more here URL=http://forums.devshed.com/showpost.php?p=2538358&postcount=12see this link to another thread here in this great forum - with the little script/URL . As this previous thread is very very long i think that it is worth to begin a new one! Note: many many thanks to Keath and Axldrweil for their great and generous help!!! So after having nailed down the I-O handle-issues and the path names in General the parser-script has to be configured.

well this means i have to define the paths in $file the file/directory incl. path and furthermore to define a path in $html_dir
BTW – what does the

 Array @html_files   do
[download]

here the full code or the html-parser:


#!/usr/bin/perl
use strict;
use warnings;

use HTML::TokeParser;

my $file = 'school.html';
my $p = HTML::TokeParser->new($file) or die "Can't open: $!";

my %school;
while (my $tag = $p->get_tag('div', '/html')) {
    # first move to the right div that contains the information
    last if $tag->[0] eq '/html';
    next unless exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'inhalt_
+large';
    
    $p->get_tag('h1');
    $school{'location'} = $p->get_text('/h1');
    
    while (my $tag = $p->get_tag('div')) {
        last if exists $tag->[1]{'id'} and $tag->[1]{'id'} eq 'fusszei
+le';
        
        # get the school name from the heading
        next unless exists $tag->[1]{'class'} and $tag->[1]{'class'} e
+q 'fm_linkeSpalte';
        $p->get_tag('h2');
        $school{'name'} = $p->get_text('/h2');
        
        # verify format for school type
        $tag = $p->get_tag('span');
        unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 's
+chulart_text') {
            warn "unexpected format: parsing stopped";
            last;
        }
        $school{'type'} = $p->get_text('/span');
        
        # verify format for address
        $tag = $p->get_tag('p');
        unless (exists $tag->[1]{'class'} and $tag->[1]{'class'} eq 'e
+inzel_text') {
            warn "unexpected format: parsing stopped";
            last;
        }
        $school{'address'} = clean_address($p->get_text('/p'));
        
        # find the description
        $tag = $p->get_tag('p');
        $school{'description'} = $p->get_text('/p');
    }
}

print qq/$school{'name'}\n/;
print qq/$school{'location'}\n/;
print qq/$school{'type'}\n/;

foreach (@{$school{'address'}}) {
    print "$_\n";
}

print qq/\nDescription: $school{'description'}\n/;

sub clean_address {
    my $text = shift;
    my @lines = split "\n", $text;
    foreach (@lines) {
        s/^\s+//;
        s/\s+$//;
    }
    return \@lines;
}
[download]

Note: i can provide you with much further information - on what the script does!

i look forward to any and all help! This is a very very great place to share knowlege!! MAny many thanks for this great plac3!
perlbeginner1!

Comment on Re^7: path-names [a very easy question of a true beginner] Select or Download Code

Replies are listed 'Best First'.
Re^8: path-names [a very easy question of a true beginner] by morgon (Priest) on Oct 02, 2010 at 23:02 UTC
This is probably beside the point, but instead of `my @files = File::Find::Rule->file() ->name('einzelergebnis.html') ->in( '.' );` [download] I would simply use `my @files = <einzelergebnis.html>;` [download] as all your files seem to be in one directory (which also seems to be your current working directory). The difference is that your code would also find files that reside in a subdirectories - and that may or may not be what you want. (I would not want subdirs as I then could simply create a subdir and move files I want exclude from processing there but you may think differently about this - you just have to be aware of it).	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^8: path-names [a very easy question of a true beginner]
by morgon (Priest) on Oct 02, 2010 at 23:02 UTC

my @files = File::Find::Rule->file()
                 ->name('einzelergebnis*.html')
                 ->in( '.' );
[download]

my @files = <einzelergebnis*.html>;
[download]

The difference is that your code would also find files that reside in a subdirectories - and that may or may not be what you want. (I would not want subdirs as I then could simply create a subdir and move files I want exclude from processing there but you may think differently about this - you just have to be aware of it).

[reply]
[d/l]
[select]