Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good evening - hello dear commmuntiy. i currently work out how to parse a site with a table (containing 6150 records) see here http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20 in order to get a good result with comma seperated values...: i have to make use of the Text::CSV - module
#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use LWP::Simple; use Text::CSV; use Cwd; use POSIX qw(strftime); my $te = HTML::TableExtract->new; my $total_records = 0; my $suchbegriffe = "e"; my $treffer = 50; my $range = 0; my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp? +e="; my $processdir = "processing"; my $counter = 50; my $displaydate = ""; my $percent = 0; &workDir(); chdir $processdir; &processURL(); print "\nPress <enter> to continue\n"; <>; $displaydate = strftime('%Y%m%d%H%M%S', localtime); open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt"; &processData(); close OUTFILE; print "Finished processing $total_records records...\n"; print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_ +$suchbegriffe\_$displaydate.txt\n"; unlink 'processing.html'; die "\n"; sub processURL() { print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$ran +ge\n"; getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'temp +file.html') or die 'Unable to get page'; while( <tempfile.html> ) { open( FH, "$_" ) or die; while( <FH> ) { if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>) +(d+).*/ ) { $total_records = $6; print "Total records to process is $total_records\n"; } } close FH; } unlink 'tempfile.html'; } sub processData() { while ( $range <= $total_records) { getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", + 'processing.html') or die 'Unable to get page'; $te->parse_file('processing.html'); my ($table) = $te->tables; for my $row ( $table->rows ) { cleanup(@$row); print OUTFILE "@$row\n"; } $| = 1; print "Processed records $range to $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; $te = HTML::TableExtract->new; } } # sub cleanup() { # for ( @_ ) { # s/s+/ /g; # } # } my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20'; $html =~ tr/\r//d; # strip carriage returns $html =~ s/&nbsp;/ /g; # expand spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { # trim leading/trailing whitespace from base fields s/^\s+//, s/\s+$// for @$row; # load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; # derive some fields from base fields, again using a hash slic +e @h{qw/name street postal town/} = split /\n+/, $h{name}; @h{qw/phone fax/} = split /\n+/, $h{phone}; # trim leading/trailing whitespace from derived fields s/^\s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } } #what do you think!? # # sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make dire +ctory $processdir: $!"; } }




what do you think - i get some nasty errors with this.. But the concept is all right! I have some encoding errorss - utf-8 / iso 8859-1



  • Comment on HTML::TableExtract - combined with Text::CSV - character issues rise
  • Download Code

Replies are listed 'Best First'.
Re: HTML::TableExtract - combined with Text::CSV - character issues rise
by roboticus (Chancellor) on Feb 19, 2011 at 23:20 UTC

    Perlbeginner1:

    My thoughts:

    • You gave us the code, but no error messages. It's frequently possible to diagnose problems from the error messages and code. I don't particularly feel like installing all the modules I would need to execute your script and see them myself, so there's not a lot I can do for you. Seeing as your post has been here for five or so hours with no responses, I'm not the only one who doesn't feel particularly motivated to dig into it.
    • That's a large chunk of code. Generally, you should cut out everything you don't need to give us less code to dig through to find the problem. That practice has two great side effects:
      • Shorter examples are easier for us to diagnose, so you'll probably get better responses,
      • Better yet: You'll often find out when you're in the process of chopping out unnecessary code, you'll find and fix the error yourself!

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      hello dear roboticus - and hello (anonymous)

      many many thanks for the reply!

      The thing is that i have had very good results with the first script! This was able to run great! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
      (note 6142 records) - But note - the data are not separated...!

      And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic.

      #!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use LWP::Simple; use Text::CSV; use Cwd; use POSIX qw(strftime); my $te = HTML::TableExtract->new; my $total_records = 0; my $suchbegriffe = "e"; my $treffer = 50; my $range = 0; my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp? +e="; my $processdir = "processing"; my $counter = 50; my $displaydate = ""; my $percent = 0; &workDir(); chdir $processdir; &processURL(); print "\nPress <enter> to continue\n"; <>; $displaydate = strftime('%Y%m%d%H%M%S', localtime); open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt"; &processData(); close OUTFILE; print "Finished processing $total_records records...\n"; print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_ +$suchbegriffe\_$displaydate.txt\n"; unlink 'processing.html'; die "\n"; sub processURL() { print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$ran +ge\n"; getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'temp +file.html') or die 'Unable to get page'; while( <tempfile.html> ) { open( FH, "$_" ) or die; while( <FH> ) { if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>) +(d+).*/ ) { $total_records = $6; print "Total records to process is $total_records\n"; } } close FH; } unlink 'tempfile.html'; } sub processData() { while ( $range <= $total_records) { getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", + 'processing.html') or die 'Unable to get page'; $te->parse_file('processing.html'); my ($table) = $te->tables; for my $row ( $table->rows ) { cleanup(@$row); print OUTFILE "@$row\n"; } $| = 1; print "Processed records $range to $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; $te = HTML::TableExtract->new; } } sub cleanup() { for ( @_ ) { s/s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make dire +ctory $processdir: $!"; } }



      But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method that does look for separators. In order to get the data (output) separated. So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...

      so here are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.


      where to set in the code Question: can we identify this point to migrate the one into the other... !? That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one? So the question is: where to set in with the CSV-Script into the script (above)


      #!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20'; $html =~ tr/\r//d; # strip carriage returns $html =~ s/&nbsp;/ /g; # expand spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { # trim leading/trailing whitespace from base fields s/^\s+//, s/\s+$// for @$row; # load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; # derive some fields from base fields, again using a hash slic +e @h{qw/name street postal town/} = split /\n+/, $h{name}; @h{qw/phone fax/} = split /\n+/, $h{phone}; # trim leading/trailing whitespace from derived fields s/^\s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } }



      Where to set in?

      $te = HTML::TableExtract->new; } } sub cleanup() { for ( @_ ) { s/s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make dire +ctory $processdir: $!"; } }


      The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
      (note 6142 records) - But note - the data are not separated...!

      And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.

      where is the part to insert?
      Roboticus - i would be glad if you can assist me here...
      thx in advance!
      pb1

        Perlbeginner:

        One of the great things about unix is the philosophy that a program should do one thing well, and be easy to combine with other programs to do work. For example, you can accomplish many tasks with a combination of sort, grep, cut, paste and join without writing a line of code. So if you're finding merging your programs to be difficult, you could instead tune them up such that they work well with each other.

        After a brief glance it appears that the first script can work on all the files in a directory. So perhaps the second one should simply drop files into that directory for processing.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: HTML::TableExtract - combined with Text::CSV - character issues rise
by Anonymous Monk on Feb 19, 2011 at 23:35 UTC