hello dear roboticus - and hello (anonymous)

many many thanks for the reply!

The thing is that i have had very good results with the first script! This was able to run great! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!

And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic.

#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use LWP::Simple; use Text::CSV; use Cwd; use POSIX qw(strftime); my $te = HTML::TableExtract->new; my $total_records = 0; my $suchbegriffe = "e"; my $treffer = 50; my $range = 0; my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp? +e="; my $processdir = "processing"; my $counter = 50; my $displaydate = ""; my $percent = 0; &workDir(); chdir $processdir; &processURL(); print "\nPress <enter> to continue\n"; <>; $displaydate = strftime('%Y%m%d%H%M%S', localtime); open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt"; &processData(); close OUTFILE; print "Finished processing $total_records records...\n"; print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_ +$suchbegriffe\_$displaydate.txt\n"; unlink 'processing.html'; die "\n"; sub processURL() { print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$ran +ge\n"; getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'temp +file.html') or die 'Unable to get page'; while( <tempfile.html> ) { open( FH, "$_" ) or die; while( <FH> ) { if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>) +(d+).*/ ) { $total_records = $6; print "Total records to process is $total_records\n"; } } close FH; } unlink 'tempfile.html'; } sub processData() { while ( $range <= $total_records) { getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", + 'processing.html') or die 'Unable to get page'; $te->parse_file('processing.html'); my ($table) = $te->tables; for my $row ( $table->rows ) { cleanup(@$row); print OUTFILE "@$row\n"; } $| = 1; print "Processed records $range to $counter"; print "\r"; $counter = $counter + 50; $range = $range + 50; $te = HTML::TableExtract->new; } } sub cleanup() { for ( @_ ) { s/s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make dire +ctory $processdir: $!"; } }



But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method that does look for separators. In order to get the data (output) separated. So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...

so here are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.


where to set in the code Question: can we identify this point to migrate the one into the other... !? That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one? So the question is: where to set in with the CSV-Script into the script (above)


#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use HTML::TableExtract; use Text::CSV; my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20'; $html =~ tr/\r//d; # strip carriage returns $html =~ s/&nbsp;/ /g; # expand spaces my $te = new HTML::TableExtract(); $te->parse($html); my @cols = qw( rownum number name phone type website ); my @fields = qw( rownum number name street postal town phone fax type website ); my $csv = Text::CSV->new({ binary => 1 }); foreach my $ts ($te->table_states) { foreach my $row ($ts->rows) { # trim leading/trailing whitespace from base fields s/^\s+//, s/\s+$// for @$row; # load the fields into the hash using a "hash slice" my %h; @h{@cols} = @$row; # derive some fields from base fields, again using a hash slic +e @h{qw/name street postal town/} = split /\n+/, $h{name}; @h{qw/phone fax/} = split /\n+/, $h{phone}; # trim leading/trailing whitespace from derived fields s/^\s+//, s/\s+$// for @h{qw/name street postal town/}; $csv->combine(@h{@fields}); print $csv->string, "\n"; } }



Where to set in?

$te = HTML::TableExtract->new; } } sub cleanup() { for ( @_ ) { s/s+/ /g; } } sub workDir() { # Use home directory to process data chdir or die "$!"; if ( ! -d $processdir ) { mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make dire +ctory $processdir: $!"; } }


The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!

And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.

where is the part to insert?
Roboticus - i would be glad if you can assist me here...
thx in advance!
pb1

In reply to Re^2: HTML::TableExtract - combined with Text::CSV - character issues rise by Perlbeginner1
in thread HTML::TableExtract - combined with Text::CSV - character issues rise by Perlbeginner1

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.