in reply to Re^5: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql
I haven't tried it yet but saw some things I'd like to comment about (if I've understood the code correctly, this new one was a bit above my level).
sub get_next_page{ my ($t) = @_; # we want table 9 my @tables = $t->look_down(_tag => q{table}); my $table = $tables[8]; # first row my @trs = $table->look_down(_tag => q{tr}); my $tr = $trs[0]; # second column my @tds = $tr->look_down(_tag => q{td}); my $td = $tds[1]; # get any text my $page_number_txt = $td->as_text; # and test if it is a page number # will be undef otherwise my ($page) = $page_number_txt =~ /PAGE (\d) >/; return $page; }
If I understand correctly you load next page, go through the source code to a particular spot on page and looks at page number? This will fail for my scenario as the server keeps serving the page number you request even if it contains no data so no matter what page number you enter it will give you a valid answer. I just used last if $content =~/No sorties/ ; which seems to do the trick
if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; # save it if it is there push @sids, $q{sid} if exists $q{sid}; }
Guess this would be the perfect place for the second loop-exit condition, we want to stop processing sids when we find the last one previously processed ($lproc).
This variable needs to be read from the pid list file as well (not sure what delimiter to use, what is best, tab or semi colon?), instead as now one number per line the actual DB export will contain two numbers per line (pid and lproc).
Q: Where does the sids end up?
Going to try it now and will be back with more comments. I really appreciate your help with this!
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^7: collect data from web pages and insert into mysql
by wfsp (Abbot) on Aug 02, 2010 at 14:23 UTC | |
by SteinerKD (Acolyte) on Aug 02, 2010 at 14:42 UTC | |
by wfsp (Abbot) on Aug 02, 2010 at 15:25 UTC | |
by SteinerKD (Acolyte) on Aug 03, 2010 at 15:13 UTC | |
by wfsp (Abbot) on Aug 03, 2010 at 15:36 UTC |