in reply to Re^8: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql

Adjust the way the get_sids() sub is called
my $lproc = 621557; my @sids = get_sids($url, $pid, $lproc);
Change the sub
sub get_sids{ my ($url, $pid, $lproc) = @_; my $page = 1; my $uri = URI->new($url); my ($i, @sids); while ($page){ # build the uri $uri->query_form(page => $page, pid => $pid); my $uri_string = $uri->as_string; # get the content, check for success my $content = get $uri->as_string; die qq{LWP get failed: $!\n} unless $content; # build the tree my $t = HTML::TreeBuilder->new_from_content($content) or die qq{new from content failed: $!\n}; # get a list of all anchor tags my @anchors = $t->look_down(_tag => q{a}) or die qq{no tables found in : $!\n}; # look at each anchor my $more = 1; # flag for my $anchor (@anchors){ # get the href my $href = $anchor->attr(q{href}); if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; my $sid = $q{sid}; next unless $sid; # exit the while loop if it # is the last processed sid $more--, last if $sid == $lproc; # otherwise save it push @sids, $sid; } } last unless $more; # see if there is another page $page = get_next_page($t); # avoid accidental indefinite loops # hammering the server, adjust to suit die if $i++ > 5; } # send 'em back return @sids; }
Have a look at the URI docs to see what the $uri->query_form does. Very useful.

Update: corrected the sub

Replies are listed 'Best First'.
Re^10: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Aug 03, 2010 at 15:13 UTC

    OK, have looked it over and think I understands most of it fairly well now. Adapted it to do the whole list as well as import PID/Lproc for processing

    There's a bug somewhere though making it abort if a PID have 0 SIDs. Instead of moving on to next PID for processing it simply ends.

    Here's what I have so far (I added some print stuff so I can see it progressing):

      In the main loop
      die qq{no sids found\n} unless @sids;
      does what is says on the tin. :-)

      Perhaps replace that line with something like

      if (not @sids){ print qq{no sids found in $cpid\n}; next; }
      That way it will print a message and simply move on if none are found.

      On a side note you could declare

      my $url = q{http://csr.wwiionline.com......etc.};
      before the main loop as it is the same everytime.

      Also,

      my $pidlist = <$settings>; # Get list of pids and process all while ( $pidlist = <$settings> ) {
      is reading a line from the file and then throwing it away. It is starting on the second line of the file. Remove the first declaration and change the while to
      while (my $pidlist = <$settings>){ #...
      I reckon it's shaping up nicely. :-)