in reply to Re^7: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql

Ah! Now i get it, smart thinking! So that solves condition 1 then.

As for condition 2 I don't think first fetching all and then sort it is the way to go as people can rack up many hundreds of sorties in quite short time which means processing dossens of pages even if the lproc was on page 1.

  • Comment on Re^8: collect data from web pages and insert into mysql

Replies are listed 'Best First'.
Re^9: collect data from web pages and insert into mysql
by wfsp (Abbot) on Aug 02, 2010 at 15:25 UTC
    Adjust the way the get_sids() sub is called
    my $lproc = 621557; my @sids = get_sids($url, $pid, $lproc);
    Change the sub
    sub get_sids{ my ($url, $pid, $lproc) = @_; my $page = 1; my $uri = URI->new($url); my ($i, @sids); while ($page){ # build the uri $uri->query_form(page => $page, pid => $pid); my $uri_string = $uri->as_string; # get the content, check for success my $content = get $uri->as_string; die qq{LWP get failed: $!\n} unless $content; # build the tree my $t = HTML::TreeBuilder->new_from_content($content) or die qq{new from content failed: $!\n}; # get a list of all anchor tags my @anchors = $t->look_down(_tag => q{a}) or die qq{no tables found in : $!\n}; # look at each anchor my $more = 1; # flag for my $anchor (@anchors){ # get the href my $href = $anchor->attr(q{href}); if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; my $sid = $q{sid}; next unless $sid; # exit the while loop if it # is the last processed sid $more--, last if $sid == $lproc; # otherwise save it push @sids, $sid; } } last unless $more; # see if there is another page $page = get_next_page($t); # avoid accidental indefinite loops # hammering the server, adjust to suit die if $i++ > 5; } # send 'em back return @sids; }
    Have a look at the URI docs to see what the $uri->query_form does. Very useful.

    Update: corrected the sub

      OK, have looked it over and think I understands most of it fairly well now. Adapted it to do the whole list as well as import PID/Lproc for processing

      There's a bug somewhere though making it abort if a PID have 0 SIDs. Instead of moving on to next PID for processing it simply ends.

      Here's what I have so far (I added some print stuff so I can see it progressing):

        In the main loop
        die qq{no sids found\n} unless @sids;
        does what is says on the tin. :-)

        Perhaps replace that line with something like

        if (not @sids){ print qq{no sids found in $cpid\n}; next; }
        That way it will print a message and simply move on if none are found.

        On a side note you could declare

        my $url = q{http://csr.wwiionline.com......etc.};
        before the main loop as it is the same everytime.

        Also,

        my $pidlist = <$settings>; # Get list of pids and process all while ( $pidlist = <$settings> ) {
        is reading a line from the file and then throwing it away. It is starting on the second line of the file. Remove the first declaration and change the while to
        while (my $pidlist = <$settings>){ #...
        I reckon it's shaping up nicely. :-)