in reply to Re^2: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql

That looks as though you are off to a good start.

I have a couple of observations and some more questions.

my $pid = <PIDLIST>;
will read the first line from the file. To look at each line in turn loop over the file handle with a while loop.

The docs for LWP::Simple show how to check if the get is successful. Always best to do that.

First question, what does a line in the pidlist.txt file actually look like? Is there any processing to be done to extract any data you need?

Secondly, could you show a (cut down) sample of what the downloaded content looks like. It may be that you can extract the data you need on the fly and write the whole lot out in one go rather than save each page to disk and parse it later.

Lastly, it looks as though the site needs a login and uses cookies. I'm curious to know how you managed to download any pages.

Looking good though!

This is some code illustrating the points above.

#! /usr/bin/perl use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = q{http://csr.wwiionline.com/scripts/services/persona/sorti +es.jsp}; my $pcnt = 1; my $pidfile = q{c:/scr/pidlist.txt}; # Open list of pid's open my $pidlist, q{<}, $pidfile or die qq{Could not open $pidfile: $! +}; # loop over list of pids one at a time while (my $pid = <$pidlist>){ print $pid; chomp $pid; # what does a line from the pidlist look like? # do need to do any work on it? while (1) { my $url = qq{$pbase?page=$pcnt&pid=$pid}; my $content = get($url); die qq{get failed for $url: $!} unless $content; # parse $content # extract data we need # test if there is more work to do # Update page number and grab next. $pcnt += 1; }; } # Close files close $pidlist or die $!; print qq{\nDone!\n};

Replies are listed 'Best First'.
Re^4: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 31, 2010 at 16:44 UTC

    No log/pass needed, you can log on but it isn't needed for just watching peoples stats.

    Currently the pidlist is just one pid per line followed by a linefeed, but as stated in the plan I will also need to import the lastproc sid so I guess it will soon change to something like (or however I can get the sql querie to save it).

    pidnumber, sidnumber linefeed

    A raw sortilist page (need to grab the sids)

    What we need here is the number after each "sid=" which will create a new list of pages to be processed.
    The last <TR></TR> block seen here and the ones that follow each contain 1 sid. (well actually the entire link for it). Layout might vary a bit though as we might be looking at several padded pages.

    The actual data though is contained in the sorti details pages where it's spread all over and which also contain random length lists (se picture in another answer).

      One step at a time.

      This will get a list of sid numbers from all the pages available.

      #! /usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TreeBuilder; use LWP::Simple; use URI; my $url = q{http://csr.wwiionline.com/scripts/services/persona/sorties +.jsp}; my $pid = 173384; my @sids = get_sids($url, $pid); die qq{no sids found\n} unless @sids; print Dumper \@sids; sub get_sids{ my ($url, $pid) = @_; my $page = 1; my $uri = URI->new($url); my ($i, @sids); while ($page){ # build the uri $uri->query_form(page => $page, pid => $pid); my $uri_string = $uri->as_string; # get the content, check for success my $content = get $uri->as_string; die qq{LWP get failed: $!\n} unless $content; # build the tree my $t = HTML::TreeBuilder->new_from_content($content) or die qq{new from content failed: $!\n}; # get a list of all anchor tags my @anchors = $t->look_down(_tag => q{a}) or die qq{no tables found in : $!\n}; # look at each anchor for my $anchor (@anchors){ # get the href my $href = $anchor->attr(q{href}); if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; # save it if it is there push @sids, $q{sid} if exists $q{sid}; } } # see if there is another page $page = get_next_page($t); # avoid accidental indefinite loops # hammering the server, adjust to suit die if $i++ > 5; } # send 'em back return @sids; } sub get_next_page{ my ($t) = @_; # we want table 9 my @tables = $t->look_down(_tag => q{table}); my $table = $tables[8]; # first row my @trs = $table->look_down(_tag => q{tr}); my $tr = $trs[0]; # second column my @tds = $tr->look_down(_tag => q{td}); my $td = $tds[1]; # get any text my $page_number_txt = $td->as_text; # and test if it is a page number # will be undef otherwise my ($page) = $page_number_txt =~ /PAGE (\d) >/; return $page; }
      Some points to note:

      It uses HTML::TreeBuilder to parse the HTML. I find it easier than using regexes. There are many parsers available and monks have their preferences, I've settled on this one and have got used to it.

      It also uses URI to construct/parse URIs. Could be overkill in this case but if someone else has done all the work I'm happy to take advantage. :-)

      And all those 'q's? They're alternatives to single and double quote marks (there are some others too). You don't have to use them, again it's a preference. I started using them for the very scientific reason that my code highlighter is particularly bad at handling single and double quotes. :-)

      If you download it, first see if it compiles. Then see if it runs. If the output is not as expected make a note of what Perl says about the matter and post it here. If all goes fine let us know the next step.

      Fingers crossed.

        I haven't tried it yet but saw some things I'd like to comment about (if I've understood the code correctly, this new one was a bit above my level).

        sub get_next_page{ my ($t) = @_; # we want table 9 my @tables = $t->look_down(_tag => q{table}); my $table = $tables[8]; # first row my @trs = $table->look_down(_tag => q{tr}); my $tr = $trs[0]; # second column my @tds = $tr->look_down(_tag => q{td}); my $td = $tds[1]; # get any text my $page_number_txt = $td->as_text; # and test if it is a page number # will be undef otherwise my ($page) = $page_number_txt =~ /PAGE (\d) >/; return $page; }

        If I understand correctly you load next page, go through the source code to a particular spot on page and looks at page number? This will fail for my scenario as the server keeps serving the page number you request even if it contains no data so no matter what page number you enter it will give you a valid answer. I just used last if $content =~/No sorties/ ; which seems to do the trick

        if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; # save it if it is there push @sids, $q{sid} if exists $q{sid}; }

        Guess this would be the perfect place for the second loop-exit condition, we want to stop processing sids when we find the last one previously processed ($lproc).
        This variable needs to be read from the pid list file as well (not sure what delimiter to use, what is best, tab or semi colon?), instead as now one number per line the actual DB export will contain two numbers per line (pid and lproc).

        Q: Where does the sids end up?

        Going to try it now and will be back with more comments. I really appreciate your help with this!

Re^4: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 31, 2010 at 17:12 UTC
    print qq{\nDone!\n};

    What's with all the q's in the code? I copies this and it doesn't seems to run.

      See perlop.

      A great way to find out why Perl does not "seem to run" your code is to look at the error messages that Perl produces. They are output not by Perl to spite you, but for your information and usually contain enough information to find the corresponding cause. For example, they usually contain a line number.

      If you feel overwhelmed by the error message that Perl gives you, instead of trying to understand and resolve the error yourself, you could tell others the error message you get and ask them for advice. Maybe now is the right time to try that approach?

        The problem was it compiled and ran, but never exited and never did what it was supposed to do. There was no error messages to report and nothing to hint at where it went wrong.

        I'm as new to this community as I am to coding Perl (zero skill in other words :X) so I apologize if I break any rules, customs or ethics that I am unaware of. I'm grateful for the help given and hope you'll be patient with me if I move to slow.

Re^4: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Aug 02, 2010 at 14:33 UTC

    We've passed this now with your latest post, but I'd like to step back to see what I did for learnings sake.

    When I copied this it just hang and only loaded on page I think, it never gave any error messages or excited. I adapted it a bit by using some of your new bits but basically keeping my own loop. This seemed to work pretty well as it would churn through all pids in the list and get pages until it hit the last one and then move on (after a bug fix)

    #! /usr/bin/perl use strict ; use warnings ; use LWP::Simple qw(get) ; use File::Slurp ; my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp' ; my $pidfile = 'c:/scr/pidlist.txt' ; my $lproc = '536192' ; # Open list of pid's open my $pidlist, "<", $pidfile or die "Could not open $pidfile: $!\n" + ; # Loop over list of pids one at a time while (my $pid = <$pidlist>){ chomp $pid ; print "Current persona processed: $pid\n" ; my $pcnt = 1 ; while (1) { my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url); die "\nGet failed for $url: $!\n" unless $content; # my $page = get "$pbase?page=$pcnt&pid=$pid" ; # Exit loop if page is empty last if $content =~/No sorties/ ; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $content ) ; # Exit loop if page contained last processed. last if $content =~/"sid=$lproc"/ ; # Update page number and run loop again. print "Page $pcnt\n" ; $pcnt++ ; } ; } ; # Close files close $pidlist or die $! ; print "\nDone!\n" ;

    The serious bug it had previously was that the page count was defined early in the script, outside the loop which meant that page 1-x was processed for pid 1, then page x-> etc was processed for successive users, NOT good! Moving the variable definition inside the loop fixed it.

    When the "No sorties" string was encountered it exited loop properly, but the second condition about finding the $lproc doesn't work, it never triggers even if I set a number I know it will find a few pages down in one pid.

    Is there a particular reason you split the content retrieval into two lines from

    my $page = get "$pbase?page=$pcnt&pid=$pid" ;

    to

    my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url);

    From what I can tell they do exactly the same, just using an extra variable.

    Thanks again!