in reply to Re: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql

This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.

use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp'; my $pcnt = 1; my $pidfile = 'c:/scr/pidlist.txt'; # Open list of pid's and set first one as current pid. open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!"; my $pid = <PIDLIST>; chomp $pid; print $pid; # Grab and store sortie list pages for persona. while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; # Update page number and grab next. $pcnt += 1; }; # Close files close PIDLIST or die $!; print '\nDone!\n';

Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).

This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
For some reason $pid isn't printed out as supposed any more.

When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.

Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.

Replies are listed 'Best First'.
Re^3: collect data from web pages and insert into mysql
by wfsp (Abbot) on Jul 31, 2010 at 15:01 UTC
    That looks as though you are off to a good start.

    I have a couple of observations and some more questions.

    my $pid = <PIDLIST>;
    will read the first line from the file. To look at each line in turn loop over the file handle with a while loop.

    The docs for LWP::Simple show how to check if the get is successful. Always best to do that.

    First question, what does a line in the pidlist.txt file actually look like? Is there any processing to be done to extract any data you need?

    Secondly, could you show a (cut down) sample of what the downloaded content looks like. It may be that you can extract the data you need on the fly and write the whole lot out in one go rather than save each page to disk and parse it later.

    Lastly, it looks as though the site needs a login and uses cookies. I'm curious to know how you managed to download any pages.

    Looking good though!

    This is some code illustrating the points above.

    #! /usr/bin/perl use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = q{http://csr.wwiionline.com/scripts/services/persona/sorti +es.jsp}; my $pcnt = 1; my $pidfile = q{c:/scr/pidlist.txt}; # Open list of pid's open my $pidlist, q{<}, $pidfile or die qq{Could not open $pidfile: $! +}; # loop over list of pids one at a time while (my $pid = <$pidlist>){ print $pid; chomp $pid; # what does a line from the pidlist look like? # do need to do any work on it? while (1) { my $url = qq{$pbase?page=$pcnt&pid=$pid}; my $content = get($url); die qq{get failed for $url: $!} unless $content; # parse $content # extract data we need # test if there is more work to do # Update page number and grab next. $pcnt += 1; }; } # Close files close $pidlist or die $!; print qq{\nDone!\n};

      No log/pass needed, you can log on but it isn't needed for just watching peoples stats.

      Currently the pidlist is just one pid per line followed by a linefeed, but as stated in the plan I will also need to import the lastproc sid so I guess it will soon change to something like (or however I can get the sql querie to save it).

      pidnumber, sidnumber linefeed

      A raw sortilist page (need to grab the sids)

      What we need here is the number after each "sid=" which will create a new list of pages to be processed.
      The last <TR></TR> block seen here and the ones that follow each contain 1 sid. (well actually the entire link for it). Layout might vary a bit though as we might be looking at several padded pages.

      The actual data though is contained in the sorti details pages where it's spread all over and which also contain random length lists (se picture in another answer).

        One step at a time.

        This will get a list of sid numbers from all the pages available.

        #! /usr/bin/perl use strict; use warnings; use Data::Dumper; use HTML::TreeBuilder; use LWP::Simple; use URI; my $url = q{http://csr.wwiionline.com/scripts/services/persona/sorties +.jsp}; my $pid = 173384; my @sids = get_sids($url, $pid); die qq{no sids found\n} unless @sids; print Dumper \@sids; sub get_sids{ my ($url, $pid) = @_; my $page = 1; my $uri = URI->new($url); my ($i, @sids); while ($page){ # build the uri $uri->query_form(page => $page, pid => $pid); my $uri_string = $uri->as_string; # get the content, check for success my $content = get $uri->as_string; die qq{LWP get failed: $!\n} unless $content; # build the tree my $t = HTML::TreeBuilder->new_from_content($content) or die qq{new from content failed: $!\n}; # get a list of all anchor tags my @anchors = $t->look_down(_tag => q{a}) or die qq{no tables found in : $!\n}; # look at each anchor for my $anchor (@anchors){ # get the href my $href = $anchor->attr(q{href}); if ($href){ # test for a sid in the query fragment my $uri = URI->new($href); my %q = $uri->query_form; # save it if it is there push @sids, $q{sid} if exists $q{sid}; } } # see if there is another page $page = get_next_page($t); # avoid accidental indefinite loops # hammering the server, adjust to suit die if $i++ > 5; } # send 'em back return @sids; } sub get_next_page{ my ($t) = @_; # we want table 9 my @tables = $t->look_down(_tag => q{table}); my $table = $tables[8]; # first row my @trs = $table->look_down(_tag => q{tr}); my $tr = $trs[0]; # second column my @tds = $tr->look_down(_tag => q{td}); my $td = $tds[1]; # get any text my $page_number_txt = $td->as_text; # and test if it is a page number # will be undef otherwise my ($page) = $page_number_txt =~ /PAGE (\d) >/; return $page; }
        Some points to note:

        It uses HTML::TreeBuilder to parse the HTML. I find it easier than using regexes. There are many parsers available and monks have their preferences, I've settled on this one and have got used to it.

        It also uses URI to construct/parse URIs. Could be overkill in this case but if someone else has done all the work I'm happy to take advantage. :-)

        And all those 'q's? They're alternatives to single and double quote marks (there are some others too). You don't have to use them, again it's a preference. I started using them for the very scientific reason that my code highlighter is particularly bad at handling single and double quotes. :-)

        If you download it, first see if it compiles. Then see if it runs. If the output is not as expected make a note of what Perl says about the matter and post it here. If all goes fine let us know the next step.

        Fingers crossed.

      print qq{\nDone!\n};

      What's with all the q's in the code? I copies this and it doesn't seems to run.

        See perlop.

        A great way to find out why Perl does not "seem to run" your code is to look at the error messages that Perl produces. They are output not by Perl to spite you, but for your information and usually contain enough information to find the corresponding cause. For example, they usually contain a line number.

        If you feel overwhelmed by the error message that Perl gives you, instead of trying to understand and resolve the error yourself, you could tell others the error message you get and ask them for advice. Maybe now is the right time to try that approach?

      We've passed this now with your latest post, but I'd like to step back to see what I did for learnings sake.

      When I copied this it just hang and only loaded on page I think, it never gave any error messages or excited. I adapted it a bit by using some of your new bits but basically keeping my own loop. This seemed to work pretty well as it would churn through all pids in the list and get pages until it hit the last one and then move on (after a bug fix)

      #! /usr/bin/perl use strict ; use warnings ; use LWP::Simple qw(get) ; use File::Slurp ; my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp' ; my $pidfile = 'c:/scr/pidlist.txt' ; my $lproc = '536192' ; # Open list of pid's open my $pidlist, "<", $pidfile or die "Could not open $pidfile: $!\n" + ; # Loop over list of pids one at a time while (my $pid = <$pidlist>){ chomp $pid ; print "Current persona processed: $pid\n" ; my $pcnt = 1 ; while (1) { my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url); die "\nGet failed for $url: $!\n" unless $content; # my $page = get "$pbase?page=$pcnt&pid=$pid" ; # Exit loop if page is empty last if $content =~/No sorties/ ; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $content ) ; # Exit loop if page contained last processed. last if $content =~/"sid=$lproc"/ ; # Update page number and run loop again. print "Page $pcnt\n" ; $pcnt++ ; } ; } ; # Close files close $pidlist or die $! ; print "\nDone!\n" ;

      The serious bug it had previously was that the page count was defined early in the script, outside the loop which meant that page 1-x was processed for pid 1, then page x-> etc was processed for successive users, NOT good! Moving the variable definition inside the loop fixed it.

      When the "No sorties" string was encountered it exited loop properly, but the second condition about finding the $lproc doesn't work, it never triggers even if I set a number I know it will find a few pages down in one pid.

      Is there a particular reason you split the content retrieval into two lines from

      my $page = get "$pbase?page=$pcnt&pid=$pid" ;

      to

      my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url);

      From what I can tell they do exactly the same, just using an extra variable.

      Thanks again!

Re^3: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 31, 2010 at 14:17 UTC

    Wohoo, think I sorted the loop!

    while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; last if $page =~/No sorties/; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; last if $page =~/"sid=$lproc"/; # Update page number and grab next. $pcnt++; };

    I'm sure the whole thing can be made prettier and more efficient, but now it seems to work as intended, grabbing the first set of data I need.

    Now I need to sort the bit where it does the same for all pids in the pidlist, preferably at the same time adding so it extracts both the pid and lproc data from the settings file.