in reply to Re: Re: Which loop should I use?
in thread Which loop should I use?
There's nothing wrong with your loop.. but the HTML on the first page is different from the rest, you're just not parsing it correctly.The first page doesn't have a - between the < and 'Next Chatter' which throws off your regex.
Well I got interested in this problem, So here's a replacement :). I dropped HTML::Tree because I thought it would be nice to be able to have the full text of messages and it was much simpler to just grab straight from the HTML. It should be trivial to run the message text back through HTML::Tree to strip the HTML. I had originally tried tackling this using the parse tree but the code was twice as long and uglier, not to mention it didn't work :(. I created a get function that grabbed off local disk so I didn't have to hit the website whenever I wanted to test.
And on to the code!:
#!/usr/bin/perl use warnings; use strict; my $start = 1; my $end = 10; for my $cnt ($start..$end) { print "<b>Current page count: $cnt</b><p>\n"; my $uri = "http://www.allpoetry.com/chat/page=$cnt"; my $html = get($uri); # retrieve the text and split into lines my @lines = split /[\r\n]+/, $html; # Now get into trouble for parsing HTML by hand # This skips through until the first chat message hopefully. while (@lines) { if ($lines[0] =~ m/^\<a href="javascript:t\('/) { last; } else { shift @lines; } } my @messages; while (@lines) { my $line = shift @lines; # get out after parsing all the messages from history # so we don't capture the current chatbox. last if $line =~ /^<\/font>/; #We use next because actions aren't grabbed properly # To handle them this needs to look for the line starting # with a <i> and no second <a href='...'> # There may be other messages this doesn't handle. next unless $line =~ s/^<a\shref="javascript:t\(' ([\d\w\s_]+) '\)">\1<\/a> <a\shref='\/poets\/\1'>:<\/a>//x; my $user = $1; next unless $line =~ s/^(.*?) \((\d+\s+ (?:days|hours?|minutes|seconds) \s+ago)\) \s+(?:<br>|<p>)//x; push @messages, {user => $user, content => $1, delay => $2}; } foreach (@messages) { print sprintf("%15s:\%s (\%s)<br>\n", $_->{user}, $_->{content}, $_->{delay}); } } exit(0); sub get { my $uri = shift; $uri =~ /(\d+)$/; my $number = $1; open my $html, "<", $number or die "Couldn't open $number: $!"; local $/; my $ret = <$html>; close $html or die "Couldn't close $number: $!"; return $ret; }
Update: Everything that isn't struck out :)
Update2: Added a few linebreaks so one line of code wouldn't wrap
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: Re: Re: Re: Which loop should I use?
by coldfingertips (Pilgrim) on Aug 01, 2003 at 06:44 UTC | |
by tedrek (Pilgrim) on Aug 01, 2003 at 16:46 UTC |