comment on

~~There's nothing wrong with your loop.. but the HTML on the first page is different from the rest, you're just not parsing it correctly.~~The first page doesn't have a - between the < and 'Next Chatter' which throws off your regex.

Well I got interested in this problem, So here's a replacement :). I dropped HTML::Tree because I thought it would be nice to be able to have the full text of messages and it was much simpler to just grab straight from the HTML. It should be trivial to run the message text back through HTML::Tree to strip the HTML. I had originally tried tackling this using the parse tree but the code was twice as long and uglier, not to mention it didn't work :(. I created a get function that grabbed off local disk so I didn't have to hit the website whenever I wanted to test.

And on to the code!:

#!/usr/bin/perl
use warnings;
use strict;

my $start = 1;
my $end = 10;

for my $cnt ($start..$end) {
    print "<b>Current page count: $cnt</b><p>\n";
    my $uri = "http://www.allpoetry.com/chat/page=$cnt";

    my $html = get($uri);
    
    # retrieve the text and split into lines
    my @lines = split /[\r\n]+/, $html;
    
    # Now get into trouble for parsing HTML by hand
    # This skips through until the first chat message hopefully.
    while (@lines) {
        if ($lines[0] =~ m/^\<a href="javascript:t\('/)
        {
            last;
        }
        else
        {
            shift @lines;
        }
    }
    
    my @messages;
    while (@lines)
    {
        my $line = shift @lines;
        
        # get out after parsing all the messages from history
        # so we don't capture the current chatbox.
        last if $line =~ /^<\/font>/;
        
        #We use next because actions aren't grabbed properly
        # To handle them this needs to look for the line starting
        # with a <i> and no second <a href='...'>
        # There may be other messages this doesn't handle.
        next unless $line =~ s/^<a\shref="javascript:t\('
                                ([\d\w\s_]+)
                                '\)">\1<\/a>
                                <a\shref='\/poets\/\1'>:<\/a>//x;
        my $user = $1;
        next unless $line =~ s/^(.*?)
                                \((\d+\s+
                                (?:days|hours?|minutes|seconds)
                                \s+ago)\)
                                \s+(?:<br>|<p>)//x;
                                
        push @messages, {user => $user,
                        content => $1,
                        delay => $2};
            
    }
    
    foreach (@messages)
    {
        print sprintf("%15s:\%s (\%s)<br>\n",
             $_->{user},
             $_->{content},
             $_->{delay});
    }
}

exit(0);

sub get {
    my $uri = shift;
    $uri =~ /(\d+)$/;
    my $number = $1;
    open my $html, "<", $number or die "Couldn't open $number: $!";
    local $/;
    my $ret = <$html>;
    close $html or die "Couldn't close $number: $!";
    return $ret;
}
[download]

Update: Everything that isn't struck out :)

Update2: Added a few linebreaks so one line of code wouldn't wrap

In reply to Re: Re: Re: Which loop should I use? by tedrek
in thread Which loop should I use? by coldfingertips

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.