sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I want to parse a page for all poem urls. Ie this page.
Fading Away by MornieAtlantie Posted 10 hours ago. Last seen 16 minutes ago. Categories: Spiritual.
Fading Away would be a poem link. I want to write a script to scan this page for all of these links, go to the poems and eventually parse the poem page for specific details. So I want to find all the links, go to all the links and then parse every single one of them.

I had a parsing program a while ago but everyone kept yelling at me to try this or that module and asking for specific help when everyone suggested different things was very discouraging. Would would be the easiest module to work with? If a person had let's say 100 poems, do you think the load time would just stop the script from finishing? Any other ideas or suggestions?

Thanks everyone. Remember I'm a moment for this rose it has to die, Now that you've set me free, I'll cease to wonder why,

"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

sulfericacid

Replies are listed 'Best First'.
Re: Where to start?
by tachyon (Chancellor) on Sep 06, 2003 at 06:23 UTC

    First you need to get the page so use LWP::UserAgent or LWP::Simple

    Once you have the content you want to extract the links. But there are a lot of links you don't want. Fortunately the site author makes it easy

    <!-- Main Area --> <h1>The links you want are between the Main Area Comment and this comm +ent</h1> <!-- other boxes can be loaded by the subroutines? bad at end? hmmph - +->

    I would use a regex to extract this chunk of the page.

    my ($chunk) = $content =~ m#<!-- Main Area -->(.*?)<!-- other boxes ca +n be#s;

    These pages appear to be autogen out of a DB. The great thing about that is consistency. For example these REs do what you want and SHOULD be reliable. YMMV.

    my ($poet) = $chunk =~ m!/poet/([^/ '"]+)!; my @poems = $chunk =~ m!(/Poems/\d+)!g; @poems = map{ "$site_url$_" } @poems;
    HTML::LinkExtor is the module solution. There is absolutely nothing wrong with it for these sort of tasks, however when markup is autogenerated out of a DB REs are reliable and can often be tailored to extract just what you want instead of extracting all the links with LinkExtor and then having to post process them to get what you want.

    As for runtime you sound like you want this under CGI in which case you need to recognise that you are looking at about 2 seconds a page on average and this is a bit harsh on your target site. Better to run it in the background and store the data in some form of DB - then you can be a good neighbour and not do a DOS on your poets. Of course merlyn has a tutorial in his Web Techniques articles on long running CGIs

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Where to start?
by LordWeber (Monk) on Sep 06, 2003 at 08:37 UTC