This can be done in many ways (as usual), but one way would be to grab the webpage with LWP::Simple and parse it. This will include HTML, so you'd have to filter that out (assuming you just want to extract the text) with, say HTML::Parser or, HTML::Strip.

WWW::Mechanize, a very popular module amongst some monks, can probably help too, although I still have been to lazy to check it out.

Then, of course, you could use programs such as "lynx" to dump the non-HTML page, and parse that. No need for HTML stripping and or LWP-like modules then.

Update: it'd be helpfull if you could tell us a little bit more on your motives for doing this (so we know if you want to get rid of the HTML, if you should go for the "lynx" approach etc.), and what you've tried so far.

--
b10m

All code is usually tested, but rarely trusted.

In reply to Re: text extract by b10m
in thread text extract by shu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.