in reply to Re: Parsing HTML question
in thread Parsing HTML question

Instead of working with HTML is it possible to convert web bage to a string so that roughly the result should be the same as if I
Open web page
CTRL A
CTRL C
Paste clipboard content to xxxx.txt file
The rest is details. So can we do this job in Perl?

Replies are listed 'Best First'.
Re^3: Parsing HTML question
by chromatic (Archbishop) on Jun 24, 2008 at 01:08 UTC

    That's what HTML::Strip does. If that module doesn't work for you, you'll have to give us more detail about exactly what kind of a string you want.

      Thanks a lot I will try this module
Re^3: Parsing HTML question
by CountZero (Bishop) on Jun 24, 2008 at 05:06 UTC
    Now you got us confused!

    Do you actually know how to get a webpage into a Perl program? If not, I suggest you look into the LWP family of modules and more specifically the LWP::Simple module which has the get function which can do

    my $webpage = get("http://www.perlmonks.org");

    You then put $webpage through HTML::Strip to get at the contents stripped of tags.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Sorry for confusion. I know LWP. My task is to strip HTML file, I may not know its URL. I checked example with HTML::Strip and it does not look very good for me.
      Ideally I would like to have something like that:
      http://www.zubrag.com/tools/html-tags-stripper.php
      Try it. It works very well for me. I do not think it is possible to strip that good using HTML::Strip or I am missing something.
        http://www.zubrag.com/tools/html-tags-stripper.php Try it. It works very well for me

        Our notions of "well" might differ. I tried it, and first thing I noticed was that it broke all non-ascii characters on my page.

        Anyway, I don't think anybody can help you unless you describe in what way the output of HTML::Strip isn't fit for your purpose.