in reply to Parsing HTML question

Dropping everything inside tags will be the easy part, getting meaningful words will be most difficult.

Some algorithms use a list of stopwords (which are words that are so frequent they will poison any database you use for catalogueing the webpage / searching). Typical words are like "the", "a", "who", ... Which I find to be very unfair if you are a fan of The Who!

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Replies are listed 'Best First'.
Re^2: Parsing HTML question
by GrandFather (Saint) on Jun 24, 2008 at 02:18 UTC

    or Doctor Who.


    Perl is environmentally friendly - it saves trees
Re^2: Parsing HTML question
by vit (Friar) on Jun 23, 2008 at 23:57 UTC
    Instead of working with HTML is it possible to convert web bage to a string so that roughly the result should be the same as if I
    Open web page
    CTRL A
    CTRL C
    Paste clipboard content to xxxx.txt file
    The rest is details. So can we do this job in Perl?

      That's what HTML::Strip does. If that module doesn't work for you, you'll have to give us more detail about exactly what kind of a string you want.

        Thanks a lot I will try this module
      Now you got us confused!

      Do you actually know how to get a webpage into a Perl program? If not, I suggest you look into the LWP family of modules and more specifically the LWP::Simple module which has the get function which can do

      my $webpage = get("http://www.perlmonks.org");

      You then put $webpage through HTML::Strip to get at the contents stripped of tags.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Sorry for confusion. I know LWP. My task is to strip HTML file, I may not know its URL. I checked example with HTML::Strip and it does not look very good for me.
        Ideally I would like to have something like that:
        http://www.zubrag.com/tools/html-tags-stripper.php
        Try it. It works very well for me. I do not think it is possible to strip that good using HTML::Strip or I am missing something.