vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I know there are a lot of ways to parse HTML.
However my goal is to create a list of meaningfull words for creating a document based on web page content.
I definetly do not want anything inside <...>
I do not want to reinvent a bike. So your suggestions would be appreciated. I am sure quite a few people did this already.

Replies are listed 'Best First'.
Re: Parsing HTML question
by moritz (Cardinal) on Jun 23, 2008 at 22:08 UTC
    Maybe you want to use HTML::Strip, which removes HTML tags and returns plain text.
    However my goal is to create a list of meaningfull words for creating a document based on web page content.

    That's not easy - writing a program that does meaningful analysis of natural language isn't easy at all. Good luck!

      Then there should be a way to represent a web page content as we see it rather than as HTML.
      For example I do this with Quick Test Pro. But it uses IE and runs JavaScripts if necessary.
      There should be some possibility to do this in Perl.
        Then there should be a way to represent a web page content as we see it rather than as HTML.
        You mean as a screenshot? That won't make your task easier, not at all.
        There should be some possibility to do this in Perl.

        Perl is Turing complete - sure you can do $this in perl. But sometimes it's not the easiest way to do $this.

        For example I do this with Quick Test Pro.

        What is "this"? I didn't know what "Quick Test Pro" is, so I looked it up - seems to be some kind of testing framework. Well, we have such things in perl also, like Test::More and many more test modules. But I'm a bit confused because I don't see a relation to your original question.

        Maybe you should just explain what you want to achieve in the end, not ask for steps on the way that you think are necessary to achieve your goal. See XY Problem.

Re: Parsing HTML question
by CountZero (Bishop) on Jun 23, 2008 at 22:40 UTC
    Dropping everything inside tags will be the easy part, getting meaningful words will be most difficult.

    Some algorithms use a list of stopwords (which are words that are so frequent they will poison any database you use for catalogueing the webpage / searching). Typical words are like "the", "a", "who", ... Which I find to be very unfair if you are a fan of The Who!

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      or Doctor Who.


      Perl is environmentally friendly - it saves trees
      Instead of working with HTML is it possible to convert web bage to a string so that roughly the result should be the same as if I
      Open web page
      CTRL A
      CTRL C
      Paste clipboard content to xxxx.txt file
      The rest is details. So can we do this job in Perl?

        That's what HTML::Strip does. If that module doesn't work for you, you'll have to give us more detail about exactly what kind of a string you want.

        Now you got us confused!

        Do you actually know how to get a webpage into a Perl program? If not, I suggest you look into the LWP family of modules and more specifically the LWP::Simple module which has the get function which can do

        my $webpage = get("http://www.perlmonks.org");

        You then put $webpage through HTML::Strip to get at the contents stripped of tags.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James