Parsing HTML question

vit has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing HTML question by moritz (Cardinal) on Jun 23, 2008 at 22:08 UTC
Maybe you want to use HTML::Strip, which removes HTML tags and returns plain text. However my goal is to create a list of meaningfull words for creating a document based on web page content. That's not easy - writing a program that does meaningful analysis of natural language isn't easy at all. Good luck!	[reply]
Re^2: Parsing HTML question by vit (Friar) on Jun 23, 2008 at 22:30 UTC
Then there should be a way to represent a web page content as we see it rather than as HTML. For example I do this with Quick Test Pro. But it uses IE and runs JavaScripts if necessary. There should be some possibility to do this in Perl.	[reply]
Re^3: Parsing HTML question by moritz (Cardinal) on Jun 23, 2008 at 22:43 UTC
Then there should be a way to represent a web page content as we see it rather than as HTML. You mean as a screenshot? That won't make your task easier, not at all. There should be some possibility to do this in Perl. Perl is Turing complete - sure you can do $this in perl. But sometimes it's not the easiest way to do $this. For example I do this with Quick Test Pro. What is "this"? I didn't know what "Quick Test Pro" is, so I looked it up - seems to be some kind of testing framework. Well, we have such things in perl also, like Test::More and many more test modules. But I'm a bit confused because I don't see a relation to your original question. Maybe you should just explain what you want to achieve in the end, not ask for steps on the way that you think are necessary to achieve your goal. See XY Problem.	[reply]
Re^4: Parsing HTML question by vit (Friar) on Jun 23, 2008 at 22:59 UTC
Re^5: Parsing HTML question by moritz (Cardinal) on Jun 24, 2008 at 08:03 UTC
Re: Parsing HTML question by CountZero (Bishop) on Jun 23, 2008 at 22:40 UTC
Dropping everything inside tags will be the easy part, getting meaningful words will be most difficult. Some algorithms use a list of stopwords (which are words that are so frequent they will poison any database you use for catalogueing the webpage / searching). Typical words are like "the", "a", "who", ... Which I find to be very unfair if you are a fan of The Who! CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply]
Re^2: Parsing HTML question by GrandFather (Saint) on Jun 24, 2008 at 02:18 UTC
or Doctor Who. Perl is environmentally friendly - it saves trees	[reply]
Re^2: Parsing HTML question by vit (Friar) on Jun 23, 2008 at 23:57 UTC
Instead of working with HTML is it possible to convert web bage to a string so that roughly the result should be the same as if I Open web page CTRL A CTRL C Paste clipboard content to xxxx.txt file The rest is details. So can we do this job in Perl?	[reply]
Re^3: Parsing HTML question by chromatic (Archbishop) on Jun 24, 2008 at 01:08 UTC
That's what HTML::Strip does. If that module doesn't work for you, you'll have to give us more detail about exactly what kind of a string you want.	[reply]
Re^4: Parsing HTML question by vit (Friar) on Jun 24, 2008 at 03:08 UTC
Re^3: Parsing HTML question by CountZero (Bishop) on Jun 24, 2008 at 05:06 UTC
Now you got us confused! Do you actually know how to get a webpage into a Perl program? If not, I suggest you look into the LWP family of modules and more specifically the LWP::Simple module which has the `get` function which can do `my $webpage = get("http://www.perlmonks.org");` [download] You then put `$webpage` through HTML::Strip to get at the contents stripped of tags. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^4: Parsing HTML question by vit (Friar) on Jun 24, 2008 at 19:27 UTC
Re^5: Parsing HTML question by moritz (Cardinal) on Jun 24, 2008 at 19:39 UTC