Re: Parsing HTML question

Dropping everything inside tags will be the easy part, getting meaningful words will be most difficult.

Some algorithms use a list of stopwords (which are words that are so frequent they will poison any database you use for catalogueing the webpage / searching). Typical words are like "the", "a", "who", ... Which I find to be very unfair if you are a fan of The Who!

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Comment on Re: Parsing HTML question

Replies are listed 'Best First'.
Re^2: Parsing HTML question by GrandFather (Saint) on Jun 24, 2008 at 02:18 UTC
or Doctor Who. Perl is environmentally friendly - it saves trees	[reply]
Re^2: Parsing HTML question by vit (Friar) on Jun 23, 2008 at 23:57 UTC
Instead of working with HTML is it possible to convert web bage to a string so that roughly the result should be the same as if I Open web page CTRL A CTRL C Paste clipboard content to xxxx.txt file The rest is details. So can we do this job in Perl?	[reply]
Re^3: Parsing HTML question by chromatic (Archbishop) on Jun 24, 2008 at 01:08 UTC
That's what HTML::Strip does. If that module doesn't work for you, you'll have to give us more detail about exactly what kind of a string you want.	[reply]
Re^4: Parsing HTML question by vit (Friar) on Jun 24, 2008 at 03:08 UTC
Thanks a lot I will try this module	[reply]
Re^3: Parsing HTML question by CountZero (Bishop) on Jun 24, 2008 at 05:06 UTC
Now you got us confused! Do you actually know how to get a webpage into a Perl program? If not, I suggest you look into the LWP family of modules and more specifically the LWP::Simple module which has the `get` function which can do `my $webpage = get("http://www.perlmonks.org");` [download] You then put `$webpage` through HTML::Strip to get at the contents stripped of tags. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James	[reply] [d/l] [select]
Re^4: Parsing HTML question by vit (Friar) on Jun 24, 2008 at 19:27 UTC
Sorry for confusion. I know LWP. My task is to strip HTML file, I may not know its URL. I checked example with HTML::Strip and it does not look very good for me. Ideally I would like to have something like that: http://www.zubrag.com/tools/html-tags-stripper.php Try it. It works very well for me. I do not think it is possible to strip that good using HTML::Strip or I am missing something.	[reply]
Re^5: Parsing HTML question by moritz (Cardinal) on Jun 24, 2008 at 19:39 UTC