good chemistry is complicated, and a little bit messy -LW |
|
PerlMonks |
pulling just text from a urlby coldfingertips (Pilgrim) |
on Mar 19, 2006 at 21:16 UTC ( [id://537802]=perlquestion: print w/replies, xml ) | Need Help?? |
coldfingertips has asked for the wisdom of the Perl Monks concerning the following question:
I need an accurate way to pull just the readable text from a web page. I was told HTML::TokeParser / Simple would work. The thing is, it's bringing back some css and javascript tags too, including Google Ad source code. On top of this, there is a lot of and li tags in the page dump, too. I can filter these out I suppose in regexes, but there's no way I can account for everything that this module misses. Also, it misprints some data, too. The below script prints '0Items in cart' for example, there IS a space there on the page. Is there an accurate way to do this?
Back to
Seekers of Perl Wisdom
|
|