I think this problem is not easy to solve, but maybe you can give me some suggestions.
I think anything like this, which is obviously useful and something someone else must have done before, is almost certainly easy in Perl. You just have to pick the right tool and anything that dumps entities is definitely the wrong tool.
Try this-
use warnings; use strict; use WWW::Mechanize; use Encode; my $mech = WWW::Mechanize->new(agent => "NotSoForbiddenBot/0.99"); $mech->get("http://en.wikipedia.org/wiki/German_language"); # You might have to edit/detect the encode statement to # match the document's. print encode("UTF-8", $mech->content(format => "text"));
That might be losing too much formatting and spacing for you. You can adapt this recipe on the raw HTML instead: Re: Strip HTML tags again. It will probably preserve white-space (outside of tables anyway) better.
In reply to Re: ContentExtractor Coding
by Your Mother
in thread ContentExtractor Coding
by fanticla
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |