comment on

I think this problem is not easy to solve, but maybe you can give me some suggestions.

I think anything like this, which is obviously useful and something someone else must have done before, is almost certainly easy in Perl. You just have to pick the right tool and anything that dumps entities is definitely the wrong tool.

Try this-

use warnings;
use strict;
use WWW::Mechanize;
use Encode;

my $mech = WWW::Mechanize->new(agent => "NotSoForbiddenBot/0.99");
$mech->get("http://en.wikipedia.org/wiki/German_language");

# You might have to edit/detect the encode statement to
# match the document's.
print encode("UTF-8", $mech->content(format => "text"));
[download]

That might be losing too much formatting and spacing for you. You can adapt this recipe on the raw HTML instead: Re: Strip HTML tags again. It will probably preserve white-space (outside of tables anyway) better.

In reply to Re: ContentExtractor Coding by Your Mother
in thread ContentExtractor Coding by fanticla

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.