I'm trying to download some webpages containing non-ASCII characters. The web-page declares it to be
and in my browser (FireFox 3) it shows very nice.<meta http-equiv="Content-Type" content="text/html; charset=windows-12 +51" />
Now I have written a little script to get me one page as follows:
Before one of you starts saying that I should not parse HTML with a regex: I know and besides there is only one <h3>-tag on the page, so it seemed a bit overkill to break out HTML::Parser or HTML::TreeBuilderuse strict; use LWP::UserAgent; my @captions; my $response = $ua->get('http://www.1418.ru/chronicles.php?p=100'); if ($response->is_success) { my $file = $response->content; $file =~ m/<h3>(.*)<\/h3>/i; my $h3_content = $1; push @captions, $h3_content; } else { warn 'ERROR: no HTML ',$response->status_line; }
After I have downloaded all pages I need, I save @captions into a file and when I open the file (which is actually an HTML-file with the proper charset declaration) it does no longer show Cyrillic charcaters, but funny accented characters.
So I thought that I needed to use a Cyrillic encoding as follows:
But that gives me a lot of errors such as:open my $fh, '>:encoding(iso-8859-5)', 'c:/data/captions2.txt'; print $fh join '\n', @captions;
(iso-8859-5 is the iso name for windows-1251) and the file is full of these "\x{00ff}" character encodings and still does not render correctly."\x{00ff}" does not map to iso-8859-5.
So I guess that things already go wrong when importing the web-page and that somewhere there the proper encoding gets lost and cannot be restored.
I think my question really boils down to "how to convince LWP::UserAgent to keep the Cyrillic endcoding of the webpage?"
Update: LWP::UserAgent indeed did not decode the webpage and all was solved when it put the proper encoding (windows-1251) in the HTML header. Thanks all!
CountZero
A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
In reply to Downloading webpages with non-ASCII characters by CountZero
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |