in reply to Re: I don't think it's crypt::ssleay either.
in thread problem with german chars in html post fetch

Holli, thanks for your feedback. I *did* have crypt::ssleay installed on my system, although I didn't include use::ssleay in my script. This must be why why I was getting output and you weren't, with the identical script. At any rate, I edited my original script to use ssleay as you suggested.

However, I am still not able to post my request successfully using "DWIM" perl, and am forced to use the function I rolled myself. I revised my script (below) to output the same two html files as before, plus an attempt encoding with cgi::enurl, another attempt with Escape::uri_escape, and an attempt with a regex suggested in Perlfaq9 ("") none of which worked, unfortunately.

According to the documentation, cgi::enurl should do what's needed here, but as the above script demonstrates, it fails where my hand rolled function succeeds.

Any ideas?

use strict; use HTTP::Request::Common; # HTTP handling use LWP::UserAgent; # HTTP handling use crypt::ssleay; use CGI::Enurl; use URI::Escape; use Encode; use Data::Dumper; my $query = 'börse'; my $www; # Returns a list of the canonical names of the available encodings tha +t are loaded. # http://cpan.uwinnipeg.ca/htdocs/Encode/Encode.html # on my system, this outputs: #$VAR1 = [ # 'ascii', # 'ascii-ctrl', # 'iso-8859-1', # 'null', # 'utf8' # ]; my @list = Encode->encodings(); open F, "> encodingsOutput.txt" or die "Cannot open encodings output." +; print F Dumper(\@list); close F; #doesn't work. #sends $query = 'börse'; $www = google_keyword_suggestions_html_debug('de', 'de', $query); open F, "> suggestionsOriginal.html" or die "Cannot open."; print F '$query: ' . "$query\n"; print F $www->content,"\n"; close F; #doesn't work either. #sends queryDoesntWorkEnurl: b%F6rse my $query_enurl = enurl($query); $www = google_keyword_suggestions_html_debug('de', 'de', $query_enurl) +; open F, "> suggestionsEnurl.html" or die "Cannot open."; print F '$query_enurl:' . "$query_enurl:\n"; print F $www->content,"\n"; close F; #encoding with uri_escape doesn't work either #sends b%F6rse (same as en_url) my $query_uri_escape = uri_escape($query); $www = google_keyword_suggestions_html_debug('de', 'de', $query_uri_es +cape); open F, "> suggestionsUriEscape.html" or die "Cannot open."; print F '$query_uri_escape: ' . "$query_uri_escape\n"; print F $www->content,"\n"; close F; #encodes with regex suggested in perlfaq9 #sends ?b%f6rse (same as en_url, except lower case.) my $query_regexPerlfaq9 = query_regexPerlfaq9($query); $www = google_keyword_suggestions_html_debug('de', 'de', $query_regexP +erlfaq9); open F, "> suggestionsPerlfaq9Regex.html" or die "Cannot open."; print F '$query_regexPerlfaq9: ' . "$query_regexPerlfaq9\n"; print F $www->content,"\n"; close F; s/([^\w()'*~!.-])/sprintf '%%%02x', ord $1/eg; # encode #works -- keyword suggestions are retrieved #sends $query = 'börse'; #keyword suggestions are retrieved, although the html is kind of warpe +d looking. my $query_works = germanchars_to_strange_html_chars($query); $www = google_keyword_suggestions_html_debug('de', 'de', $query_works) +; open F, "> suggestionsRolledMyOwn.html" or die "Cannot open."; print F '$query_works: ' . "$query_works\n"; print F $www->content,"\n"; close F; # returns $www object containing html for a successful code, or an err +or code sub google_keyword_suggestions_html_debug { my $language = shift; my $country = shift; my $query = shift; #this could be a list, but leaving it as a sing +le word. maybe change later. my $action = POST 'https://adwords.google.com/select/KeywordSandbox', [ 'save' => "save", 'wizard_name' => "keywordsandbox_wizard", 'language' => $language, 'country' => $country, 'keywords' => $query, ]; my $ua = LWP::UserAgent->new; $ua->agent('Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)'); $ua->timeout(30); my $www = $ua->request( $action ); return $www; } # Takes a variable and spits it back out with the proper german charac +ters sub germanchars_to_strange_html_chars { my $var = shift; my %table = ( 'ß' => 'ß', 'ä' => 'ä', 'ö' => 'ö', 'Ä' => 'ä', 'Ö' => 'ö', 'Ü' => 'ü', 'ü' => 'ü'); while (my ($k,$v) = each %table) { $var =~ s/$k/$v/g; } return $var; } #based on suggestion in perl faq9 sub query_regexPerlfaq9 { my $var = shift; $var =~ s/([^\w()'*~!.-])/sprintf '%%%02x', ord $1/eg; # encode return $var }
Update: added test for posting with uri_escape (unfortunately doesn't work either) Update 2: added test for posting with uri encoding regex suggested in perlfaq9 (still doesn't work) Update 3: added function that dumps supperted on my system, into a text file. Currently this outputs
$VAR1 = [ 'ascii', 'ascii-ctrl', 'iso-8859-1', 'null', 'utf8' ];
  • Comment on Sending post via enurl doesn't work either (though from documentation seems like it should)
  • Select or Download Code

Replies are listed 'Best First'.
Re: Sending post via enurl doesn't work either (though from documentation seems like it should)
by holli (Abbot) on Jan 10, 2005 at 10:50 UTC
    As you can see in my chart above, sending a query using germanchars_to_strange_html_chars() and enurl() gives the same results. If they differ for you, i don´t know why.
      Okay, I think I've figured this out. The upshot is that the reason Holli and I are getting different results is probably because his script file is encoded in Utf8, whereas mine is encoded with Ansi windows.

      When I converted my script to utf8 with editpad before running it, it worked. (Originally Holli had suggested that I convert to "dos mode", which I interpreted as running convert Ansi->OEM in editpad (since that's what the editpad help file calls dosmode). However, if I had run convert ANSI->utf8, I would have had success and saved myself many hours of head scratching. OTOH, at least I'm beginning to get a better understanding for troubleshooting encoding issues, and I hope by sharing my experience I may help others.

      During the headscratching phase, I painstakingly put together the following chart comparing utf8 and windows ansi.

      symbol encoding editpad hex mode display editpad normal mode displays
      ö ansi windows f6 ö
      ö utf8 c3b6 ö
      ö dos mode (oem) 94

      Editpad users (limited time demo version available for download) may appreciate the following info. Windows Ansi is editpad's default mode. utf8 characters were derived by running editpad->convert->unicode->ansi to utf8. dos mode characters, I ran convert->ANSI to OEM. Hex mode results for all of the above were derived in editpad by switching to hexmode with ctrl-h.

      I conclude that CGI::enurl does not work at spitting out appropriate post characters when fed german characters encoded with the windows default. Or put more simply, cgi::enurl is windows unfriendly. I wonder if there is a way to contribute to cgi::enurl and URI::Escape (which works the same way), to make them more windows friendly. But I will leave this to another day.

      thomas.

        Reference for encoding table:

        http://www.microsoft.com/globaldev/reference/sbcs/1252.htm

      Thanks again holli. The fact seem to be that

      enurl('börse') gives

      b%F6rse

      on my system, whereas it gives

      börse

      on your system. I am guessing we have different perl versions, different module versions, or different default encodings.

      perl -MCGI::Enurl -e"print $CGI::Enurl::VERSION" > enUrlVersion.txt
      outputs 1.07 for my enurl version.
      <code> perl -v > perlVoutput.txt
      outputs
      This is perl, v5.8.4 built for MSWin32-x86-multi-thread (with 3 registered patches, see perl -V for more detail) Copyright 1987-2004, Larry Wall Binary build 810 provided by ActiveState Corp. http://www.ActiveState.com ActiveState is a division of Sophos. Built Jun 1 2004 11:52:21
      I am not sure how to get the default encoding. Truly stumped...

      thomas.

      *******

      UPDATE: Relevant (though so far not helpful) documentation seems to be at: