Re: JSON character encoding
by hippo (Archbishop) on May 29, 2017 at 08:01 UTC
|
is Youtube using a mix of encodings?
Maybe. One way to tell would be to check the response headers. Did you try that? It might be that LWP::Simple doesn't support that in which case you could switch to LWP::UserAgent or similar. Another way might be to specify your desired encoding in the request's Accept header and see if you then get the correct encoding in all responses.
| [reply] |
|
|
hippo, I tried checking the response headers and got this: content-type: application/json
| [reply] |
|
|
| [reply] [d/l] |
Re: JSON character encoding
by Anonymous Monk on May 29, 2017 at 12:55 UTC
|
You didn't show us the code you used to print out the JSON results. \x{} is perl syntax, not JSON, so I suspect it may have been inserted on your end if you used Data::Dumper. Your second example is a curly quote that's been UTF-8 encoded and then interpreted as Windows-1252. Regular apostrophes don't have this problem, only curly ones. Also, it would help us understand what's going on if you posted the original JSON, not just what you decoded it to. | [reply] |
|
|
Hi thanks for the reply. Here is the JSON response. The strange apostrophe is still there (but only from some Youtube channels, others are normal). It's literally what I get after just these two steps:
my $url = "https://www.googleapis.com/youtube/v3/...";
my $result = get($url);
JSON received:
{ "items": [ { "snippet": { "description": "#BellatorNYC is headed to Madison Square Garden for an epic night of fights! \n\nSubscribe for more Bellator MMA content! http://bit.ly/SubscribeBellatorYT \n\nFollow Bellator MMA\nFacebook: https://www.facebook.com/BellatorMMA\nTwitter: https://twitter.com/BellatorMMA\nInstagram: https://instagram.com/bellatormma/\nSnapchat: BellatorNation\n\nJoin #BellatorNation to gain exclusive access and benefits including fan-fests, ticket presales and much more! http://bellator.com/newsletter\n\nCheck out the Bellator MMA App: http://bellator.spike.com/app\niTunes: http://bit.ly/1tHUGym \nAndroid: http://bit.ly/1OaBqDX\n\nAbout Bellator MMA: \nBellator MMA is about the fighters and the fans. It’s the fighters who put it all on the line each day, leaving every ounce of themselves inside the cage. Our mission is to create, promote and produce the most exciting, competitive and entertaining Mixed Martial Arts competition in the world.\n\nBellator is available to nearly 500 million homes worldwide in over 140 countries. In the United States, Bellator can be seen on Spike, the MMA television leader. Based in Hollywood, California, Bellator is owned by entertainment giant Viacom, home to the world's premier entertainment brands that connect with audiences through compelling content across television, motion picture, online and mobile platforms. Website: http://bellator.spike.com/" } } ] }
You're right, the emoticon was some perl language I inserted somehow. The actual JSON for the emoticon is here:
{ "items": [ { "snippet": { "description": "Happy with everything at the moment 😊" } } ] } | [reply] [d/l] [select] |
|
|
Yes, both of those are just UTF-8. The trick is to put them in a form that your output device can handle. If you're in a terminal window, it might not to be able to display weird unicode characters like smileys. You might try something like this:
use Encode qw( encode FB_HTMLCREF );
print encode('ascii', $jsonstuff->{items}[0]{snippet}{description}, FB
+_HTMLCREF), "\n";
You'll get "It’s" and "moment 😊", which are ugly, but should come out right if you put them in an html file. | [reply] [d/l] |
|
|
Re: JSON character encoding
by james28909 (Deacon) on May 29, 2017 at 20:26 UTC
|
One thing i worked on a few weeks back, JSON had some weird output. I downloaded and tried JSON::Parse and it did solve my problems. Maybe give it a try with JSON::Parse 'parse_json_safe'. Good luck! | [reply] |
Re: JSON character encoding
by locked_user sundialsvc4 (Abbot) on May 30, 2017 at 14:00 UTC
|
JSON, being a data exchange format, does not have a concept of “character encodings.” Its purpose is to allow structured data to be correctly transmitted from place to place, particularly in contexts where HTML and the use of “web servers” is assumed. (Secondarily, it was intended to be friendly to JavaScript, so that you can use Global.eval(), as in fact people did do in younger and more-innocent days.)
Anytime you display or “print” a received data-stream, character encoding does play a part in the process that is used to create what you see. Most of the time, UTF encoding is assumed, and the software in question watches-out for the telltale byte sequences in order to “be helpful” and “do the right thing.” This can get in the way, however, when you are debugging. Sometimes the best thing to do is to dump the relevant parts in hexadecimal:
00000000 50 65 72 6c 20 70 72 6f 67 72 61 6d 0a |Perl program.|
When you subsequently display the data “for real,” if you know that the data is UTF-encoded, you must somehow tell this to the final rendering device. In an HTML data-stream, for instance, this is done by means of a header. The necessity to use things like #1234 depends entirely on what the device has been told to expect. (Don’t assume that it will “assume.”)
| |
|
|
| [reply] |
|
|
Fine. JSON has a very restricted concept of character encodings.
| [reply] |
|
|
This statement is true – but implied. Therefore, let me clarify my previous statement.
The character-encoding of the transferred data must be agreed-upon by both the sending and the receiving parties. The specification says that the data can be UTF-encoded, which is to say that the JSON data-format has no encoding scheme of its own. Furthermore, the character-encoding (or lack thereof) of the transferred data has no bearing on how the data is packaged into syntactically valid, parseable, “JSON.” The structural format does not contain nor rely upon Unicode characters. Rather, it is agnostic to them. It is equally capable of “sending the necessary bytes to you, correctly,” no matter how you consider the bytes that you received “to have been ‘encoded,’ if at all.”
Thank you for the clarification, “Mom.” That was, indeed, an important point.
| |
|
|
|
|