G'day Bod,
"I'm processing some JSON files using JSON ..."
If you look down to the "SEE ALSO" section of that documentation, you'll see a series of RFCs: RFC8259 obsoletes RFC7159, which in turn obsoletes RFC4627. I don't know if there's anything newer; in the following, I'm referencing information in RFC8259.
"$data =~ s/.*?\[/\[/; ... seems to be a bit of a fudge!"
As written, I would agree; however, it can be improved. From RFC8259:
8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encodin +g, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests o +f interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
So, the JSON you're sourcing (with a BOM) is technically invalid; however, it is acceptable to fix that yourself by ignoring (removing) the BOM.
2. JSON Grammar ...
Use this grammar specification to formulate your regex for handling BOM removal. Here's some example code; it's primarily intended to show technique, rather than being a specific solution. Enhance, extend, and otherwise adapt to suit your needs. If you're dealing with more than one of these "dodgy" JSON files, consider putting the logic in a module for reuse.
#!/usr/bin/env perl use 5.010; use strict; use warnings; my @json_tests = ( '', 'crap', '[]', '{}', " []", "\t[]", "\x{feff}[]", qq<\x{feff}\t{"k":"v"}>, ); for my $test (@json_tests) { _json_chars($test); my $clean_json = clean_json($test); _json_chars($clean_json); say '-' x 40; } sub clean_json { my ($json) = @_; return '' unless length $json; state $re = qr{(?x: ^ ( (?: \x{feff}| ) ) ( [\x{20}\x{09}\x{0a}\x{0d}]* (?: false|null|true|\[|\{|" ) .* ) )}; if ($json =~ $re) { my ($bom, $text) = ($1, $2); if ($bom eq '') { say "JSON good as is."; } else { $json = $text; say "JSON cleaned -- BOM removed."; } } else { say 'Invalid JSON! Nothing cleaned.'; } return $json; } sub _json_chars { my ($json) = @_; if (! length $json) { say 'Zero-length JSON'; } else { say 'JSON chars: ', join '-', map sprintf('%x', ord), split //, $json; } return; }
As you can see, I've included a number of tests. Add more to cover your use cases. Here's the output using what's currently there.
Zero-length JSON Zero-length JSON ---------------------------------------- JSON chars: 63-72-61-70 Invalid JSON! Nothing cleaned. JSON chars: 63-72-61-70 ---------------------------------------- JSON chars: 5b-5d JSON good as is. JSON chars: 5b-5d ---------------------------------------- JSON chars: 7b-7d JSON good as is. JSON chars: 7b-7d ---------------------------------------- JSON chars: 20-20-5b-5d JSON good as is. JSON chars: 20-20-5b-5d ---------------------------------------- JSON chars: 9-5b-5d JSON good as is. JSON chars: 9-5b-5d ---------------------------------------- JSON chars: feff-5b-5d JSON cleaned -- BOM removed. JSON chars: 5b-5d ---------------------------------------- JSON chars: feff-9-7b-22-6b-22-3a-22-76-22-7d JSON cleaned -- BOM removed. JSON chars: 9-7b-22-6b-22-3a-22-76-22-7d ----------------------------------------
— Ken
In reply to Re: Rogue character(s) at start of JSON file
by kcott
in thread Rogue character(s) at start of JSON file
by Bod
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |