in reply to Rogue character(s) at start of JSON file

G'day Bod,

"I'm processing some JSON files using JSON ..."

If you look down to the "SEE ALSO" section of that documentation, you'll see a series of RFCs: RFC8259 obsoletes RFC7159, which in turn obsoletes RFC4627. I don't know if there's anything newer; in the following, I'm referencing information in RFC8259.

"$data =~ s/.*?\[/\[/; ... seems to be a bit of a fudge!"

As written, I would agree; however, it can be improved. From RFC8259:

8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encodin +g, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests o +f interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

So, the JSON you're sourcing (with a BOM) is technically invalid; however, it is acceptable to fix that yourself by ignoring (removing) the BOM.

2. JSON Grammar ...

Use this grammar specification to formulate your regex for handling BOM removal. Here's some example code; it's primarily intended to show technique, rather than being a specific solution. Enhance, extend, and otherwise adapt to suit your needs. If you're dealing with more than one of these "dodgy" JSON files, consider putting the logic in a module for reuse.

#!/usr/bin/env perl use 5.010; use strict; use warnings; my @json_tests = ( '', 'crap', '[]', '{}', " []", "\t[]", "\x{feff}[]", qq<\x{feff}\t{"k":"v"}>, ); for my $test (@json_tests) { _json_chars($test); my $clean_json = clean_json($test); _json_chars($clean_json); say '-' x 40; } sub clean_json { my ($json) = @_; return '' unless length $json; state $re = qr{(?x: ^ ( (?: \x{feff}| ) ) ( [\x{20}\x{09}\x{0a}\x{0d}]* (?: false|null|true|\[|\{|" ) .* ) )}; if ($json =~ $re) { my ($bom, $text) = ($1, $2); if ($bom eq '') { say "JSON good as is."; } else { $json = $text; say "JSON cleaned -- BOM removed."; } } else { say 'Invalid JSON! Nothing cleaned.'; } return $json; } sub _json_chars { my ($json) = @_; if (! length $json) { say 'Zero-length JSON'; } else { say 'JSON chars: ', join '-', map sprintf('%x', ord), split //, $json; } return; }

As you can see, I've included a number of tests. Add more to cover your use cases. Here's the output using what's currently there.

Zero-length JSON Zero-length JSON ---------------------------------------- JSON chars: 63-72-61-70 Invalid JSON! Nothing cleaned. JSON chars: 63-72-61-70 ---------------------------------------- JSON chars: 5b-5d JSON good as is. JSON chars: 5b-5d ---------------------------------------- JSON chars: 7b-7d JSON good as is. JSON chars: 7b-7d ---------------------------------------- JSON chars: 20-20-5b-5d JSON good as is. JSON chars: 20-20-5b-5d ---------------------------------------- JSON chars: 9-5b-5d JSON good as is. JSON chars: 9-5b-5d ---------------------------------------- JSON chars: feff-5b-5d JSON cleaned -- BOM removed. JSON chars: 5b-5d ---------------------------------------- JSON chars: feff-9-7b-22-6b-22-3a-22-76-22-7d JSON cleaned -- BOM removed. JSON chars: 9-7b-22-6b-22-3a-22-76-22-7d ----------------------------------------

— Ken