G'day Bod,

"I'm processing some JSON files using JSON ..."

If you look down to the "SEE ALSO" section of that documentation, you'll see a series of RFCs: RFC8259 obsoletes RFC7159, which in turn obsoletes RFC4627. I don't know if there's anything newer; in the following, I'm referencing information in RFC8259.

"$data =~ s/.*?\[/\[/; ... seems to be a bit of a fudge!"

As written, I would agree; however, it can be improved. From RFC8259:

8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encodin +g, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests o +f interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

So, the JSON you're sourcing (with a BOM) is technically invalid; however, it is acceptable to fix that yourself by ignoring (removing) the BOM.

2. JSON Grammar ...

Use this grammar specification to formulate your regex for handling BOM removal. Here's some example code; it's primarily intended to show technique, rather than being a specific solution. Enhance, extend, and otherwise adapt to suit your needs. If you're dealing with more than one of these "dodgy" JSON files, consider putting the logic in a module for reuse.

#!/usr/bin/env perl use 5.010; use strict; use warnings; my @json_tests = ( '', 'crap', '[]', '{}', " []", "\t[]", "\x{feff}[]", qq<\x{feff}\t{"k":"v"}>, ); for my $test (@json_tests) { _json_chars($test); my $clean_json = clean_json($test); _json_chars($clean_json); say '-' x 40; } sub clean_json { my ($json) = @_; return '' unless length $json; state $re = qr{(?x: ^ ( (?: \x{feff}| ) ) ( [\x{20}\x{09}\x{0a}\x{0d}]* (?: false|null|true|\[|\{|" ) .* ) )}; if ($json =~ $re) { my ($bom, $text) = ($1, $2); if ($bom eq '') { say "JSON good as is."; } else { $json = $text; say "JSON cleaned -- BOM removed."; } } else { say 'Invalid JSON! Nothing cleaned.'; } return $json; } sub _json_chars { my ($json) = @_; if (! length $json) { say 'Zero-length JSON'; } else { say 'JSON chars: ', join '-', map sprintf('%x', ord), split //, $json; } return; }

As you can see, I've included a number of tests. Add more to cover your use cases. Here's the output using what's currently there.

Zero-length JSON Zero-length JSON ---------------------------------------- JSON chars: 63-72-61-70 Invalid JSON! Nothing cleaned. JSON chars: 63-72-61-70 ---------------------------------------- JSON chars: 5b-5d JSON good as is. JSON chars: 5b-5d ---------------------------------------- JSON chars: 7b-7d JSON good as is. JSON chars: 7b-7d ---------------------------------------- JSON chars: 20-20-5b-5d JSON good as is. JSON chars: 20-20-5b-5d ---------------------------------------- JSON chars: 9-5b-5d JSON good as is. JSON chars: 9-5b-5d ---------------------------------------- JSON chars: feff-5b-5d JSON cleaned -- BOM removed. JSON chars: 5b-5d ---------------------------------------- JSON chars: feff-9-7b-22-6b-22-3a-22-76-22-7d JSON cleaned -- BOM removed. JSON chars: 9-7b-22-6b-22-3a-22-76-22-7d ----------------------------------------

— Ken


In reply to Re: Rogue character(s) at start of JSON file by kcott
in thread Rogue character(s) at start of JSON file by Bod

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.