comment on

G'day Bod,

"I'm processing some JSON files using JSON ..."

If you look down to the "SEE ALSO" section of that documentation, you'll see a series of RFCs: RFC8259 obsoletes RFC7159, which in turn obsoletes RFC4627. I don't know if there's anything newer; in the following, I'm referencing information in RFC8259.

"$data =~ s/.*?\[/\[/; ... seems to be a bit of a fudge!"

As written, I would agree; however, it can be improved. From RFC8259:

8.1. Character Encoding JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8 [RFC3629]. Previous specifications of JSON have not required the use of UTF-8 when transmitting JSON text. However, the vast majority of JSON- based software implementations have chosen to use the UTF-8 encodin +g, to the extent that it is the only encoding that achieves interoperability. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning of a networked-transmitted JSON text. In the interests o +f interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
[download]

So, the JSON you're sourcing (with a BOM) is technically invalid; however, it is acceptable to fix that yourself by ignoring (removing) the BOM.

2. JSON Grammar ...
[download]

Use this grammar specification to formulate your regex for handling BOM removal. Here's some example code; it's primarily intended to show technique, rather than being a specific solution. Enhance, extend, and otherwise adapt to suit your needs. If you're dealing with more than one of these "dodgy" JSON files, consider putting the logic in a module for reuse.

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;

my @json_tests = (
    '',
    'crap',
    '[]',
    '{}',
    "  []",
    "\t[]",
    "\x{feff}[]",
    qq<\x{feff}\t{"k":"v"}>,
);

for my $test (@json_tests) {
    _json_chars($test);
    my $clean_json = clean_json($test);
    _json_chars($clean_json);
    say '-' x 40;
}

sub clean_json {
    my ($json) = @_;

    return '' unless length $json;

    state $re = qr{(?x:
        ^
        (
            (?: \x{feff}| )
        )
        (
            [\x{20}\x{09}\x{0a}\x{0d}]*
            (?: false|null|true|\[|\{|" )
            .*
        )
    )};

    if ($json =~ $re) {
        my ($bom, $text) = ($1, $2);

        if ($bom eq '') {
            say "JSON good as is.";
        }
        else {
            $json = $text;
            say "JSON cleaned -- BOM removed.";
        }
    }
    else {
        say 'Invalid JSON! Nothing cleaned.';
    }

    return $json;
}

sub _json_chars {
    my ($json) = @_;

    if (! length $json) {
        say 'Zero-length JSON';
    }
    else {
        say 'JSON chars: ',
            join '-', map sprintf('%x', ord), split //, $json;
    }

    return;
}
[download]

As you can see, I've included a number of tests. Add more to cover your use cases. Here's the output using what's currently there.

Zero-length JSON
Zero-length JSON
----------------------------------------
JSON chars: 63-72-61-70
Invalid JSON! Nothing cleaned.
JSON chars: 63-72-61-70
----------------------------------------
JSON chars: 5b-5d
JSON good as is.
JSON chars: 5b-5d
----------------------------------------
JSON chars: 7b-7d
JSON good as is.
JSON chars: 7b-7d
----------------------------------------
JSON chars: 20-20-5b-5d
JSON good as is.
JSON chars: 20-20-5b-5d
----------------------------------------
JSON chars: 9-5b-5d
JSON good as is.
JSON chars: 9-5b-5d
----------------------------------------
JSON chars: feff-5b-5d
JSON cleaned -- BOM removed.
JSON chars: 5b-5d
----------------------------------------
JSON chars: feff-9-7b-22-6b-22-3a-22-76-22-7d
JSON cleaned -- BOM removed.
JSON chars: 9-7b-22-6b-22-3a-22-76-22-7d
----------------------------------------
[download]

— Ken

In reply to Re: Rogue character(s) at start of JSON file by kcott
in thread Rogue character(s) at start of JSON file by Bod

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.