Re^4: Rogue character(s) at start of JSON file (BOM; dumping references)

plz use Devel::Peek to find out if it's properly encoded and show us the result here

I've not come across Devel::Peek before, let alone used it - so please bear with me if this is not right...

#!/usr/bin/perl

use CGI::Carp qw(fatalsToBrowser);

use strict;
use warnings;

use Site::Utils;
use JSON;
use Devel::Peek;

print "Content-type: text/plain\n\n";

open my $fh, '<', '../data/publicextract.charity.json' or die "Unable 
+to read Charity JSON File";
my $data = <$fh>;

print "$data\n\n";

open STDERR, ">", 'output.txt' or die $!;

print STDERR "Before\n";
Dump ($data);

$data =~ s/^\x{feff}//;  # Strip off BOM

print STDERR "\n\nAfter\n";
Dump ($data);
exit;
[download]

This gives this output...

Before
SV = PV(0x1569cf0) at 0x15877a0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x1c56920 "\357\273\277[{\"date_of_extract\":\"2023-01-16T00:00
+:00\",\"organisation_number\":1,\"registered_charity_number\":200027,
+\"linked_charity_number\":1,\"charity_name\":\"POTTERNE MISSION ROOM 
+AND TRUST\",\"charity_type\":null,\"charity_registration_status\":\"R
+emoved\",\"date_of_registration\":\"1962-05-17T00:00:00\",\"date_of_r
+emoval\":\"2014-04-16T00:00:00\",\"charity_reporting_status\":null,\"
+latest_acc_fin_period_start_date\":null,\"latest_acc_fin_period_end_d
+ate\":null,\"latest_income\":null,\"latest_expenditure\":null,\"chari
+ty_contact_address1\":null,\"charity_contact_address2\":null,\"charit
+y_contact_address3\":null,\"charity_contact_address4\":null,\"charity
+_contact_address5\":null,\"charity_contact_postcode\":null,\"charity_
+contact_phone\":null,\"charity_contact_email\":null,\"charity_contact
+_web\":null,\"charity_company_registration_number\":null,\"charity_in
+solvent\":false,\"charity_in_administration\":false,\"charity_previou
+sly_excepted\":null,\"charity_is_cdf_or_cif\":null,\"charity_is_cio\"
+:null,\"cio_is_dissolved\":null,\"date_cio_dissolution_notice\":null,
+\"charity_activities\":null,\"charity_gift_aid\":null,\"charity_has_l
+and\":null}\r\n"\0
  CUR = 1082
  LEN = 1122


After
SV = PV(0x1569cf0) at 0x15877a0
  REFCNT = 1
  FLAGS = (PADMY,POK,pPOK)
  PV = 0x1c56920 "\357\273\277[{\"date_of_extract\":\"2023-01-16T00:00
+:00\",\"organisation_number\":1,\"registered_charity_number\":200027,
+\"linked_charity_number\":1,\"charity_name\":\"POTTERNE MISSION ROOM 
+AND TRUST\",\"charity_type\":null,\"charity_registration_status\":\"R
+emoved\",\"date_of_registration\":\"1962-05-17T00:00:00\",\"date_of_r
+emoval\":\"2014-04-16T00:00:00\",\"charity_reporting_status\":null,\"
+latest_acc_fin_period_start_date\":null,\"latest_acc_fin_period_end_d
+ate\":null,\"latest_income\":null,\"latest_expenditure\":null,\"chari
+ty_contact_address1\":null,\"charity_contact_address2\":null,\"charit
+y_contact_address3\":null,\"charity_contact_address4\":null,\"charity
+_contact_address5\":null,\"charity_contact_postcode\":null,\"charity_
+contact_phone\":null,\"charity_contact_email\":null,\"charity_contact
+_web\":null,\"charity_company_registration_number\":null,\"charity_in
+solvent\":false,\"charity_in_administration\":false,\"charity_previou
+sly_excepted\":null,\"charity_is_cdf_or_cif\":null,\"charity_is_cio\"
+:null,\"cio_is_dissolved\":null,\"date_cio_dissolution_notice\":null,
+\"charity_activities\":null,\"charity_gift_aid\":null,\"charity_has_l
+and\":null}\r\n"\0
  CUR = 1082
  LEN = 1122
[download]

Does that help?

UPDATE:

I've realised that because I am reading just the first line of the JSON file, it is malformed as it doesn't have the training ']' character. However, I have added $data .= ']'; to manually add it back on. This still doesn't solve the BOM issue at the end of the file but it might complicate testing...

Comment on Re^4: Rogue character(s) at start of JSON file (BOM; dumping references) Select or Download Code

Replies are listed 'Best First'.
Re^5: Rogue character(s) at start of JSON file (BOM; dumping references) by LanX (Saint) on Jan 19, 2023 at 18:12 UTC
> Does that help? It doesn't seem to be properly utf8 encoded and as you already said the JSON is malformed because you didn't read it completely. `use v5.12; use warnings; use Devel::Peek; use utf8; my $str = "\x{FEFF}['what','ever']"; $str =~ s/^\x{feff}//; Dump($str);` [download] `SV = PV(0xe9cea8) at 0x24f1008 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x2565b28 "['what','ever']"\0 [UTF8 "['what','ever']"] CUR = 15 LEN = 24` [download] so you should take care to read the data properly. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^6: Rogue character(s) at start of JSON file (BOM; dumping references) by Bod (Parson) on Jan 19, 2023 at 18:26 UTC
But `0x1c56920 "\357\273\277` at the start of the file is not removed with `$str =~ s/^\x{feff}//;` ...and...I cannot be sure that `0x1c56920 "\357\273\277` will be at the start of all the JSON files - or is it safe to assume that? I suspect not!	[reply] [d/l] [select]
Re^7: Rogue character(s) at start of JSON file (BOM; dumping references) by pryrt (Abbot) on Jan 19, 2023 at 18:49 UTC
not removed with `$str =~ s/^\x{feff}//;` compare how the behavior changes with `open my $fh, '<:encoding(UTF-8)', '../data/publicextract.charity.json' or die "Unable to read Charity JSON File";` compared to the `open` line you currently use. If you want perl to treat the bytes in the file as UTF-8, and thus be able to use `s/^\x{feff}/`, you have to tell perl to read the file as UTF-8š. If you want perl to continue to read the file as a series of bytes (not using the UTF-8 encoding), then leave your `open` as-is, and have your regex instead either search for the three bytes in octal with `s/^\357\273\277//` or in hex with `s/^\xEF\xBB\xBF//`. #!perl use 5.012; # strict, // use warnings; use Devel::Peek; open my $fo, '>:raw', 'threebytes.bin'; print {$fo} "\xEF\xBB\xBF"; close $fo; open my $fbytes, '<', 'threebytes.bin'; Dump($_ = <$fbytes>); printf "length no-encoding: %d bytes\n", length($_); printf "match no-encoding 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' : + 'nope'; printf "match no-encoding unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'no +pe'; close $fbytes; open my $futf8, '<:encoding(UTF-8)', 'threebytes.bin'; Dump($_ = <$futf8>); printf "length utf8: %d characters\n", length($_); printf "match utf8 3bytes? %s\n", m/^\xEF\xBB\xBF/ ? 'match' : 'nope' +; printf "match utf8 unicode? %s\n", m/^\x{FEFF}/ ? 'match' : 'nope'; close $futf8; __END__ SV = PV(0x6ac038) at 0xb3ebe0 REFCNT = 1 FLAGS = (POK,pPOK) PV = 0xb4d308 "\357\273\277"\0 CUR = 3 LEN = 81 SV = PV(0x6ac038) at 0xb3ebe0 REFCNT = 1 FLAGS = (POK,pPOK,UTF8) PV = 0x3442c98 "\357\273\277"\0 [UTF8 "\x{feff}"] CUR = 3 LEN = 10 length no-encoding: 3 bytes match no-encoding 3bytes? match match no-encoding unicode? nope length utf8: 1 characters match utf8 3bytes? nope match utf8 unicode? match [download] š: or, not shown, use `Encode::decode('UTF-8', $octets)` from Encode	[reply] [d/l] [select]
Re^8: Rogue character(s) at start of JSON file (BOM; dumping references) by LanX (Saint) on Jan 19, 2023 at 19:19 UTC
Re^8: Rogue character(s) at start of JSON file (BOM; dumping references) by Bod (Parson) on Jan 19, 2023 at 19:47 UTC
Re^9: Rogue character(s) at start of JSON file (BOM; dumping references) by LanX (Saint) on Jan 19, 2023 at 19:57 UTC
Some notes below your chosen depth have not been shown here
Re^9: Rogue character(s) at start of JSON file (BOM; dumping references) by pryrt (Abbot) on Jan 19, 2023 at 20:08 UTC
Re^7: Rogue character(s) at start of JSON file (BOM; dumping references) by LanX (Saint) on Jan 19, 2023 at 19:41 UTC
Perl has two ways to represent strings, without UTF-8 flag as "octet streams" i.e. a list of bytes with UTF-8 flag as "characters" in the internal representation° `\x{FEFF}` represents the unicode character with the code-point #FEFF, since Devel::Peek shows that the flag is missing, this character can't be found in the octet stream while replacing. You need to tell Perl how to interpret the read data, the fact that it's "bytewise utf-8" alone doesn't help to see it as list of characters. The `use utf8;` in my example just told Perl to read the script's source and all embedded literal strings as utf8. see Encode for more. Cheers Rolf _{(addicted to the 𐍀𐌴𐍂𐌻 Programming Language :) Wikisyntax for the Monastery} °) which is almost UTF-8, hence the flag is - for historical reasons - a bit of a misnomer	[reply] [d/l] [select]