Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Geo Package files

by soonix (Canon)
on Mar 03, 2022 at 15:33 UTC ( [id://11141808]=note: print w/replies, xml ) Need Help??


in reply to Geo Package files

if I interpret http://www.geopackage.org/spec120/#gpb_format correctly, you would have to unpack it, with a format starting with either "nCCN" or "vCCV"

Replies are listed 'Best First'.
Re^2: Geo Package files
by Bod (Parson) on Mar 03, 2022 at 23:36 UTC

    Thanks soonix - that has got me much further :)

    But - I am not understanding how to map the GeoPackage spec to Perl's unpack. I've read perlpacktut although I cannot say I understand much of it. So I've made a start based on your information and after a lot of pondering I think I understand why C is the second and third parts of the template

    I don't understand why we start with n
    The spec says "byte2 magic = ox4750" - so it's 2 bytes or 16 bits. Why is it n and not v or even s?

    Spec 2 says it is an "8-bit unsigned integer" so C fits the bill.

    I'm guessing we know 3 is unsigned from looking at the bits laid out in the spec - is that about right?

    Now I get completely lost...
    There is an int32 between 3 and 4 so that's N or V - but how can we tell? I've tried unpacking with those templates options and get 879493120 and 27700 respectively. I cannot tell from those values which is right.

    Next we have 4 double[] envelope
    As spec 3 returns 00000101 (5), the envelope consists of 6 values 2: envelope is [minx, maxx, miny, maxy, minz, maxz], 48 bytes so I am using d6 in the template. The returned values don't seem to make sense!

    Next is spec 5, the GeoPackageBinaryHeader - I'm took an educated guess this is a but the result didn't seem right so I've used N as it seems to be another flag byte.

    Finally we have the geometry which I expect to be text of unknown length. So I tried a100 and A100 but they take me back to the gobbledegook and I cannot work out which one I might need to use.

    This is what I have tried...

    use DBD::SQLite; use Data::Dumper; use strict; use warnings; my $dbh = DBI->connect("dbi:SQLite:uri=file:osopenusrn_202203.gpkg?mod +e=rwc"); my $tab = $dbh->prepare("SELECT * FROM openUSRN"); $tab->execute; my $n = $tab->fetchrow_hashref; my $geo = $n->{'geometry'}; my @test = unpack "nCCVd6C100", $geo; foreach my $t(@test) { print "$t\n"; }

    Am I heading in the right direction here or am I going down a blind alley with my reasoning behind the mapping from the specs to the unpack template?
    Is there a simpler way to understand this mapping and work out the correct template?

      n versus v is determined by the byte order. In the https://libgeos.org/specifications/wkb/ specification mentioned by swl, there is a section on Byte Order which mentions a flag stating the order used, and http://www.geopackage.org/spec120/#gpb_format says this flag is the lowest bit in the "flags" field (second C in the unpack template), so this determines whether to use big-endian (n and N) or big-endian (v and V). Luckily, this flag is a single byte, so that it itself isn't affected by byte-order ;-)

      Looking further at the flag byte, I see the the next three bits tell you how many doubles (probably f format, from Perl 5.10 upwards you can distinguish between big-endian "f>" and little-endian "f<") are in the double[] envelope array at the end of the GeoPackageBinaryHeader.

      Sorry I'm not really into this, so I can't dive very deep into it. Depending on what information comes up in the discussion, it might trigger me to go looking deeper, but don't bet on it :-)
        Yes, this format tells us at the beginning how to interpret the following bytes. In my post here, we need to know the value of YY, the critical flag byte. I am not sure exactly what that byte says.
      I don't understand exactly what your binary BLOB (Binary Large OBject) contains. A simple "print $geo" doesn't mean much to me. I would start by making a hex dump of $geo. Something along the lines of: join ' ', unpack '(H2)*', $geo; ... I would put either 10 or 16 bytes per line in the dump. It could be that some option on Data::Dump would work?

      You should wind up with something like: 47 50 XX YY... 0x47 means "G" and 0x50 means "P". The next byte looks like a version number - probably doesn't mean much to you. The very next flag byte means a lot (emphasis added) and we need to know what that is in order to understand what "double[] envelope;" means.

      My code for dealing with binary headers like this usually has a substr() to select an range of bytes then it applies the appropriate unpack template upon that subset of bytes. I suspect that some relatively straightforward, special purpose code will be able to decode your specific blobs. Decoding this binary geo format in a general sense appears to be a "non-trivial" task. Your code will probably wind up depending upon the flag byte being a particular value - your code being specific to decoding segments with that particular set of flags.

      The code to decode the bulk of the BLOB depends critically upon knowing what the flag byte is. Lets see the first 16 bytes of this $geo blob in hex dump format...

      Also don't forget the ord() function. print "Got G!\n" if ( ord(substr($geo,0,1)) == 0x47 ); I think will work.

      I have no idea of how many BLOB's you need to decode or what the performance implications are. I would try to get something functionally working and then worry about performance tweaks later. It could very well be that such tweaks are not necessary.

        Thank you Marshall

        Form print join ' ', unpack '(H2)*', $geo; I have much more sensible output to work with. Here are the first 16 hex 'characters'

        47 50 00 05 34 6c 00 00 50 b8 1e c5 2c 75 23 41
        As per the spec, the first two are 0x4750. The version number is 0 which means version 1 which is what I would expect.

        Next comes 05 - or as I said previously, this is 00000101

        The bits correspond to RRXYEEEB - where EEE defines the envelope. So that is 010 = 2 which gives the envelope as being 2: envelope is [minx, maxx, miny, maxy, minz, maxz], 48 bytes. Also from this, B = 1 tells us that it's Little Endian.

        The SRSID I worked out late last evening using 'V' to unpack it. It is 27700 which is the SRSID for OSGB 1936. That passes the sanity check as the data I am decoding comes from the Ordnance Survey.

        This is where I got lost...
        The 6 components of the envelope I tried to unpack with 'd6' and got:

        637590.385 642426.601 309577.58 310361.391000001 0 0
        The last two seem reasonable as I am not expecting height data although I would not be surprised if it were included. However, the first and third give a location inland from the east coast (near Norwich if you know your UK geography) and not off the west coast as I would expect for a minimum bounding coordinate.

        If I ue '(f<)6' I get even less sensible results:

        -2539.51953125 10.2161064147949 8.48770014272304e-08 10.2253313064575 126443847680 9.18094444274902

        I have no idea of how many BLOB's you need to decode or what the performance implications are. I would try to get something functionally working and then worry about performance tweaks later. It could very well be that such tweaks are not necessary

        There are 1.4 million BLOBs that need decoding! However, the decode process will be done once every few months (the dataset is updated monthly but doesn't change massively). The decoding process is not time critical. If it needs to run overnight then so be it.

        A useful technique for dumping binary data is to use the "v" flag in sprintf. Eg:

        % perl -E 'say sprintf "%v02X", "foo\x{0}"' 66.6F.6F.00
        See the section on "vector flag" in sprintf.
        You should wind up with something like: 47 50 XX YY... 0x47 means "G" and 0x50 means "P"

        Just as a general question...
        Why would the file specification require 0x4750 at the start of the file?
        What does it add?

        Is it just there so that any processor of the file can fail quickly if it is passed a file that doesn't start with these two bytes or is there more to it than that?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11141808]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 13:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found