in reply to Re^2: Geo Package files
in thread Geo Package files

I don't understand exactly what your binary BLOB (Binary Large OBject) contains. A simple "print $geo" doesn't mean much to me. I would start by making a hex dump of $geo. Something along the lines of: join ' ', unpack '(H2)*', $geo; ... I would put either 10 or 16 bytes per line in the dump. It could be that some option on Data::Dump would work?

You should wind up with something like: 47 50 XX YY... 0x47 means "G" and 0x50 means "P". The next byte looks like a version number - probably doesn't mean much to you. The very next flag byte means a lot (emphasis added) and we need to know what that is in order to understand what "double[] envelope;" means.

My code for dealing with binary headers like this usually has a substr() to select an range of bytes then it applies the appropriate unpack template upon that subset of bytes. I suspect that some relatively straightforward, special purpose code will be able to decode your specific blobs. Decoding this binary geo format in a general sense appears to be a "non-trivial" task. Your code will probably wind up depending upon the flag byte being a particular value - your code being specific to decoding segments with that particular set of flags.

The code to decode the bulk of the BLOB depends critically upon knowing what the flag byte is. Lets see the first 16 bytes of this $geo blob in hex dump format...

Also don't forget the ord() function. print "Got G!\n" if ( ord(substr($geo,0,1)) == 0x47 ); I think will work.

I have no idea of how many BLOB's you need to decode or what the performance implications are. I would try to get something functionally working and then worry about performance tweaks later. It could very well be that such tweaks are not necessary.

Replies are listed 'Best First'.
Re^4: Geo Package files
by Bod (Parson) on Mar 05, 2022 at 00:58 UTC

    Thank you Marshall

    Form print join ' ', unpack '(H2)*', $geo; I have much more sensible output to work with. Here are the first 16 hex 'characters'

    47 50 00 05 34 6c 00 00 50 b8 1e c5 2c 75 23 41
    As per the spec, the first two are 0x4750. The version number is 0 which means version 1 which is what I would expect.

    Next comes 05 - or as I said previously, this is 00000101

    The bits correspond to RRXYEEEB - where EEE defines the envelope. So that is 010 = 2 which gives the envelope as being 2: envelope is [minx, maxx, miny, maxy, minz, maxz], 48 bytes. Also from this, B = 1 tells us that it's Little Endian.

    The SRSID I worked out late last evening using 'V' to unpack it. It is 27700 which is the SRSID for OSGB 1936. That passes the sanity check as the data I am decoding comes from the Ordnance Survey.

    This is where I got lost...
    The 6 components of the envelope I tried to unpack with 'd6' and got:

    637590.385 642426.601 309577.58 310361.391000001 0 0
    The last two seem reasonable as I am not expecting height data although I would not be surprised if it were included. However, the first and third give a location inland from the east coast (near Norwich if you know your UK geography) and not off the west coast as I would expect for a minimum bounding coordinate.

    If I ue '(f<)6' I get even less sensible results:

    -2539.51953125 10.2161064147949 8.48770014272304e-08 10.2253313064575 126443847680 9.18094444274902

    I have no idea of how many BLOB's you need to decode or what the performance implications are. I would try to get something functionally working and then worry about performance tweaks later. It could very well be that such tweaks are not necessary

    There are 1.4 million BLOBs that need decoding! However, the decode process will be done once every few months (the dataset is updated monthly but doesn't change massively). The decoding process is not time critical. If it needs to run overnight then so be it.

      Well, it appears to me something like this will work?:
      This geo thing is a complex format in its general case.

      byte 1 ="G" byte 2 = "P' byte 3 = 0 # means version 1 -> probably doesn't matter byte 4 = 5 # flags: little endian, 32 bit (Intel) for each point, # 6 values for each data set, 6x8 = 48 bytes each byte 5-8 = # srs_id = 0x0000346C = Unique identifier for each # Spatial Reference System within a GeoPackage # I have no idea what that means? byte 9-... # start of data.. is here... # #
      Fetching and decoding 1.4 million points is no big deal. Will run in less than a minute. Also please understand that buzzwords like SRSID or OSGB mean nothing to me - I am clueless.
        something like this will work?

        So far so good. And the srs_id passes the sanity check as it is the value I would expect.

        The problem comes when we get to the envelope. We know it is a double[] from the spec and we know it is Little Endian from the flag byte. The trouble I'm having is understanding how to take that information and translate it to a template for unpack. From the list in the documentation, the closest seems to be 'V' but that is a long not a float.

        unpack'ing the envelope with 'V' gives these values:

        minx - 3307124816 maxx - 1092842796 miny - 867583392 maxy - 1092852469 minz - 1374389536 maxz - 1091757350
        This makes no sense as min_x has to numerically less than max_x - it is only a co-ordinate system, albeit one in 3 dimensional irregular spherical space. Likewise with min_z and max_z, this is height above sea-level and the min has to numerically less than the max.

        So from the data it seems I am using the wrong unpack template of 'nCCV(V)6'.

        I've tried the two float options - 'f' and 'd' but they produce equally unrealistic values.

        Do you have any advice on how I go about translating from the information I know about the data to the template necessary to unpack it?

        buzzwords like SRSID or OSGB mean nothing to me

        Sorry!

        SRS - Spatial Reference System is the co-ordinate system used to identify where a point is on the Earth's surface. Many SRS's exist and none are perfect. In order that the data can be used, we need to know which SRS we are using

        OSGB - Ordnance Survey of Great Britain is the SRS that we most commonly use here in the UK. Because the UK is a relatively small country, OSGB mostly ignores the curvature of the Earth. OSGB is what we have found as the SRS ID in this GeoPackage so that's a sanity check that we are on the right path so far.

Re^4: Geo Package files
by hv (Prior) on Mar 05, 2022 at 14:21 UTC

    A useful technique for dumping binary data is to use the "v" flag in sprintf. Eg:

    % perl -E 'say sprintf "%v02X", "foo\x{0}"' 66.6F.6F.00
    See the section on "vector flag" in sprintf.
Fixed starting bytes (was: Re^4: Geo Package files)
by Bod (Parson) on Mar 12, 2022 at 19:16 UTC
    You should wind up with something like: 47 50 XX YY... 0x47 means "G" and 0x50 means "P"

    Just as a general question...
    Why would the file specification require 0x4750 at the start of the file?
    What does it add?

    Is it just there so that any processor of the file can fail quickly if it is passed a file that doesn't start with these two bytes or is there more to it than that?

      This is a common protocol feature. For example a .WAV file starts with WAVE - the letters RIFF can also appear. Using ASCII letters makes it easy to see that you have the right kind of format just by inspection (if you deal with ASCII often). Often binary dump views will display the ASCII if it is within normal "printable" character range as added info. This is also helpful to make sure that you are at a "proper beginning". Decoding the thing requires being certain that you are at a valid byte(0) because all field definitions are a delta to that byte's "address". So this is a cheap (very little bandwidth), "sanity check". This is also often seen with communication links and can assist with resynchronization when some data "goes missing".