OK. I am a complete novice at parsing formatted binaries.
I have a project where I need to yank out a ton of data from
about 100 Lotus 123 docs and then throw them into a MySQL database.
But, this is not a one time deal. Another department maintains
our list of server names, ip addresses, and other data for our
server farms. They won't move their data over to a searchable
database, so we have tried several ways of pulling the data ourselves.
The thing is, it is an extremely hairy process. The quickest way was
to run a script in each sheet and export the format in csv format.
I had a script that parses the csv formatted output and throw
outputs SQL statements. It works, but unfortunately it is difficult
to maintain the parsing info and they quite often modify the number
sheets internal to the spreadsheet.
I am looking at the file format in a binary editor and see that
I can extract much of the info I need. The basic file format
seems to be:
Lot of formatting data
Sheet label definitions
data
footer
The sheet label definitions are fairly easy to pick out. They are
identifiable by being preceded by
0x00 0x23 0x00 .. .. 0xb0 0x36
not sure what the 4 bytes in the middle there represent. The next
field after the string shown is the number of the label followed by
it's name. That is a variable number of characters. The end of this
section is denoted by:
0x00 0x16 0x00
The data seems to fall under the general file format of:
format line_num 0x00 label_num cell_num value end_marker
I can pretty easily determine the values for text data in that format.
The problem is that I can not read the data entered in binary format.
The text "1234" is easily identifiable. But, the following sequence which represents
the number 1234 is stumping me. The first 12 bytes are the same as for text in that they
represent the data addressing and formatting. The number appears to be the
0x80 0x34 0x01 part of the string:
0x08 0x00 0x00 0x00 0x00 0x01 0x80 0x34 0x01
Invariably, the binary data is preceded by the code 0x00 0x25 0x00,
so it it fairly easy to spot.
I am needing some pointers on how to seek through a file, find that
sequence and read it out. I am trying to reverse engineer their formatting
of binary numbers. I am also needing just a general idea of
how to read a binary file and extract arbitrary strings of somewhat arbitrary
length, but that have clearly defined boundries.
So, if I have a file call test.123 with the following data:
0000:04a0 1f 00 08 00 00 01 01 00 00 00 00 00 23 00 0e 00 ............#...
0000:04b0 b0 36 00 00 54 61 62 20 6c 61 62 65 6c 00 16 00 °6..Tab label...
0000:04c0 0a 00 00 00 00 00 27 74 65 73 74 00 25 00 08 00 ......'test.%...
0000:04d0 00 00 00 01 80 34 01 00 07 01 00 00 04 07 00 00 .....4..........
0000:04e0 06 01 00 00 07 01 00 00 06 01 00 00 04 01 04 00 ................
0000:04f0 00 06 1a 01 00 01 04 00 04 00 00 00 42 06 04 00 ............B...
0000:0500 03 00 00 00 40 06 0c 00 00 00 01 00 00 00 00 00 ....@...........
What would be the best way to read the file to where I can feed
the data into an array or hash?
Thanks.