rijrunner has asked for the wisdom of the Perl Monks concerning the following question:

OK. I am a complete novice at parsing formatted binaries.

I have a project where I need to yank out a ton of data from about 100 Lotus 123 docs and then throw them into a MySQL database. But, this is not a one time deal. Another department maintains our list of server names, ip addresses, and other data for our server farms. They won't move their data over to a searchable database, so we have tried several ways of pulling the data ourselves. The thing is, it is an extremely hairy process. The quickest way was to run a script in each sheet and export the format in csv format.

I had a script that parses the csv formatted output and throw outputs SQL statements. It works, but unfortunately it is difficult to maintain the parsing info and they quite often modify the number sheets internal to the spreadsheet.

I am looking at the file format in a binary editor and see that I can extract much of the info I need. The basic file format seems to be:

Lot of formatting data
Sheet label definitions
data
footer

The sheet label definitions are fairly easy to pick out. They are identifiable by being preceded by

0x00 0x23 0x00 .. .. 0xb0 0x36
not sure what the 4 bytes in the middle there represent. The next field after the string shown is the number of the label followed by it's name. That is a variable number of characters. The end of this section is denoted by:

0x00 0x16 0x00

The data seems to fall under the general file format of:

format line_num 0x00 label_num cell_num value end_marker

I can pretty easily determine the values for text data in that format. The problem is that I can not read the data entered in binary format. The text "1234" is easily identifiable. But, the following sequence which represents the number 1234 is stumping me. The first 12 bytes are the same as for text in that they represent the data addressing and formatting. The number appears to be the 0x80 0x34 0x01 part of the string:


0x08 0x00 0x00 0x00 0x00 0x01 0x80 0x34 0x01

Invariably, the binary data is preceded by the code 0x00 0x25 0x00, so it it fairly easy to spot.

I am needing some pointers on how to seek through a file, find that sequence and read it out. I am trying to reverse engineer their formatting of binary numbers. I am also needing just a general idea of how to read a binary file and extract arbitrary strings of somewhat arbitrary length, but that have clearly defined boundries.

So, if I have a file call test.123 with the following data:
0000:04a0 1f 00 08 00 00 01 01 00 00 00 00 00 23 00 0e 00 ............#...
0000:04b0 b0 36 00 00 54 61 62 20 6c 61 62 65 6c 00 16 00 °6..Tab label...
0000:04c0 0a 00 00 00 00 00 27 74 65 73 74 00 25 00 08 00 ......'test.%...
0000:04d0 00 00 00 01 80 34 01 00 07 01 00 00 04 07 00 00 .....4..........
0000:04e0 06 01 00 00 07 01 00 00 06 01 00 00 04 01 04 00 ................
0000:04f0 00 06 1a 01 00 01 04 00 04 00 00 00 42 06 04 00 ............B...
0000:0500 03 00 00 00 40 06 0c 00 00 00 01 00 00 00 00 00 ....@...........


What would be the best way to read the file to where I can feed the data into an array or hash?

Thanks.

Replies are listed 'Best First'.
Re: Lotus 123 Spreadsheet parsing
by jmcnamara (Monsignor) on Jun 04, 2002 at 11:35 UTC

    You will find Lotus documentation on the 123 file format here

    You can extract and convert the binary information that you need using unpack. Since the binary records have defined lengths it should be easy to iterate through the file record by record looking for salient information.

    Other functions that you might find useful in this task are pack, seek, tell and binmode.

    The file format is actually quite similar to the Excel file format so you may be able to get some ideas from Spreadsheet::ParseExcel.

    --
    John.

      Thank you. That was the pointer I was hoping for.

      Chuck
Re: Lotus 123 Spreadsheet parsing
by Aristotle (Chancellor) on Jun 04, 2002 at 12:06 UTC
    Do you know about pack/unpack? They're Perl's way of converting a binary representation of a number, found in a string, from and to an actual numeric value. In addition, the format string they use offers so many options to specify the form of input data - almost like regexes - that you can parse simplistic binary file formats (without too much variation) very easily.
    Update: Huh - there was no reply visible on this node when I wrote mine..

    Makeshifts last the longest.