Re: Spreadsheet::ParseExcel with embedded PDF cells

The PDF files won't be embedded in the Excel document but rather in the OLE container/document that surrounds the Excel file.

As such Spreadsheet::ParseExcel isn't of any use in this case. If you want to extract the PDF files you will need to use OLE::Storage_Lite.

The first thing you will need to find out is the PPS (property set) name of the embedded objects. The smplls.pl utility that is part of the OLE::Storage_Lite will show you the File structure and the PPS names. For example:

    perl smplls.pl Book1.xls

    00    1 'Root Entry' (pps 0)                      ROOT 00.01.1900 
+00:00:00
    01      1 'Workbook' (pps 1)                      FILE       1000 
+bytes
    02      2 ' SummaryInformation' (pps 2)           FILE       1000 
+bytes
    03      3 ' DocumentSummaryInformation' (pps 3)   FILE       1000 
+bytes
[download]

Then you can extract the PPS structures using OLE::Storage_Lite. Here is a sample program that extracts the "Summary Information" from an Excel file to get you started.

    #!/usr/bin/perl

    use strict;
    use warnings;
    use OLE::Storage_Lite;


    my $file        = 'Book1.xls';
    my $stream_name = "\5SummaryInformation";

    # Convert stream name to UTF16.
    $stream_name = pack 'v*', unpack 'C*', $stream_name;

    # Create the OLE reader object.
    my $ole = OLE::Storage_Lite->new($file);

    # Find the required stream in the OLE container.
    my $stream = ($ole->getPpsSearch([$stream_name], 1, 1))[0];

    die "Couldn't find required OLE data in $file. $!\n" unless $strea
+m;

    # Do something with the data.
    my $data = $stream->{Data};

    # Remember to use binmode() on Windows.
    print $data;
[download]

Note, if the PPS name appears to start with a space it may actually be a low ordinal character such as "\0", "\1" or as in the case above "\5".

--
John.

Comment on Re: Spreadsheet::ParseExcel with embedded PDF cells Select or Download Code

Replies are listed 'Best First'.
Re^2: Spreadsheet::ParseExcel with embedded PDF cells by ForgotPasswordAgain (Vicar) on Jan 21, 2009 at 17:20 UTC
Probably nobody reading this now, but... I seem to be unable to associate the PDF files that I (successfully) extracted to the cells they're coming from. Is there any way to do that?	[reply]
Re^3: Spreadsheet::ParseExcel with embedded PDF cells by jmcnamara (Monsignor) on Jan 22, 2009 at 01:29 UTC
If you send me an example file using the email address in the OLE::Storage_Lite docs I'll have a look at it and see if the cell addresses can be decoded out using ParseExcel. -- John.	[reply]
Re^2: Spreadsheet::ParseExcel with embedded PDF cells by ForgotPasswordAgain (Vicar) on Jan 11, 2009 at 17:04 UTC
Thanks, that looks very promising, if I can figure out the stream name for where the PDFs are.	[reply]