in reply to Timeout for parsing corrupted excel files

I parsed a 500Kb of xls file and normally it takes around 8 seconds to complete. So I set an alarm for 5-7 seconds, and it dies successfully with timeout message.
use strict; use warnings; use Spreadsheet::ParseExcel; my $file = shift; my $excel; eval { local $SIG{ALRM} = sub { die "timeout\n" }; alarm(5); $excel = Spreadsheet::ParseExcel::Workbook->Parse($file); die "Can't parse $file\n" unless defined $excel; alarm(0); }; die $@ if $@; # contiune with $excel object
I can see two things in your code: a) warning on failed open. Instead, you should do some exception here because there's no point to continue if you can't open the file.
open DUMP, ">filename" or die "Can't open: $!\n";
b) you need to check $@ in all possible cases, not just timeout, to capture other potential errors.
if ($@) { # something bad occures if ($@ =~ /TIME OUT/) { # do something regarding timeout } else { # do something with other reason, # such as failure on open } # stop here } # continue here

Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Replies are listed 'Best First'.
Re^2: Strange MS characters are the ones causing trouble at the parsing code
by Andre_br (Pilgrim) on May 09, 2007 at 18:48 UTC
    Hello naikonta

    Thanks a lot for the reply. In fact, I tested the timeout with the code you posted (wich is an alternate way to call the Spreadsheet::ParseExcel module), and the timeout worked just perfect. I then tested my old code with another big excel file and, surprise, it worked too.

    So, as far as I want to prevent big files parsing, the timeout works. But I still can't have the timeout to work with a specific corrupted xls file I have here.

    I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.

    They're not located on the end of the line, so they're not '\n's. I checked the excel file, and guess what they are: they are those big dashes, the ones that windows converts this one '-' into, as you type. You know?

    If I try to paste it here, it pastes as '-', but they are in fact something like '--'. I mean, it's a wide dash. (what's the name of it?)

    I've seen this problem happening also with those english quotes, the ones that have some angle to the right and to the left, according to if they are opening or closing quotes.

    Does anyone know how to threat these peculiar MS characters, in order they don't cause these parsing problems on Perl?

    How do I replace them? They are \what?

    Thanks a lot

    André

      I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.
      Is it what you called 'corrupted'? Your description sounds like CRLF (\r\n), this is what considered as newline character in OS such as Windows. Try to clean the string with s/\r//g. But this normally doesn't make the process hang. Well, out of all, MS applications are notoriously known as 'weird characters inventors' at their best.

      I once read about how to get rid of these funny stuff MS applications introduce but I can't recall it at all. The person(s) that made this stuff did it by reverse engineering how MS Excel works.


      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Re^2: Timeout corrupted excel files: MS characters freeze code
by Andre_br (Pilgrim) on May 09, 2007 at 18:52 UTC
    Just to add this: the fact is that these strange MS characters I mentioned on the reply above are getting in the way of the timeout. They just seem to make the code freeze. And there's no timeout.

    Any ideas?

      What is the longest time until you're out of patience? I once migrated many xls files (around 2-3 MG each) into mysql. Each file took like forever but the process eventually ended successfully. Once or twice, I did open a file manually only to save (as) it again under the same name. This made the file size smaller dan the process went faster. But doing this for 30-40 files wasn't an option at all, and I couldn't find any way to automate this on Linux. So I just enjoyed my time doing something else while waiting the migration process to do its job, especially after I did prog.pl /path/to/dir/*.xls.

      Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!