Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Timeout for parsing corrupted excel files

by Andre_br (Pilgrim)
on May 09, 2007 at 16:28 UTC ( [id://614438]=perlquestion: print w/replies, xml ) Need Help??

Andre_br has asked for the wisdom of the Perl Monks concerning the following question:

Hello my esteemed code fellows.

I have this script that parses uploaded excel files into tab delimited .txt versions. The problem is that from time to time some user shows up with a corrupted xls file, and the Spreadsheet::ParseExcel gets stuck at the conversion. I'd like to set up a timeout, but the code below isn't working. Can any of you guys give me a hand here?

#die 1; ###################### # File conversion: xls->txt ####################### local $SIG{ALRM} = sub { die("TIMED OUT") }; eval { alarm(10); #set our alarm # sleep 5; # with this sleep, the alarm works. But with the code bel +ow it doesn't... if ( $extension eq "xls") { my $infile = "../thissite/myuploads/$user/$systemname"; my $outfile = "../thissite/myuploads/$user/inventory.txt"; # Requesting extra modules: use Spreadsheet::ParseExcel; use Data::Dumper; # Openning the file dump open (DUMP, ">$outfile") || warn "Error opening file: $!\n"; select DUMP; # Excel parsing my $prev_index = -1; my $prev_row = 0; my $prev_col = 0; my $cell_handler = sub { my $workbook = $_[0]; my $sheet_index = $_[1]; my $row = $_[2]; my $col = $_[3]; my $cell = $_[4]; # Only process the first worksheet if ($sheet_index > 0) { $workbook->ParseAbort(1); return; } # Reset the col counter between rows $prev_col = 0 if $row != $prev_row; # Add tabs between fields and newlines between rows. Also pa +d # any missing rows or columns. # print "\n" for $prev_row +1 .. $row; print "\t" for $prev_col +1 .. $col; # Print the UNformatted value of the cell print $cell->{Val}; # Keep track of where we are $prev_row = $row; $prev_col = $col; }; my $parse_excel = new Spreadsheet::ParseExcel(CellHandler => $ce +ll_handler, NotSetCell => 1); my $workbook = $parse_excel->Parse("$infile"); ############ ############ alarm(0); close DUMP; select STDOUT; } }; if ( $@ =~ /TIMED OUT/ ) { die "CONVERSION TIME EXCEEDED. FILE MAY BE CORRUPT!"; } #die 2;

The 'die 1' indicates me that the code runs just fine exactly untill before this conversion. And the 'die 2' indicates me that it won't pass through it.

I tested the alarm with the 'sleep 5', and it works just fine. But with the excel-parsing code it doesn't.

Any ideas?

I even tried setting the alarm around the line that activates the parsing, but it didn't work either.

local $SIG{ALRM} = sub { die("TIMED OUT") }; eval { alarm(10); #set our alarm + my $workbook = $parse_excel->Parse("$infile"); alarm(0); };

Thanks a lot, my friends

André

Replies are listed 'Best First'.
Re: Timeout for parsing corrupted excel files
by naikonta (Curate) on May 09, 2007 at 17:43 UTC
    I parsed a 500Kb of xls file and normally it takes around 8 seconds to complete. So I set an alarm for 5-7 seconds, and it dies successfully with timeout message.
    use strict; use warnings; use Spreadsheet::ParseExcel; my $file = shift; my $excel; eval { local $SIG{ALRM} = sub { die "timeout\n" }; alarm(5); $excel = Spreadsheet::ParseExcel::Workbook->Parse($file); die "Can't parse $file\n" unless defined $excel; alarm(0); }; die $@ if $@; # contiune with $excel object
    I can see two things in your code: a) warning on failed open. Instead, you should do some exception here because there's no point to continue if you can't open the file.
    open DUMP, ">filename" or die "Can't open: $!\n";
    b) you need to check $@ in all possible cases, not just timeout, to capture other potential errors.
    if ($@) { # something bad occures if ($@ =~ /TIME OUT/) { # do something regarding timeout } else { # do something with other reason, # such as failure on open } # stop here } # continue here

    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      Hello naikonta

      Thanks a lot for the reply. In fact, I tested the timeout with the code you posted (wich is an alternate way to call the Spreadsheet::ParseExcel module), and the timeout worked just perfect. I then tested my old code with another big excel file and, surprise, it worked too.

      So, as far as I want to prevent big files parsing, the timeout works. But I still can't have the timeout to work with a specific corrupted xls file I have here.

      I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.

      They're not located on the end of the line, so they're not '\n's. I checked the excel file, and guess what they are: they are those big dashes, the ones that windows converts this one '-' into, as you type. You know?

      If I try to paste it here, it pastes as '-', but they are in fact something like '--'. I mean, it's a wide dash. (what's the name of it?)

      I've seen this problem happening also with those english quotes, the ones that have some angle to the right and to the left, according to if they are opening or closing quotes.

      Does anyone know how to threat these peculiar MS characters, in order they don't cause these parsing problems on Perl?

      How do I replace them? They are \what?

      Thanks a lot

      André

        I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.
        Is it what you called 'corrupted'? Your description sounds like CRLF (\r\n), this is what considered as newline character in OS such as Windows. Try to clean the string with s/\r//g. But this normally doesn't make the process hang. Well, out of all, MS applications are notoriously known as 'weird characters inventors' at their best.

        I once read about how to get rid of these funny stuff MS applications introduce but I can't recall it at all. The person(s) that made this stuff did it by reverse engineering how MS Excel works.


        Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

      Just to add this: the fact is that these strange MS characters I mentioned on the reply above are getting in the way of the timeout. They just seem to make the code freeze. And there's no timeout.

      Any ideas?

        What is the longest time until you're out of patience? I once migrated many xls files (around 2-3 MG each) into mysql. Each file took like forever but the process eventually ended successfully. Once or twice, I did open a file manually only to save (as) it again under the same name. This made the file size smaller dan the process went faster. But doing this for 30-40 files wasn't an option at all, and I couldn't find any way to automate this on Linux. So I just enjoyed my time doing something else while waiting the migration process to do its job, especially after I did prog.pl /path/to/dir/*.xls.

        Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://614438]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (1)
As of 2024-04-26 02:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found