Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^2: Strange MS characters are the ones causing trouble at the parsing code

by Andre_br (Pilgrim)
on May 09, 2007 at 18:48 UTC ( [id://614475]=note: print w/replies, xml ) Need Help??


in reply to Re: Timeout for parsing corrupted excel files
in thread Timeout for parsing corrupted excel files

Hello naikonta

Thanks a lot for the reply. In fact, I tested the timeout with the code you posted (wich is an alternate way to call the Spreadsheet::ParseExcel module), and the timeout worked just perfect. I then tested my old code with another big excel file and, surprise, it worked too.

So, as far as I want to prevent big files parsing, the timeout works. But I still can't have the timeout to work with a specific corrupted xls file I have here.

I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.

They're not located on the end of the line, so they're not '\n's. I checked the excel file, and guess what they are: they are those big dashes, the ones that windows converts this one '-' into, as you type. You know?

If I try to paste it here, it pastes as '-', but they are in fact something like '--'. I mean, it's a wide dash. (what's the name of it?)

I've seen this problem happening also with those english quotes, the ones that have some angle to the right and to the left, according to if they are opening or closing quotes.

Does anyone know how to threat these peculiar MS characters, in order they don't cause these parsing problems on Perl?

How do I replace them? They are \what?

Thanks a lot

André

  • Comment on Re^2: Strange MS characters are the ones causing trouble at the parsing code

Replies are listed 'Best First'.
Re^3: Strange MS characters are the ones causing trouble at the parsing code
by naikonta (Curate) on May 10, 2007 at 02:56 UTC
    I don't know what the heck the user invented on this one (God, how I love the users! ..lghs) but, when I save it as '.txt tab delimited', I see many of those black squares in between the text.
    Is it what you called 'corrupted'? Your description sounds like CRLF (\r\n), this is what considered as newline character in OS such as Windows. Try to clean the string with s/\r//g. But this normally doesn't make the process hang. Well, out of all, MS applications are notoriously known as 'weird characters inventors' at their best.

    I once read about how to get rid of these funny stuff MS applications introduce but I can't recall it at all. The person(s) that made this stuff did it by reverse engineering how MS Excel works.


    Open source softwares? Share and enjoy. Make profit from them if you can. Yet, share and enjoy!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://614475]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (8)
As of 2024-04-19 13:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found