Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
For some spreadsheets this can take almost a minute if they have hunreds of columns and thousands of rows and dozens of sheets. When I have thousands of documents, ...

Corion already has you well covered on your specific API question, so let me instead go off on a slightly tangential algorithmics discussion to maybe get you thinking out of the box (cell?) on this one:

While you may well be able to effect some (possibly significant) linear improvements to such a search, that's all you'll ever be able to do under your current algorithm. Consider that your algorithm is proportional to cols*rows*sheets*files*avg(length(cell)), which, given your numbers (above) conservatively puts the number of comparisons into the hundreds of billions. "Proportional to" implies the rather large constant (or maybe even worse?) overhead hidden in $Sheet->Cells($row, $col)->Value. I am surprised it runs as quickly as it does.

In other words, it would really help to know what sort of larger task this Excel reading problem fits into. Looking at all of the factors in the complexity, are there any you can optimize away? For example, maybe you're scanning all of the columns but only using a few. Two orders of magnitude right there, if you can identify those columns before processing the whole file (headers? Pattern match the first few rows?). Or (and this is the part I'm personally struggling to find a rationale for), how are thousands of hundreds-of-columns, dozens-of-sheets Excel files ever the best storage solution? Is the information relational (SQL)? Unfortunately at this point, the cost of translating the Excel files to a more efficient format is probably at least as bad as what you're trying to do, so it may not be a win unless you (ever!) need to run your algorithm more than once. However, if you can automate a .csv export from whatever program created these files in the first place (assuming you didn't pay for three calendar centuries of data entry), that might be a significant improvement.1

As you're already looking to optimize performance, I reckon there's a decent chance you'll need to do this operation more than once. If so, there may be efficiencies you can build in, such as indexing, only re-processing changed files, etc., to eliminate a lot of that penalty. This sort of task might lend itself quite well to parallel processing, even on a single machine. Benchmarking strongly recommended, obviously.

You may well have considered many/most/all of these things, but then again, I don't think there's a monk here who has never had the blinders on at some point when trying to finish a project.

And I cannot use any CPAN.

Why is that? You're already using Win32::OLE::Const, which is not core. This good old link might come in handy: Yes, even you can use CPAN

Hope this helps!

___________
1. .csv is quite possibly cheaper full-stop, but also cheaper in the sense that if your matches are relatively sparse (< ~50%), you can win by processing files line-wise, rather than cell-wise. If the line doesn't match, move on. Otherwise, only then do you need to parse and match individual cells.


In reply to Re: Can't Find my way in Excel by rjt
in thread Can't Find my way in Excel by ExReg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2024-03-28 13:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found