http://qs321.pair.com?node_id=11106315

in reply to Can't Find my way in Excel

For some spreadsheets this can take almost a minute if they have hunreds of columns and thousands of rows and dozens of sheets. When I have thousands of documents, ...

Corion already has you well covered on your specific API question, so let me instead go off on a slightly tangential algorithmics discussion to maybe get you thinking out of the box (cell?) on this one:

While you may well be able to effect some (possibly significant) linear improvements to such a search, that's all you'll ever be able to do under your current algorithm. Consider that your algorithm is proportional to cols*rows*sheets*files*avg(length(cell)), which, given your numbers (above) conservatively puts the number of comparisons into the hundreds of billions. "Proportional to" implies the rather large constant (or maybe even worse?) overhead hidden in \$Sheet->Cells(\$row, \$col)->Value. I am surprised it runs as quickly as it does.

In other words, it would really help to know what sort of larger task this Excel reading problem fits into. Looking at all of the factors in the complexity, are there any you can optimize away? For example, maybe you're scanning all of the columns but only using a few. Two orders of magnitude right there, if you can identify those columns before processing the whole file (headers? Pattern match the first few rows?). Or (and this is the part I'm personally struggling to find a rationale for), how are thousands of hundreds-of-columns, dozens-of-sheets Excel files ever the best storage solution? Is the information relational (SQL)? Unfortunately at this point, the cost of translating the Excel files to a more efficient format is probably at least as bad as what you're trying to do, so it may not be a win unless you (ever!) need to run your algorithm more than once. However, if you can automate a .csv export from whatever program created these files in the first place (assuming you didn't pay for three calendar centuries of data entry), that might be a significant improvement.1

As you're already looking to optimize performance, I reckon there's a decent chance you'll need to do this operation more than once. If so, there may be efficiencies you can build in, such as indexing, only re-processing changed files, etc., to eliminate a lot of that penalty. This sort of task might lend itself quite well to parallel processing, even on a single machine. Benchmarking strongly recommended, obviously.

You may well have considered many/most/all of these things, but then again, I don't think there's a monk here who has never had the blinders on at some point when trying to finish a project.

And I cannot use any CPAN.

Why is that? You're already using Win32::OLE::Const, which is not core. This good old link might come in handy: Yes, even you can use CPAN

Hope this helps!

___________
1. .csv is quite possibly cheaper full-stop, but also cheaper in the sense that if your matches are relatively sparse (< ~50%), you can win by processing files line-wise, rather than cell-wise. If the line doesn't match, move on. Otherwise, only then do you need to parse and match individual cells.

Replies are listed 'Best First'.
Re^2: Can't Find my way in Excel
by ExReg (Priest) on Sep 17, 2019 at 19:08 UTC

Thanks for the thoughts. It is not a case of how data is stored. It is a matter of looking for things in existing Excel spreadsheets. I agree that handling the data in another format would vastly simplify the search process. But I am stuck looking at Excel spreadsheets in .xls or .xlsx format. And no, I absolutely cannot do CPAN.