how to consolidate xls/csv/text into one report?

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Masterful munging monks!

I have some csv files, some txt files, and some xls files. These are all autogenerated reports from different places, with different fields/formats/etc.

I need to consolidate all these files into one report (which then gets imported into a db, but I'm not worried about the import to db step).

Anybody have a primer/tutorial on how to do this kind of data munging thing?

I've figured out that I'll probably wind up using text::csv for sucking in the text stuff and win32::ole for reading in xls, but I'm concerned about what kinds of pitfalls I'm likely to hit when combining the data, and looking for a tutorial about good tactics, and what to watch out for. Open to book recommendations as well.

In my own background check I came across Data Munging with Perl. Is this likely to help me? Thanks!

Comment on how to consolidate xls/csv/text into one report?

Replies are listed 'Best First'.
Re: how to consolidate xls/csv/text into one report? by jhourcle (Prior) on Apr 28, 2005 at 11:43 UTC
If you're feeding everything into a database anyway, I'd most likely write seperate scripts to grab the information out of each type of file, and put the information into the database. Once everything's in the database in a consistent format, it's much, much easier to generate your reports. (basically, going with the philosophy that lots of little scripts with a limited scope of work are easier to write and debug than one big script that tries to do everything) As for data munging, the book might be a good start. Depending on the number of records involved, I'll typically just start processing tem using various patterns that seem to be in the files, and have it write out the successes to one file, and the rejects to another ... if the successes look good, then I'll place them into the database (or whatever the plan was), and look for more patterns in the rejected records. From what you describe, the csv and excel files should just be a matter of mapping the fields to the database fields (you might have to do some cleaning if there are enumerations or foreign key constraints, though), so the only problem might be the text file, and that's all just a matter of taking small steps, like what I described above. Now, if this is something that's going to repeat itself (eg, you're going to get an excel sheet once a week from one department, and a cvs from another, and you have to generate reports for your boss as soon as you get them), well, then you start getting into workflow issues -- You just work pretty much like I suggested before, but you just have the parsing scripts store the failed matched in memory, and have the program continue to process them until it's done. (or it gives up, in which case, it writes out a log that you can go through, so still have some manual work, but it's not as bad as it might be)	[reply]
Re: how to consolidate xls/csv/text into one report? by jZed (Prior) on Apr 28, 2005 at 14:25 UTC
The DBD::AnyData module handles CSV, XML, HTMLtable, Fixed-Width, Weblog, Ini-file, and many other formats. It's also extensible so if you have a format that isn't providied, you can create a simple parser for it. It was designed for exactly the task you are describing.	[reply]
Re^2: how to consolidate xls/csv/text into one report? by tphyahoo (Vicar) on Apr 29, 2005 at 08:19 UTC
Thanks for the tip. Hitting the search engines for dbd anydata excel I came up with modules for reading Excel files which mentions (among others): Spreadsheet::TieExcel DBD::Anydata - CSV DBD::Excel - DBD wrapper for Spreadsheet::WriteExcel and Spreadsheet::ParseExcel UPDATE: TieExcel seemed initially promising, only, quoting from the documentation: "Quirks You can't, for now, write anywhere else than the active workbook, and Excel must be open for all this to work. In the best M$ tradition, this and other quite obvious and necessary features will be available in some future version." But despite this I may try it out anyway, I like this easy excel/perl hash equivelancy, I think this is what would make it easiest for me to figure out how to clean up my spreasheets.	[reply]
Re^3: how to consolidate xls/csv/text into one report? by jZed (Prior) on Apr 29, 2005 at 15:18 UTC
If you are going to dump into a database anyway, it would seem simpler to to treat your input sources as databases also. I suggested DBD::AnyData becuase it supports multiple formats. It can import from any DBI accessible data source, including CSV, many different text formats, and Excel (with DBD::Excel). This means that you can use DBI to read all of your data regardless of its format and to write your data. In a situation such as you are describing that makes a lot more sense then using one module to read Excel, another to read CSV, a third to read text format foo and a fourth to read text format bar and a fifth to write the data. Actually DBD::AnyData makes use of those other modules (e.g. it would be using the excellent Spreadsheet::ParseExcel to actually read the Excel sheet or the excellent XML::Twig to read XML files), but provides you with a single interface to those modules and to the other parsing modules so that your script only needs one interface rather than one interface per format.	[reply]


"be consistent"
	PerlMonks