ETL in Perl

Replies are listed 'Best First'.
Re: ETL in Perl by Corion (Patriarch) on Sep 03, 2010 at 16:35 UTC
If you've enjoyed doing ETL using a graphical dataflow environment, you haven't done large or complex data transformations. In my experience, these data graphs always become too large to print and document. They are also hard to diff against each other. The Perl module coming closest to ETL is Workflow, and it wouldn't be too hard to create the appropriate "transformation" nodes to it. You only need to render it to SVG, and allow the user to modify it. Writing such a node editor shouldn't be too hard either, as there is existing knowledge embodied in Inkscape.	[reply]
Re^2: ETL in Perl by metaperl (Curate) on Sep 03, 2010 at 18:43 UTC
PEQUEL had a lot of work done on it, but it now appears to have fallen to the wayside. The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]
Re^2: ETL in Perl by metaperl (Curate) on Sep 03, 2010 at 17:11 UTC
If you've enjoyed doing ETL using a graphical dataflow environment, you haven't done large or complex data transformations. In my experience, these data graphs always become too large to print and document. They I have greatly enjoyed doing ETL using Ab Initio for professional work "large or complex" is qualitative. If things became large or complex, I was able to highlight a section of processes and tuck them into a box, so printing was not an issue. And I found the large graphs to be largely self-documenting. What dataflow tools does your experience encompass? Workflow is interesting but I dont see it offering what I got out of Ab Initio and what I listed as points for graphical dataflow -- parallelism, big sky picture, etc. The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]
Re^3: ETL in Perl by Corion (Patriarch) on Sep 03, 2010 at 17:46 UTC
I have worked with Aris Toolset and IBM SPSS PASV Modeler (previously SPSS PASV Modeler, née SPSS Clementine). I don't know what your requirements for documentation are, but I found that these tools hid far too much information within the nodes and did not print it out. And in the end, processes need to be documented in print. "Tucking things into boxes" only means that instead of one huge A2 or A0 sheet, you get to deal with binders of A4 sheets, which is no improvement. If you want a graphical ETL tool, I pointed you to the tools that you need to stick together to create such a thing with Perl as the backend. If you want to play buzzword bingo and tell us that you're now getting paid for playing with ETL tools, maybe you want to visit http://etlmonks.org instead?	[reply]
Re^4: ETL in Perl by jdporter (Paladin) on Sep 07, 2010 at 18:41 UTC
Re: ETL in Perl by erix (Prior) on Sep 04, 2010 at 22:28 UTC
ETL `[...]` typically involves reading large volumes of data from a database `[...]` I don't think that's right: it seems to me that ETL first and foremost means (always has): getting data from outside to inside a database. The TRANSFORM is only necessary if the db (-constraint(s)) demand it, and there, obviously, perl can come in handy. ... and processing it in massively parallel fashion "massively parallel"? That doesn't have much to do with ETL, does it? If the database can slurp data parallel (multicore, or multi-whatever), that's nice, but that doesn't seem very related to any ETL-job. (btw, your first link 'Extract, transform, and load...' links to the node itself; I suppose that was a mistake? You had wikipedia ETL in mind, perhaps? )	[reply] [d/l] [select]
Re^2: ETL in Perl by metaperl (Curate) on Sep 07, 2010 at 13:47 UTC
"massively parallel"? That doesn't have much to do with ETL, does it? If the database can slurp data parallel (multicore, or multi-whatever), that's nice, but that doesn't seem very related to any ETL-job. massively parallel is very important - if you have independant data crunching tasks, the ability to send them off to different heavyweight machines easily without a bunch of fiddling with program source code is a huge book. most of my ETL work was for a bank --- analysing database data and creating summaries of it to go right back into the database, so I extracted from a database, analyszed and loaded it back into the db.. but you are right... sometimes the initial source is not a db. thanks for the link update The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]
Re: ETL in Perl by grantm (Parson) on Sep 06, 2010 at 05:42 UTC
Dane Horne (one of our local Perl Mongers) did a talk the other month about an Open Source ETL package he's developing called ETLp. Like ~~metaperl~~Corion, Dan had found that the visual ETL tools really did not scale and became very frustating for non-trivial problems. Anyway, check out his project. Edit: fixed ref to earlier commenter	[reply]
Re^2: ETL in Perl by metaperl (Curate) on Sep 07, 2010 at 13:48 UTC
Actually Corion found they did not scale well... I loved Ab Initio and it seemed to scale well for me... after all it evolved from the parallel computer company Thinking Machines. but thanks for the link.. checking it now The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]
Re: ETL in Perl by metaperl (Curate) on Sep 03, 2010 at 18:39 UTC
I found some interesting links on Perl and ETL: DBIx::JCL Being used on a datamart in a fortune 100 company TALEND Allows for components to be written in Java or Perl The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development. -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"	[reply]
Re: ETL in Perl by jdporter (Paladin) on Sep 07, 2010 at 18:53 UTC
I disagree with the contention that ETL is the same as the input, transformation and output of data That just describes the T part of ETL. In fact, what I often see is people using "ETL" to refer only to the T part, or to the ET parts, or the TL parts. Extraction is, by definition, the extraction of data from a database (or other data storage) in its "native" model (schema). Load is, by definition, the insertion of data into database (or other storage) in the model defined there. Transformation is, of course, the transformation of the data from the one model to the other. Using Perl for the Extraction and Load parts can make a lot of sense. Perl can write, and run, the necessary SQL scripts, for example. Doing the Transformation in Perl might be easy to write, but I'd agree that it isn't the best tool in a lot of situations, for various reasons, not least of which is the fact that it's slow, relative to some other tools. What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?	[reply]
Re: ETL in Perl by sundialsvc4 (Abbot) on Sep 16, 2010 at 19:12 UTC
I recently assisted on some ETL work using Microsoft’s toolsets ... and soon enough wound up piecing together a little bit of Perl code to work with the underlying (XML-based) definition files of that system. (I also wound up writing quite a bit of Visual Basic code ... (yuck!) ... to do a great many things that the pure-visual environment could not “quite” do. As the process became bigger, it also became more complex, and it became increasingly difficult to sit down and feel like you actually understood what was going to happen when you mashed the Start button. I felt very uncomfortable with that. A graph is only comprehensible when it fits on a single uncluttered page. Maybe I am just an old Luddite, but I really do embrace having source code as the basic way of defining to the computer what I want the computer to do. I know how to `diff` such files. I know how to work with them easily in version-control. Having said that ... I, too, would like avoid having to write and to maintain “large amounts of code” by hand in any language; including Perl. I would look for (or build?) some system that allowed me to define the processes, the data relationships and so-on, and which would then just-in-time generate the necessary (Perl, of course...) source-code. Dispatching the resulting definitions for parallel execution is a different problem, and a relatively easy one to handle “generically.” (Where I am parked right now, a rather archaic version of Tivoli Workload Scheduler is performing that task quite well. It smells bad but it works.) The advantage here is that any sort of workload can be dispatched in this way... “ETL” or otherwise. It is very limiting if “ETL works this-way but nothing else does,” or when the system that is doing ETL has no way to balance itself against other work that might be going on upon the same machine at the same time.	[reply]
Re^2: ETL in Perl by runrig (Abbot) on Sep 16, 2010 at 20:24 UTC
Ditto on the ability to diff and version control the source. I've worked with Informatica where you work with GUI tools that on the backend stores your mappings and workflows in a database (that you really don't want to look at, but sometimes do anyway). There is a built in version control that sucks, and you can not easily see the changes you make or changes between versions. You can export your work as XML, and if you sort and filter it just right (which I did with Perl and XSLT), you can diff XML files. You miss the ability to grep for something, instead you have to click and click (and click...) to see the bit you want to see, and there are so many levels of places to override settings, it's sometimes a challenge to figure out why something is behaving the way it is. The one (and maybe only) thing I like about it are that you get logging with little to no effort, and (ok, two things) realtime control and monitoring of the processes (in a pretty and fairly useful GUI).	[reply]
Re: ETL in Perl by Solo (Deacon) on Sep 08, 2010 at 15:10 UTC
There's also Sprog!, but it looks unsupported/dead for some time. --Solo -- You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.	[reply]
Re: ETL in Perl by Corion (Patriarch) on Oct 14, 2010 at 20:40 UTC
There just has been a release of ETLp, which seems to be a framework / application set geared towards creation and monitoring of ETL processes.	[reply]
Re: ETL in Perl by Anonymous Monk on Sep 21, 2010 at 21:52 UTC
I'm a little surprised nobody has mentioned Pentaho Data Integration. I have some interest in working a persistent perl interpreter into their scheme to allow efficient single row parallel processing in something other than JavaScript.	[reply]


Perl Monk, Perl Meditation
	PerlMonks