Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

ETL in Perl

by metaperl (Curate)
on Sep 03, 2010 at 14:40 UTC ( [id://858725]=perlmeditation: print w/replies, xml ) Need Help??

ETL is short for Extract, transform, and load... it typically involves reading large volumes of data from a database and processing it in massively parallel fashion and then writing out the transformed data. In other words, it's the industry term for what davorg termed "The Data Munging Mantra" in his book "Data Munging with Perl":
the input, transformation and output of data
That being said, I've enjoyed ETL using graphical dataflow environments. The major players are Teradata and Ab Initio. But I've yet to see a comparable set of tools in Perl. Sure, you can do ETL in Perl, but doing it in Perl falls short of my experience in graphical dataflow environments for the following reasons:
  1. visual overview of complete data path from input to output
  2. visual feedback on flow of data through various processing and decision elements
  3. easy and implicit parallelism from the graph itself
I certainly created a few elements in my graph which called out to Perl, but the idea of processing large volumes of data with Perl as the base of the system would seem less-than-ideal.

I'm inviting your viewpoint on this issue, especially after looking at this job description:

Candidate should be fully proficient in writing scalable, high volume ETL jobs using SQL and Perl for large volume data warehouses
UPDATE a couple more things that a graphical environment offers out of the box:
  1. Resume execution of graph by setting watch points




The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development.

-- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"

Replies are listed 'Best First'.
Re: ETL in Perl
by Corion (Patriarch) on Sep 03, 2010 at 16:35 UTC

    If you've enjoyed doing ETL using a graphical dataflow environment, you haven't done large or complex data transformations. In my experience, these data graphs always become too large to print and document. They are also hard to diff against each other.

    The Perl module coming closest to ETL is Workflow, and it wouldn't be too hard to create the appropriate "transformation" nodes to it. You only need to render it to SVG, and allow the user to modify it. Writing such a node editor shouldn't be too hard either, as there is existing knowledge embodied in Inkscape.

      If you've enjoyed doing ETL using a graphical dataflow environment, you haven't done large or complex data transformations. In my experience, these data graphs always become too large to print and document. They
      • I have greatly enjoyed doing ETL using Ab Initio for professional work
      • "large or complex" is qualitative. If things became large or complex, I was able to highlight a section of processes and tuck them into a box, so printing was not an issue. And I found the large graphs to be largely self-documenting.
      • What dataflow tools does your experience encompass?
      Workflow is interesting but I dont see it offering what I got out of Ab Initio and what I listed as points for graphical dataflow -- parallelism, big sky picture, etc.



      The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development.

      -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"

        I have worked with Aris Toolset and IBM SPSS PASV Modeler (previously SPSS PASV Modeler, née SPSS Clementine). I don't know what your requirements for documentation are, but I found that these tools hid far too much information within the nodes and did not print it out. And in the end, processes need to be documented in print. "Tucking things into boxes" only means that instead of one huge A2 or A0 sheet, you get to deal with binders of A4 sheets, which is no improvement.

        If you want a graphical ETL tool, I pointed you to the tools that you need to stick together to create such a thing with Perl as the backend. If you want to play buzzword bingo and tell us that you're now getting paid for playing with ETL tools, maybe you want to visit http://etlmonks.org instead?

Re: ETL in Perl
by erix (Prior) on Sep 04, 2010 at 22:28 UTC
    ETL [...] typically involves reading large volumes of data from a database [...]

    I don't think that's right: it seems to me that ETL first and foremost means (always has): getting data from outside to inside a database. The TRANSFORM is only necessary if the db (-constraint(s)) demand it, and there, obviously, perl can come in handy.

    ... and processing it in massively parallel fashion

    "massively parallel"? That doesn't have much to do with ETL, does it? If the database can slurp data parallel (multicore, or multi-whatever), that's nice, but that doesn't seem very related to any ETL-job.

    (btw, your first link 'Extract, transform, and load...' links to the node itself; I suppose that was a mistake? You had wikipedia ETL in mind, perhaps? )

      "massively parallel"? That doesn't have much to do with ETL, does it? If the database can slurp data parallel (multicore, or multi-whatever), that's nice, but that doesn't seem very related to any ETL-job.
      massively parallel is very important - if you have independant data crunching tasks, the ability to send them off to different heavyweight machines easily without a bunch of fiddling with program source code is a huge book.

      most of my ETL work was for a bank --- analysing database data and creating summaries of it to go right back into the database, so I extracted from a database, analyszed and loaded it back into the db.. but you are right... sometimes the initial source is not a db.

      thanks for the link update



      The mantra of every experienced web application developer is the same: thou shalt separate business logic from display. Ironically, almost all template engines allow violation of this separation principle, which is the very impetus for HTML template engine development.

      -- Terence Parr, "Enforcing Strict Model View Separation in Template Engines"

Re: ETL in Perl
by grantm (Parson) on Sep 06, 2010 at 05:42 UTC

    Dane Horne (one of our local Perl Mongers) did a talk the other month about an Open Source ETL package he's developing called ETLp. Like metaperlCorion, Dan had found that the visual ETL tools really did not scale and became very frustating for non-trivial problems. Anyway, check out his project.

    Edit: fixed ref to earlier commenter

Re: ETL in Perl
by metaperl (Curate) on Sep 03, 2010 at 18:39 UTC
Re: ETL in Perl
by jdporter (Paladin) on Sep 07, 2010 at 18:53 UTC

    I disagree with the contention that ETL is the same as

    the input, transformation and output of data
    That just describes the T part of ETL.

    In fact, what I often see is people using "ETL" to refer only to the T part, or to the ET parts, or the TL parts.

    Extraction is, by definition, the extraction of data from a database (or other data storage) in its "native" model (schema). Load is, by definition, the insertion of data into database (or other storage) in the model defined there. Transformation is, of course, the transformation of the data from the one model to the other.

    Using Perl for the Extraction and Load parts can make a lot of sense. Perl can write, and run, the necessary SQL scripts, for example.

    Doing the Transformation in Perl might be easy to write, but I'd agree that it isn't the best tool in a lot of situations, for various reasons, not least of which is the fact that it's slow, relative to some other tools.

    What is the sound of Windows? Is it not the sound of a wall upon which people have smashed their heads... all the way through?
Re: ETL in Perl
by sundialsvc4 (Abbot) on Sep 16, 2010 at 19:12 UTC

    I recently assisted on some ETL work using Microsoft’s toolsets ... and soon enough wound up piecing together a little bit of Perl code to work with the underlying (XML-based) definition files of that system.   (I also wound up writing quite a bit of Visual Basic code ... (yuck!) ... to do a great many things that the pure-visual environment could not “quite” do.

    As the process became bigger, it also became more complex, and it became increasingly difficult to sit down and feel like you actually understood what was going to happen when you mashed the Start button.   I felt very uncomfortable with that.   A graph is only comprehensible when it fits on a single uncluttered page.

    Maybe I am just an old Luddite, but I really do embrace having source code as the basic way of defining to the computer what I want the computer to do.   I know how to diff such files.   I know how to work with them easily in version-control.

    Having said that ... I, too, would like avoid having to write and to maintain “large amounts of code” by hand in any language; including Perl.   I would look for (or build?) some system that allowed me to define the processes, the data relationships and so-on, and which would then just-in-time generate the necessary (Perl, of course...) source-code.

    Dispatching the resulting definitions for parallel execution is a different problem, and a relatively easy one to handle “generically.”   (Where I am parked right now, a rather archaic version of Tivoli Workload Scheduler is performing that task quite well.   It smells bad but it works.)   The advantage here is that any sort of workload can be dispatched in this way... “ETL” or otherwise.   It is very limiting if “ETL works this-way but nothing else does,” or when the system that is doing ETL has no way to balance itself against other work that might be going on upon the same machine at the same time.

      Ditto on the ability to diff and version control the source. I've worked with Informatica where you work with GUI tools that on the backend stores your mappings and workflows in a database (that you really don't want to look at, but sometimes do anyway). There is a built in version control that sucks, and you can not easily see the changes you make or changes between versions. You can export your work as XML, and if you sort and filter it just right (which I did with Perl and XSLT), you can diff XML files. You miss the ability to grep for something, instead you have to click and click (and click...) to see the bit you want to see, and there are so many levels of places to override settings, it's sometimes a challenge to figure out why something is behaving the way it is. The one (and maybe only) thing I like about it are that you get logging with little to no effort, and (ok, two things) realtime control and monitoring of the processes (in a pretty and fairly useful GUI).
Re: ETL in Perl
by Solo (Deacon) on Sep 08, 2010 at 15:10 UTC

    There's also Sprog!, but it looks unsupported/dead for some time.

    --Solo

    --
    You said you wanted to be around when I made a mistake; well, this could be it, sweetheart.
Re: ETL in Perl
by Corion (Patriarch) on Oct 14, 2010 at 20:40 UTC

    There just has been a release of ETLp, which seems to be a framework / application set geared towards creation and monitoring of ETL processes.

Re: ETL in Perl
by Anonymous Monk on Sep 21, 2010 at 21:52 UTC
    I'm a little surprised nobody has mentioned Pentaho Data Integration. I have some interest in working a persistent perl interpreter into their scheme to allow efficient single row parallel processing in something other than JavaScript.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://858725]
Approved by herveus
Front-paged by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-04-16 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found