Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Thorny design problem

by tlm (Prior)
on Sep 07, 2005 at 21:14 UTC ( [id://490001]=perlquestion: print w/replies, xml ) Need Help??

tlm has asked for the wisdom of the Perl Monks concerning the following question:

(Sorry for the uninformative title; I really couldn't think of anything better.)

Here's yet another design question of the kind I find inordinately difficult.

Suppose I need to fetch and process, on a regular basis, some data for about 100 different "entities", say the Fortune 100 companies. The goal is to generate an output file for each company, with a uniform format for all the companies.

Now, here's the thing. Assume as a given that the data I'm interested in is available only in formats that vary radically from one company to the next. E.g. in some cases I can scrape the data from the company's static website; in other cases it's easier to tweak a CGI page with WWW::Mechanize; in others cases I'd fetch flat files via FTP; or access an RDBMS directly, etc. On average I'll need anywhere between 25 and 500 lines of code to gather and process the data for each company.

My first thought was to create 100 modules:

Crunch::WalMart Crunch::ExxonMobil Crunch::GeneralMotors Crunch::FordMotorCompany Crunch::GeneralElectric . . . . . . Crunch::SupervaluInc Crunch::CiscoSystemsInc
all implementing the same simple API, say the method do_it(). Each module knows how to fetch the required data and what to do with it. I can then create a subdirectory Companies, containing subdirectories WalMart, ExxonMobil, GeneralMotors, ..., CiscoSystemsInc. The purpose of these directories is both to store the raw input files and the processed data files, and also as a way to list the companies of interest (namely, all those that are mentioned in the Companies subdirectory). With this set-up, I could then have a master update function, to be run periodically, that would look like this:
use File::Spec::Functions 'catdir'; sub do_em { my $path_to_companies = shift; opendir $dh, $path_to_companies or die "Can't opendir $path_to_companies: $!\n"; while ( my $company = readdir( $dh ) ) { next if $company ~= /^\./; my $dir = catdir( $path_to_companies, $company ); next unless -d $dir; my $module = 'Crunch::' . $company; eval "require $module; $module\::do_it( '$dir' ); 1" or die; } }
To me, this reeks to high heaven, though I can't quite say why. Perhaps it's an aversion to using eval, or because it's too reminiscent of the newbie-ish tendency to want to use symbolic refs.

An alternative approach would be to create an array of coderefs, one per company:

use File::Spec::Functions 'catdir'; { my @do_it = ( \&do_WalMart, \&do_ExxonMobil, . . . \&do_CiscoSystemsInc, ); sub do_em { my $path_to_companies = shift; $_->( $path_to_companies ) for @do_it; } } sub do_WalMart { my $name = 'WalMart'; my $dir = catdir( shift, $name ); # blah blah blah }
...but this entails having a huge file with many disparate functions, having little to do with one another.

I can think of a trillion other schemes, but not a single one presents itself as a clear winner somehow. What's your opinion?

the lowliest monk

Replies are listed 'Best First'.
Re: Thorny design problem
by friedo (Prior) on Sep 07, 2005 at 21:23 UTC
    I don't see any particular reason to use eval. Why not do something like this?

    sub do_em { ... my $module = 'Crunch::' . $company; require $module; $module->do_it( $dir ); } }

    Update: Actually that should be:

    my $module = 'Crunch/' . $company . '.pm';

    ...per mpeters' note below.

    There are also some CPAN modules for doing plugins which will scan a directory and load each module fitting some criteria, so you don't even have to do the require.

      No, this won't actually work. From perldoc:
      But if you try this:
      $class = ’Foo::Bar’; require $class; # $class is not a bareword #or require "Foo::Bar"; # not a bareword because of the ""
      The require function will look for the "Foo::Bar" file in the @INC array and will complain about not finding "Foo::Bar" there. In this case you can do:
      eval "require $class";
      But you could do:
      my $module = 'Crunch/' . $company .'.pm'; require $module;

      -- More people are killed every year by pigs than by sharks, which shows you how good we are at evaluating risk. -- Bruce Schneier

      Thanks for the cluebricks on eval, and on the keyword "plugin", which hadn't occurred to me. Searching for "plugin" in CPAN brings down about 600 hits, most of which are not general enough (e.g. Siesta::Plugin, PXP::Plugin, etc.), so I have some wading to do. If you remember any specific names, please let know.

      Update: Thanks to mpeters for reminding me why I was fussing with eval.

      the lowliest monk

        It's actually not so much a 'plugin' architecture, but a 'factory' design patter. Check out Class::Factory.

        -- More people are killed every year by pigs than by sharks, which shows you how good we are at evaluating risk. -- Bruce Schneier
Re: Thorny design problem
by davidrw (Prior) on Sep 07, 2005 at 21:31 UTC
    Perhaps it's an aversion to using eval,
    You don't need the eval here -- this works:
    # eval "require $module; $module\::do_it( '$dir' ); 1" require "$module"; $module->do_it( $dir );
    As for the general approach, it looks fine (and avoids repetition).. i've see that (e.g. load everything in the Plugins subdir) used in other modules (can't think of a name offhand but will go poke around and then update)...

    Update: doh. i am too slow...

    Update: Chemistry::File can autoload the Chemistry::File::* modules

    Update: added the quotes on the require argument.
Re: Thorny design problem
by borisz (Canon) on Sep 07, 2005 at 21:45 UTC
    I have done something really similar for about 10 Firms. I choice the Module way, where all Modules derive from one with the API. I call the modules never directly. They are created/loaded from the base module.
    my $m = Church->new('GeneralMotors') or die; # $m is Church::GeneralMotors
    I'm very happy with this method. Propably you need nerly no code to implement Church::BMW derived from Church::GeneralMotors.
    Boris
      Your factory approach is clean, but it seems like you're just moving tlm's problem from one place to another. How is Church::new implemented?

      Flavio
      perl -ple'$_=reverse' <<<ti.xittelop@oivalf

      Don't fool yourself.
        package Church; sub factory_create { my ( $class, %p ) = @_; my $new_class = ref($class) || $class . "::" . $p{firm}; eval "require $new_class" or die $@; return $new_class->new(%p); }
        Boris
Re: Thorny design problem
by xdg (Monsignor) on Sep 07, 2005 at 21:33 UTC

    How about just having a separate perl script for each company? It seems like the modularized approach isn't really doing all that much except managing a namespace and it adds a lot of complexity for little value that I can see. You could keep all the scripts in a single directory and just execute them all whenever you need to, via cron or via some control routine that just executes all the scripts in a given directory. Each script can contain the location to store raw and processed files.

    That's not that different from your approach using 100 modules, but why use require and eval when do or system would work just as well? Ideally, there would be some reuseable bits that you could put into modules to share between these scripts so that each script is just a small wrapper to drive some common way of gathering data and some shared output functions.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      How about just having a separate perl script for each company? It seems like the modularized approach isn't really doing all that much except managing a namespace and it adds a lot of complexity for little value that I can see.

      I'm glad you brought this up. My main reason for using the module approach was so I could test the various internal functions that support the do_it API function. I find that there's a lot more support available for testing functions if they belong to a loadable module, though, admittedly, this rationale has a tail-wagging-the-dog ring to it.

      the lowliest monk

        You can have it both ways. See brian d foy's article 'How a script becomes a module'. Make it a module for testing purposes and a script otherwise.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Thorny design problem (eval++, OO++, ISA--)
by tye (Sage) on Sep 08, 2005 at 03:48 UTC

    I'd use the eval, but only for the require. I'd use $module->doIt() or, more likely, $module->new(...) followed by some other method calls.

    Note that I'd likely not use inheritance to factor out commonality here. I encourage you to try to factor out common parts, but by making modules that your main modules use, not modules that your main modules try to BE (as in @ISA).

    Your situation sounds like a nearly perfect situation for discovering how inheritance binds so tightly and can leave you in a bind. But learning that from reading instead of from personal, painful experience is a sign of intelligence. (:

    - tye        

      I'm reading Head First Design Patterns, which says to favor composition over inheritance. Encapsulate what changes, but there are other ways to do that besides inheritance, like delegation. What you say, Tye, reminds me of what I've read.

Re: Thorny design problem
by 5mi11er (Deacon) on Sep 07, 2005 at 21:27 UTC
    Personally, I think your first approach is probably the best in terms of keeping the details in individual files, and those files in a separate directory.

    As for the calling mechanism, eval will certainly work, as would a system() call... Have you thought about a dispatch table? It probably wouldn't be any cleaner, but it might be.

    I have something quite similar where I have lots of detailed expect scripts that go off and do some function to lots of different network equipment. In that case, I'm actually calling those via system(); and yes I know about the CPAN expect module, I just haven't had the tuits to redo all the work already done in expect...

    HTH,

    -Scott

Re: Thorny design problem
by GrandFather (Saint) on Sep 07, 2005 at 21:53 UTC

    Rather than eval or some equivelent, use a Perl script in each company's folder that does the refresh work. Your update script then just itterates over the folders that are currently present and executes the refresh script in each folder.

    To add a new company, create a folder and the refresh script.


    Perl is Huffman encoded by design.
Re: Thorny design problem
by nedals (Deacon) on Sep 07, 2005 at 22:05 UTC

    It would seem best to put all your 'data_gathering' processes in one module. There has to be a lot of commonality in the code for each company.

    In the module you might have the following subs.

    • gather_data_A .. n using WWW::Mechanize, FTP, or access an RDBMS, etc
    • process_data_A .. n
    • save_data_A .. n

    Now you need to tell the module which process combination to use. For this you could use a hash, outside the module, that defines the process for each company. That way you don't need to mess with the module when you add a new company provided, of course, that the required sub exists in the module.

    use Crunch; %process = ( 'WalMart' => 'A:A:A', 'ExxonMobil' => 'B:D:A', ... 'CiscoSystemsInc' => 'C:D:A', ); sub do_em { my $path_to_companies = shift; opendir $dh, $path_to_companies or die "Can't opendir $path_to_compa +nies: $!\n"; while ( my $company = readdir( $dh ) ) { next if $company ~= /^\./; my $dir = catdir( $path_to_companies, $company ); next unless -d $dir; Crunch::do_it( dir => $dir company => $company process => $process{$company); ); } }

      There has to be a lot of commonality in the code for each company.

      No, I tried to be very explicit and emphatic about this point. That's precisely the problem: each company requires mostly idiosyncratic code.

      the lowliest monk

        I'd be very careful about this. Because if you look for similarities, there likely will be many. I realise that the idiosyncracies probably outweigh the similarities, but even encapsulating those few similarities (which, of course, won't be similar among all companies, but probably a significant subset will share one similarity while another subset shares another similarity, etc.) can pay dividends.

        For example, what I have done in a similar (yet completely dissimilar) situation ... where I have a bunch of tasks to perform, each one is almost completely unique from the other, and I have a task manager object that reorders the tasks (order can be important for me - not likely for you, but the manager could do other things to help, such as figure out that some companies have been updated recently enough to skip, or something) and then executes them serially.

        The manager object is actually almost completely uninteresting here.

        package TaskManager; use Task; sub new { # create. # receive task objects or task names, store them via add_task. } sub add_task { # receive task objects or task names my $self = shift; push @{$self->{_tasks}}, map { ref $_ ? $_ : Task::find_task($_) } @_; } sub execute_tasks { my $self = shift; # do we need to do them? my @tasks = grep { $_->needed() } @{$self->{_tasks}} for my $t (@tasks) { $t->perform_task(); } } 1;
        For the most part, that's it. I skipped the sorting part, for example. And strict/warnings - they're in the real module. And then, the Task module looks like this:
        package Task; sub new { # creation of generic object. } my %task_cache; sub find_task { my $task = shift; $task = shift if $task eq __PACKAGE__; # can call as Task-> or Task: +: return $task_cache{$task} if $task_cache{$task}; (my $mod_name = $task . '.pm') =~ s{::}{/}; eval { require $mod_name }; if ($@) { # handle error - return error, die, whatever. } $task_cache{$task} = $mod_name->new(@_); } sub needed { #default - yes. 1 } sub perform_task { # default... can't. my $self = shift; die "Forgot to override perform_task in " . ref($self) . "\n"; } 1;
        Now I didn't really show you how to reuse stuff. But what I've started here is a framework. A framework upon which you can build further frameworks. For example:
        package Task::WWW; use WWW::Mechanize; sub perform_task { my $self = shift; # set up proxy, ... my $www = WWW::Mechanize->new($self->get_hostname()); $self->handle_page($www); } 1;
        Now you just need to derive your HTTP-gathering info from Task::WWW, override two simpler functions, and can rest assured that if your application needs to change proxies for http requests, they all go through this code and can be easily modified. I realise that LWP::Simple probably uses environment variables for this, but it's just an example. I'm sure the more imaginative out there can find another use for this type of abstraction. Perhaps you now need to authenticate to get out of your company intranet. Or something equally silly. Whatever. By abstracting out the common parts, you not only can save time in adding new requirements like that, but you can concentrate on smaller pieces of the overall puzzle at a time in each function in each module.

        You'll have well over 100 modules this way, but each one will be simpler.

        (I think I need to formalise this and put it on CPAN...)

Re: Thorny design problem
by Codon (Friar) on Sep 08, 2005 at 19:50 UTC
    This is difficult issue. It sounds as if you are going to have a handful of data collection schemes for your 100 data sources. This, combined with the desire to create a single, unified output format leads me to think Heirarchical. You will have some significant overlap in how you access the raw data per data source. All static web sites would have a url that you access. All FTP sites would have a remote server, login, password, and file path. All RDBMS will have similar credentials. In all cases, you will want to do the following:
  • fetch_raw_data
  • parse_raw_data
  • write_pased_data
  • This would make me go with something akin to:

    DataCollector DataCollector::Mechanized DataCollector::Mechanized::WalMart DataCollector::Mechanized::GeneralElectric DataCollector::RDBMS DataCollector::RDBMS::ExxonMobil DataCollector::FTP DataCollector::FTP::GeneralMotors DataCollector::Scrape DataCollector::Scrape::FordMotorCompany DataCollector::Scrape::CiscoSystemsInc . . . etc.

    Your driver program would then, unfortunately, need to know all of the DataCollector leaf classes or devise a method to dynamically load and run them. But for each of these classes, you could call the above mentioned methods. Those methods would make private method calls on down until you get to the ugly details in the individual implementation classes. These implementation classes would only need to know where it's going for data and how to pull the real data from raw data source. Up one level would be how to talk to the data source type, based on information in the implementation classes. Up in the top level is the detail of how to write out the data.

    I hope this makes sense, isn't too vague, etc. Good luck.

    Ivan Heffner
    Sr. Software Engineer, DAS Lead
    WhitePages.com, Inc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://490001]
Approved by sk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (4)
As of 2024-04-19 20:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found