http://qs321.pair.com?node_id=1229735

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I want to handle lots of data, mostly from spreadsheets and csv files. I'm currently using Spreadsheet::Read to parse the data, which is great because it can handle different formats.

I want to take things a step further though and be able to merge data that may exist across many different spreadsheets where columns may/may not be in the same order or have a different name or maybe have no name at all or maybe not even exist. So, for example, it might identify a column full of phone numbers and ask me which column of existing data I would like to assign it to similar to the way I've seen this done through a web interface for uploading contact information. The script would also intelligently dedup and merge new data with existing data, or at least assist in that process, as well.

Before I write this myself, I'm wondering if there might be a suite of tools that already tackles this problem. Seems like it is a common enough problem.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: Module for intelligently analyzing and merging spreadsheet data
by erix (Prior) on Feb 10, 2019 at 21:39 UTC

    PostgreSQL has (as a contrib module) file_fdw (fdw=Foreign Data Wrapper) which reads csv (readonly; it cannot write to the csv file). It offers the postgres SQL-interface to address the data in the text files.

    Steps are:

    1) Installing the file_fdw module ('create extension file_fdw') (once)

    2) Creating a server [1] ('create server if not exists csvserver foreign data wrapper file_fdw') (once)

    3) Construct a CREATE FOREIGN TABLE statement [2] to reflect the structure of the csv file. The csv-file header line can often be used to convert into such an CREATE statement (possibly even adding the data type to columns).

    SQL calls to such a 'Foreign Table' now read the underlying csv file(s) (select * from mytable).

    Advantages: SQL access to csv data.

    Disadvantages: Needs Postgres. Readonly. No indexing possible, so that huge csv-files can make it slow (alhough 'copying' onto a materialized view [3] on the foreign table makes indexing possible). Filesystem access for the postgres superuser is necessary .

    [1] Create Server

    [2] Create Foreign Table

    [3] Create Materialized View

Re: Module for intelligently analyzing and merging spreadsheet data
by kschwab (Vicar) on Feb 10, 2019 at 21:51 UTC
    Might be neat to use a Bayesian classifier to guess which existing column the data should be merged into. Use the column names as labels, and train on the data already in the main sheet. See Algorithm::NaiveBayes for a generic Perl implementation and this node for a short example of how to use it.

      Thanks, yeah. I was actually looking at some of the Bayesian modules to see if they could help me determine if the spreadsheet had a header row. I couldn't quite figure out how to do it, though. And there seems to be two different Bayesian modules,the one you mentioned and this one: AI::NaiveBayes

      I'm not sure which to use or if they do anything different.

      Here is the code I was experimenting with:

      my $classifier = AI::NaiveBayes->train( { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] }, ); # Classify a feature vector my $result = $classifier->classify({fname => 5}); # $result is now a AI::NaiveBayes::Classification object my @predictors = $result->find_predictors;

      I wasn't sure where to go from there or how to better train it.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

        That's a good idea kschwab. I can explain some things but I have no experience with either module.

        In the train() mode, you need to pass on a lot of data cases, each in the form of an array of hashrefs, each as the one you already have in your post:

        { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] },

        which means that, in this case predictor "phone" has a weight of 1, "last name" the same, etc. And you, the human, classified this case as "has header".

        What does a weight mean? Let's say here in your case it is the number of times it occured in your single data case. Each data case will have its own weights for each predictor. Weight can be other things or a combination, for example: number of times it occurs, whether it is capitalised, whether it is at the beginning of a sentence etc.etc.

        And on you continue with your next data case. etc. Ideally you should represent all labels, "has header" and I guess, "has no header". All these in a single hash array (of the hashrefs mentioned above) to be given as parameter to train()

        Then it's time to classify some unknown cases. Using the couplet:

        my $result = $classifier->classify({phone => 3, fname => 0, ...}); my $best_category = $result->best_category;

        $best_category will be one of "has header", "has no header" for that particular data case you classify(). The classifier $result can tell you also what influence each field/predictor has using my $predictors = $result->find_predictors; (see AI::NaiveBayes::Classification)

        The trick is to find some predictors that you think differentiate the two labels. For example one has far fewer "phone" and the other has a lot. Then a weight for each of the predictors has to be calculated by you, or naively put the number of occurences in each data case you have. Just to start. I am not sure of predictors with zero weight for that particular data case have to be mentioned in train() or will be inferred and set to zero if at least one data case mentions them and others do not. I think they will be inferred if absent from particular data case but present in at least one other data case.

        Forgot to mention that a data case can belong to many labels! That's why you have that arrayref in labels => [...] (note: data case = data row = a single observation)

        Code taken from AI::NaiveBayes

        bw, bliako

Re: Module for intelligently analyzing and merging spreadsheet data
by nysus (Parson) on Feb 10, 2019 at 19:30 UTC
Re: Module for intelligently analyzing and merging spreadsheet data
by fishy (Friar) on Feb 12, 2019 at 14:43 UTC
    Hi nysus,

    I would recommend Data::Table.
    Pure Perl, Zero dependencies, Simple and yet Powerful.

    Cheers.