bojinlund has asked for the wisdom of the Perl Monks concerning the following question:


I am working in a small research project with limited resource, where a lot of data (measured and calculated values) need to be handled. Data comes from scientific experiments and are usually first stored in Excel spreadsheets. One spreadsheet contains data from a few experiments. For each experiment there is typically some basic information (50 data items) and a number of time series of measured values (10 series, 100 points in time and 30 measured values for each time). There exists about 100 old spreadsheets and some hundred new will be created. The old spreadsheets are similar but not standardised.

Representation of a quantity

A quantity is a property that is measured. Example: mass, length, time. A unit is a standard quantity against which a quantity is measured. Example: gram, metre, second; which are units of the above quantities.

This can also be described by:

A = {A} * [A]
A is the symbol for the quantity, {A} symbolizes the numerical value of A, and [A] represents the corresponding unit. (e.g., A = 300 * m = 0.3 * km). {A} is often called the measured value.

Example of thing need for a quantity are:
Representation of unit

The Unit must be represented in a consistent way. For Units from The International System of Units (SI) can the exponents of the base units be used. SI has 6 base units (meter [m], kilogram [kg], second [s], … ). The derived units can be expressed using exponents of the base units (area [m2], speed [ms-1]).

Goals and restrictions

Primary goals are:

MS Window systems are used.

Design ideas


Replies are listed 'Best First'.
Re: Strategy for simple data management
by sundialsvc4 (Abbot) on Jan 10, 2014 at 23:25 UTC

    Pragmatically speaking, you will have to approach tasks like this one in several very-distinct “layers” ...

    1. The first step is to get all of the data from any spreadsheet-file into a common data store ... e.g. an SQL database (SQLite file?).   Grab it exactly as-is, and arrange this data-intake script so that you are able to verify (from the database entries) that all of the available spreadsheets have in fact been imported ... when, by whom, and so on.   If you re-import a file that has already been previously imported, all of the preceding data should be cleanly replaced.   After all, the greatest threat to the data-integrity of the entire study is that data is missing, or that it is duplicated.
    2. The next step is standardization:   without altering the original “capture” data, this step converts apples to consistent oranges.   This process, once again, must be entirely reproducible.   It should create new, standardized data-tables from the data-capture originals.   If any of the input data does not conform to whatever validation rules you can come up with, it should be very-clearly flagged as non-conforming.
    3. The final step is ... whatever your analysis needs to be.   This step will rely very heavily upon all of the preceding steps to have delivered a data-set that is both complete and consistent, and/or to have clearly “blown the whistle” if something is wrong ... even if (especially if?) the source of the inconsistency is “the work of an experimenter.”   Always bear in mind that “only the computer itself” can be relied-upon to detect omissions or inconsistencies in a mass of collected data.   The scripts that comprise your pipeline must be not ony reliable but error-aware.

    You can certainly use Perl for each of these steps.   (In a Windows environment, yes, Perl does OLE...)   Unfortunately, the exact nature of what needs to be built, and of how to correctly use what has been built, will be completely determined by what you need to do in this project.

Re: Strategy for simple data management
by basiliscos (Pilgrim) on Jan 10, 2014 at 18:11 UTC

    Very simple approach: just store your experiment files json fomat, 1 file per experiment.

    And in the file just an array of json-hash like

    [ { value: "..", unit: "..", ...} ];

    Of course, I suppose, that you will not need any data search/aggregation etc., esp. if you are going to implement that analysis in pure JS.

Re: Strategy for simple data management
by djerius (Beadle) on Jan 15, 2014 at 15:42 UTC
Re: Strategy for simple data management
by tangent (Vicar) on Jan 17, 2014 at 01:58 UTC
    With regard to how you store your data in a standardised format I would suggest you consider CSV files.
    • Easy to manipulate with Perl:

    • Easy to use with javascript - many javascript libraries can read and manipulate CSV files. For example, D3.js can pull in a CSV file and generate an HTML table, or create interactive charts with transitions and interaction - I think you will find that library very useful.
    • Easy to import and export to/from spreadsheets.
    • Easy to backup - you can store them on a thumb drive or optical media.
    • Easy to 'debug' - when you encounter problems with your data you can open your file with a text editor and see exactly what data you have.
    • Easy to share via email or in the cloud.