Strategy for simple data management

bojinlund has asked for the wisdom of the Perl Monks concerning the following question:

Background

I am working in a small research project with limited resource, where a lot of data (measured and calculated values) need to be handled. Data comes from scientific experiments and are usually first stored in Excel spreadsheets. One spreadsheet contains data from a few experiments. For each experiment there is typically some basic information (50 data items) and a number of time series of measured values (10 series, 100 points in time and 30 measured values for each time). There exists about 100 old spreadsheets and some hundred new will be created. The old spreadsheets are similar but not standardised.

Representation of a quantity

A quantity is a property that is measured. Example: mass, length, time. A unit is a standard quantity against which a quantity is measured. Example: gram, metre, second; which are units of the above quantities.

This can also be described by:

A = {A} * [A]
[download]

A is the symbol for the quantity, {A} symbolizes the numerical value of A, and [A] represents the corresponding unit. (e.g., A = 300 * m = 0.3 * km). {A} is often called the measured value.

Example of thing need for a quantity are:

Value
Unit
Property measured/calculated (temperature of air)
Measuring/calculation method

Representation of unit

The Unit must be represented in a consistent way. For Units from The International System of Units (SI) can the exponents of the base units be used. SI has 6 base units (meter [m], kilogram [kg], second [s], … ). The derived units can be expressed using exponents of the base units (area [m2], speed [ms-1]).

Goals and restrictions

Primary goals are:

In a consistent way store the data from the experiments
Make it possible for Perl programs to use the stored data
Make it possible for Perl programs to create new sets of stored data
Make the data available to JavaScript programs

MS Window systems are used.

Design ideas

Use one text file to represent data from one experiment. The text files are stored in a directory structure in a normal file system.
For each spreadsheet there is one file associated, containing the additional information needed to create “standardised” representation from a spreadsheet. A Perl scripts is used to create the “standardised” representation.
A temporary database is created for each purpose. (I do not think it is possible to have one on-line database with everything.) A Perl script load the temporary databases from a number of “standardised” files.
A database implementation which can provide accessed to Perl and JavaScript program is selected.
Make it possible from Perl to create new or augment “standardised” files. (Write to the database and then create “standardised” files with the new and updated data.)

Questions

Are there any similar Perl-based system already implemented?
Is there a better design strategy? What should be changed?
What type of database can be used? Is MongoDB possible and suitable?
Suitable format for the “standardised” files.
Perl modules for handling SI units and conversion between such unit.

Comment on Strategy for simple data management Select or Download Code

Replies are listed 'Best First'.
Re: Strategy for simple data management by basiliscos (Pilgrim) on Jan 10, 2014 at 18:11 UTC
Very simple approach: just store your experiment files json fomat, 1 file per experiment. And in the file just an array of json-hash like `[ { value: "..", unit: "..", ...} ];` [download] Of course, I suppose, that you will not need any data search/aggregation etc., esp. if you are going to implement that analysis in pure JS.	[reply] [d/l]
Re: Strategy for simple data management by sundialsvc4 (Abbot) on Jan 10, 2014 at 23:25 UTC
Pragmatically speaking, you will have to approach tasks like this one in several very-distinct “layers” ... The first step is to get all of the data from any spreadsheet-file into a common data store ... e.g. an SQL database (SQLite file?). Grab it exactly as-is, and arrange this data-intake script so that you are able to verify (from the database entries) that all of the available spreadsheets have in fact been imported ... when, by whom, and so on. If you re-import a file that has already been previously imported, all of the preceding data should be cleanly replaced. After all, the greatest threat to the data-integrity of the entire study is that data is missing, or that it is duplicated. The next step is standardization: without altering the original “capture” data, this step converts apples to consistent oranges. This process, once again, must be entirely reproducible. It should create new, standardized data-tables from the data-capture originals. If any of the input data does not conform to whatever validation rules you can come up with, it should be very-clearly flagged as non-conforming. The final step is ... whatever your analysis needs to be. This step will rely very heavily upon all of the preceding steps to have delivered a data-set that is both complete and consistent, and/or to have clearly “blown the whistle” if something is wrong ... even if (especially if?) the source of the inconsistency is “the work of an experimenter.” Always bear in mind that “only the computer itself” can be relied-upon to detect omissions or inconsistencies in a mass of collected data. The scripts that comprise your pipeline must be not ony reliable but error-aware. You can certainly use Perl for each of these steps. (In a Windows environment, yes, Perl does OLE...) Unfortunately, the exact nature of what needs to be built, and of how to correctly use what has been built, will be completely determined by what you need to do in this project.	[reply]
Re: Strategy for simple data management by djerius (Beadle) on Jan 15, 2014 at 15:42 UTC
Physics::Unit can handle your units requirements.	[reply]
Re: Strategy for simple data management by tangent (Parson) on Jan 17, 2014 at 01:58 UTC
With regard to how you store your data in a standardised format I would suggest you consider CSV files. Easy to manipulate with Perl: Text::CSV_XS is fast and robust Text::CSV::Merge allows you to merge files to fill in missing data or update with new data Tie::Array::CSV enables you to access each file as an array of arrays, and use standard array functions on the data DBIx::SQLEngine::Driver::CSV allows you to perform SQL queries. Easy to use with javascript - many javascript libraries can read and manipulate CSV files. For example, D3.js can pull in a CSV file and generate an HTML table, or create interactive charts with transitions and interaction - I think you will find that library very useful. Easy to import and export to/from spreadsheets. Easy to backup - you can store them on a thumb drive or optical media. Easy to 'debug' - when you encounter problems with your data you can open your file with a text editor and see exactly what data you have. Easy to share via email or in the cloud.	[reply]

Back to Seekers of Perl Wisdom