Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

key value in text format

by pwagyi (Monk)
on Nov 06, 2019 at 03:49 UTC ( [id://11108346] : perlquestion . print w/replies, xml ) Need Help??

pwagyi has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

I would like to seek advice from monks regarding problem I'm trying to solve. I am developing perl application to extract some information from raw binary data, and store extracted data. i.e persist data somehow.

so I've choice of using relational database, flat text file, CSV/JSON/XML. The nature of data is each record has fixed field + optional key-value pairs (in future may also expand).

example record field1, field2, field3, [ key1=value1, key2=value2, ...]

There may be multiple records (10K or so) extracted from each binary file. I need quick lookup of record using key (key is a composite of some fixed fields), from records generated from all binary files. It's also required to know which binary file (file name) is this record come from.

So what data file format experienced Monks would recommend ? Should I use Storable/ Berkley DB? Should I have flat text file for each binary file, store some header info (like which binary file source in header once only).?


Replies are listed 'Best First'.
Re: key value in text format
by GrandFather (Saint) on Nov 06, 2019 at 06:13 UTC

    Sounds like a good candidate for DBD::SQLite. Text based formats aren't great for storing binary data and are pretty bad for "quick lookup". SQLite is light weight to install and get going. In fact it is built into DBD::SQLite so there is no other installation and configuration!

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond

      Yes, it sounds great. SQLite is quite lightweight too. Just to clarify, records data are only textual (numeric/string) only. But how would Relational DB like SQlite address potential sparse key -value pair ??

        I think ideally you will model your (composite) key as separate columns. Then you can query either all key columns or just a subset of them.

        If you want to keep things simple, keep the optional key-value pairs at the end as a single string. If you want to also query them, an approach a slight step better is to format and store them as JSON. Then you can query them in the database almost as if they were additional columns. The ideal way is to convert these optional things either into a fixed set of additional columns or add another table that consists of three columns, (row-key, keyname, value). But doing that makes the queries somewhat more ugly.

        I second SQlite. I use it a lot, and if your dataset is big(ish), you don't want to be parsing flat files for every lookup.

        How you store this depends on just how many [ key1=value1, key2=value2] there are.
        If there is a fixed list of 'key1','key2' and it's not too many: bite the bullet and add the columns.

        Otherwise, you get to use a link (This is Database 101. search for 'database normalization')

        each record should be
        [id], field1, field2

        then in a separate table
        [link_id], key1, value1 [link_id], key2, value2

        You can also have a 3rd table with
        [link_id], source_file_name
        or whatever other metadata you need to keep.

        SQLite very kindly has a magic ROWID column. It wont' return it on SELECT *, but will on SELECT ROWID,*.
        Of course you get to do two queries for each lookup: SELECT ROWID,* FROM main_data and then SELECT * FROM extra_data WHERE link_id = ?


        TO clarify: if your 'real' data is
        name=bob, age=75, [ hair=balding, glasses=bifocal] name=john, age=20, [ sport=chess ]

        Then you get:
        Primary table:
        1, bob, 45 2, john, 20

        lookup table:
        1,hair,balding 1,glasses,bifocal 2,sport,chess

      I'm not sure I should create a new post. Let's say I'm using SQLite, but SQLite supposedly cannot be reliably locked on NFS so multiple processes could not safely manipulate database w/o corruption. Since the application may be running multiple instances to crunch data files and inserting to a database file.

      If I create individual database file for each raw file, then I've to merge them; and need to manage parent/foreign key constraint.. The final database file should reside on NFS, so other applications can access (read).

        That just raises the DB requirements and puts any text file based solution right out of the running. The solution is essentially the same, but use one of the other DBD:: modules to access your database of choice. Once you get beyond simple requirements that can easily be met by SQLite you will find fairly strong division between proponents of different databases.

        The saving grace is that you can do a lot of development against DBD::SQLite then migrate to using the more serious database of your choice when the system needs to hit the NFS.

        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond