Reducing application footprint: large text files

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Reducing application footprint: large text files by LanX (Saint) on Feb 28, 2018 at 21:28 UTC
> Can this be packed down into a simple binary file? Yes, literally! :) Use `pack` and `unpack` Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^2: Reducing application footprint: large text files by Anonymous Monk on Feb 28, 2018 at 22:15 UTC
Thanks, I'll look into pack and unpack. I think the workflow would then become: During the build procedure, run the PM's through pack to create the binary version of the file. During runtime, read the binary file via unpack Is that what a Perl expert would do? Matt.	[reply]
Re^3: Reducing application footprint: large text files by pryrt (Abbot) on Feb 28, 2018 at 23:11 UTC
Most of the comments I've seen have assumed this is data. But your phrasing of "run the PM's thru pack..." implies that this isn't data, per se, but an actual perl module (`.pm` that you're accessing via `use Some::Module`). If this is true, I am not sure that rolling your own is the best choice. You might want to clarify on the point. Is the file you're trying to read, which you called "the PM's", pure data, data in perl format, or data plus other perl code (such as functions, for loops, etc), or something else? I don't know specifically of a CPAN module that allows loading of a compressed module, but it would surprise me if there wasn't one (a quick search for "perl compress module" finds perl modules that compress something else, not perl modules that allow you to compress your source code). Or something like the Acme::Buffy, which will modify the source code. I just don't know of what that module would be... but maybe my phrasing will spark something in a more experienced monk I hesitate to recommend Module::Crypt: I hesitate, because Module::Crypt doesn't really do what the name implies: never rely on Module::Crypt to protect your source code from prying eyes; it will not keep it secret! But I mention it nonetheless because I think that maybe the XS output from Module::Crypt would be smaller than your 10MB++ perl module. I don't know if it would be, but it might be something to try.	[reply] [d/l] [select]
Re^4: Reducing application footprint: large text files by salva (Canon) on Mar 01, 2018 at 10:09 UTC
Re^4: Reducing application footprint: large text files by LanX (Saint) on Feb 28, 2018 at 23:30 UTC
Re^4: Reducing application footprint: large text files by Anonymous Monk on Mar 01, 2018 at 00:44 UTC
Re^5: Reducing application footprint: large text files by Marshall (Canon) on Mar 01, 2018 at 01:31 UTC
Re^3: Reducing application footprint: large text files by LanX (Saint) on Feb 28, 2018 at 22:40 UTC
> Is that what a Perl expert would do? A Perl expert would ask for more details. You can certainly do what you described... ...just probably there is an even better solution. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re: Reducing application footprint: large text files by johngg (Canon) on Feb 28, 2018 at 22:53 UTC
Difficult to tell without seeing some more data; for instance, are those fields consistent in number or do they vary, what are the max & min values of the hex numbers? If the fields vary to the extent that would make pack templates impractical you might want to have a look at the core Storable module, perhaps in conjunction with IO::Compress::Bzip2 and its Uncompress sibling. Cheers, JohnGG	[reply]
Re^2: Reducing application footprint: large text files by swl (Parson) on Feb 28, 2018 at 23:58 UTC
Sereal is a more recent alternative, and its encoder provides data compression: https://metacpan.org/pod/Sereal::Encoder#compress. A few stats for an earlier version are at https://blog.booking.com/sereal-a-binary-data-serialization-format.html.	[reply]
Re^3: Reducing application footprint: large text files by Anonymous Monk on Mar 01, 2018 at 01:01 UTC
Very interesting. Sereal deserves some attention too. I'll read through that. Thanks, Matt.	[reply]
Re^2: Reducing application footprint: large text files by Anonymous Monk on Mar 01, 2018 at 00:58 UTC
There are two data structures that remain the same... the first structure describe bit-fields within a 64-bit register. The 2nd structure describes some meta-attributes about the register. min to max will be 0 to 2^64 - 1. So given these data structures are not varying, it is sounding like pack templates might be the way to go. Perhaps there will be a challenge in that the 1st data structure is an array with varying numbers of elements, although the structure will always be the same. Thanks for pointing out Storable and BZip2. That is more food for thought along the way. Thanks, Matt.	[reply]
Re^3: Reducing application footprint: large text files by BrowserUk (Patriarch) on Mar 01, 2018 at 11:12 UTC
There are two data structures that remain the same... the first structure describe bit-fields within a 64-bit register. The 2nd structure describes some meta-attributes about the register. min to max will be 0 to 2^64 - 1. So given these data structures are not varying, it is sounding like pack templates might be the way to go. Perhaps there will be a challenge in that the 1st data structure is an array with varying numbers of elements, although the structure will always be the same. The OP shows two hashes; one of which is a hash of arrays. Above you say "the 1st data structure is an array with varying numbers of elements,"? The OP mentions "many 10's of MB of computer generated data files" and shows two small data structures. My point is that you are not giving us clear information. If you want actual help rather than speculative possibilities, you need to be more clear and accurate in the specifications of the problem. Ie. Is this two files containing a huge version of one of the OP data structures in each? Or are the myriad files for each type of data structure? Or myriad files containing the two versions of the OP data structures? How many MBs? Spread across how many files? Are the sub data structures fixed or variable in length? Note: If the top level entity in a file has a variable length, that's easily accommodated; but if the sub structures vary in length that's harder. Ie. if the hash of arrays, contains a variable number of hash elements, but the values are fixed length arrays, that easily handled; but if the arrays vary in length that's much harder. Does the application need to load all of the "10s of MBs" at once for every run, or does it only use a small subset for each run? So many more questions, before I would choose an approach to solving your problem. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply]
Re: Reducing application footprint: large text files by LanX (Saint) on Feb 28, 2018 at 22:52 UTC
Apart from pack and unpack... `Storable` might be another and easier option, but I don't know much about the achievable compression. The 3rd option is to zip the raw data. You could use Archive::Zip for it. Best choice depends on the details... Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Wikisyntax for the Monastery}	[reply]
Re^2: Reducing application footprint: large text files by afoken (Chancellor) on Mar 03, 2018 at 15:44 UTC
I would avoid Storable for anything but short-lived temporary data. See also Re^2: DBI fetchall_hashref convert to scalar, Re: Accessing variables in an external hash without eval, Re^2: Storing state of execution. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: Reducing application footprint: large text files by QM (Parson) on Mar 01, 2018 at 10:59 UTC
These are essentially Perl structures compatible with JSON. A simple idea comes to mind, convert these to JSON, and zip them. (If necessary, you can convert the live structures into JSON and write files.) In the embedded system, you can unzip the data, streaming into a perl script, convert from JSON to Perl structures, and populate Perl vars. I haven't looked farther than that, but I suspect there is some simple boilerplate to read in a zipped JSON file and get a ref to the structure. The last piece of the puzzle is putting the data into the expected variables (you have shown `%my_flds` and `%my_def`). -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re: Reducing application footprint: large text files by Anonymous Monk on Feb 28, 2018 at 21:17 UTC
Well, could you put that data into SQLite database tables?	[reply]
Re^2: Reducing application footprint: large text files by Anonymous Monk on Feb 28, 2018 at 22:17 UTC
sqlite is a nice suggestion. Unfortunately, I don't that option in this case.	[reply]
Re: Reducing application footprint: large text files by Anonymous Monk on Feb 28, 2018 at 22:51 UTC
One thing that I would definitely do is to put all of this logic into one � possibly two � Perl modules which are tasked with maintaining the entire storage system. I would also "future-proof" the design by prefixing the file with a file-version identifier so that the file, whatever it turns out to be, is "self-describing" to programming that is in the know ... programming which occurs in exactly one place or set of places.	[reply]
Re^2: Reducing application footprint: large text files by BrowserUk (Patriarch) on Feb 28, 2018 at 23:19 UTC
@theOP: One thing you definitely want to do, is ignore the guy (sundialsvc4 �incognito�™) I'm responding to. See http://perlmonks.com/?node=worst+nodes and scroll to the bottom to see why. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity. In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit	[reply]
Re^3: Reducing application footprint: large text files by Anonymous Monk on Mar 01, 2018 at 01:08 UTC
I went to that link, scrolled down to the bottom (and middle, and toward the bottom :)), but the reference eludes me, sorry. Matt.	[reply]
Re^2: Reducing application footprint: large text files by Anonymous Monk on Mar 01, 2018 at 01:04 UTC
The designers of this (not me) are on the same page. This data file is a separate module with a versioned module name. Good to know this, at least, meets the best practice. Thanks, Matt.	[reply]


more useful options
	PerlMonks