Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Human-readable serialization formats other than YAML?

by jasonk (Parson)
on Apr 23, 2008 at 13:59 UTC ( [id://682379]=note: print w/replies, xml ) Need Help??


in reply to Re: Human-readable serialization formats other than YAML?
in thread Human-readable serialization formats other than YAML?

The point of making it human-readable wasn't that it was going to be reviewed by a human at test time (in fact I am using test modules just as you suggest) it was simply to make it easier for me to review the original data in the event that the test fails, so I can more easily determine what went wrong.

True, the data is currently perfectly readable, but I would like to store it in the same file as the processed data that is associated with it to make it easier to manage by not having the data scattered across multiple files.

In essence, what I'm trying to do is something like this...

my $data = file( shift() )->slurp; ### At dev-time, when this processor is known to be working my $tst1 = MyApp::Processor->process( $data ); YAML::DumpFile( $file, $data, $test1 ); ### Then, sometime later in a test script... my @files = <test-files/*.yml>; plan tests => scalar @files; for my $file ( @tests ) { my ( $data, $test1 ) = YAML::LoadFile( $file ); my $test2 = MyApp::Processor->process( $data ); eq_or_diff( $test1, $test2, "$file not broken yet!" ); }

www.jasonkohles.com
We're not surrounded, we're in a target-rich environment!

Replies are listed 'Best First'.
Re^3: Human-readable serialization formats other than YAML?
by tachyon-II (Chaplain) on Apr 24, 2008 at 00:11 UTC

    Nothing you have said so far explains the need for YAML/Data:Dump etc serialisation to me. You data is already human readable. Storing the original and processed data in the same file is a simple as adding a separator. You don't need any fancy modules to do it.

    while(<DATA>) { if (m/<ORIG DATA ABOVE MUNGE BELOW>/) { $munge .= $_ while <DATA>; # slurp } else { $orig .= $_; } }

    As an added benefit of keeping it simple you can leverage diff to do the data comparison to do your eq_or_diff() routine if a simple eq test fails. I really think you are over complicating the task by adding a middleware serialisation layer. You are not actually using it to reconstitute a data structure, nor is there any real need as all you want to do is reformat the old data into the new format so you can process it. Why add useless middleware that only offers the opportunity to include bugs for no real gain?

    As I see it you need a base class that has the functions:

    my ($orig,$munge) = load_file($file); # munge may be NULL my $data = parse($orig); # process current format data o +nly my $cur_format = serialise($data); # output current format write_file($orig, $current_format); # write to file with separator my $invalid = eq_or_diff($munge,$cur_format); print "$file\n$invalid\n" if $invalid; # diff output, null if OK

    Each filter class only requires a parse() method to generate whatever data structure you want to work with in your ultimate program.

    You probably already have parse code to work with current data. The serialise method simply writes this data struct back into a sting that you can save. For current data this may or may not be identical to the current data format, but the process is valid if a base class parse on the original and munge data serialises to the same result as it is then round tripping.

    Essentially what I am saying is don't use serialisation middleware. Write your own code that takes your data structure (which you need) and serialises it *into the current format* (which you need, mostly for validation). The filters become simply a parse method to generate your standard internal data structure. Note that if your internal data structure uses hashes ensure you apply a sort or a list ordering to the keys during serialisation. If you don't it will probably bite you. It has bitten me before as key return order is not guaranteed and is different in different versions of perl on different OSs for exactly the same data.

    Doing it this way gives you:

    1. Human readable output
    2. The old data in new data format in the same file
    3. A format that can easily be munged by diff to show the exact differences, probably in the most intuitively understandable format.
    4. No useless middleware bugs to deal with. You will personally own all bugs :-)
    5. A simple one method filter that does the absolute minimal task required - convert old data into a standardised internal representation ready to either work with or write back to file.

      Nothing you have said so far explains the need for YAML/Data:Dump etc serialisation to me. You data is already human readable. Storing the original and processed data in the same file is a simple as adding a separator. You don't need any fancy modules to do it.

      I think what you are missing is that the processed data is not text, it's a perl data structure, (It's a whole bunch of objects that subclass Tree::DAG_Node actually) so it does need some sort of serialization to be stored.

      You are not actually using it to reconstitute a data structure, nor is there any real need as all you want to do is reformat the old data into the new format so you can process it.

      No, I am using it precisely to reconstitute a data structure. What I'm doing is not converting old data into a new format for processing, what I'm doing is ensuring that updates and modifications to the processor for newer formats don't cause it to break for old formats (because I still need it to work on those formats as well).

      Basically what's going on is that the data is a tar.gz archive of a lot of information collected from a linux host (configuration files, command output, /proc contents, similar to what you would get from Red Hat's sysreport tool, or from VMware's vm-support). What format the data is in depends on what version of the operating system was on the host it was collected from. As time goes on, newer versions of linux and newer versions of our software that runs on the host mean that the data in the files may be different on newer versions than it was on older versions. So what I'm doing is collecting this information to make sure that changes that are made to support newer versions haven't introduced incompatibilities that will make it fail on older versions.

      So basically, after building a parser for one of these files and confirming that it produces the right output, I run a script that stores both the original text, and the serialized output into a file.

      This file is then used later by the test suite, which loads up the original data, and the originally serialized output. Then it runs the original data through the current version of the parser for that file and confirms that given the same original input, the current version of the parser produces the same output as the original.


      www.jasonkohles.com
      We're not surrounded, we're in a target-rich environment!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://682379]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2024-04-20 04:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found