0xbeef has asked for the wisdom of the Perl Monks concerning the following question:
I read a data file into a scalar $data using a variation of fastslurp. I'd like to refer to the actual data record (mostly text, but maybe binary as well in the future) in another variable. Here is a rather simplified representation of the scalar $data:
-----------offset----
| record1 | 0 |
| record2 | 10 |
| record3 | 87 |
---------------------
Seeing that $data already contains the actual data, is there a way to map a new variable $rec1 (or maybe an array) to the actual data in $data if I know the start/end position for each record? i.e. $rec1 should magically map to record1 using only pointers to data in $data. At all costs, it should avoid making a _copy_ of the actual data.
Sorry this strikes me as a laughable question but I just cannot think of the correct method and I need an efficient implementation!
Niel
Re: slurped scalar map
by ikegami (Patriarch) on Jun 20, 2006 at 14:33 UTC
|
If I understand correctly, tie or captures will be useful. Here's an example of the latter:
sub new_accessor {
my $start = $_[0];
my $end = $_[1];
my $data_ref = \$_[2]; # Avoid making a copy.
return sub {
return substr($$data_ref, $start, $end-$start);
};
}
{
my $data = ...;
my $start1 = ...; # Calculate start from map.
my $end1 = ...; # Calculate end from map.
my $rec1 = new_accessor($start1, $end1, $data);
print($rec1->(), "\n");
}
tie would allow you to do the same, but you'd use
print($rec1, "\n");
instead of
print($rec1->(), "\n");
Untested.
| [reply] [d/l] [select] |
|
Thanks, this is spot on for my requirement. ++Regards,
Niel
| [reply] |
|
I should have mentioned that substr makes a copy when $rec1->() is called. That's unavoidable. You can't extract a string from another without making the new string. However, the copy is only done when $rec1->() is executed.
If that's a problem, you could extract small chunks at a time. For example, only 100 (by default) chars are duplicated at any given time.
sub new_callback_accessor {
my $start = $_[0];
my $end = $_[1];
my $data_ref = \$_[2]; # Avoid making a copy.
return sub {
my ($callback, $blk_size) = @_;
local *_;
$blk_size = 100 unless defined $blk_size;
$blk_size = $end-$start unless $blk_size;
my $ofs = $start;
my $len = $end-$start;
while ($len) {
$blk_size = $len if $blk_size > $len;
$_ = substr($$data_ref, $ofs, $blk_size);
$callback->();
$ofs += $len;
$len -= $blk_size;
}
};
}
{
my $data = ...;
my $start1 = ...; # Calculate start from map.
my $end1 = ...; # Calculate end from map.
my $rec1 = new_callback_accessor($start1, $end1, $data);
$rec1->(sub { print });
print("\n");
}
Untested.
| [reply] [d/l] [select] |
|
| [reply] [d/l] |
Re: slurped scalar map
by Zaxo (Archbishop) on Jun 20, 2006 at 14:46 UTC
|
If you can rely on the fixed offset of the data in a line, unpack, or substr/regex matching will get you the data. It will be easier if you split the file into an array of lines, or else originally slurp it that way,
my @lines = <$handle>;
my %record;
for (@lines) {
next unless /^\| (\w+) \| (\d+)/;
$record{$1} = $2;
}
That doesn't just assign one value to one variable which is named after another piece of the data; it associates all those other pieces with their data.
You wind up with a more useful and easier-to-manage representation of the data in your file.
If you're stuck with that scalar variable, you can use the same regex globally,
my %record = $data =~ /^\| (\w+) \| (\d+)/g;
That looks simpler, but it is, IMO, more fragile.
To get exactly what you asked for, knowing the offset and length of the field,
my $rec1ref = \substr $data, $offset, $len;
$$rec1ref = $newval;
If length($newval) != $len, the offsets to subsequent data will be disturbed and the data seen in $$rec1ref will be truncated or augmented.
| [reply] [d/l] [select] |
Re: slurped scalar map
by BrowserUk (Patriarch) on Jun 20, 2006 at 14:54 UTC
|
#! perl -slw
use strict;
my $data = <<EOD;
record 1
record 2 is a bit longer
record 3 is just this length
EOD
my $p = 0;
my @refs;
while( my $o = 1+index $data, "\n", $p ) {
push @refs, \substr $data, $p, $o - $p;
$p = $o
}
print $$_ for @refs;
which works okay (from 5.8.4 (maybe 5.8.3 I forget) onwards), but don't try assigning to them unless your replacements are exactly the same length as the originals.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Thanks, this elaborates on your analysis as johngg mentioned. I do not have fixed record seperators but I do have records of variable length/content so I rely on offset and size. I intend to use your example based on calculating each record's end offset:
#!/usr/bin/perl -slw
use strict;
my $data = <<EOD;
record 1 is 20 bytesrecord 2 is 20 bytesrecord 3 is longer at 30 bytes
EOD
my $p = 0;
my @refs;
# endpos would be calculated based on recsize - this is simplified:
my @endpos=(20,40,70);
for (@endpos) {
push @refs, \substr $data, $p, $_ - $p;
$p = $_;
}
print '[',$$_,']' for @refs;
Hope I got that right. Niel | [reply] [d/l] |
|
Just be aware that even this method will only save you space where your records are longer than (from memory) 12 characters. And if you are having to store another array containing the record length, then you have to factor the size of that array into the argument as well unless you replace each record length with the lvalue ref of the record as you go. The space consumed storing the lengths will depend upon whether the numeric values are loaded and stored as IVs or PVs. Around 20 bytes/length for the former and approx. 50 for the later in addition to that used to store the lvalue ref.
Also, it only makes sense to build an array of refs if you are going to randomly access each (or some) records more than once. Otherwise, it would be better to simply generate and use an lvalue ref for each record as you need it. The trade-off between replacing the lengths with lvalue refs, and generatng them on the fly will depend upon the number and frequency with which you re-access records through the life of the program; how the length are loaded etc.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: slurped scalar map
by dragonchild (Archbishop) on Jun 20, 2006 at 14:29 UTC
|
Optimize for correctness, first. Parse that into a hash and get it working. Then, if it's not fast enough (and I highly doubt that will be a problem), then come back and ask a question with a working implementation.
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] |
|
I am already past the "working" phase and in the "optimisation" phase.I'm curious about efficiency in terms of "best programming practise". The program (to large to post) creates a file consisting of N records and an index containing key info like fpos markers at the end. (the records consist of the stdout/stderr of several o/s commands and files => 30-50Mb/server for almost 100 servers)The program currently reads the index first, then processes & reads each record as it requires it while processing the data-file. I'm trying find a faster solution, i.e. performing larger sequential reads upfront. Of course, it may have extra considerations, such as an max. slurp size. This exercise will be worth it (in my mind at least) if I can understand the margin by which <sequential slurp><process><process><process> operations are faster than <slurp 1 record><process><slurp next record><process> ... Hope this makes sense. Niel
| [reply] [d/l] [select] |
|
The OS already does that for you. When you read from a file, you're not actually reading from the disk itself. You read from a buffer than the disk manager creates for you. So, slurp-process-slurp-process is going to be nearly as fast (or faster) as slurp-process-process-process.
My criteria for good software:
- Does it work?
- Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
| [reply] |
|
Re: slurped scalar map
by johngg (Canon) on Jun 20, 2006 at 14:48 UTC
|
| [reply] |
|
|