Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Data Sampler (Extract sample from large text file)

by roboticus (Chancellor)
on Mar 06, 2009 at 21:12 UTC ( [id://748960]=sourcecode: print w/replies, xml ) Need Help??
Category: Utility Scripts
Author/Contact Info
Description:

I frequently find the need to test programs on *real* data. But some of the datasets I have to deal with are rather ... large.

So this program lets me generate much smaller datasets to develop with.

Update: 20090306: I edited the first print statement to remove the trace information.

#!/usr/bin/perl -w
=head1 Simple Data Sampler

This program extracts a set of random lines from the file(s)
specified.


=head1 Usage:

sample_lines.pl [<Option>*] <InFile> <InFile>*

=head1 Options:

=over

=item -per-thousand <p>

Control number of lines in the sample--try to keep <p> lines
for every thousand seen. I<NOTE:> We don't try to enforce the
number of lines per thousand to this value, we just use it to
choose when to print a line (with possible contiguous lines).

=item -contiguous <c>

Keep <c> lines after each line selected to print (so we always
get at least <c> contiguous lines, default=0.
I<NOTE:> See -contig-max

=item -contig-max <cm>

Randomizes number of contiguous lines to print after selected
lines (see -contiguous).  Print between <c> and <cm> lines
after each selected line.

=item -minimum-skip <ms>

Minimum number of lines to skip between selected lines.

=back

The options are implemented very simply, as this isn't supposed
to be the "ultimate data sampler", just a simple way to get a
random set of lines from a text file.

=cut

use strict;
use warnings;
use Getopt::Long;

#####
# Handle command-line options
#####

my $contig_min;
my $contig_max;
my $minimum_skip;
my $per_thousand = 6.5;
my $result = GetOptions (
        "contiguous=i" => \$contig_min,
        "contig-max=i" => \$contig_max,
        "minimum-skip=i" => \$minimum_skip,
        "per-thousand=i" => \$per_thousand,
);
if (defined $contig_max) {
        $contig_min = 0 unless defined $contig_min;
        $contig_max = $contig_min if $contig_max < $contig_min;
}
if (defined $contig_min) {
        $contig_max = $contig_min if !defined $contig_max;
}

$per_thousand = $per_thousand / 1000.0;


#####
# Sample the data
#####

while (my $InFile = shift) {
        open INF, '<', $InFile or die "Can't open '$InFile': $!\n";
        while (<INF>) {
                next if $per_thousand < rand;
                print;
                if (defined $contig_min) {
                        print scalar <INF> for 0 .. $contig_min + rand
+($contig_max-$contig_min);
                }
                if (defined $minimum_skip) {
                        <INF> for (1 .. $minimum_skip);
                }
        }
        close INF or die "...closing '$InFile': $!\n";
}
Replies are listed 'Best First'.
Re: Data Sampler (Extract sample from large text file)
by baxy77bax (Deacon) on Mar 09, 2009 at 15:19 UTC
    i have the same problems constantly, but it never came to my mind to write something like this, i just cut and paste random peaces of a file into a test file and that is it.

    So big ++ for this script and effort! :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://748960]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (6)
As of 2024-04-19 12:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found