Category: | Text Processing |
Author/Contact Info | Jeremy Kahn kahn@cpan.org |
Description: | For how many tasks have you wanted to use a sampling of every Nth line of a file?
Well, for me, it's nearly every line-based text-processing tool I write -- if it's not a standard requirement, it's usually much more informative to test on every 50th line of my test corpus than it is to use the first 50 lines for test data. In fact, I find it very frustrating that there's no Unix power tool a la grep or tail that does this. So, per is an addition to the Unix power-tool library -- it's sort of like head or tail except that it takes every Nth line instead of the first or last N. Save it as ~/bin/per (or /usr/bin/per) and use it every day, like me. Windows users can run pl2bat on this and put it somewhere in your path -- my NT box happily uses a variant of this. Usage info is in POD, in the script. But here it is in HTML anyway (I love pod2html):
NAMEper - return one line per N lines
SYNOPSISper -oOFFSET -N files per -90 -o2 file.txt # every 90th line starting with line 2 per -o500 -3 file.txt # every 3rd line starting with line 500 per -o1 -2 file.txt # every other line, starting with the first per -2 file.txt # same as above It can also read from STDIN, for pipelining: tail -5000 bigfile.txt | per -100 # show every 100th line for the # last 5000 in the file
DESCRIPTIONper writes every Nth line, starting with OFFSET, to STDOUT.
OPTIONS
Note that per works on files specified on the commandline, or on STDIN if no files are provided. The special input file - indicates that remaining data should be read from STDIN. |
#!perl use strict; use warnings; use constant DEBUG => 0; my ($divisor,$offset) = handleArgs(); if (DEBUG) { warn "offset $offset\n"; warn "divisor $divisor\n"; } while (<>) { next if $. < $offset; # haven't reached the first offset next if (($. - $offset) % $divisor); print; } sub handleArgs { my ($offset, $divisor); while (@ARGV and $ARGV[0] =~ s/^-//) { my $arg = shift @ARGV; if ($arg =~ s/^o//) { if (defined $offset) { warn "-o switch found more than once\n" } $offset = $arg; } else { if ($arg eq '') { unshift @ARGV, '-'; last; # arg was '-', which says "ignore following" } if (defined $divisor) { warn "divisor argument (-N) found more than once\n"; } $divisor = $arg; } } if (not defined $divisor) { die "no divisor (-N) defined on commandline!\n"; } if (not defined $offset) { $offset = 1; } if ($divisor <= 0) { die "divisor $divisor is <= 0, which makes no sense.\n"; } if ($offset <= 0) { die "offset $offset is <=, which makes no sense.\n"; } if ($divisor != int($divisor)) { warn "divisor $divisor non-integer. truncating\n"; $divisor = int($divisor); } if ($offset != int($offset)) { warn "offset $offset non-integer. truncating\n"; $offset = int($offset); } return ($divisor, $offset); } =head1 NAME per - return one line per N lines =head1 SYNOPSIS per [-oOFFSET] -N [files] per -90 -o2 file.txt # every 90th line starting with line 2 per -o500 -3 file.txt # every 3rd line starting with line 500 per -o1 -2 file.txt # every other line, starting with the first per -2 file.txt # same as above It can also read from C<STDIN>, for pipelining: tail -5000 bigfile.txt | per -100 # show every 100th line for the # last 5000 in the file =head1 DESCRIPTION C<per> writes every C<N>th line, starting with C<OFFSET>, to C<STDOUT>. =head1 OPTIONS =over =item -N the integer value C<N> provided (e.g. C<-50>, C<-2>) is used to decide which lines to return -- every C<N>th. =item -oOFFSET the value C<OFFSET> provided says how far down in the input to proceed before beginning. The output will begin at line number C<OFFSET>. Default is 1. =item [ files ] =back Note that C<per> works on files specified on the commandline, or on C<STDIN> if no files are provided. The special input file C<-> indicates that remaining data should be read from C<STDIN>. =cut __END__ |
|
---|