per - selects every Nth line

http://qs321.pair.com?node_id=215446

Category:	Text Processing
Author/Contact Info	Jeremy Kahn kahn@cpan.org
Description:	For how many tasks have you wanted to use a sampling of every Nth line of a file? selecting a "random" subset before running on all five million lines getting a flavor of what's in a line-oriented database holding out test data Well, for me, it's nearly every line-based text-processing tool I write -- if it's not a standard requirement, it's usually much more informative to test on every 50th line of my test corpus than it is to use the first 50 lines for test data. In fact, I find it very frustrating that there's no Unix power tool a la `grep` or `tail` that does this. So, `per` is an addition to the Unix power-tool library -- it's sort of like `head` or `tail` except that it takes every Nth line instead of the first or last N. Save it as `~/bin/per` (or `/usr/bin/per`) and use it every day, like me. Windows users can run `pl2bat` on this and put it somewhere in your path -- my NT box happily uses a variant of this. Usage info is in POD, in the script. But here it is in HTML anyway (I love `pod2html`): NAME per - return one line per N lines SYNOPSIS per -oOFFSET -N files per -90 -o2 file.txt # every 90th line starting with line 2 per -o500 -3 file.txt # every 3rd line starting with line 500 per -o1 -2 file.txt # every other line, starting with the first per -2 file.txt # same as above It can also read from `STDIN`, for pipelining: tail -5000 bigfile.txt \| per -100 # show every 100th line for the # last 5000 in the file DESCRIPTION `per` writes every `N`th line, starting with `OFFSET`, to `STDOUT`. OPTIONS -N the integer value `N` provided (e.g. `-50`, `-2`) is used to decide which lines to return -- every `N`th. -oOFFSET the value `OFFSET` provided says how far down in the input to proceed before beginning. The output will begin at line number `OFFSET`. Default is 1. files Note that `per` works on files specified on the commandline, or on `STDIN` if no files are provided. The special input file `-` indicates that remaining data should be read from `STDIN`.
#!perl use strict; use warnings; use constant DEBUG => 0; my ($divisor,$offset) = handleArgs(); if (DEBUG) { warn "offset $offset\n"; warn "divisor $divisor\n"; } while (<>) { next if $. < $offset; # haven't reached the first offset next if (($. - $offset) % $divisor); print; } sub handleArgs { my ($offset, $divisor); while (@ARGV and $ARGV[0] =~ s/^-//) { my $arg = shift @ARGV; if ($arg =~ s/^o//) { if (defined $offset) { warn "-o switch found more than once\n" } $offset = $arg; } else { if ($arg eq '') { unshift @ARGV, '-'; last; # arg was '-', which says "ignore following" } if (defined $divisor) { warn "divisor argument (-N) found more than once\n"; } $divisor = $arg; } } if (not defined $divisor) { die "no divisor (-N) defined on commandline!\n"; } if (not defined $offset) { $offset = 1; } if ($divisor <= 0) { die "divisor $divisor is <= 0, which makes no sense.\n"; } if ($offset <= 0) { die "offset $offset is <=, which makes no sense.\n"; } if ($divisor != int($divisor)) { warn "divisor $divisor non-integer. truncating\n"; $divisor = int($divisor); } if ($offset != int($offset)) { warn "offset $offset non-integer. truncating\n"; $offset = int($offset); } return ($divisor, $offset); } =head1 NAME per - return one line per N lines =head1 SYNOPSIS per [-oOFFSET] -N [files] per -90 -o2 file.txt # every 90th line starting with line 2 per -o500 -3 file.txt # every 3rd line starting with line 500 per -o1 -2 file.txt # every other line, starting with the first per -2 file.txt # same as above It can also read from C<STDIN>, for pipelining: tail -5000 bigfile.txt \| per -100 # show every 100th line for the # last 5000 in the file =head1 DESCRIPTION C<per> writes every C<N>th line, starting with C<OFFSET>, to C<STDOUT>. =head1 OPTIONS =over =item -N the integer value C<N> provided (e.g. C<-50>, C<-2>) is used to decide which lines to return -- every C<N>th. =item -oOFFSET the value C<OFFSET> provided says how far down in the input to proceed before beginning. The output will begin at line number C<OFFSET>. Default is 1. =item [ files ] =back Note that C<per> works on files specified on the commandline, or on C<STDIN> if no files are provided. The special input file C<-> indicates that remaining data should be read from C<STDIN>. =cut __END__

Comment on per - selects every Nth line Download Code

Back to Code Catacombs