"Very basic question ..."
Unfortunately, the question itself is too basic.
You have omitted information which, if provided, would have resulted in a better answer for you.
Your input appears to be a tab-separated CSV file.
Three things suggest this:
-
You refer to columns, not fields (CSV files have columns).
I've added a record to your posted data to demonstrate the difference (see below).
-
In each record, the second elements are aligned (with tabs?).
-
You have a header record (which is common for CSV files).
You've said nothing about the encoding of your data.
I've used "UTF-8" for both input and output; you may need something else.
Your data seems very simplistic.
Is what you posted truly representative of your real data?
I added an extra record to your posted input:
$ cat test_in.csv
id name
123 john
34 john
567 john
11 peter
899 peter
87 helen
961 Anonymous Monk
In a normal file, with no special format defined, and to the extent that it's represented in a webpage,
that last record has three fields; however, if a CSV format is specified, that last record has only two columns,
just like all of the other records.
Here's the CSV format revealed ('^I' represents a tab; '$' represents a newline):
$ cat -vet test_in.csv
id^Iname$
123^Ijohn$
34^Ijohn$
567^Ijohn$
11^Ipeter$
899^Ipeter$
87^Ihelen$
961^IAnonymous Monk$
Parsing CSV files has many gotchas.
Don't try writing your own code to deal with all of these:
Text::CSV has already done so; its use is highly recommended.
Note that if you, or your users, have Text::CSV_XS installed,
it will run faster (without requiring any change to the "use Text::CSV;" statement).
The code for performing the filtering is fairly straightforward.
Here's a few notes:
-
autodie — let Perl deal with I/O exception handling:
it won't get it wrong; it won't forget to do it; it's a tedious task that I'd prefer not to have to do myself.
-
constant — I like to have named array indices.
Possibly overkill in such a tiny script; although, I still think "$row->[NAME]" is immediately clear,
while "$row->[1]" may take a moment's thought.
[Aside:
Just this week, working with some legacy code, I came across this sort of thing: "$aref->[25]".
I was not happy about having to go back several screenfuls and start counting;
then check for changes to that count (e.g. via unshift()).]
-
%seen — that's a standard name and the way I've used it is a standard idiom.
You'll see it in lots of code and documentation.
Note that the postfix increment is important; the idiom will not work with a prefix increment.
-
Anonymous block — files are only open for the time they are needed.
Perl will automatically close them at the end of the block: another thing I don't need to concern myself with.
Note: the automatic closing described only works with lexical filehandles.
-
$fh_in & $fh_out — lexical filehandles:
always prefer these over package variables, such as IN & OUT.
They only exist in their scope (the anonymous block in this case) and can't interfere with, or be interfered by,
code elsewhere in the program.
-
open ‐ always use the 3-argument form, as I have here.
You can't use the encoding in 1- or 2-argument forms.
There's other benefits: the documentation has details.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use constant NAME => 1;
my $infile = 'test_in.csv';
my $outfile = 'test_out.csv';
use Text::CSV;
my %seen;
{
my $csv = Text::CSV::->new({
binary => 1, sep_char => "\t", quote_char => undef,
});
open my $fh_in, '<:encoding(UTF-8)', $infile;
open my $fh_out, '>:encoding(UTF-8)', $outfile;
(undef) = scalar <$fh_in>; # skip & discard header record
while (my $row = $csv->getline($fh_in)) {
$csv->say($fh_out, $row) unless $seen{$row->[NAME]}++;
}
}
Running that gives:
$ cat test_out.csv
123 john
11 peter
87 helen
961 Anonymous Monk
Revealing CSV format:
$ cat -vet test_out.csv
123^Ijohn$
11^Ipeter$
87^Ihelen$
961^IAnonymous Monk$