Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Very basic question while reading a file line by line

by kcott (Archbishop)
on Dec 10, 2022 at 05:48 UTC ( [id://11148703]=note: print w/replies, xml ) Need Help??


in reply to Very basic question while reading a file line by line

"Very basic question ..."

Unfortunately, the question itself is too basic. You have omitted information which, if provided, would have resulted in a better answer for you.

Your input appears to be a tab-separated CSV file. Three things suggest this:

  • You refer to columns, not fields (CSV files have columns). I've added a record to your posted data to demonstrate the difference (see below).
  • In each record, the second elements are aligned (with tabs?).
  • You have a header record (which is common for CSV files).

You've said nothing about the encoding of your data. I've used "UTF-8" for both input and output; you may need something else.

Your data seems very simplistic. Is what you posted truly representative of your real data?

I added an extra record to your posted input:

$ cat test_in.csv id name 123 john 34 john 567 john 11 peter 899 peter 87 helen 961 Anonymous Monk

In a normal file, with no special format defined, and to the extent that it's represented in a webpage, that last record has three fields; however, if a CSV format is specified, that last record has only two columns, just like all of the other records. Here's the CSV format revealed ('^I' represents a tab; '$' represents a newline):

$ cat -vet test_in.csv id^Iname$ 123^Ijohn$ 34^Ijohn$ 567^Ijohn$ 11^Ipeter$ 899^Ipeter$ 87^Ihelen$ 961^IAnonymous Monk$

Parsing CSV files has many gotchas. Don't try writing your own code to deal with all of these: Text::CSV has already done so; its use is highly recommended. Note that if you, or your users, have Text::CSV_XS installed, it will run faster (without requiring any change to the "use Text::CSV;" statement).

The code for performing the filtering is fairly straightforward. Here's a few notes:

  • autodie — let Perl deal with I/O exception handling: it won't get it wrong; it won't forget to do it; it's a tedious task that I'd prefer not to have to do myself.
  • constant — I like to have named array indices. Possibly overkill in such a tiny script; although, I still think "$row->[NAME]" is immediately clear, while "$row->[1]" may take a moment's thought.

    [Aside: Just this week, working with some legacy code, I came across this sort of thing: "$aref->[25]". I was not happy about having to go back several screenfuls and start counting; then check for changes to that count (e.g. via unshift()).]

  • %seen — that's a standard name and the way I've used it is a standard idiom. You'll see it in lots of code and documentation. Note that the postfix increment is important; the idiom will not work with a prefix increment.
  • Anonymous block — files are only open for the time they are needed. Perl will automatically close them at the end of the block: another thing I don't need to concern myself with. Note: the automatic closing described only works with lexical filehandles.
  • $fh_in & $fh_out — lexical filehandles: always prefer these over package variables, such as IN & OUT. They only exist in their scope (the anonymous block in this case) and can't interfere with, or be interfered by, code elsewhere in the program.
  • open ‐ always use the 3-argument form, as I have here. You can't use the encoding in 1- or 2-argument forms. There's other benefits: the documentation has details.
#!/usr/bin/env perl use strict; use warnings; use autodie; use constant NAME => 1; my $infile = 'test_in.csv'; my $outfile = 'test_out.csv'; use Text::CSV; my %seen; { my $csv = Text::CSV::->new({ binary => 1, sep_char => "\t", quote_char => undef, }); open my $fh_in, '<:encoding(UTF-8)', $infile; open my $fh_out, '>:encoding(UTF-8)', $outfile; (undef) = scalar <$fh_in>; # skip & discard header record while (my $row = $csv->getline($fh_in)) { $csv->say($fh_out, $row) unless $seen{$row->[NAME]}++; } }

Running that gives:

$ cat test_out.csv 123 john 11 peter 87 helen 961 Anonymous Monk

Revealing CSV format:

$ cat -vet test_out.csv 123^Ijohn$ 11^Ipeter$ 87^Ihelen$ 961^IAnonymous Monk$

— Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11148703]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (1)
As of 2024-04-25 04:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found