Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Very basic question while reading a file line by line

by Anonymous Monk
on Dec 10, 2022 at 01:17 UTC ( [id://11148698]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks!
I come to you with something that might be super trivial, but I am not sure I know the answer to it:
If you have a file with lines like this:
id name 123 john 34 john 567 john 11 peter 899 peter 87 helen

and you want to keep the first (only the first) record for each of the names that appear on the second column i.e.:
123 john 11 peter 87 helen

how does it work using the while loop and reading line by line?

Replies are listed 'Best First'.
Re: Very basic question while reading a file line by line
by GrandFather (Saint) on Dec 10, 2022 at 04:21 UTC

    Your sample data smells like CSV. If that is the case you really should use Text::CSV to read the file.

    That aside, I'd be much happier to see some code that you have attempted to use and a description of how it fails than a simple cap in hand request for us to do your homework for you.

    Oh, and Anonymous Monk doesn't get to consider itself a fellow monk in my book. To gain that qualification you should join the monastery.

    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Very basic question while reading a file line by line
by atcroft (Abbot) on Dec 10, 2022 at 01:45 UTC

    Q: How would you do it manually?
    A: You would recall which you have already seen, and only output if it were not in that list.

    Q: How to do that in code?
    A: One way would be to put the name into a hash, and only output if it were not present. In my example code below (which I used data from an array instead of a file, but the logic within the while loop is the same as if processing a file), I split the line into two (2) parts based on m/\s+/ (one or more whitespace characters, which could be spaces, tabs, etc). I then check if the name exists as a key in the hash (%seen); if not, I output the line. After the check, I increment the value of the hash element with the name as the key.

    Output:

    $ ./11148698-00.pl id name 123 john 11 peter 87 helen

    Code:

    Hope that helps.

Re: Very basic question while reading a file line by line
by kcott (Archbishop) on Dec 10, 2022 at 05:48 UTC
    "Very basic question ..."

    Unfortunately, the question itself is too basic. You have omitted information which, if provided, would have resulted in a better answer for you.

    Your input appears to be a tab-separated CSV file. Three things suggest this:

    • You refer to columns, not fields (CSV files have columns). I've added a record to your posted data to demonstrate the difference (see below).
    • In each record, the second elements are aligned (with tabs?).
    • You have a header record (which is common for CSV files).

    You've said nothing about the encoding of your data. I've used "UTF-8" for both input and output; you may need something else.

    Your data seems very simplistic. Is what you posted truly representative of your real data?

    I added an extra record to your posted input:

    $ cat test_in.csv id name 123 john 34 john 567 john 11 peter 899 peter 87 helen 961 Anonymous Monk

    In a normal file, with no special format defined, and to the extent that it's represented in a webpage, that last record has three fields; however, if a CSV format is specified, that last record has only two columns, just like all of the other records. Here's the CSV format revealed ('^I' represents a tab; '$' represents a newline):

    $ cat -vet test_in.csv id^Iname$ 123^Ijohn$ 34^Ijohn$ 567^Ijohn$ 11^Ipeter$ 899^Ipeter$ 87^Ihelen$ 961^IAnonymous Monk$

    Parsing CSV files has many gotchas. Don't try writing your own code to deal with all of these: Text::CSV has already done so; its use is highly recommended. Note that if you, or your users, have Text::CSV_XS installed, it will run faster (without requiring any change to the "use Text::CSV;" statement).

    The code for performing the filtering is fairly straightforward. Here's a few notes:

    • autodie — let Perl deal with I/O exception handling: it won't get it wrong; it won't forget to do it; it's a tedious task that I'd prefer not to have to do myself.
    • constant — I like to have named array indices. Possibly overkill in such a tiny script; although, I still think "$row->[NAME]" is immediately clear, while "$row->[1]" may take a moment's thought.

      [Aside: Just this week, working with some legacy code, I came across this sort of thing: "$aref->[25]". I was not happy about having to go back several screenfuls and start counting; then check for changes to that count (e.g. via unshift()).]

    • %seen — that's a standard name and the way I've used it is a standard idiom. You'll see it in lots of code and documentation. Note that the postfix increment is important; the idiom will not work with a prefix increment.
    • Anonymous block — files are only open for the time they are needed. Perl will automatically close them at the end of the block: another thing I don't need to concern myself with. Note: the automatic closing described only works with lexical filehandles.
    • $fh_in & $fh_out — lexical filehandles: always prefer these over package variables, such as IN & OUT. They only exist in their scope (the anonymous block in this case) and can't interfere with, or be interfered by, code elsewhere in the program.
    • open ‐ always use the 3-argument form, as I have here. You can't use the encoding in 1- or 2-argument forms. There's other benefits: the documentation has details.
    #!/usr/bin/env perl use strict; use warnings; use autodie; use constant NAME => 1; my $infile = 'test_in.csv'; my $outfile = 'test_out.csv'; use Text::CSV; my %seen; { my $csv = Text::CSV::->new({ binary => 1, sep_char => "\t", quote_char => undef, }); open my $fh_in, '<:encoding(UTF-8)', $infile; open my $fh_out, '>:encoding(UTF-8)', $outfile; (undef) = scalar <$fh_in>; # skip & discard header record while (my $row = $csv->getline($fh_in)) { $csv->say($fh_out, $row) unless $seen{$row->[NAME]}++; } }

    Running that gives:

    $ cat test_out.csv 123 john 11 peter 87 helen 961 Anonymous Monk

    Revealing CSV format:

    $ cat -vet test_out.csv 123^Ijohn$ 11^Ipeter$ 87^Ihelen$ 961^IAnonymous Monk$

    — Ken

Re: Very basic question while reading a file line by line
by tybalt89 (Monsignor) on Dec 10, 2022 at 08:54 UTC

    Why use a while loop and read line by line when there are other ways ?

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11148698 use warnings; use List::AllUtils qw( uniq_by ); open my $fh, '<', \<<END; id name 123 john 34 john 567 john 11 peter 899 peter 87 helen END <$fh>; # skip first line print uniq_by { (split)[1] } <$fh>;
Re: Very basic question while reading a file line by line
by Marshall (Canon) on Dec 10, 2022 at 09:01 UTC
    another solution
    use strict; use warnings; <DATA>; #throw away first line my %names; while (<DATA>) { my ($name) = (split ' ',$_)[1]; print unless $names{$name}++; } =PRINTS 123 john 11 peter 87 helen =cut __DATA__ id name 123 john 34 john 567 john 11 peter 899 peter 87 helen

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11148698]
Approved by atcroft
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (1)
As of 2024-04-25 01:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found