Re: output unique lines only
by swkronenfeld (Hermit) on Dec 06, 2005 at 16:42 UTC
|
No need for Perl, unless you're doing something more complicated. Type this from your *IX command line.
cut -d" " -f1 FileName | sort | uniq | [reply] [d/l] |
|
I'd go for a shell pipe as well, and it would be close to your suggestion. Except that I wouldn't use the final pipe, but use sort -u instead. But that's just a minor difference. I won't be handing out 'useless use of uniq' awards.
| [reply] |
|
Nearly every use of "sort | uniq" can be replaced with "sort -u".
| [reply] |
|
the only reason to use sort and uniq in combination instead of "sort -u" that I can think of is to skip specific columns when looking for unique intances.
example:
...
RH_MEa0001bG06_5 710 14 16 Invalid starting position (14)
RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT...
RH_MEa0001bG06_6 710 125 12 GGGGGACACCTTCTCTCTCT...
...
sending a file containing this output to " | sort | uniq -f1" would compare each line and take the first instance that is unique (other than the column you want to skip, column 1 in this case) up to that point and give you :
...
RH_MEa0001bG06_4 710 125 12 GGGGGACACCTTCTCTCTCT...
RH_MEa0001bG06_5 710 14 16 Invalid starting position (14)
...
| [reply] [d/l] [select] |
Re: output unique lines only
by tirwhan (Abbot) on Dec 06, 2005 at 16:33 UTC
|
You should try to make a little bit of effort to arrive at a solution on your own, at least say "This is what I've tried but it doesn't work and I don't know why".
Your task can be solved by reading the file in a loop, using split on each line and then putting the first returned element into a hash as a key (for example $hash{$element}=1. After you read the whole file you can open another file for writing and do
for my $name(keys %hash) {
print $filehandle "$name\n";
}
Try to solve it with that information and do come back and ask if you have problems.
Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. -- Brian W. Kernighan
| [reply] [d/l] [select] |
Re: output unique lines only
by davorg (Chancellor) on Dec 06, 2005 at 16:36 UTC
|
What parts are you having trouble with?
- Use "open" top open the file
- Use "< ... >" to read from the file
- USe "split" to break each line into its parts
- Use a hash to store the filename
- Only print filenames if they don't exist in the hash
Update: I deliberately didn't give any code as I don't like to help people who show no sign of putting any effort in for themselves. It seems that others don't agree with that policy.
--
< http://dave.org.uk>
"The first rule of Perl club is you do not talk about
Perl club." -- Chip Salzenberg
| [reply] |
Re: output unique lines only
by chibiryuu (Beadle) on Dec 06, 2005 at 16:36 UTC
|
my %seen;
while (<>) {
s/\t.*//s;
$seen{$_}++ or print "$_\n";
}
| [reply] [d/l] |
Re: output unique lines only
by blazar (Canon) on Dec 06, 2005 at 16:39 UTC
|
$ perl -lne 's/\t.*//; print if !$saw{$_}++' input_file > output_file
| [reply] [d/l] |
Re: output unique lines only
by EdwardG (Vicar) on Dec 06, 2005 at 16:44 UTC
|
# uniqfiles.pl
use strict; # helps prevent silly mistakes
use warnings; # helpful when writing code
while (<>) { # Reads from STDIN
if (/^(\w+)\t/) { # If the line starts with one or more 'word' char
+acters followed by a tab...
my $filename = $1; # ...assume we've got a filename captured
$uniq_fnames{$filename} = 1; # ...and add it to our hash.
}
}
print $_,"\n" for keys %uniq_fnames; # prints to STDOUT, can be piped
+ to a file
Then you could use this as follows
perl uniqfiles.pl < my_non_unique_list_of_files > my_unique_list_of_fi
+les
| [reply] [d/l] [select] |
Re: output unique lines only
by cormanaz (Deacon) on Dec 06, 2005 at 19:05 UTC
|
This is easy to do with a hash. Open the file, read in one line at a time and use the split function to put the first element in each line (i.e. the filename) into a variable like $fn. If your hash is called %uniquefiles you then set the value for $fn to some arbitrary value, like
$uniquefiles{$fn} = 1;
If your loop comes across the same filename again, it will simply set the same value for the same filename, in effect eliminating the dupes. When you're all done %uniquefiles will only contain the unique filenames, which you can print like so: foreach my $k (keys %uniquefiles) {
print OUT "$k\n";
}
If you're just learning Perl, make sure you learn about hashes. They're a very powerful feature.Steve
| [reply] [d/l] [select] |
|
Thanks everyone for their tips/suggestions. I've decided to approach this using a hashtable.
I came up with the following script but it doesn't seem to be working correctly.
#!/usr/bin/perl -w
$filelist = "/home/exp/acctlist.txt";
open(FILEDUPS, $filelist) || die ("Cannot open $filelist");
open($output, '>', '/home/exp/output.txt') || die ("Cannot open
file");
while ($line = <FILEDUPS>) {
chomp $line;
($filename, undef, undef, undef, undef) = split /\t/, $line;
}
$uniquefiles{$filename} = 1;
foreach $k (keys %uniquefiles) {
print $output "$k\n";
}
It currently only outputs one line.
For example, if my file contains
filename1
filename2
filename1
filename4
Then it outputs the first line only:
filename1
Where as it should output:
filename1
filename2
filename4
I've spent a long time trying to debug this, but i'm not sure where i'm going wrong.
Thanks.
| [reply] [d/l] |
|
hi,
I guess you should give the $uniquefiles{$filename} = 1; inside the while loop.
-kulls
| [reply] [d/l] |