Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?


Pathologically Eclectic Rubbish Lister
	PerlMonks

Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice)

on Sep 28, 2014 at 03:38 UTC ( [id://1102252]=perlquestion: print w/replies, xml )

Need Help??

coolda has asked for the wisdom of the Perl Monks concerning the following question:

As the title indicates, i have thousands of files. Each file follows the same format for example each file has a format of:

Gene exp1 exp2 exp3 exp4 ...
1
2
3
4
5
6
[download]

I want to take out third column only from every file and put it into one file so i can compare them. The code that i'm working on now requires too many codes. Is there any way i can make this work simpler? any insights tips, or advices will be appreciated. I've been working on this for a week, and i still am struggling..

Comment on Is there any efficient way i can take out a specific column from hundreds of files and put it in one file? Download Code

Replies are listed 'Best First'.

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by davido (Cardinal) on Sep 28, 2014 at 05:58 UTC

Sample input, sample output that the code should produce given the sample input, code you tried, and a description of the comparison the code is supposed to do; all things we would need to know before we could provide a useful answer. Perhaps you could follow-up in this thread with additional information that would help us to help you.

Dave

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by Athanasius (Archbishop) on Sep 28, 2014 at 06:13 UTC

Hello coolda,

Here is one approach, using Tie::File to make it easier to repeatedly append to each line of the output file:

Read more... (3 kB)

Another approach you should consider is to store the hundreds-of-files’ worth of data in a database, and then extract whatever you need via SQL.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 09:37 UTC

If you have a *nix system, what about just using a simple BASH approach?

for FILE in *.tsv; do
    cut -f3 $FILE > $FILE.temp    # use -d option if not tab-delimited
done

paste *.temp > final.tsv
rm *.temp
[download]

This code puts the third column of each file into temp files and them pastes them all together into a final file.

Based on one of your other posts, I suspect that you might want to also have the first column of one of the files in the final file. Also, it seems reasonable to label each of the columns with the file name, at least. The code below should accomplish both of these objectives.

for FILE in *.tsv; do
    echo $FILE > $FILE.temp
    cut -f3 $FILE >> $FILE.temp
done

echo Gene_IDs > gene-ids.tsv
cut -f1 one-of-the-files.tsv >> gene-ids.tsv

paste gene-ids.tsv *.temp > final.tsv
rm *.temp
[download]

[reply]
[d/l]
[select]

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by thomas895 (Deacon) on Sep 28, 2014 at 05:10 UTC

Are you looking for split?
Perhaps something like the following:

my @columns = split /\s+/, $current_line;
[download]

-Thomas

"Excuse me for butting in, but I'm interrupt-driven..."

Re^2: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by Laurent_R (Canon) on Sep 28, 2014 at 08:42 UTC

my $third_col = (split /\s+/, $current_line)[2];
[download]

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by GrandFather (Saint) on Sep 28, 2014 at 20:08 UTC

There is a theme developing in your questions and the answer to all of them is: database! Maybe the most useful thing you can do at this point is take a step back and tell us what you are trying to achieve with these 100's of files because that will influence what your database looks like and how it can be efficiently created from your files. You should also tell us if the file generation process is ongoing and whether you want to generate the output file once or, if more then once, what changes each time you generate the output.

Perl is the programming world's equivalent of English

Re^2: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice) on Sep 28, 2014 at 22:32 UTC

File generation process is not ongoing. I have a fixed number of files. Ultimately i'm trying to create two files, one with table consisting of only male data and the other female, with the format of what i described in earlier post. I am trying to compare male and female's expression level of each genes(which is column1 in the table i described) and see which genes have higher expression level in female. So my initial goal right now is to make a male file that consist of gene names(first column) and expression levels of each male (rest of the columns).

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by CountZero (Bishop) on Sep 28, 2014 at 19:31 UTC

use Modern::Perl qw/2014/;
use File::Find::Iterator;

my $find =
  File::Find::Iterator->create( dir => ['d:/Perl/scripts'], filter => 
+\&find );
open my $FH_OUT, '>', './results.CSV' or die "Could not open results f
+ile - $!";
while ( my $file = $find->next ) {
    open my $FH_IN, '<', $file or die "Could not open $file - $!";
    say $FH_OUT join ', ', ( split /,/ )[ 0, 2 ] while (<$FH_IN>);
}

sub find { /GENES\d+\.csv/; }
[download]

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Re^2: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by frozenwithjoy (Priest) on Sep 28, 2014 at 19:40 UTC

think

Re^3: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice) on Sep 28, 2014 at 19:54 UTC

Gene exp1 exp2 exp3 exp4 ...
1    1050 2020 100  100
2    100  100  100  100
3    224  11   11   11
4    11   15   555  444
5    22   51   55   555
6    55   55   55   555
...
[download]

Gene file1 exp4 file2 exp4 file3 exp4 file4 exp4....
1       100         200      155         144
2        22         55       222         444
3
4
5
6
.
.
[download]

[reply]
[d/l]
[select]

Re^4: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by frozenwithjoy (Priest) on Sep 28, 2014 at 19:59 UTC

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by Anonymous Monk on Sep 28, 2014 at 22:30 UTC

Create 26 sample files, 10 columns, 2000 rows:

for my $letter ('A'..'Z') {
    my $file = "tmp/$letter.txt";
    open my $fh, '>', $file or die "No open > $file: $!";

    say $fh join "\t", 'Gene', map "exp$_", 1..10;

    for my $i (1..2000) {
        say $fh join "\t", $i, map "$letter-$i-exp$_", 1..10;
    }
}
[download]

First lines of A.txt:

Gene    exp1    exp2    exp3    exp4    exp5    exp6    exp7    exp8  
+  exp9    exp10
1    A-1-exp1    A-1-exp2    A-1-exp3    A-1-exp4    A-1-exp5    A-1-e
+xp6    A-1-exp7    A-1-exp8    A-1-exp9    A-1-exp10
2    A-2-exp1    A-2-exp2    A-2-exp3    A-2-exp4    A-2-exp5    A-2-e
+xp6    A-2-exp7    A-2-exp8    A-2-exp9    A-2-exp10
...
[download]

Append exp4 column from each file to end of lines:

@ARGV = <tmp/*.txt>;

my %row;

while (<>) {
    my ($gene, $exp4) = (split /\t/)[0,4];
    $row{$gene} .= "\t$exp4";
}

delete $row{Gene};

say "$_$row{$_}" for sort {$a <=> $b} keys %row;
[download]

First lines of output:

1    A-1-exp4    B-1-exp4    C-1-exp4    D-1-exp4    E-1-exp4    F-1-e
+xp4    G-1-exp4    H-1-exp4    I-1-exp4    J-1-exp4    K-1-exp4    L-
+1-exp4    M-1-exp4    N-1-exp4    O-1-exp4    P-1-exp4    Q-1-exp4   
+ R-1-exp4    S-1-exp4    T-1-exp4    U-1-exp4    V-1-exp4    W-1-exp4
+    X-1-exp4    Y-1-exp4    Z-1-exp4
2    A-2-exp4    B-2-exp4    C-2-exp4    D-2-exp4    E-2-exp4    F-2-e
+xp4    G-2-exp4    H-2-exp4    I-2-exp4    J-2-exp4    K-2-exp4    L-
+2-exp4    M-2-exp4    N-2-exp4    O-2-exp4    P-2-exp4    Q-2-exp4   
+ R-2-exp4    S-2-exp4    T-2-exp4    U-2-exp4    V-2-exp4    W-2-exp4
+    X-2-exp4    Y-2-exp4    Z-2-exp4
...
[download]

[reply]
[d/l]
[select]

Re^2: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice) on Sep 29, 2014 at 01:10 UTC

this is great, may i ask what @ARGV and <> do in the second code? I googled it and i learned that empty diamond reads the @ARGV. So if you just set @ARGV = <*.txt> it reads any .txt file saved in that directory in order? If i want to skip the first line for every file, what should i do? I tried many things but it won't work. I usually used <$fh>; to read the first line and tried, next if $. <2 but neither worked.. Is there anyway you can skip the header(the first line) when using while(<>){} ???

Re^3: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by Lotus1 (Vicar) on Sep 29, 2014 at 03:47 UTC

The Anonymous Monk deleted the header row in the code provided above.

delete $row{Gene};

That seems like the easiest way to do it. To do what you are asking here you can use eof. Also refer to Variables related to filehandles

Re^4: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice) on Sep 29, 2014 at 16:50 UTC

Re^4: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice) on Sep 29, 2014 at 19:30 UTC

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 20:25 UTC

Another option is to take advantage of CPAN. Check out merge_datasets.pl from Bio::ToolBox.

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 09:38 UTC

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

Back to Seekers of Perl Wisdom

Log In^?

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: perlquestion [id://1102252]
Approved by thomas895
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others meditating upon the Monastery: (4)

As of 2024-04-19 14:58 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found