Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?

by coolda (Novice)
on Sep 28, 2014 at 03:38 UTC ( #1102252=perlquestion: print w/replies, xml ) Need Help??

coolda has asked for the wisdom of the Perl Monks concerning the following question:

As the title indicates, i have thousands of files. Each file follows the same format for example each file has a format of:

Gene exp1 exp2 exp3 exp4 ... 1 2 3 4 5 6
I want to take out third column only from every file and put it into one file so i can compare them. The code that i'm working on now requires too many codes. Is there any way i can make this work simpler? any insights tips, or advices will be appreciated. I've been working on this for a week, and i still am struggling..

  • Comment on Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
  • Download Code

Replies are listed 'Best First'.
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by davido (Cardinal) on Sep 28, 2014 at 05:58 UTC

    Sample input, sample output that the code should produce given the sample input, code you tried, and a description of the comparison the code is supposed to do; all things we would need to know before we could provide a useful answer. Perhaps you could follow-up in this thread with additional information that would help us to help you.


    Dave

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by Athanasius (Bishop) on Sep 28, 2014 at 06:13 UTC

    Hello coolda,

    Here is one approach, using Tie::File to make it easier to repeatedly append to each line of the output file:

    Another approach you should consider is to store the hundreds-of-files’ worth of data in a database, and then extract whatever you need via SQL.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 09:37 UTC

    If you have a *nix system, what about just using a simple BASH approach?

    for FILE in *.tsv; do cut -f3 $FILE > $FILE.temp # use -d option if not tab-delimited done paste *.temp > final.tsv rm *.temp

    This code puts the third column of each file into temp files and them pastes them all together into a final file.

    Based on one of your other posts, I suspect that you might want to also have the first column of one of the files in the final file. Also, it seems reasonable to label each of the columns with the file name, at least. The code below should accomplish both of these objectives.

    for FILE in *.tsv; do echo $FILE > $FILE.temp cut -f3 $FILE >> $FILE.temp done echo Gene_IDs > gene-ids.tsv cut -f1 one-of-the-files.tsv >> gene-ids.tsv paste gene-ids.tsv *.temp > final.tsv rm *.temp
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by thomas895 (Deacon) on Sep 28, 2014 at 05:10 UTC

    Are you looking for split?
    Perhaps something like the following:

    my @columns = split /\s+/, $current_line;
    -Thomas
    "Excuse me for butting in, but I'm interrupt-driven..."
      Or possibly, to make the code shorter:
      my $third_col = (split /\s+/, $current_line)[2];
      But that does not solve the issues raised by other monks about the original post lacking far too many details for us to suggest an improvement to some code that we have not seen.
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by GrandFather (Saint) on Sep 28, 2014 at 20:08 UTC

    There is a theme developing in your questions and the answer to all of them is: database! Maybe the most useful thing you can do at this point is take a step back and tell us what you are trying to achieve with these 100's of files because that will influence what your database looks like and how it can be efficiently created from your files. You should also tell us if the file generation process is ongoing and whether you want to generate the output file once or, if more then once, what changes each time you generate the output.

    Perl is the programming world's equivalent of English
      File generation process is not ongoing. I have a fixed number of files. Ultimately i'm trying to create two files, one with table consisting of only male data and the other female, with the format of what i described in earlier post. I am trying to compare male and female's expression level of each genes(which is column1 in the table i described) and see which genes have higher expression level in female. So my initial goal right now is to make a male file that consist of gene names(first column) and expression levels of each male (rest of the columns).
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by CountZero (Bishop) on Sep 28, 2014 at 19:31 UTC
    It needs a few lines of Perl code only:
    use Modern::Perl qw/2014/; use File::Find::Iterator; my $find = File::Find::Iterator->create( dir => ['d:/Perl/scripts'], filter => +\&find ); open my $FH_OUT, '>', './results.CSV' or die "Could not open results f +ile - $!"; while ( my $file = $find->next ) { open my $FH_IN, '<', $file or die "Could not open $file - $!"; say $FH_OUT join ', ', ( split /,/ )[ 0, 2 ] while (<$FH_IN>); } sub find { /GENES\d+\.csv/; }
    I tested it with 1000 files of 1000 lines of 10 fields each: Extracting the first and third column and saving them in the results file took 47 seconds on my ASUS tablet with a 1.33 GHz Intel ATOM Z3740 (4 core) processor. I call that very efficient.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      This results in a file with 2 columns and 1,000,000 rows, right? I'm not entirely sure, but I think that OP wants the final file to have 1000 + 1 columns and 1000 rows. Maybe...
        Sorry for not writing a detailed description of what i am working on. So i have thousands of files in a tab delimited format saved in one folder. Each file has a format of 2000 by 10 table What i want to do right now is create new file with the data i want. So the format of each file follows ..
        Gene exp1 exp2 exp3 exp4 ... 1 1050 2020 100 100 2 100 100 100 100 3 224 11 11 11 4 11 15 555 444 5 22 51 55 555 6 55 55 55 555 ...
        From the first file i read, I want to extract two columns for example I want 'Gene' and 'exp4' columns and put it in a new file. And from the rest of the other files, I want to extract 'exp4' column only and add on the right side of the two columns i extracted from the first file. So the final format would look like
        Gene file1 exp4 file2 exp4 file3 exp4 file4 exp4.... 1 100 200 155 144 2 22 55 222 444 3 4 5 6 . .
        So it will have a 2000 by thousands(number of my files) table as a result. I am a beginner in programming and especially in perl. help me please..
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by Anonymous Monk on Sep 28, 2014 at 22:30 UTC

    Create 26 sample files, 10 columns, 2000 rows:

    for my $letter ('A'..'Z') { my $file = "tmp/$letter.txt"; open my $fh, '>', $file or die "No open > $file: $!"; say $fh join "\t", 'Gene', map "exp$_", 1..10; for my $i (1..2000) { say $fh join "\t", $i, map "$letter-$i-exp$_", 1..10; } }

    First lines of A.txt:

    Gene exp1 exp2 exp3 exp4 exp5 exp6 exp7 exp8 + exp9 exp10 1 A-1-exp1 A-1-exp2 A-1-exp3 A-1-exp4 A-1-exp5 A-1-e +xp6 A-1-exp7 A-1-exp8 A-1-exp9 A-1-exp10 2 A-2-exp1 A-2-exp2 A-2-exp3 A-2-exp4 A-2-exp5 A-2-e +xp6 A-2-exp7 A-2-exp8 A-2-exp9 A-2-exp10 ...

    Append exp4 column from each file to end of lines:

    @ARGV = <tmp/*.txt>; my %row; while (<>) { my ($gene, $exp4) = (split /\t/)[0,4]; $row{$gene} .= "\t$exp4"; } delete $row{Gene}; say "$_$row{$_}" for sort {$a <=> $b} keys %row;

    First lines of output:

    1 A-1-exp4 B-1-exp4 C-1-exp4 D-1-exp4 E-1-exp4 F-1-e +xp4 G-1-exp4 H-1-exp4 I-1-exp4 J-1-exp4 K-1-exp4 L- +1-exp4 M-1-exp4 N-1-exp4 O-1-exp4 P-1-exp4 Q-1-exp4 + R-1-exp4 S-1-exp4 T-1-exp4 U-1-exp4 V-1-exp4 W-1-exp4 + X-1-exp4 Y-1-exp4 Z-1-exp4 2 A-2-exp4 B-2-exp4 C-2-exp4 D-2-exp4 E-2-exp4 F-2-e +xp4 G-2-exp4 H-2-exp4 I-2-exp4 J-2-exp4 K-2-exp4 L- +2-exp4 M-2-exp4 N-2-exp4 O-2-exp4 P-2-exp4 Q-2-exp4 + R-2-exp4 S-2-exp4 T-2-exp4 U-2-exp4 V-2-exp4 W-2-exp4 + X-2-exp4 Y-2-exp4 Z-2-exp4 ...
      this is great, may i ask what @ARGV and <> do in the second code? I googled it and i learned that empty diamond reads the @ARGV. So if you just set @ARGV = <*.txt> it reads any .txt file saved in that directory in order? If i want to skip the first line for every file, what should i do? I tried many things but it won't work. I usually used <$fh>; to read the first line and tried, next if $. <2 but neither worked.. Is there anyway you can skip the header(the first line) when using while(<>){} ???

        The Anonymous Monk deleted the header row in the code provided above.

        delete $row{Gene};

        That seems like the easiest way to do it. To do what you are asking here you can use eof. Also refer to Variables related to filehandles

Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 20:25 UTC
Re: Is there any efficient way i can take out a specific column from hundreds of files and put it in one file?
by frozenwithjoy (Priest) on Sep 28, 2014 at 09:38 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1102252]
Approved by thomas895
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2021-03-06 18:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favorite kind of desktop background is:











    Results (118 votes). Check out past polls.

    Notices?