Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Split a large text file by columns

by tc (Novice)
on Apr 20, 2017 at 21:51 UTC ( [id://1188487]=perlquestion: print w/replies, xml ) Need Help??

tc has asked for the wisdom of the Perl Monks concerning the following question:

I have a question regarding perl that I have been working on the past week. I have text file with 200 columns and the same number of rows per column. The first column is the names of each row. I would like to split the file every 3 columns and export each to a new file. The caveat is, the first column(name) need to be merged to each file. Is it possible to do this in perl?

I can split the file using awk in bash but i'm unable to merge the first column to each new file. I know perl has a split function but that splits rows not columns. I have tried a few perl scripts but they are dependent on specific column numbers, so as ideas has dried up, I turn to you, Monks.

Replies are listed 'Best First'.
Re: Split a large text file by columns
by Marshall (Canon) on Apr 20, 2017 at 23:35 UTC
    It would be helpful if you could post a small subset of some actual data. I guess you have something like this?:
    row_nameA:col2:col3:col4:col5:col6...col200 row_nameB:col2:col3:col4:col5:col6...col200
    It is completely unclear what separates the columns (above, ':')). This detail does matter. I would give us say, the first 7 columns x 3 rows and put that data within <code></code> tags. Then show us your "best go" at this problem so far in Perl. Can you give more info about the size of this input file? How many rows? I suspect that the entire input file will fit comfortably in memory and that generating the 70 or so output files can proceed in a straightforward way. There are a number of techniques to do this. Your question is still too general to get a concrete answer other than "heck yes, Perl can do it!". Oh, also mention if performance is of any concern at all? I don't expect that to be an issue here because most of the time will be taken by I/O, generating the plethora of output files.

    I guess there is the additional question, at least for my own curiosity: why are you doing this? Your application just seems odd enough (200 cols, 3 cols per file, about 70 output files), that perhaps there is a better way to do whatever it is that you are trying to do. This might be what is known as an X-Y problem.

    Update:
    Below is some code for one way to do this, there are other ways.

    The code reads the input file and makes a 2-D array of the data. I presume that this amount of data will "fit" into memory without problems. If the line format is complex, then perhaps a .CSV module will be needed to parse each line? At each iteration of generating a new file, column 1 (the name) is reused and then the next left most 3 columns of data are consumed (the @data array "shrinks"). The loop ends when only column 1 remains of the original data.

    #!/usr/bin/perl use strict; use warnings; my $number_of_cols_per_file =3; my @data; #this is a 2-D Array while (my $line = <DATA>) { chomp $line; my (@cols) = split (':', $line); push @data, \@cols; } my $file_num=1; while (@{$data[0]} > 1) # Any Columns after the name_column left? { # generate the next file # This print would change to a "file open" statement for # file_num, n... print "File Number = ",$file_num++,"\n"; foreach my $row_ref (@data) { my $row_name=$row_ref->[0]; my @data_cols = splice (@$row_ref,1,$number_of_cols_per_file); print join(":", $row_name, @data_cols), "\n"; } } =Prints: File Number = 1 row_nameA:col2a:col3a:col4a row_nameB:col2b:col3b:col4b File Number = 2 row_nameA:col5a:col6a row_nameB:col5b:col6b =cut __DATA__ row_nameA:col2a:col3a:col4a:col5a:col6a row_nameB:col2b:col3b:col4b:col5b:col6b

      Thank you Marshal. I appreciate your help.

      The file columns are tab separated. A small sample of the data file is below

      <GSOR> vnir_1 vnir_2 vnir_3 vnir_4 vnir_5 vnir_6 vnir_6 310015 0.37042 0.36909 0.36886 0.36698 0.36615 0.364 +49 0.36404 310100 0.25889 0.25773 0.2569 0.25563 0.25565 0.2551 +1 0.25508 310134 0.26163 0.26149 0.26059 0.26034 0.2604 0.2598 + 0.26085 310167 0.23168 0.23031 0.23045 0.22822 0.2267 0.2257 +5 0.22453 310196 0.26995 0.26902 0.2685 0.26689 0.26624 0.2647 + 0.26461

      Hi Marshall, thank you sir for your knowledge and help. The code that you wrote does the job and it works great. I spent a week trying to write a code that does this. I have a lot to learn. Your time and knowledge is most appreciated.

      I will try and provide as much detail about the file and why I am trying to do this. The file is a set of wavelength measurements collected from about 200 individual plants. instrument measures a response for each of the wavelength between 300 nm to 1000 nm in different intervals. Usually, the file contains between 100 to 200 columns depending on the settings and 200 rows (each plant ID). There are about 20 files in total when measurements are completed. I nedd to separate each column and keep the first column(rownames) with each column. I will then analyze each of these files for genetic information for research.

      The memory is of no issue since I can run it on a server. I agree, my application is odd for what perl is used for, but I started teaching myself to use it since I am working with large files and need an efficient way to process files. I have been able to write a couple scripts that has made a few tedious task efficient and less mistake prone. That's the gist of it.

      Thank you again, Monks.

        Here is a way to perform what you described (with the help of the Data::Table and Path::Tiny cpan modules).

        I'm assuming that there was a typo in the data you provided, and I changed the name of the last column to vnir_7.

        I put the following tab-delimited data into a file called data.tsv,

        <GSOR> vnir_1 vnir_2 vnir_3 vnir_4 vnir_5 vnir_6 +vnir_7 310015 0.37042 0.36909 0.36886 0.36698 0.36615 0.364 +49 0.36404 310100 0.25889 0.25773 0.2569 0.25563 0.25565 0.2551 +1 0.25508 310134 0.26163 0.26149 0.26059 0.26034 0.2604 0.2598 + 0.26085 310167 0.23168 0.23031 0.23045 0.22822 0.2267 0.2257 +5 0.22453 310196 0.26995 0.26902 0.2685 0.26689 0.26624 0.2647 + 0.26461
        This script processes the data and creates the files,
        #!/usr/bin/env perl use strict; use warnings; use Data::Table; use Path::Tiny; # Load the tsv file with a header my $dt = Data::Table::fromTSV('data.tsv', 1 ); # Get a Data::Table that contains only the first column my $names_dt = $dt->subTable( undef, [ '<GSOR>' ] ); my $n_col = $dt->nofCol; my @column_names = $dt->header; for( my $i = 1; $i <= $n_col - 1; ++$i ){ my $col_name = $column_names[ $i ]; my $col_dt = $dt->subTable( undef, [ $col_name ] ); my $new_dt = $names_dt->clone(); $new_dt->colMerge($col_dt); my $file_name = "file_$i.tsv"; my $fh = path($file_name)->openw_utf8; print {$fh} $new_dt->tsv; $fh->close; } exit;
Re: Split a large text file by columns
by ww (Archbishop) on Apr 20, 2017 at 22:13 UTC

    Simple text? (You're not giving us much to work on.)

    ... or maybe csv, tsv or fixed-width-fields?
    If so, one of the csv modules could help.
Re: Split a large text file by columns
by Anonymous Monk on Apr 20, 2017 at 21:58 UTC

    Is it possible to do this in perl?

    Yes, absolutely

    I can split the file using awk in bash but i'm unable to merge the first column to each new file. I know perl has a split function but that splits rows not columns. I have tried a few perl scripts but they are dependent on specific column numbers, so as ideas has dried up, I turn to you, Monks.

    Please link the scripts you tried

    Any such program is going to be "dependent on specific column numbers" in one way or another

      My apologies about the previous submit. Hit the create button mistakenly. It is formatted better now.

      Thank you Anonymous Monk. The column numbers vary by file. I have below a perl script where I first split the file using awk in bash then try to merge each file to a file with the row names. There is also another file that I used that split the file on tab across each row then place the first element of array in a scalar variable. Then I tried adding the first element and the remaining elements to a hash

      The first code

      #!/usr/bin/perl -w use strict; use warnings; use diagnostics; use Getopt::Std; #reading options our ($opt_i); getopts('i:'); if (!$opt_i) { print STDERR "\nInput file name (-i) required\n\n\n"; } #open the file or die open INFILE, "<", $opt_i or die "No such input file $opt_i"; while (<INFILE>) { chomp; my @fh = <INFILE>; my @fh = split ('\t', $_); #print "@fh"; my $geno = shift @fh; #open directory. the "." represents current directory opendir(DIR, "."); #place all files in an array #@files = readdir(DIR); my @files = glob("*.txt"); #close directory closedir(DIR); my @merge; #process each file foreach my $file (@files) { open FILE, "<", $file; while(my @line = <FILE>) { foreach my $comb (@line) { print "$geno\t"."$comb"; close FILE; } } } } close INFILE;

      The second code, which is splitting the file

      #!/usr/bin/perl -w use strict; use warnings; use diagnostics; use Getopt::Std; #reading options our ($opt_i); getopts('i:'); if (!$opt_i) { print STDERR "\nInput file name (-i) required\n\n\n"; } #open the file or die open INFILE, "<", $opt_i or die "No such input file $opt_i"; my %file; while (<INFILE>) { #remove the newline character chomp; #create an array my @fh = <INFILE>; #split the file by tabs in each row @fh = split ('\t', $_); #place the first column (row names) in $geno my $geno = shift @fh; my $remain = join "_", @fh[1-]; push @{$geno{$remain}}, $_; } # print the first field (the username) print "%file\n"; } close INFILE;

        I see what you're trying to do, almost

        How do you decide what filename gets what columns?

        Based on your sample data, these are two rows (headers then rows) that go into three files , this is how you're wanting to split that up

        <GSOR> vnir_1 vnir_2 vnir_3 <GSOR> vnir_4 vnir_5 vnir_6 <GSOR> vnir_7 vnir_8 vnir_9

        This is the first data row split up and ready to end up in three different files

        310015 0.37042 0.36909 0.36886 310015 0.36698 0.36615 0.36449 310015 0.36404

        What is the filename for the first 3 columns? Second 3 columns? Last "3" columns? And this is repeated for every row in the original data?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1188487]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2024-04-20 04:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found