SilverWol has asked for the wisdom of the Perl Monks concerning the following question:

Hello perlmonks! I need some advice, I got this file:
Rank Gene Symbol Definition Clusters Enriched Clusters Interactors Drugs Fold Change Pvalue

1 IL1B interleukin 1 beta 11 10 1 21 1.6227 0.0112

2 PSMD6 proteasome 26S subunit, non-ATPase 6 7 7 10 0 0.6027 0.0300

and I want to write another file with only the names of genes(Gene Symbol).
My code:
#!usr\bin\perl -w open HUBFILE,"1048_undefined.tsv"; @hub=(); while(my $line = <HUBFILE>){ if($line=~m/\d \t (\w+) \t \.+/g){ push(@hub,$1); } }close HUBFILE; open OUT,">hubs.txt"; print OUT "HUB:$hub[0]\n"; close OUT;
I'm new in programming and trying to learn perl for biology.

Replies are listed 'Best First'.
Re: Got some problem with read write file
by choroba (Cardinal) on Sep 03, 2016 at 20:36 UTC
    m/\d \t (\w+) \t \.+/g

    Your input doesn't seem to contain a space between the first digit and the tab following it, similarly, there's no space after the tab, etc. Maybe you wanted to use the /x modifier, too, which ignores unescaped space?

    That still would'n work, though, as \.+ means "at least one dot", but there's no dot after the tab. Maybe you wanted to use .+ , which will make the regex work - but it's useless, you don't need it at all. Also, there's no point to use the /g modifier, as you only match once (there's an if , not a while ), so just say

    if ($line =~ /\d \t (\w+) \t/x) {
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thank you choroba for your precise answer, I changed my code to this:
      #!usr\bin\perl -w open HUBFILE,"1048_undefined.tsv"; @hub=(); while(my $line = <HUBFILE>){ while ($line =~ /\d \t (\w+) \t/x) { push(@hub,$1); } }close HUBFILE; $L=@hub; open OUT,">hubs.txt"; for($i=0;$i<$L;$i++){ print OUT "HUB:$hub[$i]\n"; } close OUT;
      and i guess something is wrong with the second "while"
        Why is the second while there? Do you want to find several occurences on the same line? Also, you probably don't want to print all the genes found so far after finding a gene, you want to print them once all of them have been found:
        #!/usr/bin/perl use warnings; use strict; open my $HUBFILE, '<', '1048_undefined.tsv' or die $!; my @hubs; while (my $line = <$HUBFILE>) { push @hubs, $1 if $line =~ /\d \t (\w+) \t/x; } close $HUBFILE; open my $OUT, '>', 'hubs.txt' or die $!; for my $hub (@hubs) { print {$OUT} "HUB:$hub\n"; } close $OUT;

        Notice I modified some other parts of the code, too: I switched to 3-argument open with lexical filehandles, foreach style loop instead of the C-style one, etc.

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Got some problem with read write file
by CountZero (Bishop) on Sep 04, 2016 at 09:27 UTC
    Or even shorter, using split to extract the name of the gene:
    use Modern::Perl qw/2015/; use autodie; open my $IN, '<', 'genes.tsv'; open my $OUT, '>', 'genesymbol.txt'; say $OUT ( split "\t" )[1] while (<$IN>);

    split "\t" works very well for this very simple TSV file. However, for any even a bit more complicated TSV or CSV file, the use of Text::CSV comes highly recommended!


    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Got some problem with read write file
by Laurent_R (Canon) on Sep 04, 2016 at 08:42 UTC
    Hi SilverWol,

    choroba has corrected your errors and solved your issue, but I would suggest a slightly different approach based on the fact that the @hub array isn't really necessary: you can print out the data as soon as you've isolated the gene that needs to be printed. Something like this (untested):

    #!/usr/bin/perl use warnings; use strict; open my $HUBFILE, '<', '1048_undefined.tsv' or die $!; open my $OUT, '>', 'hubs.txt' or die $!; while (my $line = <$HUBFILE>) { my $gene = $1 if $line =~ /\d \t (\w+) \t/x; print $OUT, $gene, "\n"; } close $_ for ($HUBFILE, $OUT);
    The code is slightly shorter and probably slightly faster, but the main advantage is that this will work even with a huge input file, while the solution with an intermediate array might fail if the array grows to big for your available memory.

    Update: corrected the name of the addressee at the top of this post (wrong copy and paste, I guess). Thanks to choroba for pointing it out the error.

      Thank you! I still have a problem with the code choroba wrote and trying to find the problem, in my pc only prints the first gene name...
        I think that choroba's code should work fine (although I haven't tested it). Please show the code that you're using, there must be a slight mistake somewhere.
Re: Got some problem with read write file
by GotToBTru (Prior) on Sep 08, 2016 at 12:41 UTC

    Can you show what the output should look like, and what you're getting? I think that might help.

    Also, if you put your example input inside code tags, we could see what's what. Your two data lines don't have the same number of columns, and I suspect that one or more of the column titles is more than one word.

    But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)