Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

substr is behaving differently with small vs large strings

by wadunn (Initiate)
on Jul 07, 2004 at 20:39 UTC ( [id://372538]=perlquestion: print w/replies, xml ) Need Help??

wadunn has asked for the wisdom of the Perl Monks concerning the following question:

Hello all.

I am still a pretty green Perl user, but have a few functional programs under my belt. I am using Perl to crunch DNA sequences for my lab. We have been manually looking for a group of genes in the African malaria mosquito and want to send off the coordinates that we discovered to the powers that be. I thought that I would write a program that took a giant sequence of DNA (called a contig) and extracted the genes from it based on the coordinates that we found in order to double check each set of locations before we sent them off.

For those non biologists, the search file is basically a large text file and I want my program to go so many letters from the start, take a given number of letters after that and place them in a variable (think of this as a word). Then go further downstream and pick another word, and then one more. Then I concatenate the words into a sentence (the final gene). I was not anticipating a difficult program.

I thought I would use ‘substr’ because its seems tailor made for this. When I tried it on a test file (660 letters) using three calls to substr and placing the three ‘words’ into separate variables, everything behaved beautifully.

Problems arose when I went to the actual search file (2,082,241 letters). With the exact same script my extracted substrings were being pulled from the sequence AHEAD of the target sequence as confirmed by my manual location of the coordinates that the program was given. Am I missing something about the behavior of substr that it would act differently on a VERY large string?

I know that gigantic strings are memory blackholes but at my current level of expertise I don’t know how to avoid loading the whole file. I don’t care how slow it runs, I can run the thing overnight (eventually will be unleashed on about 150 genes); it just needs to work. I know that your job is not to look over idiot programmers’ code but I included mine incase it helps you see what I am trying to do or spot any mistake I may have over looked.

Thank you so much and I will be praying to the perl gods until your comments come.

Augustine

#!/usr/bin/perl -w use warnings; use strict; use diagnostics; $/ = "//"; my $input = 'CUTR;AAABtest;plus;185-190;438-440;576-579'; #two sets of "instructions" I can interchange in $input for testing #'CUTR25;AAAB01008851;plus;764935-764946;765050-765289;765372-76 +5659' #'CUTR;AAABtest;plus;185-190;438-440;576-579' my @input = split (/;/, $input); my $name = $input[0]; my $contig = $input[1]; my $ori = $input[2]; my @E1 = split (/-/, $input[3]); my @E2 = split (/-/, $input[4]); my @E3 = split (/-/, $input[5]); print "$name\n$contig\n$ori\n@E1\n@E2\n@E3\n\n"; #get and edit contig so it can be searched and used my $contigfile = "$contig.txt"; open(CONTIG, $contigfile) or die (">Could not open $contigfile!\n$!" +); my $rawseq = <CONTIG>; my $goodseq = editcontig($rawseq); #clear giant variable to free some memory???? $rawseq = ''; #Construct exon 1 and save to $EXON1 my $offset1 = $E1[0]-1; my $length1 = $E1[1]-$offset1; my $EXON1 = ''; $EXON1 = substr($goodseq, $offset1, $length1); #Construct exon 2 my $offset2 = $E2[0]-1; my $length2 = $E2[1]-$offset2; my $EXON2 = ''; $EXON2 = substr($goodseq, $offset2, $length2); #Construct exon3 my $offset3 = $E3[0]-1; my $length3 = $E3[1]-$offset3; my $EXON3 = ''; $EXON3 = substr($goodseq, $offset3, $length3); print "$EXON1\n\n$EXON2\n\n$EXON3"; exit; ########### SUBS ############ sub editcontig { my ($rawseq) = @_; $rawseq =~ s/\s//g; $rawseq =~ s/0//g; $rawseq =~ s/1//g; $rawseq =~ s/2//g; $rawseq =~ s/3//g; $rawseq =~ s/4//g; $rawseq =~ s/5//g; $rawseq =~ s/6//g; $rawseq =~ s/7//g; $rawseq =~ s/8//g; $rawseq =~ s/9//g; return $rawseq; }
This is my test data that works:
01 aagaagaagf agaagaagag agagaacttg gaaacagaat tgtagaacag atttatagac 61 agagcaaagt acaattccgt ttcagacaag taacaagagt gaaatagata taagactcag 121 agaaagtaaa aagcgtcaag taaggtacag tgaggagaaa gtaatgagtg tgtatgttca 181 atcgBROOKS tacccttgcg tatctcctcc ttcctcagcc accaacagtt catactcctg 241 cgcagcgtat cgatcccagt cagtaagcta aaaggtcaaa cacatgttat tagatggcat 301 ttcgatagaa tgtcatcgtg ctattcttac atcgtcgttt tcgcgctgtg caaatggatc 361 acgctgctcg tgctccatat acttctccag attgaagaac gtgtcgaaga aaatgggcgt 421 cattttgcat cgctttaWAS ggaccaacgt tacacagcca ggagttgctg gttttatcat 481 atcaagcatc tggaagcaaa cgagtactcc ttagcatatc gaaccaactc catctaccac 541 tccccaaaca aacacatact tgacacaggc agtctHEREa gggtagggtt tcgataccga 601 gcgattccat gcgatgctgc tgctcctcgt agaagtactc cagctcgtac atcgagagga

Replies are listed 'Best First'.
Re: substr is behaving differently with small vs large strings
by BrowserUk (Patriarch) on Jul 07, 2004 at 21:22 UTC

    Your code (and substr) works fine. I just generated a 2.6 MB test file, put START and END at the appropriate places and the results are exactly what you want.

    That leaves a couple fo possibilities.

    1. Either your method of checking is wrong.
    2. The large file is not quite in the format your expecting.

    BTW. You could replace your editcontig() sub with

    my $contigfile = "$contig.txt"; open(CONTIG, $contigfile) or die (">Could not open $contigfile!\n$!"); my $goodseq = <CONTIG>; $goodseq =~ tr[0-9\n ][]d; #clear giant variable to free some memory???? # $rawseq = '';

    This will reduce the memory required by your program from 35 MB to around 6MB.

    It will also speed it up slightly, but if that offends you, you can always add a sleep 10; :)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algoritm, algorithm on the code side." - tachyon
Re: substr is behaving differently with small vs large strings
by ccn (Vicar) on Jul 07, 2004 at 21:02 UTC
    It seems that your code works correctly. May be something wrong with data? Anyway I can suggest to improve you code, sub editcontig can be eliminated:
    my $rawseq = <CONTIG>; # my $goodseq = editcontig($rawseq); $rawseq =~ tr/ \t\n0-9//d; # now use $rawseq instead of $goodseq
Re: substr is behaving differently with small vs large strings
by qq (Hermit) on Jul 07, 2004 at 21:37 UTC

    I don't think substr is your problem:

    #!/usr/bin/perl my $pad = 100_000_000; my $s = 'a' x $pad; $s .= 'qq was here'; $s .= 'b' x $pad; print substr( $s, $pad, 10 ), "\n"; # qq was here

    If you save the return from editcontig to a file, and find the character position that should be the start of the substr, is it actually off?

    Its more likely that editconfig is stripping out more or less than it should. Perhaps there is some other character in your longer file? BTW, you can condense editconfig to something like:

    sub editcontig { my $rawseq = shift; $rawseq =~ s/\s+//g; $rawseq =~ s/[0-9]//g; return $rawseq; }

    and you could validate the output by adding:

    if ( $goodseq =~ /([^actg]+)/ ) { die 'found unexpected character $1"; }

    qq (who knows nothing about genes)

Re: substr is behaving differently with small vs large strings
by Belgarion (Chaplain) on Jul 07, 2004 at 21:03 UTC

    Since you're opening a file anyway, you might want to consider using the seek and read methods on your filehandle. This will allow you to position the read pointer anywhere you like within the file and read in exactly the number of bytes you require. Also, you will not need to read the entire file into memory before doing the string manipulation.

    Update: The above suggestion would still work, but you would need to take into account the formatted nature of your input file. To use the seek method correctly, you would need to take into account the spaces between segments and the leading numbers prefixing each string (and likely the newline which terminates each string.) Using seek would still work, but depending on the format of the full input file, it might not be feasible.

    Update II: (Thinking some more...) I have a question regarding the input file and the offset. Is the offset from the textual beginning of the file, or from the beginning of the DNA sequence? It would appear that it's from the beginning of the DNA sequence, but I'm not going to assume here.

      The offset appears to make sense in the trimmed file. That is, excluding the spaces between the columns and the line numbers. Which suggests that Augustine might aviod slurping the file, by reading one line at a time (by leaving $/ unchanged) and windowing every input sequence (exon) (a sequence of lines which covers the exon).

      You could also use `seek' and `sysread' (as suggested by Belgarion, but you'd have to work out an algorithm (quite simple I would guess) which converts the address within the trimmed file to the address within the formatted file. Something like... $address_in_file = $address_in_trimmed_file + $number_of_characters_for_line_numbering_for_first_line + ($address_in_trimmed_file / 10) + ($address_in_trimmed_file / 60) * ($one_character_for_new_line + $number_of_characters_for_line_numbering). Actually, I think that is a correct and yet unabusive way to solve this (every 10 characters you get a space, every 60 characters you get a new line and a line number).

      On another note, though, I have experience with slurping large files, and can say that perl handles it pretty good. For example, I could semi-parse an 8Mb Xml file into an internal tree within seconds. I'm afraid I cannot say how substring handles large offset/length parameters, I thought it would work as expected.
      Belgarion,

      Thanks for your input. As I said I am pretty a green Perl man. I am not familiar with how to decide how many bytes converts into a given number of charaters so I would just assume stay away from that.

      As for the input file, the textual begining of the file and the start of the DNA should be the same due to the editcontig sub or which ever of the new methods you guys suggested eating the numbers and the white space.

      Thanks again, for your time I can tell this is where i need to ask questions from now on. The other boards i frequent are not 1/8 as helpful as you guys.

      -Augustine

Re: substr is behaving differently with small vs large strings
by TrekNoid (Pilgrim) on Jul 07, 2004 at 21:08 UTC
    Looks to me like the problem is that your editcontig function changes the length of the string...

    $rawseq =~ s/9//g;
    reduces the length of the string by 1 character... Maybe you should substitute a space or something?

    Trek

Re: substr is behaving differently with small vs large strings
by husker (Chaplain) on Jul 07, 2004 at 21:16 UTC
    I really like these bioinformatics problems. :) They look deceptively simple, complicated only by the sheer size of the dataset. You are not the only "gene hunter" here at the Monastery.

    As you are a green Perl hacker, I recommend two things:

    1) Hang out here a lot. Share in the glow of our collective wisdom. Do a Super Search here in the Monastery for "DNA" (or even a short sequence like "agtc"!)and see what others who have gone before you have learned.

    2) Get yourself a good Perl bioinformatics book. A quick search on Amazon yielded about a dozen or so. I have no idea which are worthwhile and which are junk, but I'm sure you are in a better position to judge. And, if in doubt, the O'Reilly books are usually a safe bet.

Re: substr is behaving differently with small vs large strings
by wadunn (Initiate) on Jul 08, 2004 at 15:07 UTC
    THANK YOU ALL!

    I am happy to say that the problem I had been having is now fixed. I am, however, embarrassed by what was causing the problem. I was careless when my boss asked me to find a region of the contig for her by hand and actually entered notes in the sequence file its self. Being that the file is 551 pages of pure text, I did not remember/notice the additions when I looked at the result of the editcontig() sub to look for problems.

    Thanks again to all of you for your help and reminding me to verify anything that I edit before taking for granted that all went as planned.

    I have been converted, evangelized even to a believer in the PerlMonk society and will most assuredly be bugging you nice people in the future!

    one last thank you and I will end... THANK YOU!

    -Augustine

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://372538]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (5)
As of 2024-03-29 12:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found