Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: modification of the script to consume less memory with higher speed

by Anonymous Monk
on Jul 30, 2016 at 05:00 UTC ( [id://1168845]=note: print w/replies, xml ) Need Help??


in reply to Re: modification of the script to consume less memory with higher speed
in thread modification of the script to consume less memory with higher speed

I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab:
data1.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt
data2.txt @NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE 1 :data +2.txt @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE 2 :dat +a2.txt @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE 2 :da +ta2.txt
I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this:
@NS500278 AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC + =CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G 1 :da +ta1.txt 1 :data2.txt count:2 @NS500278 CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ 3 :da +ta1.txt 2 :data2.txt count:5 @NS500278 TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG + CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ 2 :da +ta1.txt 2 :data2.txt count:4
My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.

Replies are listed 'Best First'.
Re^3: modification of the script to consume less memory with higher speed
by BrowserUk (Patriarch) on Jul 30, 2016 at 05:13 UTC
    My code is working with file upto 10GB but if the file exceed this size it hangs.

    Is that 10GB, all the files together; or just one of the files?

    How much memory does your machine have?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
      yes that 10GB is all files put together. I want my script to run consuming less memory because we need to deal with large data files.
Re^3: modification of the script to consume less memory with higher speed
by ablanke (Monsignor) on Jul 30, 2016 at 13:24 UTC
    Hi,

    since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...

Re^3: modification of the script to consume less memory with higher speed
by Anonymous Monk on Jul 30, 2016 at 05:50 UTC

    You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right?

    Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order?

    If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?

      I am sorry but I am unable to follow your suggestion as I am a beginner in perl. It would be helpful if you could explain it with example or modification in my script if possible. The output records is acceptable in a random order but the complete second line should match in all files and the count is given accordingly.

        PerlMonks is not a code-writing service.

        The script you have is fine as it is (if it works), what you need is another script that first divides the job so that it becomes manageable. Keys that have differing beginnings can never match, therefore, partitioning the records by their start serves to effectively reduce the job into many smaller jobs.

        What part of the suggestion are you struggling with? For starters, you could try to work out a script that reads the records and dumps them on the screen, together with a note "this record must go in that file". The problem you have is a good learning opportunity, as it can be readily broken down into smaller sub-tasks that a beginner can handle.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1168845]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-19 14:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found