Re^2: modification of the script to consume less memory with higher speed

I have multiple fastq files in the following format giving the reads and the number of times each read occur in a file, separated by tab:

data1.txt

@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G     1 :da
+ta1.txt

@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ     3 :da
+ta1.txt

@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ     2 :da
+ta1.txt
[download]


data2.txt

@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
AAAAA#EEEEEEEEEEEEEEEE6EEEEEAEEEAE/AEEEEEEEAE<EEEEA</AE<EE     1 :data
+2.txt

@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
AAAAA#E/<EEEEEEEEEEAEEEEEEEEA/EAAEEEEEEEEEEEE/EEEE/A6<E<EEE     2 :dat
+a2.txt

@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
AAAAA#EEEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEE/EEEAE6AE<EAEEAE     2 :da
+ta2.txt
[download]

I want to sum the occurrence of read in all the files if the second line of each read matches i.e the output for the above two files should come like this:


@NS500278
AGATCNGAAGAGCACACGTCTGAACTCCAGTCACAACGTGATATCTCGTATGCCGTCTTC
+
=CCGGGCGGG1GGJJCGJJCJJJCJJGGGJJGJGJJJCG8JGJJJJ1JGG8=JGCJGG$G     1 :da
+ta1.txt     1 :data2.txt    count:2

@NS500278
CATTGNACCAAATGTAATCAGCTTTTTTCGTCGTCATTTTTCTTCCTTTTGCGCTCAGGC
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJ>JJJGGG8$CGJJGGCJ8JJ     3 :da
+ta1.txt     2 :data2.txt    count:5

@NS500278
TACAGNGAGCAAACTGAAATGAAAAAGAAATTAATCAGCGGACTGTTTCTGATGTTATGG
+
CCCGGGGGGGGGGJGJJJJJJJJJJJJJGJG$JJJJ$GGJJJJJGGG8$CGJJGGCJ8JJ     2 :da
+ta1.txt     2 :data2.txt    count:4
[download]

My code is working with file upto 10GB but if the file exceed this size it hangs. I want my script to run any size of file. Any help will be appreciated.

Comment on Re^2: modification of the script to consume less memory with higher speed Select or Download Code

Replies are listed 'Best First'.
Re^3: modification of the script to consume less memory with higher speed by BrowserUk (Patriarch) on Jul 30, 2016 at 05:13 UTC
My code is working with file upto 10GB but if the file exceed this size it hangs. Is that 10GB, all the files together; or just one of the files? How much memory does your machine have? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:25 UTC
yes that 10GB is all files put together. I want my script to run consuming less memory because we need to deal with large data files.	[reply]
Re^3: modification of the script to consume less memory with higher speed by ablanke (Monsignor) on Jul 30, 2016 at 13:24 UTC
Hi, since scalability is one of your top priorities consider using a Key Value Database like Redis, MemcacheDB or ...	[reply]
Re^3: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 05:50 UTC
You appear to keep the first record that is seen, in full, while subsequent matching records are only tallied by their count. In that right? Now, the remaining question is, do you want the output records to keep the order in which they are processed, or is it acceptable if they appear in random order? If any output order will do, then the simplest way to process your job is to divide it up in parts. For example, you can dump the records in temporary files, according to first few letters of the key. Lets say, the intermediate files are TACA.tmp, CATT.tmp, AGAT.tmp, etc. After that, process each temp. file individually, appending the output to final result. Questions?	[reply]
Re^4: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 06:22 UTC
I am sorry but I am unable to follow your suggestion as I am a beginner in perl. It would be helpful if you could explain it with example or modification in my script if possible. The output records is acceptable in a random order but the complete second line should match in all files and the count is given accordingly.	[reply]
Re^5: modification of the script to consume less memory with higher speed by Anonymous Monk on Jul 30, 2016 at 07:46 UTC
PerlMonks is not a code-writing service. The script you have is fine as it is (if it works), what you need is another script that first divides the job so that it becomes manageable. Keys that have differing beginnings can never match, therefore, partitioning the records by their start serves to effectively reduce the job into many smaller jobs. What part of the suggestion are you struggling with? For starters, you could try to work out a script that reads the records and dumps them on the screen, together with a note "this record must go in that file". The problem you have is a good learning opportunity, as it can be readily broken down into smaller sub-tasks that a beginner can handle.	[reply]


more useful options
	PerlMonks