Is this an ASCII file or are there other multi-byte character encodings? "Too slow" a PC is not likely, some other issue is afoot here, could be a Unicode issue? Can you hack this down into a simple: a)this works and b)this doesn't work example without huge files? The actual code can also be VERY useful. | [reply] |
I dont know much about file formats but the input file I am using is a FASTA file which stores DNA sequences. I am a beginner and doing this as a grad school project so this is pretty much the actual code and there isnt much else to it. The regular expression is fine as it gives the desired results when I use it on a test file with a few lines but doesnt work on larger files.
To give more context on the actual problem the 10 random characters are random barcodes flanked by a specific sequence (the abc and def in my example code). Once I get the 5 characters (i.e. dna bases) before and after this fragment I will use them to figure out which gene the random barcode inserted into. In this way I will have each gene associated with a unique barcode.
| [reply] |
| [reply] |
Try on a 100kb file, just to see if it is just taking to long.
| [reply] |
| [reply] |