Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^2: Regular expressions across multiple lines

by abcd (Novice)
on Apr 24, 2016 at 17:05 UTC ( [id://1161367]=note: print w/replies, xml ) Need Help??


in reply to Re: Regular expressions across multiple lines
in thread Regular expressions across multiple lines

I tried it on a small file with only few lines and it worked perfectly. With the large file it just does nothing. I tried it with a 10mb file and still doesnt work. I tried to output the chomped text to a txt file. When I open that text file in a text editor it shows weird overlapping text (like some sort of graphical problem). The only thing I can think of is that my pc is too slow and the process hangs or something. But if this is the only way to do it I will try it at my work pc.
  • Comment on Re^2: Regular expressions across multiple lines

Replies are listed 'Best First'.
Re^3: Regular expressions across multiple lines
by Marshall (Canon) on Apr 24, 2016 at 17:11 UTC
    Is this an ASCII file or are there other multi-byte character encodings? "Too slow" a PC is not likely, some other issue is afoot here, could be a Unicode issue? Can you hack this down into a simple: a)this works and b)this doesn't work example without huge files? The actual code can also be VERY useful.
      I dont know much about file formats but the input file I am using is a FASTA file which stores DNA sequences. I am a beginner and doing this as a grad school project so this is pretty much the actual code and there isnt much else to it. The regular expression is fine as it gives the desired results when I use it on a test file with a few lines but doesnt work on larger files.

      To give more context on the actual problem the 10 random characters are random barcodes flanked by a specific sequence (the abc and def in my example code). Once I get the 5 characters (i.e. dna bases) before and after this fragment I will use them to figure out which gene the random barcode inserted into. In this way I will have each gene associated with a unique barcode.
        I looked at the FASTA format and it is ASCII, however there could be some other issue here with the program that generated this file. Can you open the original file in the text editor, eg WordPad and see the characters displayed properly? chomp() should not affect this. This "I see bizarre characters in the texteditor" is sounding like a big clue to me that format is wrong and your small example works because it is ASCII?

        update: there are a bunch of modules to mess with this BIO FASTA format. Search CPAN for "FASTA". But this sounds easy enough to figure out without a module.

Re^3: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 18:00 UTC

    Try on a 100kb file, just to see if it is just taking to long.

Re^3: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 18:19 UTC

    At what length of file does it stop working?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161367]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (3)
As of 2024-04-26 06:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found