http://qs321.pair.com?node_id=1161365

abcd has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I am a beginner trying to use a regular expression to search for a pattern. The problem is that the pattern is split across multiple lines. For example, if I am searching for 'abc' in a file containing the following:


abcdefab
cdefa
bcdef

I should get 3 matches.

The first thing I tried was to write a While script that reads the file line by line, chomps the line and stores everything in a single variable which I then search. This works well when I use a small sample file but for some reason doesnt work with my actual txt file which is several hundred mb so maybe it is not the most efficient way of doing this.

This is my code  while ($line=<inputfile>){chomp $line; $string=$string.$line;}

A second option according to google is to use /m or /s modifiers. But the problem is that I dont know where the text would be broken by the newline so should I put a . after every character in my regular expression? My actual expression is pretty long so I dont know if that would be the best way to do it.

My regular expression is to search for a keyword and capture the 5 characters before and after it with the keyword also containing a random 10 character sequence in it. $string =~ /(?<=(.....))abc(.{10})def(?=(.....))/g

I am assuming that there is something obvious I am missing here so would appreciate any help.

Thanks

Replies are listed 'Best First'.
Re: Regular expressions across multiple lines
by AnomalousMonk (Archbishop) on Apr 24, 2016 at 19:51 UTC
    I tried to output the chomped text to a txt file. When I open that text file in a text editor it shows weird overlapping text (like some sort of graphical problem).   [from this; emphasis added]

    afoken has already covered the cross-system line-end mismatch and chomp problems pretty well. Here's a further example:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "dd $/; ;; my $s = qq{abc\r\n}; chomp $s; dd $s; ;; my @lines = (qq{Overlapping\r\n}, qq{Strange\r\n}, qq{Text\r\n}); dd \@lines; ;; chomp @lines; dd \@lines; ;; my $line = join '', @lines; dd $line; ;; print qq{$line}; " "\n" "abc\r" ["Overlapping\r\n", "Strange\r\n", "Text\r\n"] ["Overlapping\r", "Strange\r", "Text\r"] "Overlapping\rStrange\rText\r" Textngeping

    The other thing that occurs to me is that your original file has "invisible" whitespace characters other than the  \r \n line-enders: spaces, tabs, etc. If these exist at the end of a line, it will cause a problem. Try something like (untested):
        open my $filehandle, '<', ... or die "...: $!";
        ...
        my $content = do { local $/;  <$filehandle>; };
        $content =~ tr{\x20\t\f\r\n}{}d;

    In general, processing a file of a few hundred megabytes entirely held in memory should not be a big problem, depending on what you're doing, and you're basic approach (insofar as I can understand what it is) looks ok to me.

    And, of course, the golden rule: Know Your Data!

    Update: BTW:  my $content = do { local $/;  <$filehandle>; }; is the "file slurp" idiom; it does a raw read of the entire file to a string.


    Give a man a fish:  <%-{-{-{-<

Re: Regular expressions across multiple lines
by graff (Chancellor) on Apr 24, 2016 at 21:29 UTC
    I think your regex could be simpler. Apart from that, given what you've said about the task and the data, I'd take a stack approach to handling the input -- something like this:
    #!/usr/bin/perl use strict; use warnings; my @stack; my $buffer; my $regex = qr/^.*?(.{5}abc(?:.{10})def.{5})/; my $target_length = 26; # number of characters needed for a match while (<DATA>) { chomp; push @stack, $_; $buffer = join( "", @stack ); if ( $buffer =~ s/$regex// ) { my $target = $1; while ( $target ) { print "Found /$target/ after reading $. lines\n"; $target = ( $buffer =~ s/$regex// ) ? $1 : undef; } @stack = ( $buffer ); } else { while ( length( $buffer ) >= $target_length + length( $stack[0 +] )) { shift @stack; $buffer = join( "", @stack ); warn sprintf( "No match at line %d; stack is %d lines, %d +chars\n", $., scalar @stack, length( $buffer )); } } } __DATA__ sample data with five matches... foo bar 5CHRSabc_TEN_CHRS1def5chrs bax qax moo gar 5 Chrsab c_Ten_Chrs2 de f5Chrsnax zax 5cHrSabc_TeN_ ChRs3def5 chrs etc. and so on and so on and so on ad nausem fivecabc0123456789defmtch4 and then another FIVECabc98765432 +10defMTCH5 and then nothing useful after that forever more up to the end
    That will concatenate lines onto a stack, removing line terminations as it goes. As soon as there's a match, it's reported to STDOUT, and the stack is reset to start where the match ended.

    If there's no match for an extended stretch of data, the initial line is shifted off the stack so long as the overall length of the remaining lines is enough to hold a match. (I put messages to STDERR to report this, just to see it work.)

Re: Regular expressions across multiple lines
by Discipulus (Canon) on Apr 24, 2016 at 17:32 UTC
    Welcome to the Monastery abcd

    The first assignement seems easy; why check for every char with the dot and not just newline ?

    perl -E "say q(found ), $count=()=qq(abcdefab\ncdefa\nbcdef) =~ /a\n?b +\n?c/gm, q( [abc] occurences)" found 3 [abc] occurences
    By other hand the description you gave of your code, does not make so much sense to me (and you probably missing use strict; and use warnings; ).

    while ($line=<inputfile>){chomp $line; $string=$string.$line;}

    Infact what i understand is that you are accomulating every new line into $string and attempting the match for every generated string: so for a 100 lines file you are actually examining 5050 lines. this can be a problem.

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thanks, i will try this but I am new to programming so not sure i understand your code. What I was doing in my code was to simply chomp every line and append it to the end of a string so I get a single string containing everything without any newlines which I then search.
        Please clarify crystal clear: "but for some reason doesnt work with my actual txt file which is several hundred mb", I am presuming that "slow", maybe many,even tens of minutes is NOT the issue?
        you welcome, even if i'm not sure to understand your issue.

        Basicly a\n?b\n?c means match a followed by, perahps ? a newline \n followed by a b followed by, perahps ? a newline \n and a c

        The m regex modifier (probably unneeded in my example) stands for multiline and the g one means globally ie all occurences are returned.

        $count=()=$string=~/pattern/g idiom is used to count the occurences of pattern in $string infact $string=~/pattern/g with the g returns a list and the generic list () is provided and it's scalar value (ie the number of elements) is returned to the scalar $count

        For shortness i put your example data into a doublequoted string using qq operator qq(abcdefab\ncdefa\nbcdef)

        the rest is only print stuffs.

        If you want to slurp a file into a string you can play with $/ aka input record separator, see perlvar and How do I read an entire file into a string?

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        assuming you're really only looking for a short string, and given the size of your file, I would be tempted to only concatenate the new line with the last few nonspace characters from the previous line(s), and do the comparison every loop.

Re: Regular expressions across multiple lines
by marioroy (Prior) on Apr 24, 2016 at 21:45 UTC

    Hello abcd,

    The following demonstration uses the BioUtil::Seq module on CPAN. It is beneficial for this use case.

    use strict; use warnings; use BioUtil::Seq; use constant { HDR => 0, SEQ => 1 }; # From the documentation: # # FastaReader returns an anonymous subroutine, when called, returns # a fasta record which is a reference of an array containing the fasta # header and sequence. By default, spaces and \r?\n are trimmed from # the sequence. # my $next_seq = FastaReader("input_file.fasta"); while ( my $fa = $next_seq->() ) { # print ">$fa->[HDR]\n$fa->[SEQ]\n"; my $name = ( split(/ /, $fa->[HDR], 2) )[0]; while ( $fa->[SEQ] =~ /(?<=(.....))abc(.{10})def(?=(.....))/g ) { print "$name: $1, $2, $3\n"; } }

    Regards, Mario.

      Update: Changed chunk_size to '2M'.

      Update: Added full example.

      Update: Added missing tr line to trim white space.

      For the spirit of Perl and Bioinformaticians at large, the following does the same thing by utilizing the record separator option in MCE. The "\n>" is a special case which anchors ">" at the start of the line. Workers receive records beginning with ">" and ending in "\n".

      The following demonstration is fast for small and large sequences. A chunk_size greater than 8192 means to read at least the number of bytes. Perl will read until the record separator. A worker may receive 1 or several records depending on the size of the record(s).

      use strict; use warnings; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n"; mce_flow { max_workers => 4, chunk_size => '2m', input_data => "input_file.fasta", RS => "\n>", }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $name, $output ); for ( @{ $chunk_ref } ) { /^>(\w+)/; $name = $1; tr/\t\r\n //d; # trim white space while ( $_ =~ /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) { $output .= "$name: $1, $2, $3\n"; } } print $out_fh $output if length($output); };

      The following demonstration was created mainly as a template for extracting the seq_id, seq_desc, and sequence separately and doing so with low memory consumption. Basically, the whole header line is trimmed from the record leaving just sequence in $_ without Perl making an extra copy.

      use strict; use warnings; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n"; mce_flow { max_workers => 4, chunk_size => '2m', input_data => "input_file.fasta", RS => "\n>", }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $pos, $hdr, $seq_id, $seq_desc, $output ); for ( @{ $chunk_ref } ) { $pos = index($_, "\n") + 1; $hdr = substr($_, 0, $pos - 1); # skip the first record, e.g. comment at the top of the file next if ( $chunk_id == 1 && substr($hdr, 0, 1) ne '>' ); # extract seq_id and seq_desc $hdr =~ /^>(\w+)\s*([^\r\n]*)/; $seq_id = $1, $seq_desc = $2; # $_ becomes sequence, without making an extra copy substr($_, 0, $pos, ''); # trim any white space in sequence tr/\t\r\n //d; # for printing ">header\nsequence\n", uncomment the next 3 lines # ( length $seq_desc ) # ? print ">$seq_id $seq_desc\n$_\n" # : print ">$seq_id\n$_\n"; # loop through match patterns while ( /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g ) { $output .= "$seq_id: $1, $2, $3\n"; } } print $out_fh $output if length($output); };

      Regards, Mario.

      The following is a parallel demonstration when extra performance is desired for very large sequences. Otherwise, the serial demonstration is faster.

      use strict; use warnings; use BioUtil::Seq; use constant { HDR => 0, SEQ => 1 }; use MCE::Flow; use MCE::Shared; mce_open my $out_fh, '>', \*STDOUT or die "open error: $!\n"; # From the documentation: # # FastaReader returns an anonymous subroutine, when called, returns # a fasta record which is a reference of an array containing the fasta # header and sequence. By default, spaces and \r?\n are trimmed from # the sequence. # mce_flow { max_workers => 4, chunk_size => 1, input_data => FastaReader("input_file.fasta") }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $fa = $chunk_ref->[0]; # my $fa = $_; # same thing for chunk_size => 1 # therefore, the 2 lines above may be omitted # print ">$fa->[HDR]\n$fa->[SEQ]\n"; my $name = ( split(/ /, $fa->[HDR], 2) )[0]; my $output; while ( $fa->[SEQ] =~ /(?<=(.....))abc(.{10})def(?=(.....))/g ) { $output .= "$name: $1, $2, $3\n"; } print $out_fh $output if length($output); };

      Regards, Mario.

Re: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 16:54 UTC

    Why doesn't it work with your large file? Does it work with a file half the size of your large file? One tenth the size?

      I tried it on a small file with only few lines and it worked perfectly. With the large file it just does nothing. I tried it with a 10mb file and still doesnt work. I tried to output the chomped text to a txt file. When I open that text file in a text editor it shows weird overlapping text (like some sort of graphical problem). The only thing I can think of is that my pc is too slow and the process hangs or something. But if this is the only way to do it I will try it at my work pc.
        Is this an ASCII file or are there other multi-byte character encodings? "Too slow" a PC is not likely, some other issue is afoot here, could be a Unicode issue? Can you hack this down into a simple: a)this works and b)this doesn't work example without huge files? The actual code can also be VERY useful.

        Try on a 100kb file, just to see if it is just taking to long.

        At what length of file does it stop working?

Re: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 17:26 UTC

    Is your file a Windows file with both \r and \n in it, or a Linux file with just \n?

    Try slurping the whole file and tr/// out newlines.

    tr/\r\n//d for my $string = do { local $/; <inputfile> };
      chomp() is multi-platform(not exactly true). It will delete <CR><NL> and <NL>, even on Windows. These line endings even if mixed will not matter. BTW: To normalize line endings to the current platform: while(<>){chomp;print;} Works on Unix or Windows.

      Updated: In my testing and actual experience, Perl programs appear to do well under either Unix or Windows, even with mixed line endings on either platform. Perl itself also apparently doesn't have any problems. I have run the above while() code many times on several Unix platforms with great success. On my Win XP machine, there is an issue with old style Mac endings (which uses just <CR>), however I never work with files like that so I hadn't seen this before and had to write a special test case using bin mode to make a file like that.

        chomp() is multi-platform. It will delete <CR><NL> and <NL>, even on Windows. These line endings even if mixed will not matter.

        Well, it may look so, but what really happens is different. See chomp:

        This safer version of chop removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module).

        Note: Not a single word of the CR or LF control characters, the CR-LF pair, or NL (newline).

        The input record separator $/ is documented, it defaults to an abstract "newline" character:

        The input record separator, newline by default. This influences Perl's idea of what a "line" is. [...] See also Newlines in perlport.

        Now, "newlines". Perl has inherited them from C, by using two modes for accessing files, text mode and binary mode. In text mode, the systems native line ending, whatever that may be, is translated from or to a logical newline, also known as "\n". In binary mode, file content is not modified during read or write. C has been defined in a way that the logical newline is identical with the native line ending on unix, LF. So, there is no difference between text mode and binary mode ON unix.

        Quoting Newlines in perlport:

        In most operating systems, lines in files are terminated by newlines. Just what is used as a newline may vary from OS to OS. Unix traditionally uses \012, one type of DOSish I/O uses \015\012, Mac OS uses \015, and z/OS uses \025.

        Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF.

        What happens here is that Perl has reasonable defaults for text handling, so it opens files (including STDIN, STDOUT, STDERR) in text mode by default, $/ defaults to a single logical newline ("\n"), and so native newline characters are translated before chomp just removed that "\n", on any platform.

        When reading text files using a non-native line ending, things will usually go wrong:

        /tmp/demo>file *.txt linux-file.txt: ASCII text mac-file.txt: ASCII text, with CR line terminators windows-file.txt: ASCII text, with CRLF line terminators /tmp/demo>perl -MData::Dumper -E '$Data::Dumper::Useqq=1; for $fn (@AR +GV) { open $f,"<",$fn or die; @lines=<$f>; chomp @lines; say "$fn:"; +say Dumper(\@lines); }' *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated\r", "on Windows with Windows\r", "line endings.\r" ]; /tmp/demo>

        Of course, it depends on the system you are using:

        H:\tmp\demo>perl -MWin32::autoglob -MData::Dumper -E "$Data::Dumper::U +seqq=1; for $fn (@ARGV) { open $f,'<',$fn or die; @lines=<$f>; chomp +@lines; say qq<$fn:>; say Dumper(\@lines); }" *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated", "on Windows with Windows", "line endings." ]; H:\tmp\demo>

        So, chomp is NOT cross-platform. It can handle input from native text files on all platform out of the box. But if you have to work with ASCII files with mixed line endings (CR, LF, CR-LF, LF-CR), chomp can't work reliably. This is not chomp's fault, neither is it perl's fault.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Reading a 100mb file a line at a time can be slow.

Re: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 20:58 UTC
    #!/usr/bin/perl use strict; use warnings; tr/\r\n//d for my $string = do { local $/; <> }; # slurp STDIN printf "leader %s middle %s trailer %s\n", $1, $2, $3 while $string =~ /(?<=(.....))CCCC(.{10})AGA(?=(.....))/g;

      On thinking about it, replace

      tr/\r\n//d

      with

      tr/ACGT//cd

      and keep exactly what you want.

      perl scans through a 420mb test case in 1.17 seconds. Cool.

Re: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 17:13 UTC

    Do the 5 preceding chars and the 5 trailing chars and the 10 char middle part also have new lines in them? Are those new lined counted as part of the 5 or 10?

      Yes the newlines could be anywhere but they are not counted in the 5 char or 10 characters i am trying to capture. I am working on dna sequences so I only want the results to include ACGTs.
Re: Regular expressions across multiple lines
by Anonymous Monk on Apr 24, 2016 at 21:36 UTC

    How long are the actual strings that are represented by 'abc' and 'def' in your problem description? What are the real strings?