http://qs321.pair.com?node_id=1107332

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all! I need to check if file indicated in the command line contains any symbols other than a, c, t or g. Here is what I wrote:

#!/usr/bin/perl -w use strict; my $DNA; chomp ( $DNA = $ARGV[0] ); open ( INPUT_FILE, "$DNA" ) or die; if ( <INPUT_FILE> =~ m/[^actgACTG]/ ) { print "File contains something besides actg sequence.\n"; } else { print "good!\n"; }

The problem is that this code matches non actg symbols in files that contain only actg letters. Help please!

Replies are listed 'Best First'.
Re: =~ matches non-existent symbols
by Athanasius (Archbishop) on Nov 16, 2014 at 07:46 UTC

    This line:

    if ( <INPUT_FILE> =~ m/[^actgACTG]/ ) {

    calls <> (i.e., readline) only once (in scalar context), therefore only the first line of the file is read in and tested. To test the whole file, you need a loop. For example (untested):

    my $ok = 1; while (<INPUT_FILE>) { chomp; if (/[^actg]/i) { print "File contains something besides actg sequence.\n"; $ok = 0; last; } } print "good!\n" if $ok;

    And yes, the chomp is necessary, otherwise each line (except perhaps the last) will contain a newline character and so fail the regex test.

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hi! Yeah, I'm not using any loops because the files that I have to check consist of long single lines. Thanks anyway! Do you know what causes my code to match files that contain only actg?
        It doesn't just contain actg; it also contains a newline. You need to chomp your input.
Re: =~ matches non-existent symbols
by rnewsham (Curate) on Nov 16, 2014 at 08:58 UTC

    Another way could be to use Tie::File and grep.

    use strict; use warnings; use Tie::File; die "File does not exist" unless -f $ARGV[0]; tie my @file, 'Tie::File', $ARGV[0] or die "Could not tie file"; if ( grep !/^[actg]+$/i, @file ) { print "BAD\n"; } else { print "OK\n"; }

    Update: Inefficent see below update by ikegami

      What a waste. This will slow down the program by so much and it'll use up so much more memory than needed. You could simply use
      use strict; use warnings; my $bad = 0; while (<>) { if (!/^[actg]+$/i) { ++$bad; last; } } print $bad ? "BAD\n" : "OK\n";

        You are correct, I had not considered how inefficent that method is. Thanks for pointing it out.

Re: =~ matches non-existent symbols
by graff (Chancellor) on Nov 17, 2014 at 02:12 UTC
    Here's how I would modify the OP script:
    #!/usr/bin/perl use strict; use warnings; $/ = undef; # slurp-mode for input, just in case while ( <> ) { # reads stdin or all file names in ARGV s/\s+//g; # remove whitespace tr/ACGTacgt//d; # remove all acgt if ( length() ) { # anything left? print "$ARGV bad content: $_\n"; } else { print "$ARGV all clean!\n"; } }
    Using  while (<>) is good (even with slurp-mode input) because that way you can pipe data from any other process as input to the script, or you can put one or more file names on the command line (e.g. "*.txt").

    When you read multiple input files in one run, putting "$ARGV" in the print statements tells you which files are good or bad.

      Thanks, graff! But what if I need to do further manipulations with the data from the file later in the same program?
        what if I need to do further manipulations with the data from the file later in the same program?

        Presumably, the manipulation will depend on whether the file content is "good" or "bad" - in either case, just save a copy of $_ to some other variable after white-space removal but before removing "acgt"; then pass that copy to whatever function you write to do the manipulation (either good or bad).

        This will be for a second step. Right now, you are saying that your file contains only /ACGT/i but that your validation procedure fails. Many of us think that it is likely that your file contains at least one line feed or carriage return character or a combination of both. The important thing right now is to find out what are the hidden characters that lead your validation subroutine to fail. Once you know that, you can modify your original program or your regex to take the findings into account.

      That considers acegt a valid input.
        I saved my script as posted to "/tmp/j.pl", and ran it as follows:
        echo acegt | /tmp/j.pl
        The output was:
        - bad content: e
        Did you find some other way to run it that yields different results?
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 07:05 UTC
    Also, is chomp necessary here?
      Yes.

      The file does contain non-[actg] characters: carriage returns and/or line feeds (depending on the operating system where the file was created). You need to either explicitly allow those characters or use chomp to remove them.

        Hi, the only chomp there that I can see is:
        chomp ( $DNA = $ARGV[0] );
        $ARGV[0] is not the file.

        Amusing...

      No, chomping @ARGV is not necessary. The shell already splits command-line arguments by whitespace, and removes said whitespace (note that Athanasius talks about chomping lines from the file, not from @ARGV). Also, you're using two-argument form of open. The Perl's documentation says:
      The filename passed to the one- and two-argument forms of open() will have leading and trailing whitespace deleted and normal redirection characters honored. This property, known as "magic open", can often be used to good effect.
      You normally don't actually want Perl to be overly 'magical' with some redirection characters that might end up in filenames somehow; nor do you want it to chomp filenames automatically (if there is whitespace in @ARGV, the user made it so, and he probably knows why arguments must contain whitespace). It is best to use three-argument form of open:
      open my $INPUT_FILE, '<', $DNA or die;
Re: =~ matches non-existent symbols
by davies (Prior) on Nov 16, 2014 at 19:35 UTC

    A wild guess. Could this be OS dependent? Losedows uses CRLF to denote the end of a line, while *u*x (I believe) uses only CR. Thus a file created in Losedows and parsed on Linux might have a LF character that wasn't expected. I don't know if something comparable might happen at the end of a file. Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.

    Regards,

    John Davies

      Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.
      OMG, dear OP, don't do that! Use Perl instead, really.
      $ echo -n $'atcg\r\nhello\r\n' > ATCG_FILE # this is our test file $ perl -mcharnames -e 'my $s = join "", <>; printf "%s: %d\n", charna +mes::viacode(ord $1), pos($s) while $s =~ m/([^atcg])/ig' ATCG_FILE CARRIAGE RETURN: 5 LINE FEED: 6 LATIN SMALL LETTER H: 7 LATIN SMALL LETTER E: 8 LATIN SMALL LETTER L: 9 LATIN SMALL LETTER L: 10 LATIN SMALL LETTER O: 11 CARRIAGE RETURN: 12 LINE FEED: 13
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 17:29 UTC
    So, what causes this code to match text in files that contain ONLY actg?
      Check your file with
      perl -mcharnames -e '<> =~ m/([^atcg])/i; print charnames::viacode(ord + $1), "\n"' ATCG_FILE
      I created one file in Vim, typed 'atcgATCG' there, saved it, run the one-liner and got:
      LINE FEED
      Many editors add newline to the end of a file when they save it.
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 17, 2014 at 20:57 UTC

    Okay, so, I slightly corrected my original code, and now it works fine.

    #!/usr/bin/perl -w use strict; open (INPUT_FILE, "$ARGV[0]") || die "can't open file: $!"; if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) { print "File contains something besides actg sequence.\n"; } else { print "good!\n"; } close INPUT_FILE;

    Are there any potential problems with this code?

      Apart from what Laurent_R mentioned, this latest version won't tell you anything about what sort of unexpected stuff is showing up in the data (my version will do that). Maybe that's not important to you in this particular process, but when I have to work with defective or unreliable input, I find that it's very helpful to be able to see what's wrong with the data.

      BTW, in case my last reply wasn't clear, here's what I was talking about:

      #!/usr/bin/perl use strict; use warnings; $/ = undef; # slurp-mode for input, just in case while ( <> ) { # reads stdin or all file names in ARGV s/\s+//g; # remove whitespace my $content = $_; # keep a working copy tr/ACGTacgt//d; # remove all acgt if ( length() ) { # anything left? print "$ARGV bad content: $_\n"; do_something_with_bad_data( $ARGV, $content ); } else { print "$ARGV all clean!\n"; do_something_with_good_data( $ARGV, $content ); } } sub do_something_with_bad_data { my ( $filename, $data ) = @_; # . . . fix it? report it to someone? } sub do_something_with_good_data { my ( $filename, $data ) = @_; # . . . whatever you want to do }
      It looks it will work, but only insofar you have only one long line in your file. If your file comes with more than one line, you're in trouble. I would use a loop or some other mechanism to make sure it will still work fine the day I get two or more lines. Below, I localized $/ (the input record separator) so that the whole file will be slurped into the scalar.

      As a side note, there are some commonly agreed best practices in the Perl community. Among them:

      • use the use warnings; pragma rather than the -w flag
      • Use lexical filehandles rather than bareword filehandles
      • Use the three-argument syntax for the open function
      Putting all this together, this a possible (untested) rewrite of your script:
      #!/usr/bin/perl use strict; use warnings; my $infile = shift; open my $INPUT_FILE, "<", $infile" or die "can't open $infile: $!"; local $/; # the whole file will be slurped, even if it has several lin +es my $dna = <$INPUT_FILE>; if ( $dna =~ m/[^actg\s]/i ) { print "File contains something besides actg sequence.\n"; } else { print "good!\n"; } close $INPUT_FILE;

      if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) {
          ...

      Are there any potential problems with this code?

      The character class  \s includes  ' ' (space, 0x20) and IIRC  \t \n \r \f other whitespace characters. Your test allows the string read from the file to have any number of any combination of these characters. Please see perlrecharclass.

      I must say that I don't understand your desparate, last-ditch efforts to avoid the use of chomp, for it seems very likely that the line you're reading from your file is newline-terminated (whatever a newline happens to be in your OS). Here's how I might handle the file-read-and-validate portion of your program (untested):

      use warnings; use strict; die "no filename given" unless @ARGV; my $filename = $ARGV[0]; open my $fh_input, '<', $filename or die "opening '$filename': $!"; my @lines = <$fh_input>; die "no lines read from '$filename': $!" unless @lines; close $fh_input or die "closing '$filename': $!"; chomp @lines; die "more than one line in '$filename'" unless @lines == 1; my $line = $lines[0]; die "'$filename' contains something other than ACTG sequence" if $line =~ m{ [^actgACTG] }xms; my $result = do_something_with($line); print "result is: 'result'"; exit; sub do_something_with { ... }