=~ matches non-existent symbols

by Anonymous Monk on Nov 16, 2014 at 16:58 UTC

Hi! Yeah, I'm not using any loops because the files that I have to check consist of long single lines. Thanks anyway! Do you know what causes my code to match files that contain only actg?

by ikegami (Patriarch) on Nov 18, 2014 at 17:33 UTC

It doesn't just contain actg; it also contains a newline. You need to chomp your input.

Re: =~ matches non-existent symbols
by rnewsham (Curate) on Nov 16, 2014 at 08:58 UTC

Another way could be to use Tie::File and grep.

use strict;
use warnings;

use Tie::File;

die "File does not exist" unless -f $ARGV[0];

tie my @file, 'Tie::File', $ARGV[0] or die "Could not tie file";

if ( grep !/^[actg]+$/i, @file )
{
        print "BAD\n";
}
else
{
        print "OK\n";
}
[download]

Update: Inefficent see below update by ikegami

by ikegami (Patriarch) on Nov 18, 2014 at 17:38 UTC

use strict;
use warnings;

my $bad = 0;
while (<>) {
   if (!/^[actg]+$/i) {
      ++$bad;
      last;
   }
}

print $bad ? "BAD\n" : "OK\n";
[download]

by rnewsham (Curate) on Nov 18, 2014 at 22:34 UTC

You are correct, I had not considered how inefficent that method is. Thanks for pointing it out.

Re: =~ matches non-existent symbols
by graff (Chancellor) on Nov 17, 2014 at 02:12 UTC

#!/usr/bin/perl

use strict;
use warnings;

$/ = undef;     # slurp-mode for input, just in case
while ( <> ) {  # reads stdin or all file names in ARGV
    s/\s+//g;   # remove whitespace
    tr/ACGTacgt//d; # remove all acgt

    if ( length() ) { # anything left?
        print "$ARGV bad content: $_\n";
    } else {
        print "$ARGV all clean!\n";
    }
}
[download]

while (<>)

When you read multiple input files in one run, putting "$ARGV" in the print statements tells you which files are good or bad.

by Anonymous Monk on Nov 17, 2014 at 04:20 UTC

Thanks, graff! But what if I need to do further manipulations with the data from the file later in the same program?

by graff (Chancellor) on Nov 17, 2014 at 10:18 UTC

what if I need to do further manipulations with the data from the file later in the same program?

Presumably, the manipulation will depend on whether the file content is "good" or "bad" - in either case, just save a copy of $_ to some other variable after white-space removal but before removing "acgt"; then pass that copy to whatever function you write to do the manipulation (either good or bad).

by Laurent_R (Canon) on Nov 17, 2014 at 07:20 UTC

/ACGT/i

by ikegami (Patriarch) on Nov 19, 2014 at 15:31 UTC

~~That considers acegt a valid input.~~

by graff (Chancellor) on Nov 20, 2014 at 05:25 UTC

echo acegt | /tmp/j.pl
[download]

- bad content: e
[download]

Re^4: =~ matches non-existent symbols

by ikegami (Patriarch) on Nov 21, 2014 at 19:42 UTC

Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 07:05 UTC

Also, is chomp necessary here?

by dsheroh (Monsignor) on Nov 16, 2014 at 10:28 UTC

The file does contain non-[actg] characters: carriage returns and/or line feeds (depending on the operating system where the file was created). You need to either explicitly allow those characters or use chomp to remove them.

by Anonymous Monk on Nov 16, 2014 at 10:35 UTC

chomp ( $DNA = $ARGV[0] );
[download]

$ARGV[0]

not

Amusing...

by Anonymous Monk on Nov 16, 2014 at 10:32 UTC

The filename passed to the one- and two-argument forms of open() will have leading and trailing whitespace deleted and normal redirection characters honored. This property, known as "magic open", can often be used to good effect.

open my $INPUT_FILE, '<', $DNA or die;
[download]

Re: =~ matches non-existent symbols
by davies (Prior) on Nov 16, 2014 at 19:35 UTC

A wild guess. Could this be OS dependent? Losedows uses CRLF to denote the end of a line, while *u*x (I believe) uses only CR. Thus a file created in Losedows and parsed on Linux might have a LF character that wasn't expected. I don't know if something comparable might happen at the end of a file. Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.

Regards,

John Davies

by Anonymous Monk on Nov 16, 2014 at 20:45 UTC

Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.

$ echo -n $'atcg\r\nhello\r\n' > ATCG_FILE # this is our test file

$ perl -mcharnames -e 'my $s = join "", <>; printf "%s: %d\n",  charna
+mes::viacode(ord $1), pos($s) while $s =~ m/([^atcg])/ig' ATCG_FILE

CARRIAGE RETURN: 5
LINE FEED: 6
LATIN SMALL LETTER H: 7
LATIN SMALL LETTER E: 8
LATIN SMALL LETTER L: 9
LATIN SMALL LETTER L: 10
LATIN SMALL LETTER O: 11
CARRIAGE RETURN: 12
LINE FEED: 13
[download]

Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 17:29 UTC

So, what causes this code to match text in files that contain ONLY actg?

by Anonymous Monk on Nov 16, 2014 at 17:40 UTC

perl -mcharnames -e '<> =~ m/([^atcg])/i; print charnames::viacode(ord
+ $1), "\n"' ATCG_FILE
[download]

LINE FEED
[download]

Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 17, 2014 at 20:57 UTC

Okay, so, I slightly corrected my original code, and now it works fine.

#!/usr/bin/perl -w

use strict;

open (INPUT_FILE, "$ARGV[0]") || die "can't open file: $!";
         
if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) {
    print "File contains something besides actg sequence.\n";
} else {
    print "good!\n";
}            
      
close INPUT_FILE;
[download]

Are there any potential problems with this code?

by graff (Chancellor) on Nov 17, 2014 at 22:30 UTC

Laurent_R

BTW, in case my last reply wasn't clear, here's what I was talking about:

#!/usr/bin/perl

use strict;
use warnings;

$/ = undef;     # slurp-mode for input, just in case
while ( <> ) {  # reads stdin or all file names in ARGV
    s/\s+//g;   # remove whitespace
    my $content = $_;  # keep a working copy
    tr/ACGTacgt//d;    # remove all acgt

    if ( length() ) { # anything left?
        print "$ARGV bad content: $_\n";
        do_something_with_bad_data( $ARGV, $content );
    } else {
        print "$ARGV all clean!\n";
        do_something_with_good_data( $ARGV, $content );
    }
}

sub do_something_with_bad_data
{
    my ( $filename, $data ) = @_;
# . . . fix it?  report it to someone?
}

sub do_something_with_good_data
{
    my ( $filename, $data ) = @_;
# . . . whatever you want to do
}
[download]

by Laurent_R (Canon) on Nov 17, 2014 at 22:02 UTC

local

$/

As a side note, there are some commonly agreed best practices in the Perl community. Among them:

use the use warnings; pragma rather than the -w flag
Use lexical filehandles rather than bareword filehandles
Use the three-argument syntax for the open function

#!/usr/bin/perl

use strict;
use warnings;

my $infile = shift;

open my $INPUT_FILE, "<", $infile" or die "can't open $infile: $!";

local $/; # the whole file will be slurped, even if it has several lin
+es
my $dna = <$INPUT_FILE>;         
if ( $dna  =~ m/[^actg\s]/i ) {
    print "File contains something besides actg sequence.\n";
} else {
    print "good!\n";
}            
close $INPUT_FILE;
[download]

by AnomalousMonk (Archbishop) on Nov 17, 2014 at 22:27 UTC

if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) {
...

Are there any potential problems with this code?

The character class \s includes ' ' (space, 0x20) and IIRC \t \n \r \f other whitespace characters. Your test allows the string read from the file to have any number of any combination of these characters. Please see perlrecharclass.

I must say that I don't understand your desparate, last-ditch efforts to avoid the use of chomp, for it seems very likely that the line you're reading from your file is newline-terminated (whatever a newline happens to be in your OS). Here's how I might handle the file-read-and-validate portion of your program (untested):

use warnings;
use strict;

die "no filename given" unless @ARGV;
my $filename = $ARGV[0];

open my $fh_input, '<', $filename or die "opening '$filename': $!";
my @lines = <$fh_input>;
die "no lines read from '$filename': $!" unless @lines;
close $fh_input or die "closing '$filename': $!";

chomp @lines;
die "more than one line in '$filename'" unless @lines == 1;
my $line = $lines[0];
die "'$filename' contains something other than ACTG sequence"
    if $line =~ m{ [^actgACTG] }xms;

my $result = do_something_with($line);
print "result is: 'result'";

exit;

sub do_something_with { ... }
[download]