Re: =~ matches non-existent symbols
by Athanasius (Archbishop) on Nov 16, 2014 at 07:46 UTC
|
if ( <INPUT_FILE> =~ m/[^actgACTG]/ ) {
calls <> (i.e., readline) only once (in scalar context), therefore only the first line of the file is read in and tested. To test the whole file, you need a loop. For example (untested):
my $ok = 1;
while (<INPUT_FILE>)
{
chomp;
if (/[^actg]/i)
{
print "File contains something besides actg sequence.\n";
$ok = 0;
last;
}
}
print "good!\n" if $ok;
And yes, the chomp is necessary, otherwise each line (except perhaps the last) will contain a newline character and so fail the regex test.
Hope that helps,
| [reply] [d/l] [select] |
|
Hi! Yeah, I'm not using any loops because the files that I have to check consist of long single lines. Thanks anyway!
Do you know what causes my code to match files that contain only actg?
| [reply] |
|
It doesn't just contain actg; it also contains a newline. You need to chomp your input.
| [reply] |
Re: =~ matches non-existent symbols
by rnewsham (Curate) on Nov 16, 2014 at 08:58 UTC
|
use strict;
use warnings;
use Tie::File;
die "File does not exist" unless -f $ARGV[0];
tie my @file, 'Tie::File', $ARGV[0] or die "Could not tie file";
if ( grep !/^[actg]+$/i, @file )
{
print "BAD\n";
}
else
{
print "OK\n";
}
Update: Inefficent see below update by ikegami
| [reply] [d/l] |
|
What a waste. This will slow down the program by so much and it'll use up so much more memory than needed. You could simply use
use strict;
use warnings;
my $bad = 0;
while (<>) {
if (!/^[actg]+$/i) {
++$bad;
last;
}
}
print $bad ? "BAD\n" : "OK\n";
| [reply] [d/l] |
|
| [reply] |
Re: =~ matches non-existent symbols
by graff (Chancellor) on Nov 17, 2014 at 02:12 UTC
|
Here's how I would modify the OP script:
#!/usr/bin/perl
use strict;
use warnings;
$/ = undef; # slurp-mode for input, just in case
while ( <> ) { # reads stdin or all file names in ARGV
s/\s+//g; # remove whitespace
tr/ACGTacgt//d; # remove all acgt
if ( length() ) { # anything left?
print "$ARGV bad content: $_\n";
} else {
print "$ARGV all clean!\n";
}
}
Using while (<>) is good (even with slurp-mode input) because that way you can pipe data from any other process as input to the script, or you can put one or more file names on the command line (e.g. "*.txt").
When you read multiple input files in one run, putting "$ARGV" in the print statements tells you which files are good or bad. | [reply] [d/l] [select] |
|
Thanks, graff! But what if I need to do further manipulations with the data from the file later in the same program?
| [reply] |
|
what if I need to do further manipulations with the data from the file later in the same program?
Presumably, the manipulation will depend on whether the file content is "good" or "bad" - in either case, just save a copy of $_ to some other variable after white-space removal but before removing "acgt"; then pass that copy to whatever function you write to do the manipulation (either good or bad).
| [reply] |
|
This will be for a second step. Right now, you are saying that your file contains only /ACGT/i but that your validation procedure fails. Many of us think that it is likely that your file contains at least one line feed or carriage return character or a combination of both. The important thing right now is to find out what are the hidden characters that lead your validation subroutine to fail. Once you know that, you can modify your original program or your regex to take the findings into account.
| [reply] [d/l] |
|
That considers acegt a valid input.
| [reply] [d/l] |
|
I saved my script as posted to "/tmp/j.pl", and ran it as follows:
echo acegt | /tmp/j.pl
The output was:
- bad content: e
Did you find some other way to run it that yields different results? | [reply] [d/l] [select] |
|
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 07:05 UTC
|
Also, is chomp necessary here? | [reply] |
|
| [reply] [d/l] [select] |
|
Hi, the only chomp there that I can see is:
chomp ( $DNA = $ARGV[0] );
$ARGV[0] is not the file.
Amusing... | [reply] [d/l] |
|
No, chomping @ARGV is not necessary. The shell already splits command-line arguments by whitespace, and removes said whitespace (note that Athanasius talks about chomping lines from the file, not from @ARGV).
Also, you're using two-argument form of open. The Perl's documentation says:
The filename passed to the one- and two-argument forms of open() will have leading and trailing whitespace deleted and normal redirection characters honored. This property, known as "magic open", can often be used to good effect.
You normally don't actually want Perl to be overly 'magical' with some redirection characters that might end up in filenames somehow; nor do you want it to chomp filenames automatically (if there is whitespace in @ARGV, the user made it so, and he probably knows why arguments must contain whitespace). It is best to use three-argument form of open:
open my $INPUT_FILE, '<', $DNA or die;
| [reply] [d/l] |
Re: =~ matches non-existent symbols
by davies (Prior) on Nov 16, 2014 at 19:35 UTC
|
A wild guess. Could this be OS dependent? Losedows uses CRLF to denote the end of a line, while *u*x (I believe) uses only CR. Thus a file created in Losedows and parsed on Linux might have a LF character that wasn't expected. I don't know if something comparable might happen at the end of a file. Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.
Regards,
John Davies
| [reply] |
|
Try cutting your file down (half the size each time would be normal) until you get the bit that causes the problem. Then print out the problem bit in delimiters like angle brackets. You might also print out the length of that string. This may help you see what characters are really there, not just what characters you can see.
OMG, dear OP, don't do that! Use Perl instead, really.
$ echo -n $'atcg\r\nhello\r\n' > ATCG_FILE # this is our test file
$ perl -mcharnames -e 'my $s = join "", <>; printf "%s: %d\n", charna
+mes::viacode(ord $1), pos($s) while $s =~ m/([^atcg])/ig' ATCG_FILE
CARRIAGE RETURN: 5
LINE FEED: 6
LATIN SMALL LETTER H: 7
LATIN SMALL LETTER E: 8
LATIN SMALL LETTER L: 9
LATIN SMALL LETTER L: 10
LATIN SMALL LETTER O: 11
CARRIAGE RETURN: 12
LINE FEED: 13
| [reply] [d/l] |
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 16, 2014 at 17:29 UTC
|
So, what causes this code to match text in files that contain ONLY actg? | [reply] |
|
perl -mcharnames -e '<> =~ m/([^atcg])/i; print charnames::viacode(ord
+ $1), "\n"' ATCG_FILE
I created one file in Vim, typed 'atcgATCG' there, saved it, run the one-liner and got:
LINE FEED
Many editors add newline to the end of a file when they save it. | [reply] [d/l] [select] |
Re: =~ matches non-existent symbols
by Anonymous Monk on Nov 17, 2014 at 20:57 UTC
|
Okay, so, I slightly corrected my original code, and now it works fine.
#!/usr/bin/perl -w
use strict;
open (INPUT_FILE, "$ARGV[0]") || die "can't open file: $!";
if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) {
print "File contains something besides actg sequence.\n";
} else {
print "good!\n";
}
close INPUT_FILE;
Are there any potential problems with this code? | [reply] [d/l] |
|
#!/usr/bin/perl
use strict;
use warnings;
$/ = undef; # slurp-mode for input, just in case
while ( <> ) { # reads stdin or all file names in ARGV
s/\s+//g; # remove whitespace
my $content = $_; # keep a working copy
tr/ACGTacgt//d; # remove all acgt
if ( length() ) { # anything left?
print "$ARGV bad content: $_\n";
do_something_with_bad_data( $ARGV, $content );
} else {
print "$ARGV all clean!\n";
do_something_with_good_data( $ARGV, $content );
}
}
sub do_something_with_bad_data
{
my ( $filename, $data ) = @_;
# . . . fix it? report it to someone?
}
sub do_something_with_good_data
{
my ( $filename, $data ) = @_;
# . . . whatever you want to do
}
| [reply] [d/l] |
|
#!/usr/bin/perl
use strict;
use warnings;
my $infile = shift;
open my $INPUT_FILE, "<", $infile" or die "can't open $infile: $!";
local $/; # the whole file will be slurped, even if it has several lin
+es
my $dna = <$INPUT_FILE>;
if ( $dna =~ m/[^actg\s]/i ) {
print "File contains something besides actg sequence.\n";
} else {
print "good!\n";
}
close $INPUT_FILE;
| [reply] [d/l] [select] |
|
if ( <INPUT_FILE> =~ m/[^actgACTG\s]/ ) {
...
Are there any potential problems with this code?
The character class \s includes ' ' (space, 0x20) and IIRC \t \n \r \f other whitespace characters. Your test allows the string read from the file to have any number of any combination of these characters. Please see perlrecharclass.
I must say that I don't understand your desparate, last-ditch efforts to avoid the use of chomp, for it seems very likely that the line you're reading from your file is newline-terminated (whatever a newline happens to be in your OS). Here's how I might handle the file-read-and-validate portion of your program (untested):
use warnings;
use strict;
die "no filename given" unless @ARGV;
my $filename = $ARGV[0];
open my $fh_input, '<', $filename or die "opening '$filename': $!";
my @lines = <$fh_input>;
die "no lines read from '$filename': $!" unless @lines;
close $fh_input or die "closing '$filename': $!";
chomp @lines;
die "more than one line in '$filename'" unless @lines == 1;
my $line = $lines[0];
die "'$filename' contains something other than ACTG sequence"
if $line =~ m{ [^actgACTG] }xms;
my $result = do_something_with($line);
print "result is: 'result'";
exit;
sub do_something_with { ... }
| [reply] [d/l] [select] |