about regular expression

agustina_s has asked for the wisdom of the Perl Monks concerning the following question:

Hi perlmonks I really need help concerning the regex. I have a program that open a file input and create a file output. The input and output file looks like this :

INPUT

DATE     13-JUN-2000
COMMERCIAL SUPPLIERS
SEQUENCE             
                     /exon="49-333"
                     /intron="1-48;334-385"
                     
//
DATE     13-JUN-2000
COMMERCIAL SUPPLIERS
SEQUENCE            
                     /exon=""
                     /intron="1-29"
//

OUTPUT                    
DBACC   D000001
DATE     "13-JUN 2002"
Exon    {Translation%49-133}
Intron    {Translation%1-48}
Intron    (Translation%334-385}

DBACC   D000002
DATE    "13-JUN 2002"
Exon    {Translation -}
Intron    {Translation%1-29}
[download]

I have some problem with the printing of exon and intron. There can be 0 or more element separate by ; in it.I know that if there are 0 or more element in regex we use *.But in this case I'm quite confused with the way to print all the $1,$2 elements.

My code partly looks like:

#!/usr/local/bin/perl -w
# A program that accept an input file: Scorpion database from Gen Bank
# and will output the database in BioWare format

my $file1="$ARGV[0]";          #var to save the input database
my $result=">".$ARGV[1];
my $counter=1;
my $no='D000001';

open(INFO1,$file1) or die "Can't open $file1.\n";   #open file1
open(OUT,$result) or die "Can't open $result.\n";

#foreach line in the files
foreach(<INFO1>)
{
    if(/^DATE\s*(.*)-(.*)-(.*)/){
        print OUT "DBACC\t $no\n";
        print OUT "Date\t $1-$2-$3\n";
        $no++;
        }
    elsif(/\s*\/exon="(\d*-\d*)*"\n/){
        print OUT "Exon\t \{Translation\%$1\}\n";
        }
    elsif(/\s*\/intron="(\d*-\d*)*"\n/){
        print OUT "Intron\t \{Translation\%$1\}\n";
        }    
    else{
    print OUT "line $counter\n";
    }
    $counter++;
}
close(INFO1);
close(OUT);
[download]

Actually does " is considered a metacharacter? I mean if we want to search a { in a string we must use \{ if we want to search for " do I have to put \" since it always give me some error.

Thanks so much...

Comment on about regular expression Select or Download Code

Replies are listed 'Best First'.
Re: about regular expression by ryan (Pilgrim) on Feb 02, 2002 at 11:30 UTC
You can grab all the possible values for intron and exon with your regex and then split them up. Consider replacing your intron/exon elsif blocks with this: `#new intron elsif block elsif(/\s+\/intron="(.+)"\n/) { foreach $item (split('\;',$1)) { print OUT "Intron\t $item\n"; } }` [download] I replaced all the *s with +s, from my understanding this is more efficient, but I'm no regex guru :) The regex puts everything between the "double quotes" in $1 This will print out, based on your input data: `Intron 1-48 Intron 334-385` [download] Now that they are separated, you can do whatever you want with them. Ryan	[reply] [d/l] [select]
Re: Re: about regular expression by particle (Vicar) on Feb 02, 2002 at 16:36 UTC
be careful, ryan. `.+` matches one or more characters. `.` matches zero* or more characters. augustina_s specified in her dataset that there might be an empty list in the dataset. the second `.+` would break in that case. also, you don't need to escape semi-colon (;). ~Particle	[reply] [d/l] [select]
Re: Re: Re: about regular expression by ryan (Pilgrim) on Feb 03, 2002 at 04:50 UTC
Yep, point taken, if as your later post does, a blank set of inputs is mean to output for example 'Intron' with nothing after it then mine fails. Mine just prints nothing if there is no data for the input line. I didn't know which way is correct, because I lost some of the example code due to to some lovely DB errors this site keeps throwing me. also, you don't need to escape semi-colon (;). Ahh the wonders of being an incompetent novice, I'd say it doesn't hurt, but no doubt you'll give me an example of when it can :)	[reply]
Re: about regular expression by particle (Vicar) on Feb 02, 2002 at 16:22 UTC
this should do exactly what you want. i cleaned up your code a little, and added my comments with ##. you can look up info in perldoc on the items in parentheses. in particular, you may want to look at shift, FileHandle, perlre, split, and while. some nodes you might want to read are: while or foreach? Opening files Use strict warnings and diagnostics or die Death to Dot Star! best of luck in the future! #!/usr/local/bin/perl -w use strict; ## use strict, use strict, use strict!!! $\|++; ## enable line buffering to STDOUT use FileHandle; # A program that accept an input file: Scorpion database from Gen Bank # and will output the database in BioWare format ## used descriptive variable names ## used shift operator to process arguments (shift) and die with usage my $infile = shift \|\| die "usage: $0 infile outfile\n"; my $outfile = shift \|\| die "usage: $0 infile outfile\n"; my $item_count=1; my $item='D000001'; my $IN = new FileHandle; my $OUT = new FileHandle; ## check status of open and print $! for descriptive error message open($IN, "< " . $infile) or die "Can't open $infile. $!"; open($OUT, "> " . $outfile) or die "Can't open $outfile. $!"; while(<$IN>) { ## remove trailing newline chomp; ## skip blank lines next if( '^\s$' ); ## print newline if end of record if( '^//$' ) { print $OUT "\n"; next; } ## expects date format like 1or2-three-four characters (perlre) if( /^DATE\s+(..?)-(...)-(....)$/ ) { ## very fast regex print $OUT "DBACC\t", $item++, "\n"; print $OUT "DATE\t\"$1-$2 $3\"\n"; } ## non-greedy match between double quotes (perlre) elsif( /^\s\/exon="(.?)"$/ ) { ## handle null case print $OUT "Exon\t{Translation -}\n" unless $1; ## seperate the matched string and process each (split) for(split ';', $1) { print $OUT "Exon\t{Translation\%", $_ ,"}\n"; } } ## non-greedy match between double quotes (perlre) elsif( /^\s\/intron="(.*?)"$/ ) { ## handle null case print $OUT "Intron\t{Translation -}\n" unless $1; ## seperate the matched string and process each for(split ';', $1) { print $OUT "Intron\t{Translation\%", $_ ,"}\n"; } } } ## check status of close and print $! for descriptive error message close($IN) or die "Can't close $infile. $!"; close($OUT) or die "Can't close $outfile. $!"; [download] ~Particle	[reply] [d/l]
Re: about regular expression by trs80 (Priest) on Feb 02, 2002 at 15:44 UTC
Your exon and intron regex was limiting the results if I understand your question correctly. I reworked the code to to: foreach(<INFO1>) { # got rid of the '' which is too greedy # I made the date matches specific to the # example input you provided in your post # you may need to adjust for more options # in the matches depending on your data # consistency if(/^DATE\s+(\d{2})-(\w{3})-(\d{4})/){ print OUT "DBACC\t $no\n"; print OUT "Date\t $1-$2-$3\n"; $no++; # made one conditional that gets both # exon and intron. Used a [] (character # class match) instead of the \d- # The + after it allows for 1 or more # of a 0-9 , ';' or '-' } elsif(/\s+\/(intr\|ex)on="([\d-;]+)"\n/) { # added a split on ';' in case you want # or need to do something with each one # seperated by a ';' my @values = split(/;/,$2); foreach (@values) { # needed to uppercase the matched prefix # based on your example output since # the match was on the lowercase prefix print OUT ucfirst($1) . "on\t \{Translation\%$_\}\n"; } # if you don't need to do the split just do this # print OUT ucfirst($1) . "on\t \{Translation\%$2\}\n"; } else { print OUT "line $counter\n"; } $counter++; } [download] There are several good nodes on regex in the tutorial section. See the gotcha one in particular.	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks