Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

about regular expression

by agustina_s (Sexton)
on Feb 02, 2002 at 10:49 UTC ( #142906=perlquestion: print w/replies, xml ) Need Help??

agustina_s has asked for the wisdom of the Perl Monks concerning the following question:

Hi perlmonks I really need help concerning the regex. I have a program that open a file input and create a file output. The input and output file looks like this :
INPUT DATE 13-JUN-2000 COMMERCIAL SUPPLIERS SEQUENCE /exon="49-333" /intron="1-48;334-385" // DATE 13-JUN-2000 COMMERCIAL SUPPLIERS SEQUENCE /exon="" /intron="1-29" // OUTPUT DBACC D000001 DATE "13-JUN 2002" Exon {Translation%49-133} Intron {Translation%1-48} Intron (Translation%334-385} DBACC D000002 DATE "13-JUN 2002" Exon {Translation -} Intron {Translation%1-29}
I have some problem with the printing of exon and intron. There can be 0 or more element separate by ; in it.I know that if there are 0 or more element in regex we use *.But in this case I'm quite confused with the way to print all the $1,$2 elements.

My code partly looks like:

#!/usr/local/bin/perl -w # A program that accept an input file: Scorpion database from Gen Bank # and will output the database in BioWare format my $file1="$ARGV[0]"; #var to save the input database my $result=">".$ARGV[1]; my $counter=1; my $no='D000001'; open(INFO1,$file1) or die "Can't open $file1.\n"; #open file1 open(OUT,$result) or die "Can't open $result.\n"; #foreach line in the files foreach(<INFO1>) { if(/^DATE\s*(.*)-(.*)-(.*)/){ print OUT "DBACC\t $no\n"; print OUT "Date\t $1-$2-$3\n"; $no++; } elsif(/\s*\/exon="(\d*-\d*)*"\n/){ print OUT "Exon\t \{Translation\%$1\}\n"; } elsif(/\s*\/intron="(\d*-\d*)*"\n/){ print OUT "Intron\t \{Translation\%$1\}\n"; } else{ print OUT "line $counter\n"; } $counter++; } close(INFO1); close(OUT);
Actually does " is considered a metacharacter? I mean if we want to search a { in a string we must use \{ if we want to search for " do I have to put \" since it always give me some error.

Thanks so much...

Replies are listed 'Best First'.
Re: about regular expression
by ryan (Pilgrim) on Feb 02, 2002 at 11:30 UTC
    You can grab all the possible values for intron and exon with your regex and then split them up.

    Consider replacing your intron/exon elsif blocks with this:
    #new intron elsif block elsif(/\s+\/intron="(.+)"\n/) { foreach $item (split('\;',$1)) { print OUT "Intron\t $item\n"; } }
    I replaced all the *s with +s, from my understanding this is more efficient, but I'm no regex guru :) The regex puts everything between the "double quotes" in $1

    This will print out, based on your input data:
    Intron 1-48 Intron 334-385
    Now that they are separated, you can do whatever you want with them.

    Ryan
      be careful, ryan.

      .+ matches one or more characters.
      .* matches zero or more characters.

      augustina_s specified in her dataset that there might be an empty list in the dataset. the second .+ would break in that case.

      also, you don't need to escape semi-colon (;).

      ~Particle

        Yep, point taken, if as your later post does, a blank set of inputs is mean to output for example 'Intron' with nothing after it then mine fails.

        Mine just prints nothing if there is no data for the input line. I didn't know which way is correct, because I lost some of the example code due to to some lovely DB errors this site keeps throwing me.

        also, you don't need to escape semi-colon (;).

        Ahh the wonders of being an incompetent novice, I'd say it doesn't hurt, but no doubt you'll give me an example of when it can :)
Re: about regular expression
by particle (Vicar) on Feb 02, 2002 at 16:22 UTC
    this should do exactly what you want. i cleaned up your code a little, and added my comments with ##. you can look up info in perldoc on the items in parentheses. in particular, you may want to look at shift, FileHandle, perlre, split, and while.

    some nodes you might want to read are:
    while or foreach?
    Opening files
    Use strict warnings and diagnostics or die
    Death to Dot Star!

    best of luck in the future!

    #!/usr/local/bin/perl -w use strict; ## use strict, use strict, use strict!!! $|++; ## enable line buffering to STDOUT use FileHandle; # A program that accept an input file: Scorpion database from Gen Bank # and will output the database in BioWare format ## used descriptive variable names ## used shift operator to process arguments (shift) and die with usage my $infile = shift || die "usage: $0 infile outfile\n"; my $outfile = shift || die "usage: $0 infile outfile\n"; my $item_count=1; my $item='D000001'; my $IN = new FileHandle; my $OUT = new FileHandle; ## check status of open and print $! for descriptive error message open($IN, "< " . $infile) or die "Can't open $infile. $!"; open($OUT, "> " . $outfile) or die "Can't open $outfile. $!"; while(<$IN>) { ## remove trailing newline chomp; ## skip blank lines next if( '^\s*$' ); ## print newline if end of record if( '^//$' ) { print $OUT "\n"; next; } ## expects date format like 1or2-three-four characters (perlre) if( /^DATE\s+(..?)-(...)-(....)$/ ) { ## very fast regex print $OUT "DBACC\t", $item++, "\n"; print $OUT "DATE\t\"$1-$2 $3\"\n"; } ## non-greedy match between double quotes (perlre) elsif( /^\s*\/exon="(.*?)"$/ ) { ## handle null case print $OUT "Exon\t{Translation -}\n" unless $1; ## seperate the matched string and process each (split) for(split ';', $1) { print $OUT "Exon\t{Translation\%", $_ ,"}\n"; } } ## non-greedy match between double quotes (perlre) elsif( /^\s*\/intron="(.*?)"$/ ) { ## handle null case print $OUT "Intron\t{Translation -}\n" unless $1; ## seperate the matched string and process each for(split ';', $1) { print $OUT "Intron\t{Translation\%", $_ ,"}\n"; } } } ## check status of close and print $! for descriptive error message close($IN) or die "Can't close $infile. $!"; close($OUT) or die "Can't close $outfile. $!";

    ~Particle

Re: about regular expression
by trs80 (Priest) on Feb 02, 2002 at 15:44 UTC
    Your exon and intron regex was limiting the results if I understand your question correctly. I reworked the code to to:
    foreach(<INFO1>) { # got rid of the '*' which is too greedy # I made the date matches specific to the # example input you provided in your post # you may need to adjust for more options # in the matches depending on your data # consistency if(/^DATE\s+(\d{2})-(\w{3})-(\d{4})/){ print OUT "DBACC\t $no\n"; print OUT "Date\t $1-$2-$3\n"; $no++; # made one conditional that gets both # exon and intron. Used a [] (character # class match) instead of the \d*- # The + after it allows for 1 or more # of a 0-9 , ';' or '-' } elsif(/\s+\/(intr|ex)on="([\d-;]+)"\n/) { # added a split on ';' in case you want # or need to do something with each one # seperated by a ';' my @values = split(/;/,$2); foreach (@values) { # needed to uppercase the matched prefix # based on your example output since # the match was on the lowercase prefix print OUT ucfirst($1) . "on\t \{Translation\%$_\}\n"; } # if you don't need to do the split just do this # print OUT ucfirst($1) . "on\t \{Translation\%$2\}\n"; } else { print OUT "line $counter\n"; } $counter++; }
    There are several good nodes on regex in the tutorial section. See the gotcha one in particular.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://142906]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2022-05-21 16:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (76 votes). Check out past polls.

    Notices?