Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

This regular expression has me stumped

by tsk1979 (Scribe)
on May 01, 2008 at 08:19 UTC ( [id://683879]=perlquestion: print w/replies, xml ) Need Help??

tsk1979 has asked for the wisdom of the Perl Monks concerning the following question:

I have been doing perl for quite some time now, and have been able to write some pretty advanced regular expressions, but now a relatively simple(I hope) problem has me stumped. Many times while running testcases we diff gold file with the results file generated. Now the result file generated(and the gold file) The files have lines which contain
file /user/name/some/path/to/filename@@ dumped: replaced /user/name/bl +ah/blah/filename Sometimes we have this @@@@user/some/file/filename.sdc: dumped
Now obviously this will fail whenever the run path changes. So I was trying this s:/.*/\(\w\+\)/\1/g equivalent and kinds etc., but if you have 2 paths dumped on the same line, it will eat up text in between that too because of greedy nature. What I really want is a bullet proof regexp which will change every instance of /home/user/blah/blah...../filename to filename This seems so simple yet so difficult because sometimes the filename is followed by whitespace, sometimes by @ sometimes by : or sometimes by any other special character(but never a word of a number)

Replies are listed 'Best First'.
Re: This regular expression has me stumped
by tachyon-II (Chaplain) on May 01, 2008 at 08:42 UTC

    Not every problem is best solved with a big fat regex.

    while (my $line = <FILE>) { my @files = map { m!/([\w\.\-]+)\W*$!; $1 } grep { m!/! } split ' ', $line; # blah }

    The logic goes split on whitespace, ignore all tokens that don't have a file path sep / with , then get the last bit after the / up to the end or optional \W* using map. The character class [\w\.\-] should match most filenames. Normally I would use [^/] but this is problematic in this case. Should work on your data as described.

      I was hung up on regexp because I want this via a command line perl -nei.bak..... I checked my log files /blah/blah/blah/filename can be follows by a whitespace, a "@' q "," or a ":" I have searched for perl non greedy and I suspect
      /.*?[@:,\s+]/
      will actually match the whole <code>/blah/blah/blah/filename.ext>/code> the problem here is, how to retain the filename...?

        You almost never want .* A negated character class is generally better. For example m!/[(^/)]+$! will grab the last bit of the filepath reliably but the regex posted above in the map should DWIM

        You could certainly code the example above as a one liner but it seems a waste of time to me. You can make a reusable 4 line script in less time than it will take fiddling. You can put options like -p -F -n on the shebang. As a one liner it would be like:

        perl -F -ane 'print map{"$_\n"} map{ } grep { } @F' <file>

        where the map and grep blocks are as above.

        Picking up with the theme you were following, I got this to work. I haven't thought alot about corner cases, performance or reusability, so Grandfather's and tachyon-II's solutions are probably better.

        update: apparently I'm just confused on this matter && added comment on second s/// with no effect: I didn't like doing the substitution twice just to get the end-of-line anchor to work. Perhaps some wiser monks can explain that to me. update: That was before I added chomp, so never mind . . .

        #/usr/bin/perl -W $\="\n"; use strict; use warnings; while (<DATA>) { chomp; print $_; s/\/(?:[^\@:,\s+]*\/)(.*?)[\@:,\s+]*/\/new\/path\/$1/g; #s/\/(?:[^\@:,\s+]*\/)(.*?)[\@:,\s+]*$/\/new\/path\/$1/g; print $_; print ''; } # produces: # C:\chas_sandbox> # 683879resp.pl # file /user/name/some/path/to/filename@@ dumped: replaced /user/name/ +blah/blah/filename # file /new/path/filename@@ dumped: replaced /new/path/filename # # @@@@user/some/file/filename.sdc: dumped # @@@@user/new/path/filename.sdc: dumped __DATA__ file /user/name/some/path/to/filename@@ dumped: replaced /user/name/bl +ah/blah/filename @@@@user/some/file/filename.sdc: dumped


        #my sig used to say 'I humbly seek wisdom. '. Now it says:
        use strict;
        use warnings;
        I humbly seek wisdom.

      No need for the; $1 in map { m!/([\w\.\-]+)\W*$!; $1 }

      since a match in list context returns the captured substrings and the block of a map is in list context.

        Good point. It was a rather off the cuff untested solution....

Re: This regular expression has me stumped
by GrandFather (Saint) on May 01, 2008 at 09:31 UTC

    For this task a little looking around helps as does knowing what not to find, oh, and taking care of lose ends helps too. Consider:

    use strict; use warnings; my @tests = ( "First: /home/user/blah/filename and /home/user/blah/filename2 end +", "/home/user/blah/filename,/home/user/blah/filename2", "/home/user/blah/filename; /home/user/blah/filename2", "/home/user/blah/filename\@10:30 /home/user/blah/filename2", ); for my $str (@tests) { $str =~ s!(?:^|/)[^\s,@;]*(?<=/)([^\s,@;]+?)(?=[\s,@;]|$)!$1!g; print "$str\n"; }

    Prints:

    First: filename and filename2 end filename,filename2 filename; filename2 filename@10:30 filename2

    Perl is environmentally friendly - it saves trees
      Hmm your solution looks like its working! Great. Now the big problem. I cannot make a head or tail of the regexp :( could you explain me a little bit on what exactly happened up there. It made a whooshing sound and flew right by :)

        :-D

        Ok, let's take it a a little at a time:

        s! you know, although it's possible you didn't know you can use pretty much any character for the expression delimiters.

        (?:^|/) matches (without capturing) either the start of the string or a /.

        [^\s,@;]* matches as many characters that aren't in the set of terminal characters as can be found.

        (?<=/) looks back and asserts the last character matched was /.

        ([^\s,@;]+?) matches and captures as few non-terminal characters as it can and still find a match. That's the filename that you want.

        (?=[\s,@;]|$) looks ahead and asserts that the next character is a terminal character or the end of the string.

        !$1!g you are probably completely familiar with - replace all the matched stuff with the captured string and do it for every match that can be found.

        So with a little head scratching the introductory line of my initial reply might make a more sense along with the regex. For further study consult perlretut, perlre and perlreref.


        Perl is environmentally friendly - it saves trees
        To supplement GrandFather's excellent explanation, here is the output generated by YAPE::Regex::Explain.
        use warnings; use strict; use YAPE::Regex::Explain; my $re = 's!(?:^|/)[^\s,@;]*(?<=/)([^\s,@;]+?)(?=[\s,@;]|$)!$1!g'; my $parser = YAPE::Regex::Explain->new($re); print $parser->explain;
      I found a corner case.... :) how about ../filename or ../some/path/filename or ../../some/path/filename
        Another one /some/silly/path/here/../../another/silly/path/filename
      this can work right ? /fjsdklf/fjsldkfs/fsjdklf-fs-0-fsf/../fjskfjs/.. +/../../fsfkslf/filename ../../../../filename ../hello dofghello/two/forut/../filename2 Will this work ../../../../jfsdfjskdlfjs/../fjsklf/fjksfjskflsd/filena +me I will do replacement for ../filename this can work right /fjsdklf/fjsldkfs/fsjdklf-fs-0-fsf/../fjskfjs/../ +../../fsfkslf/filename I will think of even/more/silly/../../harder/cases/../analysis/filenam +e and do it ../twice as well as put/some/path/and/make/it/thrice
      We always assume that the whole path starts with / But the path can be some/path/to/filename also! In that case this will definitely fail. I am scratching my head as to what kind of check to put in for that. Helllp!! :)
        Solved!
        use strict; use warnings; my $file; foreach $file (@ARGV) { open (INFILE,"<$file") or die "Cannot open Input file\n"; while (<INFILE>) { s!(?:^|\w*/|\.\./)[^\s,@;:]*(?<=/)([^\s,@;:]+?)(?=[\s,@;:]|$)! +$1!g; # s!\.\.!!g; print "$_"; } close INFILE; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://683879]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2024-04-25 10:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found