Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

My regex works, but I want to make sure it's not blind luck

by SergioQ (Beadle)
on Dec 21, 2020 at 23:25 UTC ( [id://11125588]=perlquestion: print w/replies, xml ) Need Help??

SergioQ has asked for the wisdom of the Perl Monks concerning the following question:

My regex works, but I want to make sure it's not blind luck, because am still too green.

In my code I have a full link to an image, and the image is guaranteed to end with the picture extension. So I would like the period and the extension.

Using RegEx101 I came up with this which works but I would like to explain why I think it works, so if I got it wrong and am just being lucky, someone in the know can tell me.

^.*(\..*)$

This starts at the beginning and all is match, and I extract the last period and extension because \. somehow stops the greedy wild cards preceding it, and the $ anchor grabs everything till the end?

Replies are listed 'Best First'.
Re: My regex works, but I want to make sure it's not blind luck
by GrandFather (Saint) on Dec 21, 2020 at 23:45 UTC

    Put it in a test script. Try it against edge cases (unusual test strings that might work when they shouldn't, or might not work when they should). For example:

    use strict; use warnings; my @tests = ( ".", "", "A sentence.", ".gitignore", "word.doc", "a.dotted.name", ); print /^.*(\..*)$/ ? "Matched '$1' in " : "Failed", " '$_'\n" for @tes +ts;

    Prints:

    Matched '.' in '.' Failed '' Matched '.' in 'A sentence.' Matched '.gitignore' in '.gitignore' Matched '.doc' in 'word.doc' Matched '.name' in 'a.dotted.name'
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: My regex works, but I want to make sure it's not blind luck
by kcott (Archbishop) on Dec 22, 2020 at 02:58 UTC

    G'day SergioQ,

    Your approach to learning and fully understanding what you are doing is very good; and in response to that, you're getting good advice.

    However, in this particular case, I don't believe a regex is the right tool for the job. It would be far more efficient to use Perl's built-in string handling functions.

    $ perl -E ' my @images = qw{a.png b.gif c.svg d.jpg}; for (@images) { say substr $_, rindex $_, "."; } ' .png .gif .svg .jpg

    See the documentation for those: substr and rindex.

    Note: if you provide some representative, sample data, I may have additional, or even different, advice.

    — Ken

      I guess if we are going to beat this thing to death, split() could also be used:
      use strict; use warnings; my @images = qw{a.png b.gif c.svg d.blah.jpg ..}; foreach (@images) { my $after_last_dot = (split (/\./,$_))[-1]; $after_last_dot //= ''; print ".",$after_last_dot,"\n"; } __END__ .png .gif .svg .jpg . <=might want something else here?
      I believe that the substr, rindex approach will be by far the fastest - these are very simple functions. The regex will be slower, but in my opinion, it is much easier to understand and I would prefer it for that reason. For most of my work, the speed difference would not be of any significance what-so-ever. There are of course always exceptions if you do something enough times! I suppose that split() performance would possibly wind up performance wise somewhere in-between? Although without benchmarking, I can't be sure. It could actually be slower than the first regex method because of more things pushed onto the output array.

      Anyway in the spirit of "more than one way to do it", see split() solution. I did add code to handle the "undefined" case. The //= operator is a cool thing.

Re: My regex works, but I want to make sure it's not blind luck
by jcb (Parson) on Dec 22, 2020 at 04:29 UTC

    As kcott mentioned, this might be an example of a problem better solved without regex, but since you have indicated a desire to learn, I will explain:

    The pattern qr/^.*(\..*)$/ will always backtrack but happens to work (and will continue to work) because of the (documented and stable) order in which the regex engine considers possible matches. This pattern starts at the beginning of the buffer (and only there) /^/, then matches the entire string /.*/, then fails to match the /\./, so the regex engine begins backtracking, removing characters from the greedy qualifier's match until a /\./ matches or the beginning of the string is reached (the match fails if this occurs). After the regex engine has found a position where /\./ can match, the second /.*/ is applied, matching the suffix after the /\./ and completing the capture group. The final /$/ is an assertion that the capture group extends to the end of the string, but is not actually needed because the regex engine will consider that match first and it will succeed. Including the final /$/ is good practice, however, since it clearly indicates the intent to later programmers.

    Depending on your input and details of the regex engine implementation, you may be able to get better performance by changing the pattern to qr/(\.[^.]*)$/, which removes the anchor at the start of the string. This pattern first searches for a dot /\./, then matches any (possibly empty) sequence of characters other than dot /[^.]*/ (dot is not special in character classes), then asserts that the end of the string was reached before the match returns success. Again, the engine can backtrack if the string contains multiple dots, but backtracking is avoided in the simple case of input containing only one dot: the regex engine will find the dot, reach the end of the string scanning only non-dot characters, and return success. If the input contains multiple dots, the regex engine will find the first dot /\./, then match characters until the next dot /[^.]*/, then fail the end-of-line assertion /$/ and backtrack to searching for the next dot, which (depending on the regex optimizer) it may already have.

    There is another issue here, not always covered in RegEx101: the * qualifier matches zero-or-more, while the + qualifier matches one-or-more, and there are also {N,M} qualifiers for general number ranges. Early regular expression engines did not support the + or {N,M} qualifiers, but Perl's regex engine has both. In an {N,M} qualifier, the upper limit can be omitted for "at least N" and the comma can also be omitted for "exactly N" matches. The {N,M} form is most general: * is equivalent to {0,}, + is equivalent to {1,}, and ? is equivalent to {0,1}, but the shorter single-character forms should always be preferred in hand-written patterns for readability.

    For all the details, see perlre and "Regexp Quote-Like Operators" in perlop.

      If you want a . . . (well, I almost said "fun" but that's probably not the right word :) an interesting rabbit hole to go down check out Nondeterministic finite automaton and Deterministic finite automaton to get a better picture of what's going on under the hood in a regex engine. And/or find an undergraduate course in automata theory to audit. If you use the Regexp::Debugger that's already been mentioned above it'll let you watch and you can almost picture the state machine hopping along from node to node until it gets to a valid terminal state.

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: My regex works, but I want to make sure it's not blind luck
by eyepopslikeamosquito (Archbishop) on Dec 22, 2020 at 22:03 UTC

    I pulled a face the instant I caught sight of your ^.*(\..*)$ regex: the leading ^.* looks pretty pointless, the trailing (\..*) overly generic. For some background on where I'm coming from see the classic old node from 2000: Death to Dot Star! by Ovid.

    Given you say "the image is guaranteed to end with the picture extension" I would write it something like: /\.([^.]+)$/ or /\.([a-zA-Z0-9]+)$/ or /\.(\w+)$/ or some such, depending on your requirements, the point being to be more precise than the dreaded "dot star". To illustrate, using GrandFather's example test program:

    use strict; use warnings; my @tests = ( ".", "", "A sentence.", ".gitignore", "word.doc", "a.dotted.name", ); print /\.([^.]+)$/ ? "Matched '$1' in " : "Failed", " '$_'\n" for @tes +ts;
    produces:
    Failed '.' Failed '' Failed 'A sentence.' Matched 'gitignore' in '.gitignore' Matched 'doc' in 'word.doc' Matched 'name' in 'a.dotted.name'

    If you give us a lot more specific examples of strings that should match and ones that shouldn't, we can offer a more precise regex.

    See also: Rosetta code

    Update: For an alternative to regexes, using instead mostly standard Perl facilities, such as glob, opendir, readdir, File::Glob, File::Basename, File::Copy, File::Spec, Path::Tiny and Path::Class::Dir, see:

Re: My regex works, but I want to make sure it's not blind luck
by jwkrahn (Abbot) on Dec 21, 2020 at 23:51 UTC

    At a minimum that will match either "." or ".\n".

    In other words that will match any string of non-newlines with a period in it that may or may not end in a newline.

Re: My regex works, but I want to make sure it's not blind luck
by BillKSmith (Monsignor) on Dec 22, 2020 at 00:03 UTC
    Your explanation is reasonably good. The problem with your regex is that it can make false matches. Have you considered Regexp::Common::net?
    Bill

      The problem with your regex is that it can make false matches.

      Is there a way to tell regex to work from right to left?

        Have you a test case that fails? Why do you think "right to left" will fix the failing test case?

        Note that you can reverse a string with reverse then match against the reversed string. But that's only going to help if you can identify the problem you are trying to fix, and reversing the string somehow fixes that problem.

        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
        Is there a way to tell regex to work from right to left?

        No. One of the basic principles of the regex engine is that it works from left to right (GrandFather's suggestion of reverseing the string is a workaround/hack, though I personally have never seen anyone actually do this in production). Another basic principle is that the engine will stop at the first successful match, which sometimes leads to confusion when, for example, people expect .* to match more than "" (though in your example in the root node you're using the ^ $ anchors to help with that). Combine this with the idea of backtracking (Update: which of course does work from right to left, but too much backtracking can be very inefficient) and hopefully this will lead to a better understanding :-) I very much recommend a read of perlretut, and if you want to see your regex in action, then install Regexp::Debugger and run e.g. perl -MRegexp::Debugger -e '"foo.bar" =~ /^.*(\..*)$/'

        Not exactly right to left, but this code looks for a line that ends with a period, followed by at least one character that is not a period before the end of the line or string. The period and characters that follow it are captured. If the match fails, $suffix is set to a null string. You can specify what you want in relation to the end of the string, but this is not "backwards" or right to left - this is the "rightmost" pattern that matches.

        use strict; use warnings; foreach my $test ('..', 'file.txt', 'blah.abc.txt') { my ($suffix) = $test =~ /(\.[^.]+)$/; $suffix //= ''; #suffix is null string if no match print "test=$test suffix=$suffix\n"; } __END__ test=.. suffix= test=file.txt suffix=.txt test=blah.abc.txt suffix=.txt
        There are many modules like: File::Basename.

        Update: I guess probably: my ($suffix) = $test =~ /(\.\w+)$/;
        Word characters are A-Za-z0-9_. Space and control chars are not allowed.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11125588]
Approved by GrandFather
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2024-04-19 01:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found