Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Regex to detect file name

by kcott (Archbishop)
on Jul 06, 2018 at 12:12 UTC ( [id://1218041]=note: print w/replies, xml ) Need Help??


in reply to Regex to detect file name

G'day lirc201,

Welcome to the Monastery.

You've missed some information which could be important. Is there a minimum number of characters? Filenames can't start with [_.-], but can they end with all, some or none of those?

Just this week, I implemented something along these lines for production code. The requirements were: names could be just one character long; the start and end characters (the same character for one-character names) must match [A-Za-z0-9]; the middle characters for names with three or more characters must match [A-Za-z0-9_.-]. The regex for this:

qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}

Note that, in a bracketed character class, '.' is not special and '-' is only special when between two characters to form a range: as you can see, you don't actually need to escape any characters.

Here's a limited test:

$ perl -E ' my @x = (qw{A AA AAA _ __ ___ -A A- A.A A. .A A-A A_A}, "A\n", "A\ +tA"); my $re = qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}; say "|$_| is ", /$re/ ? "OK" : "BAD" for @x ' |A| is OK |AA| is OK |AAA| is OK |_| is BAD |__| is BAD |___| is BAD |-A| is BAD |A-| is BAD |A.A| is OK |A.| is BAD |.A| is BAD |A-A| is OK |A_A| is OK |A | is BAD |A A| is BAD

Modify that to suit your own filename specifications. Add some more tests which should probably include digits and lowercase letters.

— Ken

Replies are listed 'Best First'.
Re^2: Regex to detect file name
by AnomalousMonk (Archbishop) on Jul 06, 2018 at 16:01 UTC

    Use of POSIX character classes (see perlre, perlrecharclass) and /x can make regexes easier on the eye:
        my $re = qr{ \A [[:alnum:]] (?: [[:alnum:]_.-]* [[:alnum:]])? \z }xms;
    is equivalent.


    Give a man a fish:  <%-{-{-{-<

      is equivalent

      No, it really isn't:

      use strict; use warnings; use Test::More tests => 2; my $in = "\N{LATIN SMALL LETTER C WITH CEDILLA}"; like ($in, qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}, 'kcott +'); like ($in, qr{ \A [[:alnum:]] (?: [[:alnum:]_.-]* [[:alnum:]])? \z }xm +s, 'AnomalousMonk');

        Hmmm... Good point. Well, I think in a case like this, I'd still like to try to take advantage of some degree of factoring:

        c:\@Work\Perl\monks>perl -wMstrict -le "use charnames ':full'; ;; use Test::More tests => 2; ;; my $in = qq{\N{LATIN SMALL LETTER C WITH CEDILLA}}; ;; like ($in, qr{\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|)\z}, 'kcot +t'); ;; my $alpha = qr{ [A-Za-z0-9] }xms; my $alpha_plus = qr{ $alpha | [_.-] }xms; like ($in, qr{ \A $alpha (?: $alpha_plus* $alpha)? \z }xms, 'Anomalou +sMonk'); " 1..2 not ok 1 - kcott # Failed test 'kcott' # at -e line 1. # 'ô' # doesn't match '(?^:\A[A-Za-z0-9](?:[A-Za-z0-9_.-]*?[A-Za-z0-9]|) +\z)' not ok 2 - AnomalousMonk # Failed test 'AnomalousMonk' # at -e line 1. # 'ô' # doesn't match '(?^msx: \A (?^msx: [A-Za-z0-9] ) (?: (?^msx: ( +?^msx: [A-Za-z0-9] ) | [_.-] )* (?^msx: [A-Za-z0-9] ))? \z )' # Looks like you failed 2 tests of 2.
        (Both the  /a modifier and the  (?a) embedded modifier seem to work with the original
            qr{ \A [[:alnum:]] (?: [[:alnum:]_.-]* [[:alnum:]])? \z }xms
        regex to suppress extended Unicode matching, but I don't fully understand the interaction of this and related flags with POSIX character classes. And it's one more modifier to remember!)


        Give a man a fish:  <%-{-{-{-<

      G'day AnomalousMonk,

      With regard to the POSIX character class, ++hippo has already pointed out the problem with that. You can certainly be forgiven for that because the documentation appears to be wrong. From "perlrecharclass: POSIX Character Classes":

      Perl recognizes the following POSIX character classes:

      ...

      2. alnum Any alphanumeric character ("[A-Za-z0-9]").

      I rarely use the POSIX classes and wasn't aware of that discrepancy. Anyway, while possibly "easier on the eye", that's likely to result in a fair amount of frustration for someone attempting to perform debugging and assuming the documentation is correct.

      The problem could be further exacerbated when input characters may not appear to be ones that should be failing. While hippo's example using "LATIN SMALL LETTER C WITH CEDILLA" (ç) was fairly obvious, the glyphs for some characters (depending on the font) may be identical or so similar that it's difficult to tell them apart. Consider "LATIN CAPITAL LETTER A" (A) and "GREEK CAPITAL LETTER ALPHA" (Α):

      $ perl -C -E '
          use utf8;
          say "$_ (", ord $_, "): ", /\A[A-Za-z0-9]\z/ ? "✓" : "✗"
              for qw{A Α}
      '
      A (65): ✓
      Α (913): ✗
      
      $ perl -C -E '
          use utf8;
          say "$_ (", ord $_, "): ", /\A[[:alnum:]]\z/ ? "✓" : "✗"
              for qw{A Α}
      '
      A (65): ✓
      Α (913): ✓
      

      As far as the 'x' modifier goes, I don't disagree that it can improve readability; however, where it's felt necessary to use it — either because the regex is particularly complex or it's code that junior developers will need to deal with — spreading the regex across multiple lines and including comments might be even better:

      my $re = qr{ \A # Assert start of string [A-Za-z0-9] # Must start with one of these (?: # Followed by either [A-Za-z0-9_.-]*? # Zero or more of these [A-Za-z0-9] # But ending with one of these | # OR # Nothing ) \z # Assert end of string }x;

      And, with 5.26 or later, perhaps even clearer as:

      my $re = qr{ \A # Assert start of string [A-Z a-z 0-9] # Must start with one of these (?: # Followed by either [A-Z a-z 0-9 _ . -]*? # Zero or more of these [A-Z a-z 0-9] # But ending with one of these | # OR # Nothing ) \z # Assert end of string }xx;

      We've already had exhaustive discussions about the 'm' and 's' modifiers. Use them if you want to follow PBP suggestions but understand that they do absolutely nothing here: there's no '^' or '$' assertions that 'm' might affect; there's no '.' (outside a bracketed character class) that 's' might affect.

      — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1218041]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2024-04-16 04:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found