Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Defining Characters in Word Boundary?

by iaw4 (Monk)
on Jan 19, 2011 at 22:44 UTC ( [id://883217]=perlquestion: print w/replies, xml ) Need Help??

iaw4 has asked for the wisdom of the Perl Monks concerning the following question:

Dear perl monks---

is it possible to define the characters that '\b' matches? I am processing latex code, and their macro character space is \ a-zA-Z \. I would like to write

   \\$keyword\b

where $keyword may hold, say, "Chi". The problem is that if my latex code says "\\Chi_2", perl thinks that the '_' is a word character. same problem for "\\sqrt22".

Alternatively, is there a way to have a "zero width" match like boundary? It would be tedious to have to write \\$keyword([^a-zA-Z]) and then have to substitute back $1 (because I do not want it eaten).

advice appreciated. /iaw

Replies are listed 'Best First'.
Re: Defining Characters in Word Boundary?
by ikegami (Patriarch) on Jan 19, 2011 at 22:47 UTC

    /\b/ is equivalent to /(?<=\w)(?!\w)|(?<!\w)(?=\w)/. Feel free to replace \w with a character class.

    It would be tedious to have to write \\$keyword([^a-zA-Z]) and then have to substitute back $1 (because I do not want it eaten).

    Don't eat it if you don't want add it back. Equivalent without eating:

    \\$keyword(?=[^a-zA-Z])

    But you surely meant

    \\$keyword(?![a-zA-Z])

    In general, it's easier to extract the keyword, then check if it's the one you want.

    \\([a-zA-Z]+)
      In general, it's easier to extract the keyword, then check if it's the one you want.

      I agree wholeheartedly. Since the LaTeX name constraint is exact and well-understood (the characters 'a' through 'z' and the characters 'A' through 'Z'), you simply need to match just those characters. Explicitly matching the right-hand boundary isn't necessary.

      thanks. this is what I needed to learn. I did not know the extended regex expressions in the camel book (i.e., (?...) sequences), chapter 5, table 5.6. is there a meaningful difference between (?!a-z) and (?=^a-z)? is the former recommended? /iaw
        Compare
        'ab' =~ /a(?!a)/ 'a' =~ /a(?!a)/
        and
        'ab' =~ /a(?=[^a])/ 'a' =~ /a(?=[^a])/
        is there a meaningful difference between (?![a-z]) and (?=[^a-z])? is the former recommended?

        Yes, they're different regular expression patterns that match different things. (?![a-z]) asserts "not followed by any of the characters from 'a' through 'z', which includes not being followed by any character." (?=[^a-z]) asserts "followed by a single character that is not any of the characters from 'a' through 'z'." The former is a negative assertion; the latter is a positive assertion.

        In your case, (?![a-z]) is what you would want to use.

        [PerlMonks posting tip: Enclose Perl code in <code></code> tags, even code within paragraphs.]

        UPDATE: Removed color.

          A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Defining Characters in Word Boundary?
by Jim (Curate) on Jan 20, 2011 at 01:16 UTC
Re: Defining Characters in Word Boundary?
by luis.roca (Deacon) on Jan 19, 2011 at 22:59 UTC
     

    is it possible to define the characters that '\b' matches? I am processing latex code, and their macro character space is \ a-zA-Z \. I would like to write

      

    \\$keyword\b

    Unless I'm understanding your intentions wrong, that's the purpose of \b.

      Example: m/\bChi_2\b/

    I don't think the underscore will cause you problems within the defined \b \b but I'm sure I'll be corrected shortly if I'm wrong. :)

      UPDATE: 1.20.2011 1:30PM
    • Through help in the chatterbox and "Mastering Regular Expressions" pg. 89, I learned that \w has included _ since Perl 2. So /\bChi_2\b/ will not match.

    • "...the adversities born of well-placed thoughts should be considered mercies rather than misfortunes." — Don Quixote
      I believe he's saying that "_2" isn't part of the macro, so $keyword = 'Chi'; '...\\Chi_2...' =~ /\\$keyword\b/ should match.

        Ugh! — Apologies to the OP. I misread that.

        "...the adversities born of well-placed thoughts should be considered mercies rather than misfortunes." — Don Quixote

      I'm not sure why you added the update. I don't see what it adds, and it's not true. /\bChi_2\b/ will match plenty of strings.

      'Chi_2' =~ /\bChi_2\b/ # Match '!Chi_2!' =~ /\bChi_2\b/ # Match

      Maybe you had a specific string in mind, but I don't see how this relates to the OP. He would not use Chi_2 in the regex pattern.

      In a world where an identifier matches /^\w+\z/, you might do something like

      ($_ = '\\Chi+3' ) =~ s/\\Ch\b/$ch/g; # Won't replace ($_ = '\\Chi+3' ) =~ s/\\Chi\b/$chi/g; # Will replace ($_ = '\\Chi_2+3') =~ s/\\Chi\b/$chi/g; # Won't replace

      But what if identifiers match /^[a-zA-Z]\z/? You'd want the following behaviour:

      ($_ = '\\Chi+3' ) =~ s/\\Ch???/$ch/g; # Won't replace ($_ = '\\Chi+3' ) =~ s/\\Chi???/$chi/g; # Will replace ($_ = '\\Chi_2+3') =~ s/\\Chi???/$chi/g; # Will replace

      That's the OP's question.

      As I've already mentioned, I recommend extracting the identifier, then checking if it's one of interest. This can be as simple as the following:

      /\\([a-zA-Z]+)/ exists($vars{$1}) ? $vars{$1} : "\\$1" /eg

      The technique scales well, and it avoids the problem of matching something you've previously replaced.

           

        "Maybe you had a specific string in mind, but I don't see how this relates to the OP. He would not use Chi_2 in the regex pattern.

        I did have a very similar string in mind.    

        In a world where an identifier matches /^\w+\z/, you might do something like"
           ($_ = '\\Chi_2+3') =~ s/\\Chi\b/$chi/g; # Won't replace

        I understand my update isn't contributing to the OP's original question. I'm not trying to distract from his post or the thread, simply attempting to correct what I said regarding the underscore having no effect on the RegEx's success (again the one I had in mind).

        In my original reply I was referring to matching 'Chi' within 'Chi_2' using \b. I previously said that I didn't think the underscore would be a problem. However after some help in the CB from erix and Tanktalus it was shown that an underscore would interfere with this particular match:

         say (("Chi_2" =~ /\bChi\b/) ? "match" : "no match");
         returns: "no match"

          * Thanks again to Tantalus for this control structure.

        Again, apologies for any confusion caused.


        "...the adversities born of well-placed thoughts should be considered mercies rather than misfortunes." — Don Quixote

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://883217]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-26 03:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found