Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

Re: using substitution and pattern matching

by steves (Curate)
on Dec 18, 2004 at 15:14 UTC ( #415865=note: print w/replies, xml ) Need Help??

in reply to using substitution and pattern matching

You need to read about capturing matches into the $1, $2, etc. variables. Basically, you need to put what you want to match in parens. Each group of parens is a $N variable in the replacement pattern:

use strict; my $test = "this is 'inquotes' o'leary"; $test =~ s/ \'(\S*)\' / \"$1\" /g; print "$test\n";

It should also be noted that your use of whitespace to find just the "words" will likely fall apart for some cases. There are better ways of doing that. Again, as part of your study, check out zero width assertions, such as \b that matches at word boundaries without actually matching a physical character. Also check out lookahead assertions. In some cases, it can also be as simple as replacing something like your \S non-whitespace match with a character class that excludes quotes.

Replies are listed 'Best First'.
Re^2: using substitution and pattern matching
by Anonymous Monk on Dec 18, 2004 at 16:20 UTC
    I knew there was a simple answer

      It's a simple start, at least.

      Usage of non-alphabetic marks in text (in English, at least) will always pose some boundary cases that are really hard or basically impossible to treat with a straight-forward, procedural algorithm (and on top of that, people who create text tend to make mistakes or ignore "rules" of style).

      For the current task, there's the problem of the possessive apostrophe without a following "s" (because the word ends in "s") -- and sometimes, punctuation will follow a close-quote (even though style manuals say it shouldn't). Here's a worst case for you:

      'You've got to talk to Miles' brother', she said.

      Easy for humans, hard for programs. There is a regex that will treat this one correctly:

      s/ '(.*)'(\W)/ "$1"$2/; # note the greedy use of ".*"
      but it will screw up on some other case that would need a non-greedy match, like:

      When he said 'kiss the sky,' I heard 'kiss this guy.'

      You just have to make a guess what sort of mistake will happen less often (and hope your data isn't really this bad).

      One other hint: for stuff like this, where initial and final positions in the string might make things more complicated, it's okay to "cheat" a little: add a space or some other "safe" character at the beginning and end of the string before working on the quotes, so that the edge cases can be treated just like the non-edge cases. You can take the edge padding off when you're done.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://415865]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2023-03-24 02:46 GMT
Find Nodes?
    Voting Booth?
    Which type of climate do you prefer to live in?

    Results (60 votes). Check out past polls.