great!
I knew there was a simple answer
thanks | [reply] |
It's a simple start, at least.
Usage of non-alphabetic marks in text (in English, at least) will always pose some boundary cases that are really hard or basically impossible to treat with a straight-forward, procedural algorithm (and on top of that, people who create text tend to make mistakes or ignore "rules" of style).
For the current task, there's the problem of the possessive apostrophe without a following "s" (because the word ends in "s") -- and sometimes, punctuation will follow a close-quote (even though style manuals say it shouldn't). Here's a worst case for you:
'You've got to talk to Miles' brother', she said.
Easy for humans, hard for programs. There is a regex that will treat this one correctly:
s/ '(.*)'(\W)/ "$1"$2/; # note the greedy use of ".*"
but it will screw up on some other case that would need a non-greedy match, like:
When he said 'kiss the sky,' I heard 'kiss this guy.'
You just have to make a guess what sort of mistake will happen less often (and hope your data isn't really this bad).
One other hint: for stuff like this, where initial and final positions in the string might make things more complicated, it's okay to "cheat" a little: add a space or some other "safe" character at the beginning and end of the string before working on the quotes, so that the edge cases can be treated just like the non-edge cases. You can take the edge padding off when you're done. | [reply] [d/l] |