Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

regex: how to negate a set of character ranges?

by kettle (Beadle)
on Apr 29, 2007 at 17:20 UTC ( [id://612649]=perlquestion: print w/replies, xml ) Need Help??

kettle has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have a specific set of character ranges which I would like to preserve, and I would like to eliminate all other characters. Normally something like:

$_ =~ s/[^a-z]//g;
Would get rid of anything other than the lower case alphabet. However, this doesn't seem to be working right with my set of character ranges, and I don't know why... I tried the code below and a couple of other variations but I can't get the inversion to work. I can successfully delete all characters in the ranges, I just can't invert the set properly... and I can't explicitly set the ranges I want to delete. any help will be greatly appreciated!
my $shiftjis = q{ [\x30-\x39] | [\x41-\x59] | [\x61-\x7A] | [\x82-\x83][\x3F-\xFE] | [\x88][\x9E-\xFE] | [\x89-\xE9][\x3F-\xFE] | [\xEA][\x3F-\x9F] }; while(<STDIN>){ chomp(); s/[^${shiftjis}]/ogx; print $_."\n"; }

Note: the above doesn't break, it just doesn't do what I'd hoped.

Replies are listed 'Best First'.
Re: regex: how to negate a set of character ranges?
by Joost (Canon) on Apr 29, 2007 at 17:40 UTC
    Hmm... I don't think you can nest character classes. your final regex looks something like:
    /[^[\x30-39]|[\x41-\x59]|[\x61-\x7a]| ... #etc
    Which means you're using the literal [, - | and ] characters as part of a bigger character class.

    Also, it looks like you're trying to match raw byte sequences instead of characters. I have zero experience with shift-jis so I have no clue what characters (if any) you're trying to match, but as a wild stab, I would assume it's a lot easier to use Encode's decode() function to translate the shift-jis bytes into true (utf-8) characters and then match on characters (since you then can match multi-byte codepoints directly).

      Thanks for the reply!
      "Which means you're using the literal and characters as part of a bigger character class."

      Yeah, I sort of figured this out, but hadn't figured a way around it.

      You're right, I am trying to match the byte sequences. For somewhat annoying reasons I have to first run a parser over the shiftjis text, then convert it to eucjp, run a utility on that (which only accepts eucjp input) and then output the final product in utf8. I could do what you say, and then convert back to eucjp, but I'm processing a very large amount of data and need to do it in as timely a manner as possible. I'm also just a little bit worried that perhaps there are a couple of shiftjis characters that don't translate properly into utf8 (read about this issue somewhere...) Finally, I just sort of like to know whether this is possible, and if so, how I can accomplish it.
        Well, the problem with using regexes for raw variable-width encodings is that you can't match characters with character ranges anymore, since ranges on raw data match only bytes. That means normal character ranges will match invalid data, and inverted ranges will exclude data that's possibly valid.

        You might have to give up on using combined character ranges altogether if you want to process the encoded data directly, and inverting ranges will be especially annoying. I mean, you could possibly match like this /([\x00-\x40][\x56-\x90]|[\x50-\x60][\x56-\x90])*/ (numbers made up), but you can't (easily) invert that match. Also, keep in mind that your regexes might shift (eh) off their alignment since shift-jis has 1 and multi-byte characters - meaning [\x00-\x40] might match both the first and/or later byte(s) of any character.

        I think it's still likely that using the internal perl multi-byte encoding (i.e. utf-8) will be a lot easier, but it depends on what you're trying to do exactly.

Re: regex: how to negate a set of character ranges?
by dynamo (Chaplain) on Apr 29, 2007 at 17:36 UTC
    As The Camel tells us, in chapter 5, verse 4 (Character Classes), you may combine ranges (note Table 5.8, it has several ranges combined in the lower portion). So - you can write the first few lines of $shiftjis as:
    [\x30-\x39\x41-\x59\x61-\x7A]
    intsead of:
    [\x30-\x39] | [\x41-\x59] | [\x61-\x7A]
    This would allow you to have one huge character class instead of multiple. I don't have any source text with exotic chacters, so this is not tested. But, I see other problems. You have a missing second slash in the substitution regex. Another issue is that you have tried to use nesting within character classes, which doesn't work.. So, assuming that you've done the above and combined the classes, in the fix I'm also removing the outer square brackets. Instead of:
    s/[${shiftjis}]/ogx;
    Try:
    s/${shiftjis}//ogx;
    If that doesn't help, please post a short section of your source material to help test other solutions.
      Thanks for the speedy reply! I've kept both the single and multibyte ranges on separate lines just to help me keep track of what they actually represent.

      However, I do not think it is possible to combine the multibyte characters, which means that I can't quite combine everything.

      The missing slash was a typo.

      Also, I should point out that,
      s/[${shiftjis}]//ogx; (or s/${shiftjis}//ogx; or s/$shiftjis//ogx;) will work as expected.

      What doesn't work as expected is:
      s/[^${shiftjis}]//ogx;

      Unfortunately I'm now at home and do not have access to the text. However, I think that the problem is that I don't know this little corner of the regex syntax...
        Is this still not working as expected when you combine even just a couple of the ranges? I don't think that using multiple ranges and the bitwise or (|) op is doing what you want once it's expanded inside of the char class brackets. Unless performance is a really big problem, if you can't combine the classes for whatever reason, or don't want to, try storing each class string in an array and go through it running the substitution once per char class on the source text. It'll get the job done. Good luck!
Re: regex: how to negate a set of character ranges?
by Sidhekin (Priest) on Apr 29, 2007 at 17:59 UTC

    <clippy>It looks like you are trying to build a negated character class. Would you like some help?</clippy>

    You have the syntax for including several ranges within a character class wrong. Simple rule: Within a character class, [, |, and whitespace (even with /x) are literal, and ] terminates the class (except in first position). So keep those square braces out until constructing the class, and keep the pipes and whitespace out, period.

    Ah, but then, clippy may have been wrong? What you are trying to do is combine (and then negate) multi-character sequences, right?

    Oh my. Combining is easy. What you have even works, though I'd personally use qr//x instead:

    my $shiftjis = qr{ [\x30-\x39] | [\x41-\x59] | [\x61-\x7A] | [\x82-\x83][\x3F-\xFE] | [\x88][\x9E-\xFE] | [\x89-\xE9][\x3F-\xFE] | [\xEA][\x3F-\x9F] }x;

    Negating is another subject alltogether, since there is more than one set of semantics for such a negation. I'm not entirely sure which makes more sense here ... if any! Here's one example/guess though:

    while(<STDIN>){ chomp(); # Oops, nope -- variable-length lookbehind: # s/(?!$shiftjis).(?<!$shiftjis)//ogx; # This runs, but doesn't do the job: # s/(?!$shiftjis).//ogx; # This should work: s/(?!$shiftjis). (?<![\x30-\x39\x41-\x59\x61-\x7A]) (?<![\x82-\x83][\x3F-\xFE] |[\x88][\x9E-\xFE] |[\x89-\xE9][\x3F-\xFE] |[\xEA][\x3F-\x9F]) //ogx; print $_."\n"; }

    But do you have to read STDIN as bytes? If you could read it as characters (in whatever encoding this is; see Encode), you'd be spared this mess.

    print "Just another Perl ${\(trickster and hacker)},"
    The Sidhekin proves Sidhe did it!

Re: regex: how to negate a set of character ranges?
by ikegami (Patriarch) on Apr 29, 2007 at 18:02 UTC
    I think you'll have a much easier time if you decode the Shift-JIS bytes to characters, work with characters, then encode the chararacters into Shift-JIS bytes when needed. These functions are found in Encode in Perl 5.8.
Re: regex: how to negate a set of character ranges?
by kettle (Beadle) on Apr 30, 2007 at 10:01 UTC
    Thanks for all the help on this one everyone. I ended up using Encode to decode the shiftjis text, and used an exclusion list made up of characters I don't want to see in the final output, and finally compiled this list into a traditional:
    my $string = "NON_DELIMITED_CHAR_LIST"; my $regex = qr/ s/[^a-z]//g;

    type of structure. This worked as expected, but still seems a little bit more messy than it ought to need to be. I'd still prefer not to have the big long list of characters. Anyway, thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://612649]
Approved by TStanley
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 12:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found