Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Disable Regex

by SavannahLion (Pilgrim)
on Aug 26, 2009 at 05:38 UTC ( [id://791255] : perlquestion . print w/replies, xml ) Need Help??

SavannahLion has asked for the wisdom of the Perl Monks concerning the following question:

How do I disable or prevent Perl from processing special regex characters such as \, * and ? in a quoted string so I can process them as regular characters?

I'm stuck on this one. I'm pulling data from a couple of sources and amongst the many chunks of information I'm sucking up, the darn thing kept choking. Come to find out there are a couple of "illegal" characters that's causing problems for me. For example, some characters like \, * and ? (amongst others) are triggering problems elsewhere within the code. Unfortunately, I can't just send these characters on their merry way since the some of this data is being converted into files on an NTFS partition.

In other words. I'm slurping down data from files (roughly 30GB worth of data). The data within is digested and new NTFS palatable files are created. Perl did exactly what I wanted it to do with the exception of the file naming convention. To my surprise, Linux happily wrote illegal Windows characters to the NTFS partition causing Windows to balk when trying to do anything with them. In other words, Linux (and Perl) created files such as:
C:\random\location\ab\delta.txt
and
C:\random\location\ab*delta.txt
(Technically, Linux wrote to /mnt/c/random/location/. Windows sees it as C:\random\location\).


Where C:\random\location\ is the full path and ab\delta.txt or ab*delta.txt is the file name. So, thinking I was smart, I just did a s/// for all the illegal characters and just replaced them all with _. That worked until I got to ab\delta.txt and ab*delta.txt where both would be renamed to ab_delta.txt, one overwriting the other. OK, so I tried to be a little smarter and tried to use iteration creating ab_delta-1.txt and ab_delta-2.txt but if Perl dies for any reason, I get a bunch of files ending in -3 -4 -5 and no idea what was what.

OK..... Looking back and my internal file structure, I finally decided to do a bit of substitution making \, * and ? into [bs], [a], [q]. Yaaay! It started working. Until I ran into files that needed to be ab\delta.txt and ab\\delta.txt. I was getting ab[bs]delta.txt. DOH!!

So here's what I've come up with so far (with all the extranous crusty stuff removed):

my $test = 'illegal\characters*example?'; my @illegal = qw(\ * ?); my @legal = qw(bs a q); my $c = 0; foreach my $val (@illegal) { $test =~ s/\Q$val\E/[$legal[$c]]/g; $c++; } print $test."\n";

Nice, it produces what I want: illegal[bs]characters[a]example[q].
However if I modify $test to equal 'illegal\\characters\*example?' I get illegal[bs]characters[bs][a]example[q] when it should read illegal[bs][bs]characters[bs][a]example[q] (note the missing [bs] after illegal). I've been trying all sorts of ideas to supress the \\ being escaped, but I'm stuck.

Please, enlighten me and direct me to the proper solution.

Replies are listed 'Best First'.
Re: Disable Regex
by roboticus (Chancellor) on Aug 26, 2009 at 10:04 UTC
    SavannahLion:

    Two things. First, you can save yourself a lot of trouble with Windows paths if you use the '/' instead of '\' characters in your paths. It will work just fine in Windows, and you don't have to worry about the annoying backslashes. (Note: there's a case where it doesn't work reliably, described later.)

    Secondly, you'll want to read perlre to learn more about regular expressions and the various tips and tricks.

    In your question, you're having two (or more) issues: First, in a regular expression, some character have special meaning. To use them as characters without their special meaning, you need to escape them by preceding them with a '\' character. The * and ? characters are the ones you're having trouble with in this case. So s/foo\*\?/bar?*/ allows you to convert the "foo*?" part of a string to "bar?*". Note: The specialness of the ? and * characters only applies to the regular expression (first) part, not the replacement part, which is why I didn't escape them in the right hand chunk.

    The second issue is that in double-quoted strings and regular expressions, the '\' character has a special meaning: Specifically, it's saying "do something different with the following character(s)". In a regular expression, it's saying "treat this as a normal character instead of a regular expression feature". Since the '\' has special meaning, to get one in your double-quoted string or regular expression, you need to escape it by prefixing it with a '\' character. (If you want two '\' characters in a row, you'll have to escape each one, so you'll wind up with four in a row...).

    I said "or more", right? What I mean here is that even when you have your string set up with a valid Windows filename, you may still run into problems, depending on how you use it. If you pass the filename into a command shell after that, then you may have another problem: The command shell has its own rules and special characters that you have to fight. (So I avoid calling command shells in my perl scripts.) The rules are different for differing command shells (bash, zsh, etc. in the *NIX world, CMD in the Windows world). So once you have your filename encoded, you still need to understand what the shell is going to do to your filename, so you can further mangle it if necessary. (This is what I meant earlier when I said that there's a case where using '/' instead of '\' won't work reliably: In the CMD shell, it wants to use '/' as an indicator for a command-line switch, so if you pass your filenames to a CMD shell, you'll have to wrestle with the '\' rules...).

    ...roboticus
Re: Disable Regex
by james2vegas (Chaplain) on Aug 26, 2009 at 05:59 UTC
    Does this happen with real data? The reason
    my $test = 'illegal\\\\characters*example?'; my @illegal = qw(\ * ?); my @legal = qw(bs a q); my $c = 0; foreach my $val (@illegal) { $test =~ s/\Q$val\E/[$legal[$c]]/g; $c++; } print $test."\n";
    returns illegal[bs]characters[a]example[q] is because the two \s are actually \ as an escape and then as the escaped character. From Quote and Quote like Operators:
    A single-quoted, literal string. A backslash represents a backslash unless followed by the delimiter or another backslash, in which case the delimiter or backslash is interpolated.


    To have two \s in your '-delimited string use '\\\\' (two escaped backslaches)
    my $test = 'illegal\\\\characters*example?'; my @illegal = qw(\ * ?); my @legal = qw(bs a q); my $c = 0; foreach my $val (@illegal) { $test =~ s/\Q$val\E/[$legal[$c]]/g; $c++; } print $test."\n";
    If you are building up your string from component parts, it shouldn't be a concern.
      To have two \s in your '-delimited string use '\\\\' (two escaped backslaches)
      Or else a single escaped backslash and then a single un-escaped backslash:
      >perl -wMstrict -le "my $s1 = 'a\b'; my $s2 = 'c\\d'; my $s3 = 'e\\\f'; my $s4 = 'g\\\\h'; print $s1, $s2, $s3, $s4; " a\bc\de\\fg\\h

      Sadly, no. :( I'm not building up the string in question. That was just a sample of a typical (atypical?) string I'm grabbing. Well... I lied. I am building up the path (in the above example, that would be C:\random\location), it's the names I'm not building up. Really, they literally come to me as ab\delta (I add the .txt) and ab\delta is literally going to be the name of the file or directory in the string I'm constructing. While scanning over the logs, I did find several botched names that come up as something like ab\\delta

      So if I have two strings, one as ab\delta and the other as ab\\delta. I must know how many real \s (or * or ? or whatever) there are so I can add in the appropriate number of replacements.

        If your string is coming from an external source (read from a file, read from the command line, STDIN, a database field) you shouldn't need to worry about escaping \ * ?, f.e.:


        qt.pl:
        #!/usr/bin/perl use strict;use warnings; my $string = $ARGV[0]; my $test = $string; my @illegal = qw(\ * ?); my @legal = qw(bs a q); my $c = 0; foreach my $val (@illegal) { $test =~ s/\Q$val\E/[$legal[$c]]/g; $c++; } print $test."\n";

        $ perl ./qt.pl 'a\\b\\c\d\\\\e' a[bs][bs]b[bs][bs]c[bs]d[bs][bs][bs][bs]e $ perl ./qt.pl 'foo*bar\eleven?three' foo[a]bar[bs]eleven[q]three $

        the reason you needed to escape \ * ? is that you were entering the assignment in perl, and perl was doing the interpolation. This won't happen in a already assigned string.
Re: Disable Regex
by jwkrahn (Abbot) on Aug 26, 2009 at 08:33 UTC
    I've been trying all sorts of ideas to supress the \\ being escaped, but I'm stuck.
    my $test = <<'TEXT'; illegal\\characters\*example? TEXT my @illegal = qw( \ * ? ); my @legal = qw( bs a q ); my $c = 0; for my $val ( @illegal ) { $test =~ s/\Q$val\E/[$legal[$c]]/g; $c++; } print "$test\n";

    This produces:

    illegal[bs][bs]characters[bs][a]example[q]