Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Is there a way to make these two regex lines cleaner?

by bartender1382 (Beadle)
on Apr 16, 2022 at 18:31 UTC ( [id://11143004]=perlquestion: print w/replies, xml ) Need Help??

bartender1382 has asked for the wisdom of the Perl Monks concerning the following question:

Am currently using the following two regex lines:

$name =~ s/[^a-zA-Z0-9\-\"\'\ \.\?\!"]//g; $name =~ s/^\s+|\s+$//g;

It's not really a big deal, but for the sake of cleaner coding, is there a way to clean the two lines up?

p.s. Am just trying to remove strange characters. i.e. 

Replies are listed 'Best First'.
Re: Is there a way to make these two regex lines cleaner?
by haukex (Archbishop) on Apr 16, 2022 at 18:46 UTC

    You could replace the first line with a tr///cd, that should be a bit faster. The second line is the usual way to trim whitespace from a string in Perl so it's fine the way it is.

    However, "" is the Byte order mark when the file is encoded in UTF-8 but was opened with the incorrect encoding. So instead of that first regex, you probably want to open the file with open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename: $!";, and then do a $line =~ s/\A\N{U+FEFF}//; on the first line of the file. This has the major advantage that any other UTF-8 encoded characters in the file will be decoded correctly - meaning you won't get "strange characters", you'll get the correct Unicode characters, assuming no other encoding issues - and this really is the correct way to solve this issue. If you then still want to turn the text into ASCII-only, see e.g. Text::Unidecode.

    Updated: A few edits for clarification. Also: If you have further issues with encoding, I have some brief advice on what to post to get the best answers here.

      Awesome catch!

      I am using the use the Spreadsheet::Read module. Which uses the command,

      my $book  = ReadData ("$upload_dir/$filename");

      It will allow me to open the buffer, read it myself, then hand off the buffer to the ReadData command.

      Sadly that's failing, see below, and will have to debug more.

      Again, awesome catch! Glad I included the garbage.

      open my $fh, '<:raw:encoding(UTF-8)', "$upload_dir/$filename"; read $fh, my $string, -s $fh; close $fh; my $book = ReadData ($string);
        I am using the use the Spreadsheet::Read module.

        That's an important piece of information missing from the root node! I am guessing that your files are CSV files? Because opening any other file type (XLS, XLSX, etc.) with an '<:raw:encoding(UTF-8)' will likely corrupt those files, and ReadData($filename) should be preferred there. And for CSV files, Spreadsheet::Read uses Text::CSV or Text::CSV_XS under the hood, both of which have a detect_bom option when used directly - unfortunately I currently don't see a way to get Spreadsheet::Read to apply that option, so unless Tux has any hints, you could use one of those two CSV modules directly.

        In any case, you may want to check your $filename to see if it's a CSV file first, before handing it off to the processing code appropriate for the file type.

        Update: Regarding read $fh, my $string, -s $fh;, the idiomatic way to slurp a file in Perl is my $string = do { local $/; <$fh> }; (see $/). Other minor edits. And you need to check your open for errors, see "open" Best Practices.

        You could also use File::BOM to open the file and then pass the file handle to Spreadsheet::Read.

        # untested use File::BOM qw( :all ); use Spreadsheet::Read; open_bom(my $fh, $file, ':utf8'); my $book = ReadData ($fh, parser => "csv");
Re: Is there a way to make these two regex lines cleaner?
by hv (Prior) on Apr 16, 2022 at 21:03 UTC

    For the first line: most characters do not need escaping in a character class, probably only '-' and ']'.

    For the second line: coming soon to a perl near you:

    % perl -wle 'use builtin qw{trim}; $s=" foo bar "; $s=trim($s); print +"<$s>"' Built-in function 'builtin::trim' is experimental at -e line 1. <foo bar> %
Re: Is there a way to make these two regex lines cleaner?
by kcott (Archbishop) on Apr 17, 2022 at 13:59 UTC

    G'day bartender1382,

    hv wrote: "... most characters do not need escaping in a character class, probably only '-' and ']'.".

    See "perlrecharclass: Special Characters Inside a Bracketed Character Class" for a discussion of this.

    You have duplicated double-quote in the class ('\"' and later '"') so you can lose one of those. Also note that '-' is special because it indicates a range; however, when it's the last character in the class, there is no range; so no special meaning and no escape required.

    haukex suggested transliteration "should be a bit faster". In my experience, it is a lot faster. See "Search and replace or tr" in "Perl Performance and Optimization Techniques". If your line is just by itself, the improvement is unlikely to be noticeable; however, if it occurs in a loop, or a frequently called routine, it could make a big difference: run your own Benchmark to determine this.
    [In case you didn't know, y/// and tr/// are synonymous.]

    Here a script that shows the various points I made:

    #!/usr/bin/env perl use strict; use warnings; my $str = q{*a%Z(5)-["'] <.?!>}; my $fmt = "%12s : %s\n"; my $op_sub = $str; $op_sub =~ s/[^a-zA-Z0-9\-\"\'\ \.\?\!"]//g; printf $fmt, 'OP', $op_sub; my $no_dup_quote = $str; $no_dup_quote =~ s/[^a-zA-Z0-9\-\"\'\ \.\?\!]//g; printf $fmt, 'NO_DUP_QUOTE', $no_dup_quote; my $less_esc = $str; $less_esc =~ s/[^a-zA-Z0-9\-"' .?!]//g; printf $fmt, 'LESS_ESC', $less_esc; my $no_esc = $str; $no_esc =~ s/[^a-zA-Z0-9"' .?!-]//g; printf $fmt, 'NO_ESC', $no_esc; my $trans = $str; $trans =~ y/a-zA-Z0-9"' .?!-//cd; printf $fmt, 'TRANS', $trans;

    Output:

    OP : aZ5-"' .?! NO_DUP_QUOTE : aZ5-"' .?! LESS_ESC : aZ5-"' .?! NO_ESC : aZ5-"' .?! TRANS : aZ5-"' .?!

    So, if you choose substitution: s/[^a-zA-Z0-9"' .?!-]//g
    if transliteration (would be my choice): y/a-zA-Z0-9"' .?!-//cd

    — Ken

      So, if you choose substitution: s/[^a-zA-Z0-9"' .?!-]//g

      if transliteration (would be my choice): y/a-zA-Z0-9"' .?!-//cd

      I'd still be inclined to escape the '-' in both cases: someone is all too likely to come along in a couple of years needing to add one more character to the allowed list, and the natural tendency would be to add it to the end.

        A hyphen placed at the beginning of a character class or tr/// search/replace list is also interpreted literally:

        Win8 Strawberry 5.8.9.5 (32) Sun 04/17/2022 18:43:19 C:\@Work\Perl\monks >perl use strict; use warnings; my $s = '123-abc-456'; $s =~ tr/-a-z//cd; print "'$s' \n"; $s = '123-xyz-456'; $s =~ s/[^-a-z]//g; print "'$s' \n"; ^Z '-abc-' '-xyz-'
        But one can argue that one is as likely to place new stuff at the start as at the end, so escaping remains wise. :)


        Give a man a fish:  <%-{-{-{-<

        I've been putting '-' at the end for a very long time (probably decades) and have never encountered the scenario you describe; however, I'm not averse to a bit of defensive programming. :-)

        Update: The remainder of what I originally wrote is just wrong: I'll put it down to an idiotic brain fart. I've stricken it and, because it was quite long, removed it to a spoiler.

        — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11143004]
Approved by haukex
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (6)
As of 2024-04-19 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found