Is there a way to make these two regex lines cleaner?

bartender1382 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Is there a way to make these two regex lines cleaner? by haukex (Archbishop) on Apr 16, 2022 at 18:46 UTC
You could replace the first line with a `tr///cd`, that should be a bit faster. The second line is the usual way to trim whitespace from a string in Perl so it's fine the way it is. However, "" is the Byte order mark when the file is encoded in UTF-8 but was opened with the incorrect encoding. So instead of that first regex, you probably want to open the file with `open my $fh, '<:raw:encoding(UTF-8)', $filename or die "$filename: $!";`, and then do a `$line =~ s/\A\N{U+FEFF}//;` on the first line of the file. This has the major advantage that any other UTF-8 encoded characters in the file will be decoded correctly - meaning you won't get "strange characters", you'll get the correct Unicode characters, assuming no other encoding issues - and this really is the correct way to solve this issue. If you then still want to turn the text into ASCII-only, see e.g. Text::Unidecode. Updated: A few edits for clarification. Also: If you have further issues with encoding, I have some brief advice on what to post to get the best answers here.	[reply] [d/l] [select]
Re^2: Is there a way to make these two regex lines cleaner? by bartender1382 (Beadle) on Apr 16, 2022 at 19:17 UTC
Awesome catch! I am using the use the Spreadsheet::Read module. Which uses the command, `my $book = ReadData ("$upload_dir/$filename");` It will allow me to open the buffer, read it myself, then hand off the buffer to the ReadData command. Sadly that's failing, see below, and will have to debug more. Again, awesome catch! Glad I included the garbage. `open my $fh, '<:raw:encoding(UTF-8)', "$upload_dir/$filename"; read $fh, my $string, -s $fh; close $fh; my $book = ReadData ($string);` [download]	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by haukex (Archbishop) on Apr 16, 2022 at 19:30 UTC
I am using the use the Spreadsheet::Read module. That's an important piece of information missing from the root node! I am guessing that your files are CSV files? Because opening any other file type (XLS, XLSX, etc.) with an `'<:raw:encoding(UTF-8)'` will likely corrupt those files, and `ReadData($filename)` should be preferred there. And for CSV files, Spreadsheet::Read uses Text::CSV or Text::CSV_XS under the hood, both of which have a `detect_bom` option when used directly - unfortunately I currently don't see a way to get Spreadsheet::Read to apply that option, so unless Tux has any hints, you could use one of those two CSV modules directly. In any case, you may want to check your `$filename` to see if it's a CSV file first, before handing it off to the processing code appropriate for the file type. Update: Regarding `read $fh, my $string, -s $fh;`, the idiomatic way to slurp a file in Perl is `my $string = do { local $/; <$fh> };` (see $/). Other minor edits. And you need to check your open for errors, see "open" Best Practices.	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by swl (Parson) on Apr 16, 2022 at 23:59 UTC
You could also use File::BOM to open the file and then pass the file handle to Spreadsheet::Read. `# untested use File::BOM qw( :all ); use Spreadsheet::Read; open_bom(my $fh, $file, ':utf8'); my $book = ReadData ($fh, parser => "csv");` [download]	[reply] [d/l]
Re: Is there a way to make these two regex lines cleaner? by hv (Prior) on Apr 16, 2022 at 21:03 UTC
For the first line: most characters do not need escaping in a character class, probably only '-' and ']'. For the second line: coming soon to a perl near you: `% perl -wle 'use builtin qw{trim}; $s=" foo bar "; $s=trim($s); print +"<$s>"' Built-in function 'builtin::trim' is experimental at -e line 1. <foo bar> %` [download]	[reply] [d/l]
Re: Is there a way to make these two regex lines cleaner? by kcott (Archbishop) on Apr 17, 2022 at 13:59 UTC
G'day bartender1382, hv wrote: "... most characters do not need escaping in a character class, probably only '-' and ']'.". See "perlrecharclass: Special Characters Inside a Bracketed Character Class" for a discussion of this. You have duplicated double-quote in the class ('`\"`' and later '`"`') so you can lose one of those. Also note that '`-`' is special because it indicates a range; however, when it's the last character in the class, there is no range; so no special meaning and no escape required. haukex suggested transliteration "should be a bit faster". In my experience, it is a lot faster. See "Search and replace or tr" in "Perl Performance and Optimization Techniques". If your line is just by itself, the improvement is unlikely to be noticeable; however, if it occurs in a loop, or a frequently called routine, it could make a big difference: run your own Benchmark to determine this. [In case you didn't know, `y///` and `tr///` are synonymous.] Here a script that shows the various points I made: #!/usr/bin/env perl use strict; use warnings; my $str = q{*a%Z(5)-["'] <.?!>}; my $fmt = "%12s : %s\n"; my $op_sub = $str; $op_sub =~ s/[^a-zA-Z0-9\-\"\'\ \.\?\!"]//g; printf $fmt, 'OP', $op_sub; my $no_dup_quote = $str; $no_dup_quote =~ s/[^a-zA-Z0-9\-\"\'\ \.\?\!]//g; printf $fmt, 'NO_DUP_QUOTE', $no_dup_quote; my $less_esc = $str; $less_esc =~ s/[^a-zA-Z0-9\-"' .?!]//g; printf $fmt, 'LESS_ESC', $less_esc; my $no_esc = $str; $no_esc =~ s/[^a-zA-Z0-9"' .?!-]//g; printf $fmt, 'NO_ESC', $no_esc; my $trans = $str; $trans =~ y/a-zA-Z0-9"' .?!-//cd; printf $fmt, 'TRANS', $trans; [download] Output: `OP : aZ5-"' .?! NO_DUP_QUOTE : aZ5-"' .?! LESS_ESC : aZ5-"' .?! NO_ESC : aZ5-"' .?! TRANS : aZ5-"' .?!` [download] So, if you choose substitution: `s/[^a-zA-Z0-9"' .?!-]//g` if transliteration (would be my choice): `y/a-zA-Z0-9"' .?!-//cd` — Ken	[reply] [d/l] [select]
Re^2: Is there a way to make these two regex lines cleaner? by hv (Prior) on Apr 17, 2022 at 17:01 UTC
So, if you choose substitution: `s/[^a-zA-Z0-9"' .?!-]//g` if transliteration (would be my choice): `y/a-zA-Z0-9"' .?!-//cd` I'd still be inclined to escape the '-' in both cases: someone is all too likely to come along in a couple of years needing to add one more character to the allowed list, and the natural tendency would be to add it to the end.	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by AnomalousMonk (Archbishop) on Apr 17, 2022 at 22:55 UTC
A hyphen placed at the beginning of a character class or `tr///` search/replace list is also interpreted literally: `Win8 Strawberry 5.8.9.5 (32) Sun 04/17/2022 18:43:19 C:\@Work\Perl\monks >perl use strict; use warnings; my $s = '123-abc-456'; $s =~ tr/-a-z//cd; print "'$s' \n"; $s = '123-xyz-456'; $s =~ s/[^-a-z]//g; print "'$s' \n"; ^Z '-abc-' '-xyz-'` [download] But one can argue that one is as likely to place new stuff at the start as at the end, so escaping remains wise. :) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^3: Is there a way to make these two regex lines cleaner? by kcott (Archbishop) on Apr 18, 2022 at 04:40 UTC
I've been putting '`-`' at the end for a very long time (probably decades) and have never encountered the scenario you describe; however, I'm not averse to a bit of defensive programming. :-) Update: The remainder of what I originally wrote is just wrong: I'll put it down to an idiotic brain fart. I've stricken it and, because it was quite long, removed it to a spoiler. <Reveal this spoiler or all in this thread> — Ken	[reply] [d/l] [select]


Problems? Is your data what you think it is?
	PerlMonks