Find what characters never appear

Narveson has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Find what characters never appear by almut (Canon) on Sep 04, 2009 at 21:34 UTC
For every character in the file, set (or increment) `$seen[ord($ch)]`. When you're through the file, the unset elements of the array `@seen` (indices 0..255) are the bytes that didn't occur...	[reply] [d/l] [select]
Re^2: Find what characters never appear by Narveson (Chaplain) on Sep 04, 2009 at 22:20 UTC
If we've seen `$chr` once, can we somehow avoid repeating the assignment to `$seen[ord($chr)]` during the rest of the read? Can we avoid even testing `$seen[ord($chr)]`? I'd like to make a regex that matches any of our dwindling array of unseen characters, and update this regex every time I update `$seen`. Has anybody done this?	[reply] [d/l] [select]
Re^3: Find what characters never appear by kennethk (Abbot) on Sep 04, 2009 at 23:21 UTC
If you want to avoid potential issues w/ regex metacharacters, you can use a set of hash keys to track what's been seen and rebuild the regex once for each character: #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = undef foreach (33 .. 127); my $chars = join "", keys %char_hash; my $regex = "([\Q$chars\E])"; while (<DATA>) { while (/$regex/g) { delete $char_hash{$1}; $chars = join "", keys %char_hash; $regex = "([\Q$chars\E])"; } } my @good_array = keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./01234567 89:;<=>?@ABCDE FGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefghijklmnop qrstuvwxyz{\|}~ [download] though I feel like there must be a simpler way of implementing this approach.	[reply] [d/l]
Re^4: Find what characters never appear by Narveson (Chaplain) on Sep 05, 2009 at 13:35 UTC
Re^3: Find what characters never appear by almut (Canon) on Sep 04, 2009 at 23:01 UTC
Maybe something like this (demo with reduced charset): `#!/usr/bin/perl my $s = "fccccaaaaeaaaddaaaaabbcccaaacaaabbaaaa"; my $set = "[abcdefg]"; while ($s =~ /($set)/g) { my $ch = $1; $set =~ s/$ch//; # remove $ch from search set printf "found %s at %d -> regex now: %s\n", $ch, pos($s), $set; } __END__ found f at 1 -> regex now: [abcdeg] found c at 2 -> regex now: [abdeg] found a at 6 -> regex now: [bdeg] found e at 10 -> regex now: [bdg] found d at 14 -> regex now: [bg] found b at 21 -> regex now: [g]` [download] Update: kennethk noted that you would run into complications with regex metacharacters with this simple approach (when using the full ASCII set) — which is of course correct...	[reply] [d/l]
Re: Find what characters never appear by kennethk (Abbot) on Sep 04, 2009 at 21:40 UTC
How about something like: #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = 0 foreach (33 .. 127); while (<DATA>) { map $char_hash{$_}++, split //; } my @good_array = grep{ $char_hash{$_} == 0 } keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefg +hijklmnopqrstuvwxyz{\|}~ [download] where you may want to change the bounds on the first foreach loop.	[reply] [d/l]
Re^2: Find what characters never appear by Narveson (Chaplain) on Sep 04, 2009 at 22:29 UTC
Thanks, this has been running against my big file for the past half hour. I think it will work, but it won't finish for another half hour or so, and meanwhile my ride is coming to take me home.	[reply]
Re: Find what characters never appear by MidLifeXis (Monsignor) on Sep 04, 2009 at 21:57 UTC
Is the available character that you find guaranteed to never be used? This sounds like a use case for one of the CSV modules. --MidLifeXis Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.	[reply]
Re^2: Find what characters never appear by Narveson (Chaplain) on Sep 04, 2009 at 22:13 UTC
I agree.	[reply]
Re^3: Find what characters never appear by AnomalousMonk (Archbishop) on Sep 04, 2009 at 23:12 UTC
I second your agreement, especially since the 11th corollary of Finagle's Fifth Law of Dynamic Disappointment states that "any file that is guaranteed not to contain a given character at the current moment is guaranteed to contain that character at some future moment, and probably sooner rather than later".	[reply]
Re: Find what characters never appear by ikegami (Patriarch) on Sep 05, 2009 at 05:46 UTC
Just use Text::CSV_XS, and it'll quote the fields it needs to quote.	[reply]
Re: Find what characters never appear by kwaping (Priest) on Sep 04, 2009 at 22:22 UTC
I don't know if this will work for you, but in the past when I've had similar challenges, I've used a string of characters as my separator pattern instead of a single character. For example, `#\|+\|#`. The odds of a pattern occuring naturally are less than those of a single character. --- It's all fine and dandy until someone has to look at the code.	[reply] [d/l]
Re: Find what characters never appear by bv (Friar) on Sep 04, 2009 at 21:44 UTC
Just a chance this might work: `my %chars; @chars{0 .. 127}=undef; while(<>) { for (split //) { delete $chars{ord $_} ; } } print "Chars never seen:\n"; $,="\n"; print keys %chars;` [download] You'd have to do some logic to filter out unprintables and to show the actual characters, rather than their decimal ASCII value `print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))`	[reply] [d/l] [select]
Re^2: Find what characters never appear by Narveson (Chaplain) on Sep 04, 2009 at 22:08 UTC
This ought to work. I appreciate the elegance of using the hash keys to keep track of a set, without ever updating any hash values (since after all it's only the keys that we need). The solution I actually ran was kennethk's above, which took over an hour. I suspect this one would take about as long, because even though it doesn't update any counts, it still reads every single character in the file. Thanks, your solution is running against my file right now. Launched it right about 22:00 GMT. I estimate it will finish in just under an hour.	[reply]
Re: Find what characters never appear by graff (Chancellor) on Sep 05, 2009 at 05:21 UTC
You said: can I find a printable ASCII character that never appears in our big file? If it happens to be true that the data values to be separated are all entirely in the ASCII range, why not just use a single-byte non-ASCII character as the delimiter -- e.g. 0xA0 "non-breaking-space", or if you want it to be visible, 0xA1 "inverted exclamation mark" or 0xB0 "degree mark" or ... (there are several nice candidates). The situations where you might encounter a non-8-bit-clean process or channel are virtually non-existent these days. For that matter, is it really entirely mandatory that the character be printable? It seems hard to imagine that making it "look nice" should be an important factor for a 2GB file. Who's going to be looking at it?	[reply]
Re: Find what characters never appear by Narveson (Chaplain) on Sep 06, 2009 at 14:32 UTC
Final Report Thanks for the responses, which fell into three groups: Build a histogram. Match a dynamically updated character class. Consider doing something else instead. The histogram is a classic recipe. When I ran kennethk's implementation against my big file, I added a printout showing all the character counts as well as the unused characters I'd been looking for. Although pipe occurred 43 times and tilde occurred once, there were in fact three printable ASCII characters that were never used. The job ended up taking 79 minutes. Having heard that hash lookups are expensive, I was attracted by almut's suggestion to put the histogram in an array instead of a hash. That modification ran in 77 minutes. Either the hash mechanism isn't that expensive after all, or a hash whose keys are single ASCII characters somehow achieves the same performance as an array. The way to do this job fast is to quit looking at characters that have already been seen. I ran kennethk's correction (using `quotemeta`) to almut's illustration of how to dynamically generate a character class from a list, and it took only a couple of minutes (I didn't bother to put it in a harness to get an exact timing). Thanks, finally, to all who pointed out that the solution to this puzzle has no business value. What I didn't mention was that we're writing a file to be read by Microsoft SQL Server Integration Services (SSIS). So one of the CSV formats is probably the way to go. My own preference had been to just use `pack` and generate a fixed-width file, but our SSIS developers think reading fixed-width data is too much trouble. I'm planning to spend the rest of the weekend Googling for ways in which SSIS might learn to read a configuration spec and unpack fixed-width data as easily as I know Perl can.	[reply] [d/l] [select]


No such thing as a small change
	PerlMonks

Find what characters never appear

Final Report