Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Find what characters never appear

by Narveson (Chaplain)
on Sep 04, 2009 at 21:14 UTC ( [id://793599]=perlquestion: print w/replies, xml ) Need Help??

Narveson has asked for the wisdom of the Perl Monks concerning the following question:

My team was debating the idea of writing a large file using pipe characters to delimit the fields, and I pointed out that our data includes pipe characters. So the tilde was suggested, but I found some of those too.

So the obvious question arises, can I find a printable ASCII character that never appears in our big file?

I'd like to do this with just a single pass through the file. For locating a pipe character, all I had to say was

while (<>) {last if /\|/} printf "| seen at index %d in %s\n", index($_, '|'), $_;

Update: Should have mentioned that I'm going to run any executable solutions against a 2 GB file. So efficiency may be an issue.

Later: Thanks for the insights! I applaud almut's choice of an indicator array @seen, updated with $seen[ord $chr]++, which should be more efficient than the hash version $seen{$chr}++.

I think I will try kennethk's solution using \Q, both as written and then with the indicator array, as soon as my daughters are in bed and I can connect to my server again.

(Just a minute, sweetie, Daddy will be right with you ...)

It's true the indicator array doesn't offer a builtin keys function to list the seen items, but a grep should do just fine.

Replies are listed 'Best First'.
Re: Find what characters never appear
by almut (Canon) on Sep 04, 2009 at 21:34 UTC

    For every character in the file, set (or increment) $seen[ord($ch)]. When you're through the file, the unset elements of the array @seen (indices 0..255) are the bytes that didn't occur...

      If we've seen $chr once, can we somehow avoid repeating the assignment to $seen[ord($chr)] during the rest of the read?

      Can we avoid even testing $seen[ord($chr)]?

      I'd like to make a regex that matches any of our dwindling array of unseen characters, and update this regex every time I update $seen. Has anybody done this?

        If you want to avoid potential issues w/ regex metacharacters, you can use a set of hash keys to track what's been seen and rebuild the regex once for each character:

        #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = undef foreach (33 .. 127); my $chars = join "", keys %char_hash; my $regex = "([\Q$chars\E])"; while (<DATA>) { while (/$regex/g) { delete $char_hash{$1}; $chars = join "", keys %char_hash; $regex = "([\Q$chars\E])"; } } my @good_array = keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./01234567 89:;<=>?@ABCDE FGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefghijklmnop qrstuvwxyz{|}~

        though I feel like there must be a simpler way of implementing this approach.

        Maybe something like this  (demo with reduced charset):

        #!/usr/bin/perl my $s = "fccccaaaaeaaaddaaaaabbcccaaacaaabbaaaa"; my $set = "[abcdefg]"; while ($s =~ /($set)/g) { my $ch = $1; $set =~ s/$ch//; # remove $ch from search set printf "found %s at %d -> regex now: %s\n", $ch, pos($s), $set; } __END__ found f at 1 -> regex now: [abcdeg] found c at 2 -> regex now: [abdeg] found a at 6 -> regex now: [bdeg] found e at 10 -> regex now: [bdg] found d at 14 -> regex now: [bg] found b at 21 -> regex now: [g]

        Update: kennethk noted that you would run into complications with regex metacharacters with this simple approach (when using the full ASCII set) — which is of course correct...

Re: Find what characters never appear
by kennethk (Abbot) on Sep 04, 2009 at 21:40 UTC
    How about something like:

    #!/usr/bin/perl use strict; use warnings; my %char_hash = (); $char_hash{ chr($_) } = 0 foreach (33 .. 127); while (<DATA>) { map $char_hash{$_}++, split //; } my @good_array = grep{ $char_hash{$_} == 0 } keys %char_hash; print @good_array; __DATA__ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMOPQRSTUVWXYZ[\]^_`abcdefg +hijklmnopqrstuvwxyz{|}~

    where you may want to change the bounds on the first foreach loop.

      Thanks, this has been running against my big file for the past half hour. I think it will work, but it won't finish for another half hour or so, and meanwhile my ride is coming to take me home.

Re: Find what characters never appear
by MidLifeXis (Monsignor) on Sep 04, 2009 at 21:57 UTC

    Is the available character that you find guaranteed to never be used? This sounds like a use case for one of the CSV modules.

    --MidLifeXis

    Please consider supporting my wife as she walks in the 2009 Alzheimer's Walk.

      I agree.

        I second your agreement, especially since the 11th corollary of Finagle's Fifth Law of Dynamic Disappointment states that "any file that is guaranteed not to contain a given character at the current moment is guaranteed to contain that character at some future moment, and probably sooner rather than later".
Re: Find what characters never appear
by ikegami (Patriarch) on Sep 05, 2009 at 05:46 UTC
    Just use Text::CSV_XS, and it'll quote the fields it needs to quote.
Re: Find what characters never appear
by kwaping (Priest) on Sep 04, 2009 at 22:22 UTC
    I don't know if this will work for you, but in the past when I've had similar challenges, I've used a string of characters as my separator pattern instead of a single character. For example, #|+|#. The odds of a pattern occuring naturally are less than those of a single character.

    ---
    It's all fine and dandy until someone has to look at the code.
Re: Find what characters never appear
by bv (Friar) on Sep 04, 2009 at 21:44 UTC

    Just a chance this might work:

    my %chars; @chars{0 .. 127}=undef; while(<>) { for (split //) { delete $chars{ord $_} ; } } print "Chars never seen:\n"; $,="\n"; print keys %chars;

    You'd have to do some logic to filter out unprintables and to show the actual characters, rather than their decimal ASCII value

    print pack("A25",pack("V*",map{1919242272+$_}(34481450,-49737472,6228,0,-285028276,6979,-1380265972)))

      This ought to work. I appreciate the elegance of using the hash keys to keep track of a set, without ever updating any hash values (since after all it's only the keys that we need).

      The solution I actually ran was kennethk's above, which took over an hour. I suspect this one would take about as long, because even though it doesn't update any counts, it still reads every single character in the file.

      Thanks, your solution is running against my file right now.

      Launched it right about 22:00 GMT. I estimate it will finish in just under an hour.

Re: Find what characters never appear
by graff (Chancellor) on Sep 05, 2009 at 05:21 UTC
    You said:

    can I find a printable ASCII character that never appears in our big file?

    If it happens to be true that the data values to be separated are all entirely in the ASCII range, why not just use a single-byte non-ASCII character as the delimiter -- e.g. 0xA0 "non-breaking-space", or if you want it to be visible, 0xA1 "inverted exclamation mark" or 0xB0 "degree mark" or ... (there are several nice candidates).

    The situations where you might encounter a non-8-bit-clean process or channel are virtually non-existent these days.

    For that matter, is it really entirely mandatory that the character be printable? It seems hard to imagine that making it "look nice" should be an important factor for a 2GB file. Who's going to be looking at it?

Re: Find what characters never appear
by Narveson (Chaplain) on Sep 06, 2009 at 14:32 UTC

    Final Report

    Thanks for the responses, which fell into three groups:

    • Build a histogram.
    • Match a dynamically updated character class.
    • Consider doing something else instead.

    The histogram is a classic recipe. When I ran kennethk's implementation against my big file, I added a printout showing all the character counts as well as the unused characters I'd been looking for. Although pipe occurred 43 times and tilde occurred once, there were in fact three printable ASCII characters that were never used.

    The job ended up taking 79 minutes. Having heard that hash lookups are expensive, I was attracted by almut's suggestion to put the histogram in an array instead of a hash. That modification ran in 77 minutes.

    Either the hash mechanism isn't that expensive after all, or a hash whose keys are single ASCII characters somehow achieves the same performance as an array.

    The way to do this job fast is to quit looking at characters that have already been seen. I ran kennethk's correction (using quotemeta) to almut's illustration of how to dynamically generate a character class from a list, and it took only a couple of minutes (I didn't bother to put it in a harness to get an exact timing).

    Thanks, finally, to all who pointed out that the solution to this puzzle has no business value. What I didn't mention was that we're writing a file to be read by Microsoft SQL Server Integration Services (SSIS). So one of the CSV formats is probably the way to go. My own preference had been to just use pack and generate a fixed-width file, but our SSIS developers think reading fixed-width data is too much trouble. I'm planning to spend the rest of the weekend Googling for ways in which SSIS might learn to read a configuration spec and unpack fixed-width data as easily as I know Perl can.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://793599]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (5)
As of 2024-04-18 03:56 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found