Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Final Report

Thanks for the responses, which fell into three groups:

  • Build a histogram.
  • Match a dynamically updated character class.
  • Consider doing something else instead.

The histogram is a classic recipe. When I ran kennethk's implementation against my big file, I added a printout showing all the character counts as well as the unused characters I'd been looking for. Although pipe occurred 43 times and tilde occurred once, there were in fact three printable ASCII characters that were never used.

The job ended up taking 79 minutes. Having heard that hash lookups are expensive, I was attracted by almut's suggestion to put the histogram in an array instead of a hash. That modification ran in 77 minutes.

Either the hash mechanism isn't that expensive after all, or a hash whose keys are single ASCII characters somehow achieves the same performance as an array.

The way to do this job fast is to quit looking at characters that have already been seen. I ran kennethk's correction (using quotemeta) to almut's illustration of how to dynamically generate a character class from a list, and it took only a couple of minutes (I didn't bother to put it in a harness to get an exact timing).

Thanks, finally, to all who pointed out that the solution to this puzzle has no business value. What I didn't mention was that we're writing a file to be read by Microsoft SQL Server Integration Services (SSIS). So one of the CSV formats is probably the way to go. My own preference had been to just use pack and generate a fixed-width file, but our SSIS developers think reading fixed-width data is too much trouble. I'm planning to spend the rest of the weekend Googling for ways in which SSIS might learn to read a configuration spec and unpack fixed-width data as easily as I know Perl can.


In reply to Re: Find what characters never appear by Narveson
in thread Find what characters never appear by Narveson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (3)
As of 2024-04-19 02:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found