Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Hash entries starting with a given string.

by SwiftEagle (Initiate)
on Jul 27, 2008 at 23:18 UTC ( [id://700432]=perlquestion: print w/replies, xml ) Need Help??

SwiftEagle has asked for the wisdom of the Perl Monks concerning the following question:

My current code involves a hash which counts occurrences of groups of characters.

So, for example:

%freq = { aaa => 5, aab => 2, aac => 8, aad => 7, ... etc. }

However, rather than being hard coded the letter groups are generated by my script, and could potentially be any ASCII character, rather than just alphabetical.

What I need to do is find all the keys starting with a given string (such as 'aa' in the above example). So far the only way I've been able to come up with is sorting the result of the keys() and then just running through the relevant slice of the array, which could take a very long time (potentially 2^40 entries if I'm unlucky), and it just seems there has to be a better way to do this.

Thanks in advance.

Replies are listed 'Best First'.
Re: Hash entries starting with a given string.
by shmem (Chancellor) on Jul 27, 2008 at 23:43 UTC

    DB_File's DB_BTREE provides partial match on keys of an ordered binary tree. Maybe you can stuff your hash into that. I used that in IP prefixes: match IP against "best" network.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Hash entries starting with a given string.
by BrowserUk (Patriarch) on Jul 28, 2008 at 03:21 UTC

    This works and is very simple to understand. The efficiency, with the small numbers I've tried it on (<1e6), the creation of the tree takes a few minutes. Traversing it a few seconds.

    The basic idea is that as you will have to store your data externally somewhere, why not use the filesystem. Take the up to 5-char strings (I've assumed that 2^40 is based on 256^5), split them into characters. Convert the characters to their 8-bit ascii values to avoid problems with restricted characters in pathnames. And then create a file at the resultant path to hold the counts.

    Eg. string 12345 becomes path yourbase/48/49/50/51/52.count.

    This file is opened/read/incremented/re-rewritten (or created and set to 1 first time).

    Obtaining the counts for all paths below a certain prefix just become a recursive directory walk (though I've implemented it iteratively below), opening the files and reading the counts.

    It takes options -N=nnn for the random counts to increment; -PRE=ccc for the prefix to traverse.

    Set -N=1 to just traverse with different prefixes without create rafts of new entries.

    The code:

    #! perl -slw use strict; our $N ||= 1e3; our $PRE ||= '12'; my $base = 'c:/test/700432/'; sub incPath { my @path = unpack 'C*', $_[ 0 ]; my $full = $base; mkdir $full .= $_ . '/' for @path[ 0 .. $#path -1 ]; $full .= $path[ -1 ] . '.count'; if( -e $full ) { open my $fh, '+<', $full or die "$! : $full"; my $count = <$fh>; seek $fh, 0, 0; print $fh ++$count; close $fh; } else { open my $fh, '>', $full or die "$! : $full"; print $fh 1; close $fh; } } sub traversePrefix { my( $path, $code ) = @_; my @path = unpack 'C*', $path; my $prefix = $base . join '/', @path; return unless -e $prefix; my @dirs = $prefix; for my $dir ( @dirs ) { for my $file ( glob $dir . '*' ) { push( @dirs, $file . '/' ), next if -d $file; open my $fh, '<', $file or die "$! : $file"; chomp( my $count = scalar <$fh> ); ( $file ) = $file =~ m[^$base(.+).count$]; my $key = pack 'C*', split '/', $file; $code->( $key, $count ); } } } sub rndStr{ join'', @_[ map{ rand @_ } 1 .. shift ] } ## Generate $N keys (paths) and incr their counts for ( 1 .. $N ) { printf "\r$_\t"; my $key = rndStr 1+int( rand 5 ), map chr, 0..255; incPath( $key ); } ## Traverse starting from $PRE and print out the keys and counts traversePrefix $PRE, sub { print "@_"; };

    There are various way this could be sped up. Primary amoungst them would be to accumulate counts in a memory structure until that structure reached some preset size and then update the filesystem, discard the structure and start over.

    The biggest problem with that would be determining the current size of the memory structure. Devel::Size will do it, but it carries a fairly heafty time penalty doing so. (If the was an 'I now my structure is non-self referencing so don't bother checking' option it could be speeded up.)

    Another alternative would be to just accumulated a number of increments in memory before flushing to disk. That should have a dramatic benefit on performance without needing too much tuning. Hm. I might have a go at that later.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Hash entries starting with a given string.
by derby (Abbot) on Jul 28, 2008 at 00:22 UTC

    Well ... if your hash wasn't 2^40 entries (yikes!) ... grep would be the answer.

    #!/usr/bin/perl use strict; use warnings; my %freq = ( aaa => 5, aab => 2, aac => 8, aad => 7, faa => 5, fab => 2, fac => 8, fad => 7, baa => 5, bab => 2, bac => 8, bad => 7, caa => 5, cab => 2, cac => 8, cad => 7, ); my @keys = grep( /^aa/, keys %freq ); foreach my $key ( @keys ) { print $key, " => ", $freq{$key}, "\n"; }

    -derby
Re: Hash entries starting with a given string.
by moritz (Cardinal) on Jul 28, 2008 at 06:33 UTC
    One possible solution is to build your hash as a tree (that is a nested hash) in the first place:
    my %freq = ( aa => { a => 5, b => 2, c => 8, ... }, ab => { ... }, );

    You can then get all two letter prefixes with a simple hash access. If you want to access one-letter prefixes as well, nest beginning from the first level instead.

Re: Hash entries starting with a given string.
by swampyankee (Parson) on Jul 28, 2008 at 02:11 UTC

    240 is about 1012 entries, so if you're unlucky, you may moving into the territory of very large databases (see, for example, this site).

    It may help if you give a description of what sort of problem this is intended to solve. Somebody here may have a better algorithm. Of course, if you've already got the optimal solution, you may want to write it up in a paper to present at one of the upcoming VLDB conferences.


    Information about American English usage here and here. Floating point issues? Please read this before posting. — emc

Re: Hash entries starting with a given string.
by TedPride (Priest) on Jul 28, 2008 at 05:17 UTC
    Why are you doing this? You can theoretically create a sorted list in n lg n time, or create additional hashes for every searchable substring length, but with 2^40 entries - or even a fraction of that - you won't have enough memory to hold the original hash, never mind any additional data structures.
Re: Hash entries starting with a given string.
by dHarry (Abbot) on Jul 28, 2008 at 07:34 UTC

    Recently we ended up with a matrix of 4 TB (calibration information for calibrating infrared images). In theory it would have fit on the server but doing matrix operations on a matrix of that size is asking for trouble. So we had to retrace are steps, give it another thought and we could reduce the size dramatically. Turned out that only a fraction of the data was really what we needed.

    In your case: 2^40 = 1099511627776.

    So this would result in about 10^12 hash entries. I assume you need to do lots of searches in the data. (Why else the hash?)

    Now I don’t know how much Perl internally uses for representing a hash. But let’s take a small value like 10 bytes for each hash entry (key value pair). (It’s probably more!). Then this already results in approximately 10 TB’s (2^40 / 1000^4). Clearly a Perl hash is not the way forward.

    BrowserUK suggested to use the file system. I have implemented something similar in the past. The size was “only” one TB and it worked fine. But there was a difference; the tree structure was very simple in my case, only a few levels deep and the files stored where relatively big (tiff images). Storing many small files in a deeper tree.... I don’t know.

    The best approach to me seems the VLDB approach as suggested by swampyankee. Because that’s basically what you have/want: a VERY LARGE DB.

    It's probably a good idea to rethink your problem.

      I don’t know how much Perl internally uses for representing a hash. But let’s take a small value like 10 bytes for each hash entry (key value pair).

      For a 5-byte key and a simple integer value, nearly 48 bytes per pair:

      c:\test>p1 use Devel::Size qw[ total_size size ];; $h{ pack 'C*', map int rand 256, 1 .. 5 }=1 for 1 .. 1e6;; print scalar keys %h;; 1000000 print total_size( \%h ) / 1e6;; 47.194364
      Then this already results in approximately 10 TB’s (2^40 / 1000^4). Clearly a Perl hash is not the way forward.

      There is room for ambiguity in interpreting the OPs post. He said: (potentially 2^40 entries if I'm unlucky)

      I'm assuming that by "ascii" he's talking 8-bit chars. And when he suggested 2^40, (256^5. Ie. 5(8-bit)byte keys) was possible, I'm guessing that this is probably a (very) sparsely populated range.

      BrowserUK suggested to use the file system. I have implemented something similar in the past. The size was “only” one TB and it worked fine. But there was a difference; the tree structure was very simple in my case, only a few levels deep and the files stored where relatively big (tiff images). Storing many small files in a deeper tree.... I don’t know.

      Well. If he ever comes back and responds to the suggestion, then I suggest a better mechanism that uses 3 level of directory and a single 512k file (256*256*4) to represent the final 2 dimensions. By reducing the depth of the directory structure from 4 to 3, will reduce the number of directories involved from (max) 4 billion to 16 million. And the number (max) files involved from 1 trillion, to 4 billion.

      In addition, instead of each of those files using 4k to represent a single number, the combined files will use just 4-bytes. And then you make those combined files sparse, and whole 16k ranges (eg. ...aa through ...ay) will be represented by just 4 bytes if they are not used.

      Assuming 32-bit counts and a 10% loading factor, the minimum 4-terabyte storage requirement (fully populated, 32-bit counts) might be reduced to 125GB. And if the possibilities for duplicates allows 2-byte or 1 byte counts, cut that in half or half again.

      But the truth is, I think this question is probably similar to the guy a year or two ago that wanted to generate all the possible 1024x768 24-bit images and then scan through them looking for the ones that showed starlets in interesting poses.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Interesting solution and food for thought. We need to know more about the problem (as usual).

Re: Hash entries starting with a given string.
by SwiftEagle (Initiate) on Jul 28, 2008 at 11:35 UTC

    Thinking about the problem in the light of day (rather than stupid-o'clock in the morning), I'm guessing I may need to work on a very different implementation.

    The actual purpose of the script is of linguistic interest. One of the ways of generating text that seems like English, without being it, is to generate strings of characters where the next character is selected based on the probability of it appearing in text given the previous n characters. The code I'm using to store probabilities effectively reads through the text and increments the hash entry for each substring of n characters.

    So, looking at it in the harsh light of morning, and thinking about what I am actually saying (2^40, which is indeed referring to a 5 character string, is a very large number), I'm probably going to have to find a new way to deal with this...

      If, as it would seem, the definition of this problem looks-at the string as a concatenation of interesting-values ... such as prefixes and suffixes of an English word ... then perhaps what you really need here is “a hash of hashes.”

      In other words, if your problem looks at the string “prefoobar” as “pre+foobar,” then the first-level hash (containing “pre”) refers-to a subsequent hash that contains “foobar.”)

      If, on the other hand, it looks at the string as a string of n characters where the-important-thing is “what is the probability of character x occurring right-here,” perhaps what you need is a nested set of hashes per-character.

      At each position in the tree, you would store (for each character):

      1. The probability of “this character”
      2. A reference to a hash of possible “next characters” each one similarly (recursively...) described.

      In an application like this, your choice of “appropriate data-structure” will be paramount. Make very sure that your chosen data-structure works hard for you. Your data structure should not merely be “a passive-something that other parts of the program can conveniently (and repetitively...) search.” It must be a tool that the program can leverage to efficiently produce the solution.

      “Strange” or “awkward” as such-a-structure may feel as you are writing it ... it's the runtime-performance that really matters here. Given the right data-representation (and Perl Is No Slouch!), performance should be “impressive.” (If it isn't, look harder.) Perl is an industrial-strength power tool that eats requirements like this one for lunch...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://700432]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (3)
As of 2024-04-24 03:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found