Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

Maybe it hasn't been written because the code is fairly trivial, but the details are highly application specific?

Something like the following would seem to be what you need to do the bucketing:

use strict; use warnings; my %buckets; my $lineStart = tell DATA; while (<DATA>) { chomp; next unless length; push @{$buckets{lc substr $_, 0, 1}}, [$lineStart, tell DATA]; $lineStart = tell DATA; } for my $key (sort keys %buckets) { my @pairs = map {"@$_"} @{$buckets{$key}}; print "$key: ", (join ', ', @pairs), "\n"; } __DATA__ Ok, let's say file A has a series of strings, one per line. Let's say +that file B has a series of strings, one per line. The goal is, for each line in A, to return the best match from B using + a subroutine named fuzzy_match, a function that takes two strings and re +turns a float from 0 to 1. Now, let's assume that file B is enormous, making the prospect of appl +ying fuzzy_match to each member infeasible. But let's also assume that the +first character of each member of B will always be the best result from fuzz +y_match for A. This means that instead of looking through all of B, you simply + need to retrieve all records from B which start with the same first letter as +the current record in A.

Prints:

b: 458 499 c: 822 900, 1053 1074 f: 651 670, 746 822, 900 979 n: 670 746 o: 378 458 r: 979 1053 s: 573 651 t: 499 573

Perl is environmentally friendly - it saves trees

In reply to Re: optimizing a linear search by Indexed or bucketed hashing by GrandFather
in thread optimizing a linear search by Indexed or bucketed hashing by princepawn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (2)
    As of 2020-07-04 00:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?