Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re: Creating an index on a string-collection

by samtregar (Abbot)
on Feb 08, 2009 at 21:35 UTC ( [id://742284]=note: print w/replies, xml ) Need Help??


in reply to Creating an index on a string-collection

A few seconds is a long time to search a set of 1 million strings on a modern machine, so I don't think you'll need to try very hard. If your searches always have a fixed start or a fixed end you could use Tree::Trie. Here's an example that handles 'foo*' and '*foo' searches:

use Tree::Trie; # read in all of dict - it's around 400K words open(DICT, "<", "/usr/share/dict/words"); my $trie = new Tree::Trie; my $trie_r = new Tree::Trie; while (<DICT>) { chomp; next if /[^-\w\d]/; $trie->add($_); $trie_r->add(join('',reverse split('', $_))); } close DICT; use Time::HiRes qw(time); my (@match, $start); foreach my $lookup (@ARGV) { my @args; my $t; my $start = time; if ($lookup =~ /^([^\*]+)\*$/) { @match = $trie->lookup($1); } elsif ($lookup =~ /^\*([^\*]+)$/) { @match = map { join('',reverse split('', $_)) } $trie_r->lookup(join('',reverse split('', $1))); } print $lookup, sprintf(" (%0.4fs) : ", time - $start), join(',', @match), "\n"; }

On my system the searches are very fast once the database is loaded.

$ perl trie.pl 'oophoro*' '*otomy' '*botomy' 'sam*' oophoro* (0.0011s) : oophoroepilepsy,oophoropexy,oophoron,oophorocyste +ctomy,oophorocele,oophororrhaphy,oophoromalacia,oophoroma,oophoromani +a,oophorosalpingectomy,oophorostomy,oophorotomy *otomy (0.0209s) : sclerotomy,iridosclerotomy,viscerotomy,merotomy,ure +terotomy,sphincterotomy,enterotomy,laparoenterotomy,gastroenterotomy, +celioenterotomy,hernioenterotomy,uterotomy,abdomino-uterotomy,laparo- +uterotomy,hysterotomy,colpohysterotomy,laparocolpohysterotomy,abdomin +ohysterotomy,laparohysterotomy,gastrohysterotomy,celiohysterotomy,lap +arotomy,splenolaparotomy,hysterolaparotomy,blepharotomy,Caesarotomy,h +ydrotomy,angiohydrotomy,androtomy,chondrotomy,synchondrotomy,thyrotom +y,pleurotomy,neurotomy,aponeurotomy,commissurotomy,necrotomy,sacrotom +y,microtomy,nephrotomy,lithonephrotomy,laparonephrotomy,urethrotomy,o +stearthrotomy,arthrotomy,osteoarthrotomy,cerebrotomy,corotomy,aeropor +otomy,oophorotomy,metrotomy,elytrotomy,laparoelytrotomy,gastroelytrot +omy,celioelytrotomy,ventrotomy,antrotomy,sequestrotomy,gastrotomy,lap +arogastrotomy,gastrogastrotomy,celiogastrotomy,colopexotomy,loxotomy, +tendotomy,pericardotomy,chordotomy,cleidotomy,orchidotomy,iridotomy,c +litoridotomy,thyroidotomy,mastoidotomy,ichthyotomy,embryotomy,myotomy +,tenomyotomy,fibromyotomy,ophthalmomyotomy,tenontomyotomy,leukotomy,l +ymphotomy,nymphotomy,synechotomy,polychotomy,bronchotomy,thoracobronc +hotomy,orchotomy,trichotomy,dichotomy,subdichotomy,choledochotomy,duo +denocholedochotomy,canthotomy,lithotomy,cholelithotomy,pyelolithotomy +,ureterolithotomy,nephrolithotomy,choledocholithotomy,pelviolithotomy +,ornithotomy,coccygotomy,laryngotomy,tracheolaryngotomy,pharyngotomy, +salpingotomy,laparosalpingotomy,celiosalpingotomy,myringotomy,syringo +tomy,dacryocystosyringotomy,esophagotomy,vagotomy,herniotomy,cranioto +my,arteriotomy,salpingo-ovariotomy,ovariotomy,parovariotomy,oariotomy +,pelviotomy,cardiotomy,pericardiotomy,symphysiotomy,episiotomy,celiot +omy,coeliotomy,ciliotomy,fasciotomy,kiotomy,orchiotomy,rachiotomy,bra +chiotomy,angiotomy,lymphangiotomy,biotomy,pubiotomy,herpetotomy,oment +otomy,tenontotomy,odontotomy,pancreatotomy,meatotomy,hepatotomy,lapar +ohepatotomy,keratotomy,dermatotomy,stomatotomy,prostatotomy,aortotomy +,cystotomy,cholecystotomy,laparocholecystotomy,epicystotomy,dacryocys +totomy,hypocystotomy,lithocystotomy,uterocystotomy,laparocystotomy,re +ctocystotomy,proctocystotomy,prostatocystotomy,mastotomy,histotomy,co +stotomy,periostotomy,phytotomy,septotomy,rectotomy,proctotomy,autotom +y,orbitotomy,ototomy,ileotomy,laparoileotomy,peritoneotomy,perineotom +y,peotomy,tracheotomy,cricotracheotomy,laryngotracheotomy,stereotomy, +thyreotomy,cricothyreotomy,edeotomy,symphyseotomy,osteotomy,hebeosteo +tomy,periosteotomy,splenotomy,laparosplenotomy,hymenotomy,rumenotomy, +adenotomy,duodenotomy,gastroduodenotomy,tenotomy,myotenotomy,myoenoto +my,jejunotomy,splanchnotomy,tympanotomy,neuroanotomy,vaginotomy,turbi +notomy,mediastinotomy,pogonotomy,pneumonotomy,cionotomy,kionotomy,val +votomy,proctovalvotomy,ophthalmotomy,pneumotomy,thalamotomy,mammotomy +,desmotomy,syndesmotomy,plasmotomy,myomotomy,laparomyomotomy,celiomyo +motomy,entomotomy,symphysotomy,vasotomy,tarsotomy,cirsotomy,glossotom +y,pyelotomy,celotomy,kelotomy,helotomy,trachelotomy,laparotrachelotom +y,cystotrachelotomy,cephalotomy,encephalotomy,omphalotomy,amygdalotom +y,staphylotomy,ankylotomy,xylotomy,spondylotomy,condylotomy,bdellotom +y,fallotomy,tonsillotomy,cyclotomy,aplotomy,vesiculotomy,ossiculotomy +,valvulotomy,uvulotomy,capsulotomy,typhlotomy,alveolotomy,colotomy,il +eocolotomy,laparocolotomy,gastrocolotomy,lumbocolotomy,cholecystocolo +tomy,cheilotomy,chilotomy,pulpotomy,colpotomy,laparocolpotomy,gastroc +olpotomy,celiocolpotomy,hippotomy,anthropotomy,cecotomy,caecotomy,onc +otomy,leucotomy,thoracotomy,scotomy,phrenicotomy,cricotomy,varicotomy +,clavicotomy,vesicotomy,hepaticotomy,scleroticotomy,phlebotomy,ophtha +lmophlebotomy,hepatophlebotomy,arteriophlebotomy,flebotomy,hebotomy,g +astrotubotomy,strabotomy,lobotomy,rhizotomy,zootomy *botomy (0.0008s) : phlebotomy,ophthalmophlebotomy,hepatophlebotomy,ar +teriophlebotomy,flebotomy,hebotomy,gastrotubotomy,strabotomy,lobotomy sam* (0.0052s) : sam,saman,samarra,samara,samaras,samarskite,samaroid, +samarium,samariums,samaria,samariform,samaritan,samaritans,samadh,sam +adhi,samaj,sam-sodden,samuel,samurai,samurais,samuin,samh,samhita,sam +khya,samkara,samgha,samfoo,samiel,samiels,samiresite,samiri,samian,sa +mizdat,samisen,samisens,samite,samites,samiti,samely,samel,sameliness +,same-colored,same-minded,same-seeming,same-sounding,same-sized,same- +featured,same,samenesses,sameness,samen,samech,samechs,samek,samekh,s +amekhs,sameks,samesome,samnani,samnite,samvat,sammy,sammel,sammer,sam +mier,samskara,samshu,samshus,samshoo,samsara,samsaras,samson,samsonia +n,samsonite,samlet,samlets,sample,samplery,sampler,samplers,samplemen +,sampleman,sampled,samples,sampling,samplings,samp,samphire,samphires +,sampaloc,sampan,sampans,sampaguita,samps,sampi,sambel,sambul,sambuni +grin,sambuca,sambucas,sambur,samburs,sambuke,sambukes,sambuk,sambhur, +sambhurs,sambhar,sambhars,sambhogakaya,sambal,sambaed,samba,sambar,sa +mbars,sambaqui,sambaquis,sambas,sambaing,sambo,sambouk,sambouse,sambo +s,samoyed,samory,samohu,samoa,samoan,samoans,samogon,samogonka,samova +r,samovars,samosa,samosas,samosatenian,samothere,samothracian

You'd need to do more work if you wanted fast "foo*bar" searches or other syntax. Also worth noting is that it uses quite a lot of memory, but hey, that's Perl for you!

Thanks for the question - that was fun!

-sam

Replies are listed 'Best First'.
Re^2: Creating an index on a string-collection
by AnomalousMonk (Archbishop) on Feb 09, 2009 at 02:14 UTC
    I can understand that oophoroepilepsy matches against oophoro* (with the '*' wildcard at the end), but how does otomyces match against *otomy (with the '*' wildcard at the beginning)?

    Shouldn't '*otomy' (and '*botomy', for that matter) match against something like 'phlebotomy'?

    What is the significance of the position of the wildcard character in the search string?

      Hey, reverse() on a scalar doesn't reverse it! I wonder why I thought it did. Fixed code will go in the original node in a moment. I'm sure there are other bugs though - this was just example code!

      -sam

        It does in  scalar context:
        >perl -wMstrict -le "my $s = 'foobar'; print reverse $s, 'zot'; print scalar reverse $s, 'zot'; " zotfoobar tozraboof

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://742284]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-16 08:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found