Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^2: Create a dictionary from wikipedia

by linuxkid (Sexton)
on Jul 31, 2012 at 19:53 UTC ( #984653=note: print w/replies, xml ) Need Help??


in reply to Re: Create a dictionary from wikipedia
in thread Create a dictionary from wikipedia

There are ways to remove the markup using regexes. Try this:

$page = "my ##Media Wiki [text|here]"; %wordcount; @words = split /(\s*|#|\[|\||\]|@|$|!|.|,)/ $page; foreach $word (@words) { $wordcount{$word}++ if $word =~ /\w/; } foreach $word (keys %wordcount) { print "$word\t$wordcount{$word}\n"; }
I hope this helps.

--linuxkid


imrunningoutofideas.co.cc

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://984653]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2023-09-30 13:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?