Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Create a dictionary from wikipedia

by davido (Cardinal)
on Jul 31, 2012 at 16:57 UTC ( [id://984626]=note: print w/replies, xml ) Need Help??


in reply to Create a dictionary from wikipedia

perlfaq6: How can I print out a word frequency or line frequency summary?

You've really got two programming problems. One is the business logic: How to do word counts (or whatever stats you want to generate given a body of text). The next is how to scrape Wikipedia. If it were just some 3rd tier website you were scraping I would expect that you would have to deal with a separation of concerns; you would find a module that helps with the business logic, and another that helps with the scraping (plus something to help with the parsing). But this is Wikipedia, and it's possible that there is something already in existence that can scrape Wikipedia more effectively "out of the box." There may even be something that can handle your language statistics. You have to search.

Type "Wikipedia" into the search box. Try it now: Wikipedia. There you find all sorts of CPAN solutions that mention Wikipedia. You browse through them. You find one that seems to suit your needs. And then you incorporate it into your project. If you're lucky you find something where you just write a wrapper around it and all the functionality you need it provided.

More likely, you find something that gets you part-way there, and the rest is what we call programming.

When I did a quick search I was sort of impressed with Text::Corpus::Summaries::Wikipedia. But this might be a case where you're better off using WWW::Scraper::Wikipedia::ISO3166, or WWW::Wikipedia (more general solutions), and then come up with your own business logic, or let another CPAN solution take over where the Wikipedia modules leave off.


Dave

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://984626]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-24 03:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found