Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

comment on

( [id://3333] : superdoc . print w/replies, xml ) Need Help??

perlfaq6: How can I print out a word frequency or line frequency summary?

You've really got two programming problems. One is the business logic: How to do word counts (or whatever stats you want to generate given a body of text). The next is how to scrape Wikipedia. If it were just some 3rd tier website you were scraping I would expect that you would have to deal with a separation of concerns; you would find a module that helps with the business logic, and another that helps with the scraping (plus something to help with the parsing). But this is Wikipedia, and it's possible that there is something already in existence that can scrape Wikipedia more effectively "out of the box." There may even be something that can handle your language statistics. You have to search.

Type "Wikipedia" into the search box. Try it now: Wikipedia. There you find all sorts of CPAN solutions that mention Wikipedia. You browse through them. You find one that seems to suit your needs. And then you incorporate it into your project. If you're lucky you find something where you just write a wrapper around it and all the functionality you need it provided.

More likely, you find something that gets you part-way there, and the rest is what we call programming.

When I did a quick search I was sort of impressed with Text::Corpus::Summaries::Wikipedia. But this might be a case where you're better off using WWW::Scraper::Wikipedia::ISO3166, or WWW::Wikipedia (more general solutions), and then come up with your own business logic, or let another CPAN solution take over where the Wikipedia modules leave off.


In reply to Re: Create a dictionary from wikipedia by davido
in thread Create a dictionary from wikipedia by vit

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.