Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Create a dictionary from wikipedia

by vit (Friar)
on Jul 31, 2012 at 14:52 UTC ( [id://984603]=perlquestion: print w/replies, xml ) Need Help??

vit has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I want to rephrase what I asked in the thread "Wikipedia content to text converter".
I want to create a dictionary from wikipedia. I need to create a bag of words (hash with word frequencies) from meaningful content.

Replies are listed 'Best First'.
Re: Create a dictionary from wikipedia
by marto (Cardinal) on Jul 31, 2012 at 15:29 UTC

    This isn't even a question, it's a statement. Did you investigate the reply to Wikipedia content to text converter? You've been here for years so the concepts outlined here should be familiar to you. Try doing some research and come back with an actual question or problem which you need help with.

Re: Create a dictionary from wikipedia
by davido (Cardinal) on Jul 31, 2012 at 16:57 UTC

    perlfaq6: How can I print out a word frequency or line frequency summary?

    You've really got two programming problems. One is the business logic: How to do word counts (or whatever stats you want to generate given a body of text). The next is how to scrape Wikipedia. If it were just some 3rd tier website you were scraping I would expect that you would have to deal with a separation of concerns; you would find a module that helps with the business logic, and another that helps with the scraping (plus something to help with the parsing). But this is Wikipedia, and it's possible that there is something already in existence that can scrape Wikipedia more effectively "out of the box." There may even be something that can handle your language statistics. You have to search.

    Type "Wikipedia" into the search box. Try it now: Wikipedia. There you find all sorts of CPAN solutions that mention Wikipedia. You browse through them. You find one that seems to suit your needs. And then you incorporate it into your project. If you're lucky you find something where you just write a wrapper around it and all the functionality you need it provided.

    More likely, you find something that gets you part-way there, and the rest is what we call programming.

    When I did a quick search I was sort of impressed with Text::Corpus::Summaries::Wikipedia. But this might be a case where you're better off using WWW::Scraper::Wikipedia::ISO3166, or WWW::Wikipedia (more general solutions), and then come up with your own business logic, or let another CPAN solution take over where the Wikipedia modules leave off.


    Dave

Re: Create a dictionary from wikipedia
by Old_Gray_Bear (Bishop) on Jul 31, 2012 at 15:29 UTC
    So what code do you have, and where are you having problems?

    If I were going to approach the problem, I'd define "Meaningful Content" in terms of it's characteristics (it appears in a <title> or <subtitle>, it uses a particular CSS strophe, it has an XPATH that looks like ___, etc).

    Then, just do it.

    ----
    I Go Back to Sleep, Now.

    OGB

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Create a dictionary from wikipedia
by cavac (Parson) on Jul 31, 2012 at 18:06 UTC

    Wikipedia ... meaningful content

    Uhm, i'm not sure that is a problem that actually can be solved by using Perl.... scnr.

    Back on topic, since you have the "raw" articles, you have to do multiple things. First, you have to remove all the markup. That alone does not seem trivial, since the MediaWiki format is a big mess to begin with. It's actually messy enough that more and more editors quit and the MediaWiki developers don't seem to be able to come up with a working visual editor.

    A quick and dirty solution for this would be to try to use one of the MediaWiki-to-HTML converters like Text::Markup::Mediawiki and then scrape the text by using something like HTML::Extract.

    Then, you can split resulting text on whitespaces. For each word then increment the counter in the hash.

    since sometimes words are also used as double words (like "flying dutchman"), you might want to count them as well and see where it leads you. For this, consider 774421.

    "I know what i'm doing! Look, what could possibly go wrong? All i have to pull this lever like so, and then press this button here like ArghhhhhaaAaAAAaaagraaaAAaa!!!"
      First, you have to remove all the markup. That alone does not seem trivial, since the MediaWiki format is a big mess to begin with.
      All I need is to parse the text from an xml dump of the articles enwiki-latest-pages-articles.xml to create a clean dictionary with good statistics of terms. I kind of hoped there exists a module which retrieve the pure text from the content. Once I have it, creating a dict. is a one line code.
      Yes, I already found that MediaWiki parser does not do it, but at least gracefully a reads multi-giga file. I think I need probably to apply some filtering. Say retrieve only rows without special characters hoping that those have only pure text or so from what MediaWiki parser gives me. So something like that:
      $pages = Parse::MediaWikiDump::Pages->new("xml file"); while(defined($page = $pages->next)) { $text = $page->text; ## process text, which is quite messy }

        It seems to me that the sticking point is going to be deciding what qualifies as "pure text." Getting the content out of the XML is fairly trivial: just walk recursively through the XML file after loading it into some XML parsing module, and grab the values of the "content" keys. (I only downloaded about 0.2% of the file as a sample, but that appears to be consistent.) The simple bit of code below does that, counts the "words" in a dictionary hash, and outputs the sorted results. However, since it splits the text on whitespace, the resulting words contain a lot of punctuation, including wiki formatting. So you'll have to parse that out, and also deal with other issues: Unicode and HTML encoded characters, embedded HTML tags, "wide characters," and more.

        #!/usr/bin/env perl use Modern::Perl; use XML::Simple; use Data::Dumper; my $xml = XML::Simple->new(); my $in = $xml->XMLin('wiki.xml'); my %dict; walk($in); for (sort {$a cmp $b} keys %dict){ say "$dict{$_} $_"; } sub walk { my $h = shift; for my $k (keys %$h){ if($k eq 'content'){ add_to_dict($h->{$k}); } elsif( ref($h->{$k}) eq 'HASH' ){ walk($h->{$k}); } } } sub add_to_dict { my $text = shift; for my $w (split /\s+/, $text){ $dict{$w}++; } }

        Aaron B.
        Available for small or large Perl jobs; see my home node.

      There are ways to remove the markup using regexes. Try this:

      $page = "my ##Media Wiki [text|here]"; %wordcount; @words = split /(\s*|#|\[|\||\]|@|$|!|.|,)/ $page; foreach $word (@words) { $wordcount{$word}++ if $word =~ /\w/; } foreach $word (keys %wordcount) { print "$word\t$wordcount{$word}\n"; }
      I hope this helps.

      --linuxkid


      imrunningoutofideas.co.cc
Re: Create a dictionary from wikipedia
by Anonymous Monk on Jul 31, 2012 at 21:40 UTC
    Also, please bear in mind that WikiPedia content is actually licensed material. You cannot do "just any ol' thing you like" with it ... specifically including making "your own encyclopedia or dictionary." It might be free in terms of money, but it's not free of strings attached.

      That's not quite right. It's licensed under the Creative Commons Attribution-ShareAlike License which allows fairly broad reuse including making your own encyclopedia or dictionary as long as the license of the new resource is open.

      You are free to:
      • Read and Print our articles and other media free of charge.
      • Share and Reuse our articles and other media under free and open licenses.
      • Contribute To and Edit our various sites or Projects.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://984603]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2024-04-19 21:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found