Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^2: Create a dictionary from wikipedia

by vit (Friar)
on Aug 01, 2012 at 22:09 UTC ( #984908=note: print w/replies, xml ) Need Help??

in reply to Re: Create a dictionary from wikipedia
in thread Create a dictionary from wikipedia

First, you have to remove all the markup. That alone does not seem trivial, since the MediaWiki format is a big mess to begin with.
All I need is to parse the text from an xml dump of the articles enwiki-latest-pages-articles.xml to create a clean dictionary with good statistics of terms. I kind of hoped there exists a module which retrieve the pure text from the content. Once I have it, creating a dict. is a one line code.
Yes, I already found that MediaWiki parser does not do it, but at least gracefully a reads multi-giga file. I think I need probably to apply some filtering. Say retrieve only rows without special characters hoping that those have only pure text or so from what MediaWiki parser gives me. So something like that:
$pages = Parse::MediaWikiDump::Pages->new("xml file"); while(defined($page = $pages->next)) { $text = $page->text; ## process text, which is quite messy }

Replies are listed 'Best First'.
Re^3: Create a dictionary from wikipedia
by aaron_baugher (Curate) on Aug 02, 2012 at 01:25 UTC

    It seems to me that the sticking point is going to be deciding what qualifies as "pure text." Getting the content out of the XML is fairly trivial: just walk recursively through the XML file after loading it into some XML parsing module, and grab the values of the "content" keys. (I only downloaded about 0.2% of the file as a sample, but that appears to be consistent.) The simple bit of code below does that, counts the "words" in a dictionary hash, and outputs the sorted results. However, since it splits the text on whitespace, the resulting words contain a lot of punctuation, including wiki formatting. So you'll have to parse that out, and also deal with other issues: Unicode and HTML encoded characters, embedded HTML tags, "wide characters," and more.

    #!/usr/bin/env perl use Modern::Perl; use XML::Simple; use Data::Dumper; my $xml = XML::Simple->new(); my $in = $xml->XMLin('wiki.xml'); my %dict; walk($in); for (sort {$a cmp $b} keys %dict){ say "$dict{$_} $_"; } sub walk { my $h = shift; for my $k (keys %$h){ if($k eq 'content'){ add_to_dict($h->{$k}); } elsif( ref($h->{$k}) eq 'HASH' ){ walk($h->{$k}); } } } sub add_to_dict { my $text = shift; for my $w (split /\s+/, $text){ $dict{$w}++; } }

    Aaron B.
    Available for small or large Perl jobs; see my home node.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://984908]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2023-09-30 13:24 GMT
Find Nodes?
    Voting Booth?

    No recent polls found