Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Re^3: Create a dictionary from wikipedia

by aaron_baugher (Curate)
on Aug 02, 2012 at 01:25 UTC ( #984925=note: print w/replies, xml ) Need Help??

in reply to Re^2: Create a dictionary from wikipedia
in thread Create a dictionary from wikipedia

It seems to me that the sticking point is going to be deciding what qualifies as "pure text." Getting the content out of the XML is fairly trivial: just walk recursively through the XML file after loading it into some XML parsing module, and grab the values of the "content" keys. (I only downloaded about 0.2% of the file as a sample, but that appears to be consistent.) The simple bit of code below does that, counts the "words" in a dictionary hash, and outputs the sorted results. However, since it splits the text on whitespace, the resulting words contain a lot of punctuation, including wiki formatting. So you'll have to parse that out, and also deal with other issues: Unicode and HTML encoded characters, embedded HTML tags, "wide characters," and more.

#!/usr/bin/env perl use Modern::Perl; use XML::Simple; use Data::Dumper; my $xml = XML::Simple->new(); my $in = $xml->XMLin('wiki.xml'); my %dict; walk($in); for (sort {$a cmp $b} keys %dict){ say "$dict{$_} $_"; } sub walk { my $h = shift; for my $k (keys %$h){ if($k eq 'content'){ add_to_dict($h->{$k}); } elsif( ref($h->{$k}) eq 'HASH' ){ walk($h->{$k}); } } } sub add_to_dict { my $text = shift; for my $w (split /\s+/, $text){ $dict{$w}++; } }

Aaron B.
Available for small or large Perl jobs; see my home node.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://984925]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2023-10-04 00:37 GMT
Find Nodes?
    Voting Booth?

    No recent polls found