Re^3: Create a dictionary from wikipedia

by aaron_baugher (Curate)
on Aug 02, 2012 at 01:25 UTC

in reply to Re^2: Create a dictionary from wikipedia
in thread Create a dictionary from wikipedia

It seems to me that the sticking point is going to be deciding what qualifies as "pure text." Getting the content out of the XML is fairly trivial: just walk recursively through the XML file after loading it into some XML parsing module, and grab the values of the "content" keys. (I only downloaded about 0.2% of the file as a sample, but that appears to be consistent.) The simple bit of code below does that, counts the "words" in a dictionary hash, and outputs the sorted results. However, since it splits the text on whitespace, the resulting words contain a lot of punctuation, including wiki formatting. So you'll have to parse that out, and also deal with other issues: Unicode and HTML encoded characters, embedded HTML tags, "wide characters," and more.

#!/usr/bin/env perl use Modern::Perl; use XML::Simple; use Data::Dumper; my $xml = XML::Simple->new(); my $in = $xml->XMLin('wiki.xml'); my %dict; walk($in); for (sort {$a cmp $b} keys %dict){ say "$dict{$_} $_"; } sub walk { my $h = shift; for my $k (keys %$h){ if($k eq 'content'){ add_to_dict($h->{$k}); } elsif( ref($h->{$k}) eq 'HASH' ){ walk($h->{$k}); } } } sub add_to_dict { my $text = shift; for my $w (split /\s+/, $text){ $dict{$w}++; } }

