Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: downloading a russian dictionary and getting matches with the arbitrary underpattern, a utility for crosswords

by aitap (Curate)
on Dec 13, 2020 at 11:47 UTC ( [id://11125101]=note: print w/replies, xml ) Need Help??


in reply to Re^2: downloading a russian dictionary and getting matches with the arbitrary underpattern, a utility for crosswords
in thread downloading a russian dictionary and getting matches with the arbitrary underpattern, a utility for crosswords

never come back with a word list
The dictionaries on the Net page links to the new frequency-based dictionary, offering a tab-separated version, which contains 52138 rows. The Tower of Babel project also offers some dictionaries in self-extracting 7-zip archives containing some kind of binary format. (dBase? I could see CP-866-encoded text in some of those files.) Let me know if you need help with translating any of that.
  • Comment on Re^3: downloading a russian dictionary and getting matches with the arbitrary underpattern, a utility for crosswords
  • Download Code

Replies are listed 'Best First'.
Re^4: downloading a russian dictionary and getting matches with the arbitrary underpattern, a utility for crosswords
by Aldebaran (Curate) on Dec 15, 2020 at 00:19 UTC
    Let me know if you need help with translating any of that.

    I feel like we're getting somewhere now. I can see something I can use, but it's column one of a database. How do we get the words out of there and into a garden variety perl array without creating a sea of mojibake?

      To read freqrnc2011.csv into a Perl data structure, one could use Text::CSV_XS, which is smart enough to auto-decode UTF-8 bytes into Perl wide characters by default:

      use Text::CSV_XS 'csv'; my @words = map { $_->{Lemma} } @{ csv in => "freqrnc2011.csv", headers => "auto", sep_char => "\t" };
      Make sure to set an :encoding(...) PerlIO layer on your STDOUT when you work with (and print) wide characters.

      I'll have to admit that wasn't able so far to read the .var files (which seem to contain the actual words mixed with binary data when read as CP-866) from the latter source without the use of Starling for DOS from the same website. We may have to contact the original author's son about the file format if you are interested in dictionaries from there.

      perlunitut

      $ perldoc path::tinY |ack -in slurp $guts = $file->slurp; $guts = $file->slurp_utf8;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11125101]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-25 06:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found