Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Parsing Gutenberg Catalog Index

by kvale (Monsignor)
on Aug 30, 2004 at 05:30 UTC ( [id://386851]=note: print w/replies, xml ) Need Help??


in reply to Parsing Gutenberg Catalog Index

For the other readers, here is a snippet of the index:
Anna Karenina, by Lev Nikolaevica Tolstoi + 13214 [Language: Dutch] Night Before Christmas & Other Popular Stories For Children, by Variou +s 13213 The Wild Olive, by Basil King 13212 The Pearl, by Sophie Jewett 13211 El Comendador Mendoza, by Juan Valera 13210 [Subtitle: Obras Completas Tomo VII] [Language: Spanish]
It seems that there is a good bit of structure here. Each new entry starts on a new line. The title and author are separated by /, by/. The ID is at the end of the first line of the entry. Combining these, a first stab at a regexp would be
$line =~ /^(\w.*?), by (.*?)\s+(\d+)$/; $author = $1; $title = $2; $id = $3;

-Mark

Replies are listed 'Best First'.
Re^2: Parsing Gutenberg Catalog Index
by Anonymous Monk on Aug 30, 2004 at 06:37 UTC
    Woudlnt it just make sense to use their XML formatted catalog? ... http://gutenberg.net/browse/rdf/catalog.rdf.bz2

      I would second that, given that the RDF looks like:

      <rdf:Description rdf:ID="etext13218"> <dc:publisher>&pg;</dc:publisher> <dc:title rdf:parseType="Literal">Don Orsino</dc:title> <dc:creator>Crawford, F. Marion (Francis Marion) (1854-1909)</dc:cre +ator> <dc:language>en</dc:language> <dc:created>2004-08-19</dc:created> <dc:rights rdf:resource="&lic;" /> </rdf:Description>

      Then all you need is something trivial like this to create a file ready for a MySQL 'load data local infile .....'

      #!/usr/bin/perl local $/ = "\n\n"; open RDF, $ARGV[0] or die $!; while(<RDF>){ next unless m/<rdf:Description rdf:ID="etext(\d+)"/; my $id = $1; next unless m/<dc:title[^>]+>([^<\n]+)</; my $title = $1; next unless m/<dc:creator>([^<\n]+)</; my $author = $1; $title =~ s/\s+/ /g; $author =~ s/\s+/ /g; print "$id\t$title\t$author\n"; }

      cheers

      tachyon

Re^2: Parsing Gutenberg Catalog Index
by lidden (Curate) on Aug 30, 2004 at 12:08 UTC
    Looking a little closer and also in older index files i found this.
    *****A "C" Following a Project Gutenberg eBook Number Indicates Copyri +ght**** *****A "*" Following a Project Gutenberg eBook Number Indicates Reserv +ed **** [snip] The Life of John Ruskin, by W. G. Collingwood + 13076 A Hero and a Great Man, by Francis Kruckvich + 13075C [Illustrator: Fritz] Punch, Vol. 100, February 7, 1891, Ed. by Sir Francis Burnand + 13074 [snip] Feb 1995 Moon and Sixpence by Somerset Maugham [Maugham #1][moonaxxx.x +xx] 222 Feb 1995 The Return of Sherlock Holmes [Magazine Edition] [rholmxxb.x +xx] 221B Feb 1995 The Secret Sharer, by Joseph Conrad [Conrad #2] [ssharxxx.x +xx] 220
    I did not found the meaning of the 'B' though.

      221B - as in 221B Baker Street. It could mean something else, but I think it's somebody's attempt at humor.

      Wonder how this project turned out?

      But God demonstrates His own love toward us, in that while we were yet sinners, Christ died for us. Romans 5:8 (NASB)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://386851]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 06:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found