Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^2: Ignoring not well-formed (invalid token) errors

by bitingduck (Chaplain)
on Jan 19, 2015 at 08:15 UTC ( #1113711=note: print w/replies, xml ) Need Help??


in reply to Re: Ignoring not well-formed (invalid token) errors
in thread Ignoring not well-formed (invalid token) errors

If it's looking for a simple pattern it might be doable in a reasonable amount of time. There are extractors for the Open Directory Project and Wikipedia dumps, both of which are in the many GB range, that can process very quickly, even on relatively old machines. I was pulling all of the music content out of ODP in less than a few minutes some 10 years ago on a mac laptop that was reasonably current then, and I don't recall how long it took to pull all the music topics out of Wikipedia, but I think it was quite reasonable.

  • Comment on Re^2: Ignoring not well-formed (invalid token) errors

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1113711]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2023-01-31 09:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?