http://qs321.pair.com?node_id=1113711


in reply to Re: Ignoring not well-formed (invalid token) errors
in thread Ignoring not well-formed (invalid token) errors

If it's looking for a simple pattern it might be doable in a reasonable amount of time. There are extractors for the Open Directory Project and Wikipedia dumps, both of which are in the many GB range, that can process very quickly, even on relatively old machines. I was pulling all of the music content out of ODP in less than a few minutes some 10 years ago on a mac laptop that was reasonably current then, and I don't recall how long it took to pull all the music topics out of Wikipedia, but I think it was quite reasonable.