Re: Ignoring not well-formed (invalid token) errors

by Laurent_R (Canon)
on Jan 19, 2015 at 07:24 UTC

in reply to Ignoring not well-formed (invalid token) errors

If the error is always showing the same pattern, maybe you could preprocess the file to remove the offending line(s). I know that the idea of preprocessing 13 GB is not very attractive, but sometimes you have to bite the bullet.

Re^2: Ignoring not well-formed (invalid token) errors
by bitingduck (Chaplain) on Jan 19, 2015 at 08:15 UTC

    If it's looking for a simple pattern it might be doable in a reasonable amount of time. There are extractors for the Open Directory Project and Wikipedia dumps, both of which are in the many GB range, that can process very quickly, even on relatively old machines. I was pulling all of the music content out of ODP in less than a few minutes some 10 years ago on a mac laptop that was reasonably current then, and I don't recall how long it took to pull all the music topics out of Wikipedia, but I think it was quite reasonable.

