Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)

Replies are listed 'Best First'.

Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by jcb (Parson) on Aug 10, 2019 at 03:33 UTC

That is what I meant by Archive:: seeming to fit at first glance: I had the same idea, after all "Web ARChive" is literally the name of the format. I could argue that Archive::Web would be an appropriate root, but then you have the problem that WARC is not the only format for storing Web documents, merely the one favored by the Internet Archive and a few national libraries. Argument by weighty authority is still argument by authority. :-(

While you are correct that conventions can be ignored, I would prefer to reserve Archive::WARC for a (future) simpler file-ish interface. There are ways to treat a WARC file much like a ZIP or ZOO archive.
That is good, thank you.
Is that a lack of information or just not having had time to look yet? (In other words, is more information needed or only patience?)
WARC::Fields is a fairly simple ordered in-memory key-value store and unlikely to need subclasses. Overloading the dereference operators would make the tied array/hash interfaces nearly transparent, which seems nice to me.

This would make $record->fields->{WARC-Type} or $record->fields->{WARC-Target-URI} shorthand for $record->field('WARC-Type') or $record->field('WARC-Target-URI'), since the field method on a WARC::Record is passed to the embedded WARC::Fields object.

That is not very useful, but the real reason for overloading hash dereference to use the tied hash interface is to make keys %{$record->fields} valid and exactly what it looks like. Why roll my own iterator API when Perl already has one?

On a side note, I realized that this question mentions the wrong package. Oops, fixed. (It had been part of WARC::Record originally before I decided to follow the same split as HTTP::Message and HTTP::Headers. I had been keeping a list of questions, and updating that fell through the cracks somehow. Oops!)
Overloading provides convenience mostly, like being able to use sort on an array of WARC::Record without having to specify a comparison. The overload would probably be to a compareTo or compare_to method anyway. An overload to a method should work with subclasses, although I would expect an overload to a coderef to cause problems unless subclasses also use overload to override it. If I understand the overload documentation correctly the overhead of overloaded operators is tiny for packages that do not use them and is really the cost of supporting overloading at all.

That is a fairly good argument against using overloads on WARC::Record, except that, without overloads, none of the overloadable operators make sense on a WARC::Record. There is ==, but that is object identity and exactly the most obvious candidate for overloading to make WARC::Record objects compare equal iff they refer to the same physical record even if they were obtained from two different indexes and therefore have been constructed separately and have different memory addresses.
The purpose of WARC segmentation is to store payloads that are too large for a single WARC file. (The format has no inherent limit, but the specification recommends a policy of limiting WARC files to 1G each.) We run into this problem inside the READ or READLINE method implementing the tied file handle returned from open_payload on a WARC::Record object. Reading a payload from a WARC collection should be transparent, so the WARC library must recombine segments here.

Also, due to limitations of the WARC format, there is no previous method: its implementation would require starting at the first record in the WARC file and repeatedly following next, a nasty performance surprise for the unwary. Better to let the module user do that if they really need it. At least that way, they should know it will be very slow.
So I must ask the related question: How should WARC::Collection expose information about the volumes in the collection? Collections can be large enough that the indexes must be primarily stored on disk. Common Crawl, as an example, is ~~double-digit TB~~ hundreds of TB — ~~tens~~ hundreds of thousands of 1GB WARC files storing ~~many~~ billions of records per crawl. Then again, simply returning an array should work here — ~~ten~~ two hundred thousand WARC::Volume objects should fit in a few hundred MB or so of RAM. Is array memory overhead still significantly smaller than hash memory overhead? I will have to carefully think about expected live object counts when choosing internal representations.

Or should this be another tied array interface, where the list of WARC files is drawn from an index as needed? That can only work if the collection object is only using one index, but I think requiring a merged index for collections too large for even a complete list of WARC files to fit in RAM is reasonable.
This is less of a problem for reading WARC files — the open_payload method provides a tied file handle that reads the payload from a WARC record; the real problem is supplying the data when writing a WARC file, especially in a way that is compatible with future support for transparently saving LWP exchanges to WARC files. Are temporary files really the only practical option here? (I suspect probably so.)

Temporary file space can be bounded even if payload size is not: segments can be recorded as they arrive.

Edited 2019-08-10 by jcb: Correct size of Common Crawl datasets and redo math. The conclusion seems to remain valid due to a previous math error.

[reply]
[d/l]
[select]


We don't bite newbies here... much
	PerlMonks