After a long sojourn in the wilderness, I have returned to the Monastery with part of an API in hand and several questions for my fellow monks.
Nothing is too trivial here: this is intended for CPAN and bikeshedding public APIs is the best way to avoid backwards compatibility becoming unpleasant later.
The modules are not ready for CPAN yet, mostly due to the still-lingering namespace question. Nor has any significant code been written yet, since I prefer to have a solid idea of the API before getting too involved in implementation. The rest of this node is a copy of the current documentation draft as formatted with pod2html: (internal links are probably broken, sorry)
WARC - Web ARChive support for Perl
use WARC;
$collection = assemble WARC::Collection (@indexes);
$record = $collection->search(url => $url, time => $when);
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record;
$next_record = $record->next;
$record = $volume->record_at($offset);
# $record is a WARC::Record object
The WARC module is a convenience module for loading basic WARC support.
After loading this module, the WARC::Volume and WARC::Collection
classes are available.
- WARC::Collection
-
A WARC::Collection object represents a set of indexed WARC files.
- WARC::Volume
-
A WARC::Volume object represents a single WARC file.
- WARC::Record
-
Each record in a WARC volume is analogous to an HTTP::Message, with
headers specific to the WARC format.
- WARC::Record::Payload
- WARC::Record::Segment
- WARC::Fields
-
A WARC::Fields object represents the set of headers in a WARC record,
analogous to the use of HTTP::Headers with HTTP::Message. The
HTTP::Headers class is not reused because it has protocol-specific
knowledge of a set of valid headers and a standard ordering. WARC headers
come from a different set and order is preserved.
-
The key-value format used in WARC headers has its own MIME type
``application/warc-fields'' and is also usable as the contents of a
``warcinfo'' record and elsewhere. The WARC::Fields class also provides
support for objects of this type.
- WARC::Index
-
WARC::Index is the base class for WARC index formats and also holds a
registry of loaded index formats for convenience when assembling
WARC::Collection objects.
- WARC::Index::CDX
-
Access module for the common CDX WARC index format.
- WARC::Index::SDBM
-
Planned ``fast index'' format using ``SDBM_File'' to index multiple CDX indexes
for fast lookup by URL/timestamp pairs. Planned because sdbm is included
with Perl and the 1008 byte record limit should be a minor problem by
storing URL prefixes and splitting records.
- WARC::Index::SQLite
-
Another planned ``fast index'' format using DBI and DBD::SQLite. This module
avoids the limitations of SDBM, but depends on modules from CPAN.
- WARC::Volume::Builder
-
The WARC::Volume::Builder class provides a means to write new WARC files.
- WARC::Index::CDX::Builder
- WARC::Index::SDBM::Builder
- WARC::Index::SQLite::Builder
-
The WARC::Index::*::Builder classes provide tools for building indexes
either incrementally while writing the corresponding WARC file or
after-the-fact by scanning an existing WARC file.
-
The build constructor that WARC::Index provides uses one of these
classes for the actual work.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection
interface to find the next segment in a different WARC file. The
WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats
directly supported. Interfaces for ``WARC-alike'' handlers are planned as
WARC::Alike::*. Metadata normally present in WARC volumes may not be
available from other formats.
Formats planned for eventual inclusion include MAFF described at
http://maf.mozdev.org/maff-specification.html and the MHTML format
defined in RFC 2557.
Jacob Bachmeyer, <jcb@cpan.org>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at
https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
WARC::Builder - Web ARChive construction support for Perl
use WARC::Builder;
$warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
format => 'WARC File Format 1.0',
# other fields omitted ...
);
$warcinfo = new WARC::Record (type => 'warcinfo',
content => $warcinfo_data);
# for a small-scale crawl
$build = new WARC::Builder (warcinfo => $warcinfo,
filename => $warcfilename);
# for a large-scale crawl
$index1 = build WARC::Index::CDX (into => $indexprefix.'.cdx');
$index2 = build WARC::Index::SDBM (into => $indexprefix.'.sdbm');
$build = new WARC::Builder (warcinfo => $warcinfo,
filename_template =>
$warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
index => [$index1, $index2]);
# for each collected object
$build->append(@records); # or ...
$build->append($record1, $record2, ... );
The WARC::Builder class is the high-level interface for writing WARC
archives. It is a very simple interface, because, at this level, WARC is a
very simple format: a simple sequence of WARC records, which
WARC::Builder accepts as WARC::Record objects to append to the
in-progress WARC file.
WARC file size limits are handled automatically if configured.
- $build = new WARC::Builder (key => value, ...)
-
Construct a WARC::Builder object. The following keys are supported:
- index => [$index]
- index => [$index1, $index2, ...]
-
If set, must be an array reference of a list of index builder objects.
Each newly-added WARC::Record will be presented to all index builder
objects in this list.
- filename => $warcfilename
-
If set, create a single WARC file with the given file name. The file name
must match m/\.warc(?:\.gz)?$/. The presence of a final ``.gz'' indicates
that the WARC file should be written with per-record gzip compression.
-
This option is mutually exclusive with the filename_template option.
-
Using this option inhibits starting a new WARC file and causes the
max_file_size option to be ignored. A warning is emitted in this case.
- filename_template => $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz'
-
Establish an sprintf format string to construct file names. The file name
produced by the template string must match m/\.warc(?:\.gz)?$/. The
presence of a final ``.gz'' indicates that the WARC file should be written
with per-record gzip compression.
-
The filename_template option gives the format string, while
filename_template_vars gives an array reference of named parameters to
be used with the format.
-
If constructing file names in accordance with the IIPC WARC implementation
guidelines, this string should be of the form
'PREFIX-%s-%05d-HOSTNAME.warc.gz' where PREFIX is any chosen prefix to name
the crawl and HOSTNAME is the name or other identifier for the machine
writing the file.
-
This option is mutually exclusive with the filename option.
- filename_template_vars => [qw/timestamp serial/]
-
Provide the list of parameters to the sprintf call used to produce a WARC
filename from the filename_template option.
-
The available variables are:
- serial
-
A number, incremented each time adding a record causes a new WARC file to
be started.
- timestamp
-
A 14-digit timestamp in the YYYYmmddHHMMSS format recommended in the IIPC
WARC implementation guidelines. The timestamp is always in UTC. The time
used is the time at which the WARC::Builder object was constructed and
is constant between WARC files. This should be substituted as a string.
Default [qw/timestamp serial/] in accordance with IIPC guidelines.
- first_serial => $count
-
The initial value of the serial filename variable for this object.
Default 0.
- max_file_size => $size
-
Maximum size of a WARC file. A new WARC file is started if appending a
record would cause the current file to exceed this length.
-
The limit can be specified as an exact number of bytes, or a number
followed by a size suffix m/[KMG]i?/. The ``K'', ``M'', and ``G'' suffixes
indicate base-10 multiples (10**(3*n)), while the ``Ki'', ``Mi'', and ``Gi''
suffixes indicate base-2 multiples (2**(10*n)) widely used in computing.
-
Default ``1G'' == 1_000_000_000.
- warcinfo => $warcinfo_record
-
A WARC::Record object of type ``warcinfo'' that will be written at the
start of each WARC file. This record will be cloned and written with a
distinct ``WARC-Record-ID'' as the first record in each WARC file, including
the first. As a consequence, it does not require a ``WARC-Record-ID'' header
and any ``WARC-Record-ID'' given is silently ignored.
-
Each clone of this record will also have the ``WARC-Filename'' header added.
-
Each clone of this record will also have the ``WARC-Date'' header set to the
time at which the WARC::Builder object was constructed.
- warcversion => 'WARC/1.0'
-
Set the version of the WARC format to be written. This string is the first
line of each WARC record. It must begin with the prefix 'WARC/' and should
be the version from the WARC specification that the crawler follows.
-
Default ``WARC/1.0''.
- $build->append( $record1, ... )
-
Add any number of WARC::Record objects to the growing WARC file. If
WARC file size limits are configured, and a record would cause the current
WARC file to exceed the configured size limits, a new WARC file is opened
automatically.
-
All records passed to a single append call are added to the same WARC
file. If a new WARC file is to be started, it will be started before
any records are written.
-
All records passed to a single append call are considered ``concurrent''
and all subsequent records will have a ``WARC-Concurrent-To'' header added
referencing the first record, if they do not already have a
``WARC-Concurrent-To'' header. This is a convenience feature for simpler
crawlers and is inhibited if any record already has a ``WARC-Concurrent-To''
header when append is called.
-
If a WARC::Record passed to this method lacks a ``WARC-Record-ID'' header,
a warning will be emitted using carp(), a UUID will be generated, and a
record ID of the form ``urn:uuid:UUID'' will be assigned. If the record
object is read-only, this method will croak() instead.
-
If a WARC::Record passed to this method lacks any of the ``WARC-Date'',
``WARC-Type'', or ``Content-Length'' headers, this method will croak().
Jacob Bachmeyer, <jcb@cpan.org>
WARC, the WARC::Record manpage
International Internet Preservation Consortium (IIPC) WARC implementaion
guidelines. https://netpreserve.org/resources/WARC_Guidelines_v1.pdf
...
WARC::Collection - Interface to a group of WARC files
use WARC::Collection;
$collection = assemble WARC::Collection ($index_1, $index_2, ...);
$collection = assemble WARC::Collection from => ($index_1, ...);
$record = $collection->search(url => $url, time => $when);
The WARC::Collection class is the primary means by which user code is
expected to use the WARC library. This class uses indexes to efficiently
search for records in one or more WARC files.
- $collection = assemble WARC::Collection ($index_1, $index_2, ...);
- $collection = assemble WARC::Collection from => ($index_1, ...);
-
Assemble a collection of WARC files from one index or multiple indexes,
specified either as objects derived from WARC::Index or filenames.
-
While multiple indexes can be used in a collection, note that searching a
collection requires individually searching every index in the collection.
- $record = $collection->search( ... )
- @records = $collection->search( ... )
-
Search the index for records matching the parameters and return the best
match in scalar context or a list of all matches in list context. The
returned values are WARC::Record objects.
-
The parameters are specified as key => value pairs and each narrows the
search, sorts the results, or both, indicated in the following list with
``[N ]'', ``[ S]'', or ``[NS]'', respectively.
-
The keys supported are:
- [N ] url
-
An exact match for a URL.
- [NS] url_prefix
-
A prefix match for a URL. Prefers records with shorter URLs.
- [ S] time
-
Prefer records collected nearer to the requested time.
...
WARC::Date - datestamp objects for WARC library
use WARC::Date;
$datestamp = WARC::Date->now(); # construct from current time
$datestamp = WARC::Date->from_epoch(time); # likewise
# construct from string
$datestamp = parse WARC::Date ($text); # full-featured
$datestamp = WARC::Date->from_text($string); # standard format only
$time = $datestamp->as_epoch; # as seconds since epoch
$text = $datestamp->as_string; # as "YYYY-MM-DDThh:mm:ssZ"
WARC::Date objects encapsulate the details of the required format for
timestamps in WARC headers.
- $datestamp = WARC::Date->now
-
Construct a WARC::Date object representing the current time.
- $datestamp = WARC::Date->from_epoch( $timestamp )
-
Construct a WARC::Date object representing the time indicated by an
epoch timestamp.
- $datestamp = WARC::Date->from_text( $string )
-
Construct a WARC::Date object representing the time indicated by a
string in the same format returned by the as_string method.
- $datestamp = parse WARC::Date ($text)
-
Construct a WARC::Date object from a textual representation. If
the HTTP::Date manpage is installed, accepts any input acceptable to
HTTP::Date::str2time. Otherwise, this method is equivalent to the
from_text method.
- $datestamp->as_string
-
Return a string in the format specified by [W3C-NOTE-datetime] restricted
to 14 digits and UTC time zone, which is
``YYYY-MM-DDThh:mm:ssZ''.
WARC::Date objects use epoch time internally and are therefore limited
by the range of Perl's integers.
Jacob Bachmeyer, <jcb@cpan.org>
WARC, the HTTP::Date manpage
[W3C-NOTE-datetime] ``Date and Time Formats''
http://www.w3.org/TR/NOTE-datetime.
...
WARC::Fields - WARC record headers and application/warc-fields
require WARC::Fields;
$f = new WARC::Fields;
$f = $record->fields; # get WARC record headers
$f->field('WARC-Type' => 'metadata'); # set
$f->field('WARC-Type'); # get
$f->remove_field('WARC-Type'); # delete
tie @field_names, ref $f, $f; # bind ordered list of field names
tie %fields, ref $f, $f; # bind hash of field names => values
The WARC::Fields class encapsulates information in the
``application/warc-fields'' format used for WARC record headers. This is a
simple key-value format closely analogous to HTTP headers, however
differences are significant enough that the HTTP::Headers class cannot
be reliably reused for WARC fields.
Instances of this class are usually created as member variables of the
WARC::Record class, but can also be returned as the content of WARC
records with Content-Type ``application/warc-fields''.
Instances of WARC::Fields retrieved from WARC files are read-only and
will croak() if any attempt is made to change their contents.
This class strives to faithfully represent the contents of a WARC file,
although the field names are defined to be case-insensitive.
Most WARC headers may only appear once and with a single value in valid
WARC records, with the notable exception of the WARC-Concurrent-To header.
WARC::Fields neither attempts to enforce nor relies upon this
constraint. Headers that appear multiple times are considered to have
multiple values, that is, the value associated with the header name will be
an array reference. Similarly, the name of a recurring header is
repeated in the tied array interface. When iterating a tied hash, all
values of a recurring header are collected and returned with the first
occurrence of its key.
As with HTTP::Headers, the '_' character is converted to '-' in field
names unless the first character of the name is ':', which cannot itself
appear in a field name. Unlike HTTP::Headers, the leading ':' is
stripped off immediately and the name stored otherwise exactly as given.
The method and tied hash interfaces allow this convenience feature. The
field names exposed via the tied array interface are reported exactly as
they appear in the WARC file.
Strictly, ``X-Crazy-Header'' and ``X_Crazy_Header'' are two different
headers that the above convenience mechanism conflates. The solution is
simple: if (and only if) a header field already exists with the exact
name given, it is used, otherwise y/_/-/ occurs and the name is rechecked
for another exact match. If no match is found, case is folded and a third
check performed. If a match is found, the existing header is updated,
otherwise a new header is created with character case as given.
The WARC standard specifically states that field names are
case-insensitive, accordingly, ``X-Crazy-Header'' and ``X-CRAZY-HeAdEr'' are
considered the same header for the method and tied hash interfaces. They
will appear exactly as given in the tied array interface, however.
- $f = WARC::Fields->new
-
Construct a new WARC::Fields object. Initial contents can be passed as
key-value pairs to this constructor and will be added in the given order.
- $f->clone
-
Copy a WARC::Fields object. A copy of a read-only object is writable.
- $f->field( $name )
- $f->field( $name => $value )
- $f->field( $n1 => $v1, $n2 => $v2, ... )
-
Get or set the value of one or more fields. The field name is not case
sensitive, but WARC::Fields will preserve its case if a new entry is
created.
- $f = WARC::Fields->parse( $text )
- $f = WARC::Fields->parse_from( $fh )
-
Construct a new WARC::Fields object, reading initial contents from the
provided text string or filehandle.
-
If either parse method encounters a field name with a leading ':', which
implies an empty name and is not allowed, the leading ':' is silently
dropped from the line and parsing retried. If the line is not valid after
this change, the parse method croaks.
- $f->as_string
-
Return the contents as a formatted WARC header or application/warc-fields
block.
- $f->set_readonly
-
Mark a WARC::Fields object read-only. All methods that modify the
object will croak() if called on a read-only object.
The order of field names can be fully controlled by tying an array to a
WARC::Fields object and manipulating the array using ordinary Perl
operations. Removing a name from the array effectively removes the field
from the object, but the value for that name is still remembered, allowing
names to be moved about without loss of data.
WARC::Fields will croak() if an attempt is made to set a field name with
a leading ':' using the tied array interface.
The contents of a WARC::Fields object can be easily examined by tying a
hash to the object. Reading or setting a hash key is equivalent to the
field method, but the tied hash will iterate keys and values in the
order in which each key first appears in the internal list.
...
WARC::Index - base class for WARC index classes
use WARC::Index::CDX; # or ...
use WARC::Index::SDBM;
# or some other WARC::Index::* implementation
$index = attach WARC::Index::CDX (...); # or ...
$index = attach WARC::Index::SDBM (...);
$record = $index->search(url => $url, time => $when);
@results = $index->search(url => $url, time => $when);
build WARC::Index::CDX (...); # or ...
build WARC::Index::SDBM (...);
WARC::Index is an abstract base class for indexes on WARC files and
WARC-alike files. This class establishes the expected interface and
provides a simple interface for building indexes.
- $index = attach WARC::Index::* (...)
-
Construct an index object using the indicated technology and whatever
parameters the index implementation needs.
-
Typically, indexes are file-based and a single parameter is the name of an
index file which in turn contains the names of the indexed WARC files.
- $record = $collection->search( ... )
- @records = $collection->search( ... )
-
Search an index for records matching parameters. The WARC::Collection
class uses this method to search each index in a collection.
- build WARC::Index::* (into => $dest, from => ...)
- build WARC::Index::* (from => [...], into => $dest)
-
The WARC::Index base class does provide this method, however. The
build method works by loading the corresponding index builder class and
driving the process or simply returning the newly-constructed object.
-
The build method itself handles the from key for specifying the files
to index. The from key can be given an array reference, after which
more key => value pairs may follow, or can simply use the rest of the
argument list as its value.
-
If the from key is given, the build method will read the indicated
files, construct an index, and return nothing. If the from key is not
given, the build method will construct and return an index builder.
-
All index builders accept at least the into key for specifying where to
store the index. See the documentation for WARC::Index::*::Builder for
more information.
The WARC::Index package also maintains a registry of loaded index
support. The register function adds the calling package to the list.
- WARC::Index::register( filename => $filename_re )
-
Add the calling package to an internal list of available index handlers.
The calling package must be a subclass of WARC::Index or this function
will croak().
-
The filename key indicates that the calling package expects to handle
index files with names matching the provided regex.
- WARC::Index::find_handler( $filename )
-
Return the registered handler for $filename or undef if none match.
...
WARC::Record - one record from a WARC file
use WARC; # or ...
use WARC::Volume; # or ...
use WARC::Collection;
# WARC::Record objects are returned from ->record_at and ->search methods
# Construct a record, as when preparing a WARC file
$warcinfo = new WARC::Record (type => 'warcinfo');
...
WARC::Record objects come in two flavors with a common interface.
Records read from WARC files are read-only and have meaningful return
values from the methods listed in ``Methods on records from WARC files''.
Records constructed in memory can be updated and those same methods all
return undef.
- $record->fields
-
Get the internal WARC::Fields object that contains WARC record headers.
- $record->field( $name )
-
Get the value of the WARC header named $name from the internal
WARC::Fields object.
These methods all return undef if called on a WARC::Record object that
does not represent a record in a WARC file.
- $record->protocol
-
Return the format and version tag for this record. For WARC 1.0, this
method returns 'WARC/1.0'.
- $record->volume
-
Return the WARC::Volume object representing the file in which this
record is located.
- $record->offset
-
Return the file offset at which this record can be found.
- $record->next
-
Return the next WARC::Record in the WARC file that contains this record.
- $record->replay
-
Return a protocol-specific object representing the record contents.
-
This method returns undef if the library does not recognize the protocol
message stored in the record.
-
A record with Content-Type ``application/http'' with an appropriate ``msgtype''
parameter produces an HTTP::Request or HTTP::Response object. An
unknown ``msgtype'' on ``application/http'' produces a generic
HTTP::Message. The returned object may be a subclass to support
deferred loading of entity bodies.
- $record->open_payload
-
Return a tied filehandle that reads the WARC record payload.
-
The WARC record payload is defined as the decoded content of the protocol
response or other resource stored in the record. This method returns undef
if called on a WARC record that has no payload or content that we do not
recognize.
- $record = new WARC::Record (key => value, ...)
-
Construct a fresh WARC record, suitable for use with WARC::Builder.
...
WARC::Volume - Web ARChive file access for Perl
use WARC::Volume;
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record;
$record = $volume->record_at($offset);
$record = $volume->search(url => $url, time => $when);
WARC::Volume ...
- $volume = mount WARC::Volume ($filename)
-
Construct a WARC::Volume object. The parameter is the name of an
existing WARC file. An exception is raised if the first record does not
have a valid WARC header.
- $volume->first_record
-
Construct and return a WARC::Record object representing the first WARC
record in $volume. This should be a ``warcinfo'' record, but it is not
required to be so.
- $volume->record_at( $offset )
-
Construct and return a WARC::Record object representing the WARC record
beginning at $offset within $volume. An exception is raised if an
appropriate magic number is not found at $offset.
Jacob Bachmeyer, <jcb@cpan.org>
...
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.