jcb has asked for the wisdom of the Perl Monks concerning the following question:
After a long sojourn in the wilderness, I have returned to the Monastery with part of an API in hand and several questions for my fellow monks.
Archive:: seems to fit at first glance, but this module has a radically different interface from most of the modules in that namespace because WARC files store subtly different information. (Most archives store "files"; WARC can store files, but are designed to store HTTP request/response exchanges.) An Archive::WARC interface could be reasonable, but it would provide a special "view" of a WARC file that omits most details. (Recognizing this was a major step in designing this API -- and took me a few years to do!)
HTTP:: could be a possibility, but does not really fit because WARC files can also store information from other protocols. (The WARC spec envisions storing DNS records "as observed" as an example.)
LWP:: fits the eventual goal of providing transparent access to WARC files as a sort of "local Wayback" but is probably better reserved for the interface modules that *actually* implement that "local Wayback" than the generic support for accessing and building WARC files. (The baseline WARC distribution uses the HTTP::* classes, but has no other dependencies on LWP and no dependencies in the LWP:: namespace.)
The WARC::Collection and WARC::Volume modules provide read-only access to existing (collections of)? WARC files. The constructors have been given names to reflect this: WARC::Volume->mount and WARC::Collection->assemble.
The use of "open" for a WARC::Volume constructor was considered, but cannot be used in the indirect object syntax that I prefer for a constructor due to a parse conflict with the "open" builtin that perl resolves by raising a parse error instead of looking for a class method.
("open WARC::File ($name)" would have been ideal, but looks too much like a typo using the "open" builtin.)
The current API envisions some means of retrieving the content of a WARC record as a file handle or string and another means of getting a reconstructed protocol response object. (An HTTP::Response in the usual case, but possibly something else.)
Options also include whether or not to actually retrieve the request chain or to just synthesize a request from the information in the "response" record. (There is no point in reading several WARC records for a long redirect chain if the user only cares about the URL and the server's final response.) This is a significant concern because the common CDX index format only indexes response records.
Or is this asking for trouble?
This would be most useful to coalesce duplicate records from multiple indexes. Logically, two record objects that refer to the same physical record should compare as equal.
WARC file names are normally systematic: we can probably guess the next WARC filename in "normal" cases, but there will always be edge cases where we have no idea.
How far should I go in trying to make this Just Work? When the "It Just Works" logic fails, is it better to return an undefined value or raise an exception? And should we ensure that all segments are available when first opening a segmented payload or defer failure to when we actually "run out of road"?
This would mean that $record->next on the last record in a file returns the first record in the next file.
Related:Should WARC::Collection expose information about the set of volumes in a collection? If so, how?
Simply keeping the contents in memory is not always an option -- WARC segmentation permits payloads of unlimited size.
Nothing is too trivial here: this is intended for CPAN and bikeshedding public APIs is the best way to avoid backwards compatibility becoming unpleasant later.
The modules are not ready for CPAN yet, mostly due to the still-lingering namespace question. Nor has any significant code been written yet, since I prefer to have a solid idea of the API before getting too involved in implementation. The rest of this node is a copy of the current documentation draft as formatted with pod2html: (internal links are probably broken, sorry)
WARC - Web ARChive support for Perl
use WARC;
$collection = assemble WARC::Collection (@indexes);
$record = $collection->search(url => $url, time => $when);
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record; $next_record = $record->next;
$record = $volume->record_at($offset);
# $record is a WARC::Record object
The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.
The key-value format used in WARC headers has its own MIME type ``application/warc-fields'' and is also usable as the contents of a ``warcinfo'' record and elsewhere. The WARC::Fields class also provides support for objects of this type.
The build constructor that WARC::Index provides uses one of these classes for the actual work.
Support for WARC record segmentation is planned but not yet implemented.
Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.
The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for ``WARC-alike'' handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.
Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.
Jacob Bachmeyer, <jcb@cpan.org>
Information about the WARC format at http://bibnum.bnf.fr/WARC/.
An overview of the WARC format at https://www.loc.gov/preservation/digital/formats/fdd/fdd000236.shtml.
# TODO: add relevant RFCs.
The POD pages for the modules mentioned in the overview lists.
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
WARC::Builder - Web ARChive construction support for Perl
use WARC::Builder;
$warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
format => 'WARC File Format 1.0',
# other fields omitted ...
);
$warcinfo = new WARC::Record (type => 'warcinfo',
content => $warcinfo_data);
# for a small-scale crawl
$build = new WARC::Builder (warcinfo => $warcinfo,
filename => $warcfilename);
# for a large-scale crawl
$index1 = build WARC::Index::CDX (into => $indexprefix.'.cdx');
$index2 = build WARC::Index::SDBM (into => $indexprefix.'.sdbm');
$build = new WARC::Builder (warcinfo => $warcinfo,
filename_template =>
$warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
index => [$index1, $index2]);
# for each collected object $build->append(@records); # or ... $build->append($record1, $record2, ... );
The WARC::Builder class is the high-level interface for writing WARC archives. It is a very simple interface, because, at this level, WARC is a very simple format: a simple sequence of WARC records, which WARC::Builder accepts as WARC::Record objects to append to the in-progress WARC file.
WARC file size limits are handled automatically if configured.
This option is mutually exclusive with the filename_template option.
Using this option inhibits starting a new WARC file and causes the max_file_size option to be ignored. A warning is emitted in this case.
The filename_template option gives the format string, while filename_template_vars gives an array reference of named parameters to be used with the format.
If constructing file names in accordance with the IIPC WARC implementation guidelines, this string should be of the form 'PREFIX-%s-%05d-HOSTNAME.warc.gz' where PREFIX is any chosen prefix to name the crawl and HOSTNAME is the name or other identifier for the machine writing the file.
This option is mutually exclusive with the filename option.
The available variables are:
Default [qw/timestamp serial/] in accordance with IIPC guidelines.
The limit can be specified as an exact number of bytes, or a number followed by a size suffix m/[KMG]i?/. The ``K'', ``M'', and ``G'' suffixes indicate base-10 multiples (10**(3*n)), while the ``Ki'', ``Mi'', and ``Gi'' suffixes indicate base-2 multiples (2**(10*n)) widely used in computing.
Default ``1G'' == 1_000_000_000.
Each clone of this record will also have the ``WARC-Filename'' header added.
Each clone of this record will also have the ``WARC-Date'' header set to the time at which the WARC::Builder object was constructed.
Default ``WARC/1.0''.
All records passed to a single append call are added to the same WARC file. If a new WARC file is to be started, it will be started before any records are written.
All records passed to a single append call are considered ``concurrent'' and all subsequent records will have a ``WARC-Concurrent-To'' header added referencing the first record, if they do not already have a ``WARC-Concurrent-To'' header. This is a convenience feature for simpler crawlers and is inhibited if any record already has a ``WARC-Concurrent-To'' header when append is called.
If a WARC::Record passed to this method lacks a ``WARC-Record-ID'' header, a warning will be emitted using carp(), a UUID will be generated, and a record ID of the form ``urn:uuid:UUID'' will be assigned. If the record object is read-only, this method will croak() instead.
If a WARC::Record passed to this method lacks any of the ``WARC-Date'', ``WARC-Type'', or ``Content-Length'' headers, this method will croak().
Jacob Bachmeyer, <jcb@cpan.org>
WARC, the WARC::Record manpage
International Internet Preservation Consortium (IIPC) WARC implementaion guidelines. https://netpreserve.org/resources/WARC_Guidelines_v1.pdf
...
WARC::Collection - Interface to a group of WARC files
use WARC::Collection;
$collection = assemble WARC::Collection ($index_1, $index_2, ...); $collection = assemble WARC::Collection from => ($index_1, ...);
$record = $collection->search(url => $url, time => $when);
The WARC::Collection class is the primary means by which user code is expected to use the WARC library. This class uses indexes to efficiently search for records in one or more WARC files.
While multiple indexes can be used in a collection, note that searching a collection requires individually searching every index in the collection.
The parameters are specified as key => value pairs and each narrows the search, sorts the results, or both, indicated in the following list with ``[N ]'', ``[ S]'', or ``[NS]'', respectively.
The keys supported are:
...
WARC::Date - datestamp objects for WARC library
use WARC::Date;
$datestamp = WARC::Date->now(); # construct from current time $datestamp = WARC::Date->from_epoch(time); # likewise
# construct from string $datestamp = parse WARC::Date ($text); # full-featured $datestamp = WARC::Date->from_text($string); # standard format only
$time = $datestamp->as_epoch; # as seconds since epoch $text = $datestamp->as_string; # as "YYYY-MM-DDThh:mm:ssZ"
WARC::Date objects encapsulate the details of the required format for timestamps in WARC headers.
WARC::Date objects use epoch time internally and are therefore limited by the range of Perl's integers.
Jacob Bachmeyer, <jcb@cpan.org>
WARC, the HTTP::Date manpage
[W3C-NOTE-datetime] ``Date and Time Formats'' http://www.w3.org/TR/NOTE-datetime.
...
WARC::Fields - WARC record headers and application/warc-fields
require WARC::Fields;
$f = new WARC::Fields; $f = $record->fields; # get WARC record headers
$f->field('WARC-Type' => 'metadata'); # set
$f->field('WARC-Type'); # get
$f->remove_field('WARC-Type'); # delete
tie @field_names, ref $f, $f; # bind ordered list of field names
tie %fields, ref $f, $f; # bind hash of field names => values
The WARC::Fields class encapsulates information in the ``application/warc-fields'' format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers class cannot be reliably reused for WARC fields.
Instances of this class are usually created as member variables of the WARC::Record class, but can also be returned as the content of WARC records with Content-Type ``application/warc-fields''.
Instances of WARC::Fields retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.
This class strives to faithfully represent the contents of a WARC file, although the field names are defined to be case-insensitive.
Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values, that is, the value associated with the header name will be an array reference. Similarly, the name of a recurring header is repeated in the tied array interface. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.
As with HTTP::Headers, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The method and tied hash interfaces allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.
Strictly, ``X-Crazy-Header'' and ``X_Crazy_Header'' are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise y/_/-/ occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.
The WARC standard specifically states that field names are case-insensitive, accordingly, ``X-Crazy-Header'' and ``X-CRAZY-HeAdEr'' are considered the same header for the method and tied hash interfaces. They will appear exactly as given in the tied array interface, however.
If either parse method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, the parse method croaks.
The order of field names can be fully controlled by tying an array to a WARC::Fields object and manipulating the array using ordinary Perl operations. Removing a name from the array effectively removes the field from the object, but the value for that name is still remembered, allowing names to be moved about without loss of data.
WARC::Fields will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.
The contents of a WARC::Fields object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal list.
...
WARC::Index - base class for WARC index classes
use WARC::Index::CDX; # or ... use WARC::Index::SDBM; # or some other WARC::Index::* implementation
$index = attach WARC::Index::CDX (...); # or ... $index = attach WARC::Index::SDBM (...);
$record = $index->search(url => $url, time => $when); @results = $index->search(url => $url, time => $when);
build WARC::Index::CDX (...); # or ... build WARC::Index::SDBM (...);
WARC::Index is an abstract base class for indexes on WARC files and WARC-alike files. This class establishes the expected interface and provides a simple interface for building indexes.
Typically, indexes are file-based and a single parameter is the name of an index file which in turn contains the names of the indexed WARC files.
The build method itself handles the from key for specifying the files to index. The from key can be given an array reference, after which more key => value pairs may follow, or can simply use the rest of the argument list as its value.
If the from key is given, the build method will read the indicated files, construct an index, and return nothing. If the from key is not given, the build method will construct and return an index builder.
All index builders accept at least the into key for specifying where to store the index. See the documentation for WARC::Index::*::Builder for more information.
The WARC::Index package also maintains a registry of loaded index support. The register function adds the calling package to the list.
The filename key indicates that the calling package expects to handle index files with names matching the provided regex.
...
WARC::Record - one record from a WARC file
use WARC; # or ... use WARC::Volume; # or ... use WARC::Collection;
# WARC::Record objects are returned from ->record_at and ->search methods
# Construct a record, as when preparing a WARC file $warcinfo = new WARC::Record (type => 'warcinfo');
...
WARC::Record objects come in two flavors with a common interface. Records read from WARC files are read-only and have meaningful return values from the methods listed in ``Methods on records from WARC files''. Records constructed in memory can be updated and those same methods all return undef.
These methods all return undef if called on a WARC::Record object that does not represent a record in a WARC file.
This method returns undef if the library does not recognize the protocol message stored in the record.
A record with Content-Type ``application/http'' with an appropriate ``msgtype'' parameter produces an HTTP::Request or HTTP::Response object. An unknown ``msgtype'' on ``application/http'' produces a generic HTTP::Message. The returned object may be a subclass to support deferred loading of entity bodies.
The WARC record payload is defined as the decoded content of the protocol response or other resource stored in the record. This method returns undef if called on a WARC record that has no payload or content that we do not recognize.
...
WARC::Volume - Web ARChive file access for Perl
use WARC::Volume;
$volume = mount WARC::Volume ($filename);
$record = $volume->first_record;
$record = $volume->record_at($offset);
$record = $volume->search(url => $url, time => $when);
WARC::Volume ...
Jacob Bachmeyer, <jcb@cpan.org>
...
Copyright (C) 2019 by Jacob Bachmeyer
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Edited 2019-08-09 by jcb: Demote headings and elide boilerplate to make draft documentation easier to read. Also clarify first question.
Edited 2019-08-09 by jcb: Oops: the only class that has tied array/hash interfaces is WARC::Fields, not WARC::Record.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by shmem (Chancellor) on Aug 09, 2019 at 23:51 UTC | |
by jcb (Parson) on Aug 10, 2019 at 03:33 UTC | |
|
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by haukex (Archbishop) on Aug 10, 2019 at 12:09 UTC | |
by jcb (Parson) on Aug 10, 2019 at 23:42 UTC | |
|
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by stevieb (Canon) on Aug 09, 2019 at 21:23 UTC | |
by jcb (Parson) on Aug 09, 2019 at 22:58 UTC |