comment on

After a long sojourn in the wilderness, I have returned to the Monastery with part of an API in hand and several questions for my fellow monks.

Is there a better namespace for this than the top-level? If so, where?
Archive:: seems to fit at first glance, but this module has a radically different interface from most of the modules in that namespace because WARC files store subtly different information. (Most archives store "files"; WARC can store files, but are designed to store HTTP request/response exchanges.) An Archive::WARC interface could be reasonable, but it would provide a special "view" of a WARC file that omits most details. (Recognizing this was a major step in designing this API -- and took me a few years to do!)

HTTP:: could be a possibility, but does not really fit because WARC files can also store information from other protocols. (The WARC spec envisions storing DNS records "as observed" as an example.)

LWP:: fits the eventual goal of providing transparent access to WARC files as a sort of "local Wayback" but is probably better reserved for the interface modules that *actually* implement that "local Wayback" than the generic support for accessing and building WARC files. (The baseline WARC distribution uses the HTTP::* classes, but has no other dependencies on LWP and no dependencies in the LWP:: namespace.)
Any problems with the use of "meaningful" constructors?
The WARC::Collection and WARC::Volume modules provide read-only access to existing (collections of)? WARC files. The constructors have been given names to reflect this: WARC::Volume->mount and WARC::Collection->assemble.

The use of "open" for a WARC::Volume constructor was considered, but cannot be used in the indirect object syntax that I prefer for a constructor due to a parse conflict with the "open" builtin that perl resolves by raising a parse error instead of looking for a class method.

("open WARC::File ($name)" would have been ideal, but looks too much like a typo using the "open" builtin.)
How best to provide options on the "replay" method of WARC::Record?
The current API envisions some means of retrieving the content of a WARC record as a file handle or string and another means of getting a reconstructed protocol response object. (An HTTP::Response in the usual case, but possibly something else.)

Options also include whether or not to actually retrieve the request chain or to just synthesize a request from the information in the "response" record. (There is no point in reading several WARC records for a long redirect chain if the user only cares about the URL and the server's final response.) This is a significant concern because the common CDX index format only indexes response records.
Should the tied hash and tied array interfaces for ~~WARC::Record~~ WARC::Fields be automatically invoked using overloaded dereference operators?
Or is this asking for trouble?
Is overloading the == (or <=>) operator on WARC::Record to use file:offset tuples as good an idea as it seems?
This would be most useful to coalesce duplicate records from multiple indexes. Logically, two record objects that refer to the same physical record should compare as equal.
What to do with a segmented record if we lack index information to find the next segment?
WARC file names are normally systematic: we can probably guess the next WARC filename in "normal" cases, but there will always be edge cases where we have no idea.

How far should I go in trying to make this Just Work? When the "It Just Works" logic fails, is it better to return an undefined value or raise an exception? And should we ensure that all segments are available when first opening a segmented payload or defer failure to when we actually "run out of road"?
Should the WARC::Collection class have a concept of "next volume"?
This would mean that $record->next on the last record in a file returns the first record in the next file.

Related:Should WARC::Collection expose information about the set of volumes in a collection? If so, how?
Any advice on attaching contents to WARC records?
Simply keeping the contents in memory is not always an option -- WARC segmentation permits payloads of unlimited size.

Nothing is too trivial here: this is intended for CPAN and bikeshedding public APIs is the best way to avoid backwards compatibility becoming unpleasant later.

The modules are not ready for CPAN yet, mostly due to the still-lingering namespace question. Nor has any significant code been written yet, since I prefer to have a solid idea of the API before getting too involved in implementation. The rest of this node is a copy of the current documentation draft as formatted with pod2html: (internal links are probably broken, sorry)

NAME

WARC - Web ARChive support for Perl

SYNOPSIS

  use WARC;

  $collection = assemble WARC::Collection (@indexes);

  $record = $collection->search(url => $url, time => $when);

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;
  $next_record = $record->next;

  $record = $volume->record_at($offset);

  # $record is a WARC::Record object

DESCRIPTION

The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules

WARC::Collection: A WARC::Collection object represents a set of indexed WARC files.
WARC::Volume: A WARC::Volume object represents a single WARC file.
WARC::Record: Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.
WARC::Record::Payload
WARC::Record::Segment
WARC::Fields: A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.; The key-value format used in WARC headers has its own MIME type ``application/warc-fields'' and is also usable as the contents of a ``warcinfo'' record and elsewhere. The WARC::Fields class also provides support for objects of this type.
WARC::Index: WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.
WARC::Index::CDX: Access module for the common CDX WARC index format.
WARC::Index::SDBM: Planned ``fast index'' format using ``SDBM_File'' to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
WARC::Index::SQLite: Another planned ``fast index'' format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.

Overview of the WARC writer support modules

WARC::Volume::Builder: The WARC::Volume::Builder class provides a means to write new WARC files.
WARC::Index::CDX::Builder
WARC::Index::SDBM::Builder
WARC::Index::SQLite::Builder: The WARC::Index::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.; The build constructor that WARC::Index provides uses one of these classes for the actual work.

CAVEATS

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for ``WARC-alike'' handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

NAME

WARC::Builder - Web ARChive construction support for Perl

SYNOPSIS

  use WARC::Builder;

  $warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
                                     format => 'WARC File Format 1.0',
                                     # other fields omitted ...
                                     );

  $warcinfo = new WARC::Record (type => 'warcinfo',
                                content => $warcinfo_data);

  # for a small-scale crawl
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename => $warcfilename);

  # for a large-scale crawl
  $index1 = build WARC::Index::CDX (into => $indexprefix.'.cdx');
  $index2 = build WARC::Index::SDBM (into => $indexprefix.'.sdbm');
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename_template =>
                                $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
                              index => [$index1, $index2]);

  # for each collected object
  $build->append(@records);     # or ...
  $build->append($record1, $record2, ... );

DESCRIPTION

The WARC::Builder class is the high-level interface for writing WARC archives. It is a very simple interface, because, at this level, WARC is a very simple format: a simple sequence of WARC records, which WARC::Builder accepts as WARC::Record objects to append to the in-progress WARC file.

WARC file size limits are handled automatically if configured.

Methods

$build = new WARC::Builder (key => value, ...): Construct a WARC::Builder object. The following keys are supported:
$build->append( $record1, ... ): Add any number of WARC::Record objects to the growing WARC file. If WARC file size limits are configured, and a record would cause the current WARC file to exceed the configured size limits, a new WARC file is opened automatically.; All records passed to a single append call are added to the same WARC file. If a new WARC file is to be started, it will be started before any records are written.; All records passed to a single append call are considered ``concurrent'' and all subsequent records will have a ``WARC-Concurrent-To'' header added referencing the first record, if they do not already have a ``WARC-Concurrent-To'' header. This is a convenience feature for simpler crawlers and is inhibited if any record already has a ``WARC-Concurrent-To'' header when append is called.; If a WARC::Record passed to this method lacks a ``WARC-Record-ID'' header, a warning will be emitted using carp(), a UUID will be generated, and a record ID of the form ``urn:uuid:UUID'' will be assigned. If the record object is read-only, this method will croak() instead.; If a WARC::Record passed to this method lacks any of the ``WARC-Date'', ``WARC-Type'', or ``Content-Length'' headers, this method will croak().

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

NAME

WARC::Collection - Interface to a group of WARC files

SYNOPSIS

  use WARC::Collection;

  $collection = assemble WARC::Collection ($index_1, $index_2, ...);
  $collection = assemble WARC::Collection from => ($index_1, ...);

  $record = $collection->search(url => $url, time => $when);

DESCRIPTION

The WARC::Collection class is the primary means by which user code is expected to use the WARC library. This class uses indexes to efficiently search for records in one or more WARC files.

Methods

$collection = assemble WARC::Collection ($index_1, $index_2, ...);
$collection = assemble WARC::Collection from => ($index_1, ...);: Assemble a collection of WARC files from one index or multiple indexes, specified either as objects derived from WARC::Index or filenames.; While multiple indexes can be used in a collection, note that searching a collection requires individually searching every index in the collection.
$record = $collection->search( ... )
@records = $collection->search( ... ): Search the index for records matching the parameters and return the best match in scalar context or a list of all matches in list context. The returned values are WARC::Record objects.; The parameters are specified as key => value pairs and each narrows the search, sorts the results, or both, indicated in the following list with ``[N ]'', ``[ S]'', or ``[NS]'', respectively.; The keys supported are:

...

NAME

WARC::Date - datestamp objects for WARC library

SYNOPSIS

  use WARC::Date;

  $datestamp = WARC::Date->now();               # construct from current time
  $datestamp = WARC::Date->from_epoch(time);    # likewise

  # construct from string
  $datestamp = parse WARC::Date ($text);        # full-featured
  $datestamp = WARC::Date->from_text($string);  # standard format only

  $time = $datestamp->as_epoch;         # as seconds since epoch
  $text = $datestamp->as_string;        # as "YYYY-MM-DDThh:mm:ssZ"

DESCRIPTION

WARC::Date objects encapsulate the details of the required format for timestamps in WARC headers.

Methods

$datestamp = WARC::Date->now: Construct a WARC::Date object representing the current time.
$datestamp = WARC::Date->from_epoch( $timestamp ): Construct a WARC::Date object representing the time indicated by an epoch timestamp.
$datestamp = WARC::Date->from_text( $string ): Construct a WARC::Date object representing the time indicated by a string in the same format returned by the as_string method.
$datestamp = parse WARC::Date ($text): Construct a WARC::Date object from a textual representation. If the HTTP::Date manpage is installed, accepts any input acceptable to HTTP::Date::str2time. Otherwise, this method is equivalent to the from_text method.
$datestamp->as_string: Return a string in the format specified by [W3C-NOTE-datetime] restricted to 14 digits and UTC time zone, which is ``YYYY-MM-DDThh:mm:ssZ''.

CAVEATS

WARC::Date objects use epoch time internally and are therefore limited by the range of Perl's integers.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

NAME

WARC::Fields - WARC record headers and application/warc-fields

SYNOPSIS

  require WARC::Fields;

  $f = new WARC::Fields;
  $f = $record->fields;                 # get WARC record headers

  $f->field('WARC-Type' => 'metadata'); # set
  $f->field('WARC-Type');               # get
  $f->remove_field('WARC-Type');        # delete

  tie @field_names, ref $f, $f;         # bind ordered list of field names

  tie %fields, ref $f, $f;              # bind hash of field names => values

DESCRIPTION

The WARC::Fields class encapsulates information in the ``application/warc-fields'' format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers class cannot be reliably reused for WARC fields.

Instances of this class are usually created as member variables of the WARC::Record class, but can also be returned as the content of WARC records with Content-Type ``application/warc-fields''.

Instances of WARC::Fields retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.

This class strives to faithfully represent the contents of a WARC file, although the field names are defined to be case-insensitive.

Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values, that is, the value associated with the header name will be an array reference. Similarly, the name of a recurring header is repeated in the tied array interface. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.

As with HTTP::Headers, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The method and tied hash interfaces allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.

Strictly, ``X-Crazy-Header'' and ``X_Crazy_Header'' are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise y/_/-/ occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.

The WARC standard specifically states that field names are case-insensitive, accordingly, ``X-Crazy-Header'' and ``X-CRAZY-HeAdEr'' are considered the same header for the method and tied hash interfaces. They will appear exactly as given in the tied array interface, however.

Methods

$f = WARC::Fields->new: Construct a new WARC::Fields object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order.
$f->clone: Copy a WARC::Fields object. A copy of a read-only object is writable.
$f->field( $name )
$f->field( $name => $value )
$f->field( $n1 => $v1, $n2 => $v2, ... ): Get or set the value of one or more fields. The field name is not case sensitive, but WARC::Fields will preserve its case if a new entry is created.
$f = WARC::Fields->parse( $text )
$f = WARC::Fields->parse_from( $fh ): Construct a new WARC::Fields object, reading initial contents from the provided text string or filehandle.; If either parse method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, the parse method croaks.
$f->as_string: Return the contents as a formatted WARC header or application/warc-fields block.
$f->set_readonly: Mark a WARC::Fields object read-only. All methods that modify the object will croak() if called on a read-only object.

Tied Array Access

The order of field names can be fully controlled by tying an array to a WARC::Fields object and manipulating the array using ordinary Perl operations. Removing a name from the array effectively removes the field from the object, but the value for that name is still remembered, allowing names to be moved about without loss of data.

WARC::Fields will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.

Tied Hash Access

The contents of a WARC::Fields object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal list.

...

NAME

WARC::Index - base class for WARC index classes

SYNOPSIS

  use WARC::Index::CDX; # or ...
  use WARC::Index::SDBM;
  # or some other WARC::Index::* implementation

  $index = attach WARC::Index::CDX (...);       # or ...
  $index = attach WARC::Index::SDBM (...);

  $record = $index->search(url => $url, time => $when);
  @results = $index->search(url => $url, time => $when);

  build WARC::Index::CDX (...); # or ...
  build WARC::Index::SDBM (...);

DESCRIPTION

WARC::Index is an abstract base class for indexes on WARC files and WARC-alike files. This class establishes the expected interface and provides a simple interface for building indexes.

Methods

$index = attach WARC::Index::* (...): Construct an index object using the indicated technology and whatever parameters the index implementation needs.; Typically, indexes are file-based and a single parameter is the name of an index file which in turn contains the names of the indexed WARC files.
$record = $collection->search( ... )
@records = $collection->search( ... ): Search an index for records matching parameters. The WARC::Collection class uses this method to search each index in a collection.
build WARC::Index::* (into => $dest, from => ...)
build WARC::Index::* (from => [...], into => $dest): The WARC::Index base class does provide this method, however. The build method works by loading the corresponding index builder class and driving the process or simply returning the newly-constructed object.; The build method itself handles the from key for specifying the files to index. The from key can be given an array reference, after which more key => value pairs may follow, or can simply use the rest of the argument list as its value.; If the from key is given, the build method will read the indicated files, construct an index, and return nothing. If the from key is not given, the build method will construct and return an index builder.; All index builders accept at least the into key for specifying where to store the index. See the documentation for WARC::Index::*::Builder for more information.

Index system registration

The WARC::Index package also maintains a registry of loaded index support. The register function adds the calling package to the list.

WARC::Index::register( filename => $filename_re ): Add the calling package to an internal list of available index handlers. The calling package must be a subclass of WARC::Index or this function will croak().; The filename key indicates that the calling package expects to handle index files with names matching the provided regex.
WARC::Index::find_handler( $filename ): Return the registered handler for $filename or undef if none match.

...

NAME

WARC::Record - one record from a WARC file

SYNOPSIS

  use WARC;             # or ...
  use WARC::Volume;     # or ...
  use WARC::Collection;

  # WARC::Record objects are returned from ->record_at and ->search methods

  # Construct a record, as when preparing a WARC file
  $warcinfo = new WARC::Record (type => 'warcinfo');

...

DESCRIPTION

WARC::Record objects come in two flavors with a common interface. Records read from WARC files are read-only and have meaningful return values from the methods listed in ``Methods on records from WARC files''. Records constructed in memory can be updated and those same methods all return undef.

Common Methods

$record->fields: Get the internal WARC::Fields object that contains WARC record headers.
$record->field( $name ): Get the value of the WARC header named $name from the internal WARC::Fields object.

Methods on records from WARC files

These methods all return undef if called on a WARC::Record object that does not represent a record in a WARC file.

$record->protocol: Return the format and version tag for this record. For WARC 1.0, this method returns 'WARC/1.0'.
$record->volume: Return the WARC::Volume object representing the file in which this record is located.
$record->offset: Return the file offset at which this record can be found.
$record->next: Return the next WARC::Record in the WARC file that contains this record.
$record->replay: Return a protocol-specific object representing the record contents.; This method returns undef if the library does not recognize the protocol message stored in the record.; A record with Content-Type ``application/http'' with an appropriate ``msgtype'' parameter produces an HTTP::Request or HTTP::Response object. An unknown ``msgtype'' on ``application/http'' produces a generic HTTP::Message. The returned object may be a subclass to support deferred loading of entity bodies.
$record->open_payload: Return a tied filehandle that reads the WARC record payload.; The WARC record payload is defined as the decoded content of the protocol response or other resource stored in the record. This method returns undef if called on a WARC record that has no payload or content that we do not recognize.

Methods on fresh WARC records

$record = new WARC::Record (key => value, ...): Construct a fresh WARC record, suitable for use with WARC::Builder.

...

NAME

WARC::Volume - Web ARChive file access for Perl

SYNOPSIS

  use WARC::Volume;

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;

  $record = $volume->record_at($offset);

  $record = $volume->search(url => $url, time => $when);

DESCRIPTION

WARC::Volume ...

Methods

$volume = mount WARC::Volume ($filename): Construct a WARC::Volume object. The parameter is the name of an existing WARC file. An exception is raised if the first record does not have a valid WARC header.
$volume->first_record: Construct and return a WARC::Record object representing the first WARC record in $volume. This should be a ``warcinfo'' record, but it is not required to be so.
$volume->record_at( $offset ): Construct and return a WARC::Record object representing the WARC record beginning at $offset within $volume. An exception is raised if an appropriate magic number is not found at $offset.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

...

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Edited 2019-08-09 by jcb: Demote headings and elide boilerplate to make draft documentation easier to read. Also clarify first question.

Edited 2019-08-09 by jcb: Oops: the only class that has tied array/hash interfaces is WARC::Fields, not WARC::Record.

In reply to Planning a new CPAN module for WARC support (DSLIP: IdpOp) by jcb

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.