Planning a new CPAN module for WARC support (DSLIP: IdpOp)

jcb has asked for the wisdom of the Perl Monks concerning the following question:

After a long sojourn in the wilderness, I have returned to the Monastery with part of an API in hand and several questions for my fellow monks.

Is there a better namespace for this than the top-level? If so, where?
Archive:: seems to fit at first glance, but this module has a radically different interface from most of the modules in that namespace because WARC files store subtly different information. (Most archives store "files"; WARC can store files, but are designed to store HTTP request/response exchanges.) An Archive::WARC interface could be reasonable, but it would provide a special "view" of a WARC file that omits most details. (Recognizing this was a major step in designing this API -- and took me a few years to do!)

HTTP:: could be a possibility, but does not really fit because WARC files can also store information from other protocols. (The WARC spec envisions storing DNS records "as observed" as an example.)

LWP:: fits the eventual goal of providing transparent access to WARC files as a sort of "local Wayback" but is probably better reserved for the interface modules that *actually* implement that "local Wayback" than the generic support for accessing and building WARC files. (The baseline WARC distribution uses the HTTP::* classes, but has no other dependencies on LWP and no dependencies in the LWP:: namespace.)
Any problems with the use of "meaningful" constructors?
The WARC::Collection and WARC::Volume modules provide read-only access to existing (collections of)? WARC files. The constructors have been given names to reflect this: WARC::Volume->mount and WARC::Collection->assemble.

The use of "open" for a WARC::Volume constructor was considered, but cannot be used in the indirect object syntax that I prefer for a constructor due to a parse conflict with the "open" builtin that perl resolves by raising a parse error instead of looking for a class method.

("open WARC::File ($name)" would have been ideal, but looks too much like a typo using the "open" builtin.)
How best to provide options on the "replay" method of WARC::Record?
The current API envisions some means of retrieving the content of a WARC record as a file handle or string and another means of getting a reconstructed protocol response object. (An HTTP::Response in the usual case, but possibly something else.)

Options also include whether or not to actually retrieve the request chain or to just synthesize a request from the information in the "response" record. (There is no point in reading several WARC records for a long redirect chain if the user only cares about the URL and the server's final response.) This is a significant concern because the common CDX index format only indexes response records.
Should the tied hash and tied array interfaces for ~~WARC::Record~~ WARC::Fields be automatically invoked using overloaded dereference operators?
Or is this asking for trouble?
Is overloading the == (or <=>) operator on WARC::Record to use file:offset tuples as good an idea as it seems?
This would be most useful to coalesce duplicate records from multiple indexes. Logically, two record objects that refer to the same physical record should compare as equal.
What to do with a segmented record if we lack index information to find the next segment?
WARC file names are normally systematic: we can probably guess the next WARC filename in "normal" cases, but there will always be edge cases where we have no idea.

How far should I go in trying to make this Just Work? When the "It Just Works" logic fails, is it better to return an undefined value or raise an exception? And should we ensure that all segments are available when first opening a segmented payload or defer failure to when we actually "run out of road"?
Should the WARC::Collection class have a concept of "next volume"?
This would mean that $record->next on the last record in a file returns the first record in the next file.

Related:Should WARC::Collection expose information about the set of volumes in a collection? If so, how?
Any advice on attaching contents to WARC records?
Simply keeping the contents in memory is not always an option -- WARC segmentation permits payloads of unlimited size.

Nothing is too trivial here: this is intended for CPAN and bikeshedding public APIs is the best way to avoid backwards compatibility becoming unpleasant later.

The modules are not ready for CPAN yet, mostly due to the still-lingering namespace question. Nor has any significant code been written yet, since I prefer to have a solid idea of the API before getting too involved in implementation. The rest of this node is a copy of the current documentation draft as formatted with pod2html: (internal links are probably broken, sorry)

NAME

WARC - Web ARChive support for Perl

SYNOPSIS

  use WARC;

  $collection = assemble WARC::Collection (@indexes);

  $record = $collection->search(url => $url, time => $when);

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;
  $next_record = $record->next;

  $record = $volume->record_at($offset);

  # $record is a WARC::Record object

DESCRIPTION

The WARC module is a convenience module for loading basic WARC support. After loading this module, the WARC::Volume and WARC::Collection classes are available.

Overview of the WARC reader support modules

WARC::Collection: A WARC::Collection object represents a set of indexed WARC files.
WARC::Volume: A WARC::Volume object represents a single WARC file.
WARC::Record: Each record in a WARC volume is analogous to an HTTP::Message, with headers specific to the WARC format.
WARC::Record::Payload
WARC::Record::Segment
WARC::Fields: A WARC::Fields object represents the set of headers in a WARC record, analogous to the use of HTTP::Headers with HTTP::Message. The HTTP::Headers class is not reused because it has protocol-specific knowledge of a set of valid headers and a standard ordering. WARC headers come from a different set and order is preserved.; The key-value format used in WARC headers has its own MIME type ``application/warc-fields'' and is also usable as the contents of a ``warcinfo'' record and elsewhere. The WARC::Fields class also provides support for objects of this type.
WARC::Index: WARC::Index is the base class for WARC index formats and also holds a registry of loaded index formats for convenience when assembling WARC::Collection objects.
WARC::Index::CDX: Access module for the common CDX WARC index format.
WARC::Index::SDBM: Planned ``fast index'' format using ``SDBM_File'' to index multiple CDX indexes for fast lookup by URL/timestamp pairs. Planned because sdbm is included with Perl and the 1008 byte record limit should be a minor problem by storing URL prefixes and splitting records.
WARC::Index::SQLite: Another planned ``fast index'' format using DBI and DBD::SQLite. This module avoids the limitations of SDBM, but depends on modules from CPAN.

Overview of the WARC writer support modules

WARC::Volume::Builder: The WARC::Volume::Builder class provides a means to write new WARC files.
WARC::Index::CDX::Builder
WARC::Index::SDBM::Builder
WARC::Index::SQLite::Builder: The WARC::Index::*::Builder classes provide tools for building indexes either incrementally while writing the corresponding WARC file or after-the-fact by scanning an existing WARC file.; The build constructor that WARC::Index provides uses one of these classes for the actual work.

CAVEATS

Support for WARC record segmentation is planned but not yet implemented.

Handling segmented WARC records requires using the WARC::Collection interface to find the next segment in a different WARC file. The WARC::Volume interface is only usable for access within one WARC file.

The older ARC format is not yet supported, nor are other archival formats directly supported. Interfaces for ``WARC-alike'' handlers are planned as WARC::Alike::*. Metadata normally present in WARC volumes may not be available from other formats.

Formats planned for eventual inclusion include MAFF described at http://maf.mozdev.org/maff-specification.html and the MHTML format defined in RFC 2557.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

NAME

WARC::Builder - Web ARChive construction support for Perl

SYNOPSIS

  use WARC::Builder;

  $warcinfo_data = new WARC::Fields (software => 'MyWebCrawler/1.2.3 ...',
                                     format => 'WARC File Format 1.0',
                                     # other fields omitted ...
                                     );

  $warcinfo = new WARC::Record (type => 'warcinfo',
                                content => $warcinfo_data);

  # for a small-scale crawl
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename => $warcfilename);

  # for a large-scale crawl
  $index1 = build WARC::Index::CDX (into => $indexprefix.'.cdx');
  $index2 = build WARC::Index::SDBM (into => $indexprefix.'.sdbm');
  $build = new WARC::Builder (warcinfo => $warcinfo,
                              filename_template =>
                                $warcprefix.'-%s-%05d-'.$hostname.'.warc.gz',
                              index => [$index1, $index2]);

  # for each collected object
  $build->append(@records);     # or ...
  $build->append($record1, $record2, ... );

DESCRIPTION

The WARC::Builder class is the high-level interface for writing WARC archives. It is a very simple interface, because, at this level, WARC is a very simple format: a simple sequence of WARC records, which WARC::Builder accepts as WARC::Record objects to append to the in-progress WARC file.

WARC file size limits are handled automatically if configured.

Methods

$build = new WARC::Builder (key => value, ...): Construct a WARC::Builder object. The following keys are supported:
$build->append( $record1, ... ): Add any number of WARC::Record objects to the growing WARC file. If WARC file size limits are configured, and a record would cause the current WARC file to exceed the configured size limits, a new WARC file is opened automatically.; All records passed to a single append call are added to the same WARC file. If a new WARC file is to be started, it will be started before any records are written.; All records passed to a single append call are considered ``concurrent'' and all subsequent records will have a ``WARC-Concurrent-To'' header added referencing the first record, if they do not already have a ``WARC-Concurrent-To'' header. This is a convenience feature for simpler crawlers and is inhibited if any record already has a ``WARC-Concurrent-To'' header when append is called.; If a WARC::Record passed to this method lacks a ``WARC-Record-ID'' header, a warning will be emitted using carp(), a UUID will be generated, and a record ID of the form ``urn:uuid:UUID'' will be assigned. If the record object is read-only, this method will croak() instead.; If a WARC::Record passed to this method lacks any of the ``WARC-Date'', ``WARC-Type'', or ``Content-Length'' headers, this method will croak().

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

NAME

WARC::Collection - Interface to a group of WARC files

SYNOPSIS

  use WARC::Collection;

  $collection = assemble WARC::Collection ($index_1, $index_2, ...);
  $collection = assemble WARC::Collection from => ($index_1, ...);

  $record = $collection->search(url => $url, time => $when);

DESCRIPTION

The WARC::Collection class is the primary means by which user code is expected to use the WARC library. This class uses indexes to efficiently search for records in one or more WARC files.

Methods

$collection = assemble WARC::Collection ($index_1, $index_2, ...);
$collection = assemble WARC::Collection from => ($index_1, ...);: Assemble a collection of WARC files from one index or multiple indexes, specified either as objects derived from WARC::Index or filenames.; While multiple indexes can be used in a collection, note that searching a collection requires individually searching every index in the collection.
$record = $collection->search( ... )
@records = $collection->search( ... ): Search the index for records matching the parameters and return the best match in scalar context or a list of all matches in list context. The returned values are WARC::Record objects.; The parameters are specified as key => value pairs and each narrows the search, sorts the results, or both, indicated in the following list with ``[N ]'', ``[ S]'', or ``[NS]'', respectively.; The keys supported are:

...

NAME

WARC::Date - datestamp objects for WARC library

SYNOPSIS

  use WARC::Date;

  $datestamp = WARC::Date->now();               # construct from current time
  $datestamp = WARC::Date->from_epoch(time);    # likewise

  # construct from string
  $datestamp = parse WARC::Date ($text);        # full-featured
  $datestamp = WARC::Date->from_text($string);  # standard format only

  $time = $datestamp->as_epoch;         # as seconds since epoch
  $text = $datestamp->as_string;        # as "YYYY-MM-DDThh:mm:ssZ"

DESCRIPTION

WARC::Date objects encapsulate the details of the required format for timestamps in WARC headers.

Methods

$datestamp = WARC::Date->now: Construct a WARC::Date object representing the current time.
$datestamp = WARC::Date->from_epoch( $timestamp ): Construct a WARC::Date object representing the time indicated by an epoch timestamp.
$datestamp = WARC::Date->from_text( $string ): Construct a WARC::Date object representing the time indicated by a string in the same format returned by the as_string method.
$datestamp = parse WARC::Date ($text): Construct a WARC::Date object from a textual representation. If the HTTP::Date manpage is installed, accepts any input acceptable to HTTP::Date::str2time. Otherwise, this method is equivalent to the from_text method.
$datestamp->as_string: Return a string in the format specified by [W3C-NOTE-datetime] restricted to 14 digits and UTC time zone, which is ``YYYY-MM-DDThh:mm:ssZ''.

CAVEATS

WARC::Date objects use epoch time internally and are therefore limited by the range of Perl's integers.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

NAME

WARC::Fields - WARC record headers and application/warc-fields

SYNOPSIS

  require WARC::Fields;

  $f = new WARC::Fields;
  $f = $record->fields;                 # get WARC record headers

  $f->field('WARC-Type' => 'metadata'); # set
  $f->field('WARC-Type');               # get
  $f->remove_field('WARC-Type');        # delete

  tie @field_names, ref $f, $f;         # bind ordered list of field names

  tie %fields, ref $f, $f;              # bind hash of field names => values

DESCRIPTION

The WARC::Fields class encapsulates information in the ``application/warc-fields'' format used for WARC record headers. This is a simple key-value format closely analogous to HTTP headers, however differences are significant enough that the HTTP::Headers class cannot be reliably reused for WARC fields.

Instances of this class are usually created as member variables of the WARC::Record class, but can also be returned as the content of WARC records with Content-Type ``application/warc-fields''.

Instances of WARC::Fields retrieved from WARC files are read-only and will croak() if any attempt is made to change their contents.

This class strives to faithfully represent the contents of a WARC file, although the field names are defined to be case-insensitive.

Most WARC headers may only appear once and with a single value in valid WARC records, with the notable exception of the WARC-Concurrent-To header. WARC::Fields neither attempts to enforce nor relies upon this constraint. Headers that appear multiple times are considered to have multiple values, that is, the value associated with the header name will be an array reference. Similarly, the name of a recurring header is repeated in the tied array interface. When iterating a tied hash, all values of a recurring header are collected and returned with the first occurrence of its key.

As with HTTP::Headers, the '_' character is converted to '-' in field names unless the first character of the name is ':', which cannot itself appear in a field name. Unlike HTTP::Headers, the leading ':' is stripped off immediately and the name stored otherwise exactly as given. The method and tied hash interfaces allow this convenience feature. The field names exposed via the tied array interface are reported exactly as they appear in the WARC file.

Strictly, ``X-Crazy-Header'' and ``X_Crazy_Header'' are two different headers that the above convenience mechanism conflates. The solution is simple: if (and only if) a header field already exists with the exact name given, it is used, otherwise y/_/-/ occurs and the name is rechecked for another exact match. If no match is found, case is folded and a third check performed. If a match is found, the existing header is updated, otherwise a new header is created with character case as given.

The WARC standard specifically states that field names are case-insensitive, accordingly, ``X-Crazy-Header'' and ``X-CRAZY-HeAdEr'' are considered the same header for the method and tied hash interfaces. They will appear exactly as given in the tied array interface, however.

Methods

$f = WARC::Fields->new: Construct a new WARC::Fields object. Initial contents can be passed as key-value pairs to this constructor and will be added in the given order.
$f->clone: Copy a WARC::Fields object. A copy of a read-only object is writable.
$f->field( $name )
$f->field( $name => $value )
$f->field( $n1 => $v1, $n2 => $v2, ... ): Get or set the value of one or more fields. The field name is not case sensitive, but WARC::Fields will preserve its case if a new entry is created.
$f = WARC::Fields->parse( $text )
$f = WARC::Fields->parse_from( $fh ): Construct a new WARC::Fields object, reading initial contents from the provided text string or filehandle.; If either parse method encounters a field name with a leading ':', which implies an empty name and is not allowed, the leading ':' is silently dropped from the line and parsing retried. If the line is not valid after this change, the parse method croaks.
$f->as_string: Return the contents as a formatted WARC header or application/warc-fields block.
$f->set_readonly: Mark a WARC::Fields object read-only. All methods that modify the object will croak() if called on a read-only object.

Tied Array Access

The order of field names can be fully controlled by tying an array to a WARC::Fields object and manipulating the array using ordinary Perl operations. Removing a name from the array effectively removes the field from the object, but the value for that name is still remembered, allowing names to be moved about without loss of data.

WARC::Fields will croak() if an attempt is made to set a field name with a leading ':' using the tied array interface.

Tied Hash Access

The contents of a WARC::Fields object can be easily examined by tying a hash to the object. Reading or setting a hash key is equivalent to the field method, but the tied hash will iterate keys and values in the order in which each key first appears in the internal list.

...

NAME

WARC::Index - base class for WARC index classes

SYNOPSIS

  use WARC::Index::CDX; # or ...
  use WARC::Index::SDBM;
  # or some other WARC::Index::* implementation

  $index = attach WARC::Index::CDX (...);       # or ...
  $index = attach WARC::Index::SDBM (...);

  $record = $index->search(url => $url, time => $when);
  @results = $index->search(url => $url, time => $when);

  build WARC::Index::CDX (...); # or ...
  build WARC::Index::SDBM (...);

DESCRIPTION

WARC::Index is an abstract base class for indexes on WARC files and WARC-alike files. This class establishes the expected interface and provides a simple interface for building indexes.

Methods

$index = attach WARC::Index::* (...): Construct an index object using the indicated technology and whatever parameters the index implementation needs.; Typically, indexes are file-based and a single parameter is the name of an index file which in turn contains the names of the indexed WARC files.
$record = $collection->search( ... )
@records = $collection->search( ... ): Search an index for records matching parameters. The WARC::Collection class uses this method to search each index in a collection.
build WARC::Index::* (into => $dest, from => ...)
build WARC::Index::* (from => [...], into => $dest): The WARC::Index base class does provide this method, however. The build method works by loading the corresponding index builder class and driving the process or simply returning the newly-constructed object.; The build method itself handles the from key for specifying the files to index. The from key can be given an array reference, after which more key => value pairs may follow, or can simply use the rest of the argument list as its value.; If the from key is given, the build method will read the indicated files, construct an index, and return nothing. If the from key is not given, the build method will construct and return an index builder.; All index builders accept at least the into key for specifying where to store the index. See the documentation for WARC::Index::*::Builder for more information.

Index system registration

The WARC::Index package also maintains a registry of loaded index support. The register function adds the calling package to the list.

WARC::Index::register( filename => $filename_re ): Add the calling package to an internal list of available index handlers. The calling package must be a subclass of WARC::Index or this function will croak().; The filename key indicates that the calling package expects to handle index files with names matching the provided regex.
WARC::Index::find_handler( $filename ): Return the registered handler for $filename or undef if none match.

...

NAME

WARC::Record - one record from a WARC file

SYNOPSIS

  use WARC;             # or ...
  use WARC::Volume;     # or ...
  use WARC::Collection;

  # WARC::Record objects are returned from ->record_at and ->search methods

  # Construct a record, as when preparing a WARC file
  $warcinfo = new WARC::Record (type => 'warcinfo');

...

DESCRIPTION

WARC::Record objects come in two flavors with a common interface. Records read from WARC files are read-only and have meaningful return values from the methods listed in ``Methods on records from WARC files''. Records constructed in memory can be updated and those same methods all return undef.

Common Methods

$record->fields: Get the internal WARC::Fields object that contains WARC record headers.
$record->field( $name ): Get the value of the WARC header named $name from the internal WARC::Fields object.

Methods on records from WARC files

These methods all return undef if called on a WARC::Record object that does not represent a record in a WARC file.

$record->protocol: Return the format and version tag for this record. For WARC 1.0, this method returns 'WARC/1.0'.
$record->volume: Return the WARC::Volume object representing the file in which this record is located.
$record->offset: Return the file offset at which this record can be found.
$record->next: Return the next WARC::Record in the WARC file that contains this record.
$record->replay: Return a protocol-specific object representing the record contents.; This method returns undef if the library does not recognize the protocol message stored in the record.; A record with Content-Type ``application/http'' with an appropriate ``msgtype'' parameter produces an HTTP::Request or HTTP::Response object. An unknown ``msgtype'' on ``application/http'' produces a generic HTTP::Message. The returned object may be a subclass to support deferred loading of entity bodies.
$record->open_payload: Return a tied filehandle that reads the WARC record payload.; The WARC record payload is defined as the decoded content of the protocol response or other resource stored in the record. This method returns undef if called on a WARC record that has no payload or content that we do not recognize.

Methods on fresh WARC records

$record = new WARC::Record (key => value, ...): Construct a fresh WARC record, suitable for use with WARC::Builder.

...

NAME

WARC::Volume - Web ARChive file access for Perl

SYNOPSIS

  use WARC::Volume;

  $volume = mount WARC::Volume ($filename);

  $record = $volume->first_record;

  $record = $volume->record_at($offset);

  $record = $volume->search(url => $url, time => $when);

DESCRIPTION

WARC::Volume ...

Methods

$volume = mount WARC::Volume ($filename): Construct a WARC::Volume object. The parameter is the name of an existing WARC file. An exception is raised if the first record does not have a valid WARC header.
$volume->first_record: Construct and return a WARC::Record object representing the first WARC record in $volume. This should be a ``warcinfo'' record, but it is not required to be so.
$volume->record_at( $offset ): Construct and return a WARC::Record object representing the WARC record beginning at $offset within $volume. An exception is raised if an appropriate magic number is not found at $offset.

AUTHOR

Jacob Bachmeyer, <jcb@cpan.org>

...

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Edited 2019-08-09 by jcb: Demote headings and elide boilerplate to make draft documentation easier to read. Also clarify first question.

Edited 2019-08-09 by jcb: Oops: the only class that has tied array/hash interfaces is WARC::Fields, not WARC::Record.

Comment on Planning a new CPAN module for WARC support (DSLIP: IdpOp) Select or Download Code

Replies are listed 'Best First'.
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by shmem (Chancellor) on Aug 09, 2019 at 23:51 UTC
It is a sad (or joyous?) fact that namespaces aren't related but by convention. Your module under the Archive:: Namespace doesn't have to follow the conventions of the other modules under this namespace. If that were so, a transparent Archive.pm would be in sight. Then, your Module says `WARC - Web ARChive support for Perl` so it is definitely an Archive type of module. No. can't answer yet. You might do that, and my guess is that it is not asking for trouble, since overload occurs just in that package. But I remember having trouble with overloading and subclassing. No, at first glance. What benefit does overloading provide you over calling a function with arguments? Overloading is useful to extend something (Math::BigInt) but has its overloading price. You should probably leave that to code using the module. Methods `qw(next previous)` and done. Also... yes. See previous point ;-) Hash::Util::FieldHash perhaps? an object which knows about its size and limit? Just an opinion of some monk. perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'	[reply] [d/l] [select]
Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by jcb (Parson) on Aug 10, 2019 at 03:33 UTC
That is what I meant by `Archive::` seeming to fit at first glance: I had the same idea, after all "Web ARChive" is literally the name of the format. I could argue that `Archive::Web` would be an appropriate root, but then you have the problem that WARC is not the only format for storing Web documents, merely the one favored by the Internet Archive and a few national libraries. Argument by weighty authority is still argument by authority. :-( While you are correct that conventions can be ignored, I would prefer to reserve `Archive::WARC` for a (future) simpler file-ish interface. There are ways to treat a WARC file much like a ZIP or ZOO archive. That is good, thank you. Is that a lack of information or just not having had time to look yet? (In other words, is more information needed or only patience?) `WARC::Fields` is a fairly simple ordered in-memory key-value store and unlikely to need subclasses. Overloading the dereference operators would make the tied array/hash interfaces nearly transparent, which seems nice to me. This would make `$record->fields->{WARC-Type}` or `$record->fields->{WARC-Target-URI}` shorthand for `$record->field('WARC-Type')` or `$record->field('WARC-Target-URI')`, since the `field` method on a `WARC::Record` is passed to the embedded `WARC::Fields` object. That is not very useful, but the real reason for overloading hash dereference to use the tied hash interface is to make `keys %{$record->fields}` valid and exactly what it looks like. Why roll my own iterator API when Perl already has one? On a side note, I realized that this question mentions the wrong package. Oops, fixed. (It had been part of `WARC::Record` originally before I decided to follow the same split as `HTTP::Message` and `HTTP::Headers`. I had been keeping a list of questions, and updating that fell through the cracks somehow. Oops!) Overloading provides convenience mostly, like being able to use `sort` on an array of `WARC::Record` without having to specify a comparison. The overload would probably be to a `compareTo` or `compare_to` method anyway. An overload to a method should work with subclasses, although I would expect an overload to a coderef to cause problems unless subclasses also `use overload` to override it. If I understand the overload documentation correctly the overhead of overloaded operators is tiny for packages that do not use them and is really the cost of supporting overloading at all. That is a fairly good argument against using overloads on `WARC::Record`, except that, without overloads, none of the overloadable operators make sense on a `WARC::Record`. There is `==`, but that is object identity and exactly the most obvious candidate for overloading to make `WARC::Record` objects compare equal iff they refer to the same physical record even if they were obtained from two different indexes and therefore have been constructed separately and have different memory addresses. The purpose of WARC segmentation is to store payloads that are too large for a single WARC file. (The format has no inherent limit, but the specification recommends a policy of limiting WARC files to 1G each.) We run into this problem inside the `READ` or `READLINE` method implementing the tied file handle returned from `open_payload` on a `WARC::Record` object. Reading a payload from a WARC collection should be transparent, so the WARC library must recombine segments here. Also, due to limitations of the WARC format, there is no `previous` method: its implementation would require starting at the first record in the WARC file and repeatedly following `next`, a nasty performance surprise for the unwary. Better to let the module user do that if they really need it. At least that way, they should know it will be very slow. So I must ask the related question: How should `WARC::Collection` expose information about the volumes in the collection? Collections can be large enough that the indexes must be primarily stored on disk. Common Crawl, as an example, is ~~double-digit TB~~ hundreds of TB — ~~tens~~ hundreds of thousands of 1GB WARC files storing ~~many~~ billions of records per crawl. Then again, simply returning an array should work here — ~~ten~~ two hundred thousand `WARC::Volume` objects should fit in a few hundred MB or so of RAM. Is array memory overhead still significantly smaller than hash memory overhead? I will have to carefully think about expected live object counts when choosing internal representations. Or should this be another tied array interface, where the list of WARC files is drawn from an index as needed? That can only work if the collection object is only using one index, but I think requiring a merged index for collections too large for even a complete list of WARC files to fit in RAM is reasonable. This is less of a problem for reading WARC files — the `open_payload` method provides a tied file handle that reads the payload from a WARC record; the real problem is supplying the data when writing a WARC file, especially in a way that is compatible with future support for transparently saving `LWP` exchanges to WARC files. Are temporary files really the only practical option here? (I suspect probably so.) Temporary file space can be bounded even if payload size is not: segments can be recorded as they arrive. Edited 2019-08-10 by jcb: Correct size of Common Crawl datasets and redo math. The conclusion seems to remain valid due to a previous math error.	[reply] [d/l] [select]
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by haukex (Archbishop) on Aug 10, 2019 at 12:09 UTC
Just a couple of my opinions: I'm not aware of any requirement placed on the `Archive::` namespace for all modules there to have a similar API or to only work on certain archives. At the moment it feels to me like the most natural place for such a module. No, I don't see any issues with using constructor names different from `new`, in fact this might make the code more readable later on. Just make sure to pick names that really do describe what the constructor is doing, and don't overload it too much - feel free to add more than one constructor with different names if that fits better. I would say key/value pairs (hash), as in `$record->replay( foo => "bar" )` - that is IMO one of the most flexible ways of doing it. If you mean that `$object->method` as well as `$object->[...]` and `$object->{...}` should work, then yes, overloaded array/hash dereferencing that returns a tied array/hash does work (Update: I've done this myself before, but my classes for the two tie classes are different from the object's class!). Just keep in mind that you wouldn't be able to use that API for anything else then. Before you overload an operator, I'd suggest providing a method to do the operation. An overloaded operator can always be added later. (Similarly for the above point.) I'm not sure, but I would suggest providing both a low-level API that doesn't try to do anything fancy, so users can choose to use that for precise control of what happens, and optionally a higher-level API that tries to do the "right" thing (what that means will also be a question of experience with the module). I would say "why not?", but not meant rhetorically - I probably don't know all the issues involved with doing this? I don't know enough about WARC to give a good answer here...	[reply] [d/l] [select]
Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by jcb (Parson) on Aug 10, 2019 at 23:42 UTC
The main reason that I find this reasoning unconvincing is that `Archive::WARC::` felt like the most natural place for this to me for a long time, too. While there may not be a rule that requires this in some bureaucratic sense, the Principle of Least Surprise suggests (at least to me) that modules in the same namespace should share, in principle, similar interfaces. While the method names are often different, all of the modules I have looked at in `Archive::` map some kind of string-like filename to an archive member. While conceptually, this is possible for a subset of WARC records, I want this library to provide complete support for WARC files, and think that that simpler read interface should eventually go into an `Archive::WARC` package that is a front-end to this library. While I mentioned `Archive::Web::` as a possibility in an earlier reply, I have since realized that I cannot actually use that: people will be searching for "WARC" so the name needs to include it. Another reason to put this at top-level is that the WARC format is actually a generic container, not unlike YAML or JSON or MIME. The plan for a `WARC::Alike::` hierarchy to put WARC-like interfaces on other related formats also suggests to me that this library is looking more like a type of framework than a simple archive access tool. Describing what the constructors do is the main reason for not using `new`. The `WARC::Volume`, `WARC::Index::{CDX,SDBM,...}`, and `WARC::Collection` classes all work only for reading existing data. (The `WARC::Index->build` class method inherited by index implementations constructs an index builder, planned as `build WARC::Index::CDX (...)` returning a `WARC::Index::CDX::Builder` object if not given the `from` option. Or should it always return the index builder, even if it "took care" of indexing some volumes for you?) So, in the current draft, volumes are mounted, indexes are attached, and collections are assembled. So, `$record->replay` to read whatever most closely matches the actual record (and probably `croak()` if we do not have a class for it), `$record->replay( as => 'http' )` to read an `HTTP::Response` (possibly translated a la `LWP` from some other protocol, probably also `croak()`ing if we cannot do it), `$record->replay( as => 'http', with => 'request' )` to actually read the HTTP request rather than synthesizing a stub, and `$record->replay( as => 'http', with => 'chain' )` to fetch an entire HTTP redirect chain along with the final request/response pair? And feel free to bikeshed the values for the `with` option, if anyone reading has any ideas. The concern I had was about having one method do too much, but logically `replay` is a single operation, even if it dispatches to `_replay_as_` methods to handle protocol translations. Considering that `WARC::Fields` is a simple in-memory ordered key-value store with a few convenience semantics, I do not expect that to be a problem, although your comment suggests that the array `FETCH` should perhaps return an object that stringifies to the key name, but also has an "offset" field indicating which* of multiple occurrences of the same key this item represents. The idea is that the array interface should provide the "field name" column from an "application/warc-fields" document. (The WARC record headers also have their own MIME type.) Here is a sample, extracted from a WARC file I have around (actually that I made in order to have some "real-world" data for developing this): `software: Wget/1.16 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdr +aft.pdf robots: classic` [download] That is from the "warcinfo" record that Wget wrote. For this example, the tied array would contain: `qw/software format conformsTo robots/` or objects that stringify to those values. Although if `FETCH` returns an object, it could also include the value for that line as well. Hmmmm... I realized fairly quickly that the tie classes need to be different, and that the tied objects need to be different as well. (I recall something about self-tying causing segmentation faults in several versions of perl, but I do not have an exact citation for that at hand.) As I currently understand, while the various access methods need to be in subclasses, the `TIEHASH` and `TIEARRAY` methods are responsible for blessing the references that they return and can put them into any class desired, so tying a hash to `WARC::Fields` can invoke `WARC::Fields->TIEHASH` which returns a `WARC::Fields::TiedHash` object. The tied object class name will be a string constant, to allow the "empty subclass" test to pass, since a subclass can always override `TIEHASH`, call `SUPER::TIEHASH`, and then re-`bless` the returned object. The overloaded array/hash dereference on `WARC::Fields` is convenience for `tie`, which would remain documented, (I think the underlying tied object would actually be a scalar reference to the `WARC::Fields` object or its data) while the overloaded `<=>` on `WARC::Record` would probably be `use overload '<=>' => 'compareTo';` with the use of camelCase in the method name as a hint that there is something special about that method: it is not directly called by perl, but it is called implicitly. That said, the main reason to overload `<=>` on `WARC::Record` is to redefine `==` to return true iff both objects refer to the same physical record, even if they are distinct objects. This is "value semantics" if I understand the term correctly. The `WARC::Record` generally is that low-level API. The `open_payload` method returns a tied filehandle which is a higher-level API that reads the stored entity in a record or possibly multiple records if segmentation is used. (I would expect an `Archive::WARC::open_member_file` call to eventually map to `open_payload` somehow.) This suggests an `open_content` method that returns a tied filehandle that reads from the body of a (single) WARC record without performing decoding. Now that I think about it, that could be very useful for implementing the `open_payload` method. Thanks for pointing me in this direction. The most significant issue I see is "which volume should be 'next'?" — a collection can use multiple indexes that may partially overlap and that are presumably from multiple (possibly simultaneous) crawls. How to impose a total order amongst the WARC volumes that is least surprising or is this not possible in general? Remember that reading indexes into memory may not be possible and even just a list of WARC volumes may be too large to hold in RAM. While physical hardware with "that kind of disk space" probably has "that kind of RAM" too, thanks to networks and cloud computing, we may be on an instance that has access to that much data, even mapped into the local filesystem, but definitely does not have "that much" RAM. I am thinking about Common Crawl here. While I personally do not have much use for that at this time, I do want this library to scale well enough for those who do have those uses. This comes back to WARC being a generic format, and one of the goals when developing WARC was to allow dumping network traffic (at a certain layer) directly into the growing archive. This is why WARC stores HTTP messages as records with Content-Type "application/http" and entities with transfer encodings intact. I have an eventual goal to be able to use WARC on a small scale as a type of persistent cache, nearly transparently integrating into `LWP`. This library is the first step: routines for handling the on-disk format. Later steps include interfaces that allow `LWP::UserAgent` to transparently return items from a WARC collection when appropriate, or even to (transparently) use only a WARC collection, which could be useful for testing. Long term ideal goals include coordinating with the `LWP` maintainers to add hooks that enable an `LWP`/`WARC` interface to record the exact bytes sent and received over the socket. But first, I need to implement reliable access to and construction of WARC files. All the rest builds on this layer.	[reply] [d/l] [select]
Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by stevieb (Canon) on Aug 09, 2019 at 21:23 UTC
Hey jcb, this is a great presentation here, but to be honest, I feel that it's a bit overwhelming. It might be easier to digest for our busy Monks if you could put the code into a repository of some sort (Github/Bitbucket etc), then ask your questions in a shorter, more direct and concise post, referring to the code in the external location where necessary. Not trying to dissuade you here... I've definitely asked for code review numerous times here over the years. I'm just making a suggestion from experience that may get more eyes on what you're trying to achieve/ask. -stevieb	[reply]
Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by jcb (Parson) on Aug 09, 2019 at 22:58 UTC
If not for my very first question, I would have uploaded a "preview release" to CPAN already. If the answer to that first question is (as I suspect) "No, put it at top-level", then I can start making early releases to CPAN. I expect that "0.0.0 alpha N" is a reasonable version number for "no code yet". :-) And there seems to be a small misunderstanding: I am really asking for an API design review. There is effectively no code written yet because I am hoping for monks more experienced than myself to say either "Yes, that API is sound and will be a good addition to CPAN." or "You will have problems here, here, and here. Have you considered ...?" before I put too much effort into writing code that will make problems later.	[reply]