The main reason that I find this reasoning unconvincing is that Archive::WARC:: felt like the most natural place for this to me for a long time, too.
While there may not be a rule that requires this in some bureaucratic sense, the Principle of Least Surprise suggests (at least to me) that modules in the same namespace should share, in principle, similar interfaces. While the method names are often different, all of the modules I have looked at in Archive:: map some kind of string-like filename to an archive member. While conceptually, this is possible for a subset of WARC records, I want this library to provide complete support for WARC files, and think that that simpler read interface should eventually go into an Archive::WARC package that is a front-end to this library.
While I mentioned Archive::Web:: as a possibility in an earlier reply, I have since realized that I cannot actually use that: people will be searching for "WARC" so the name needs to include it.
Another reason to put this at top-level is that the WARC format is actually a generic container, not unlike YAML or JSON or MIME. The plan for a WARC::Alike:: hierarchy to put WARC-like interfaces on other related formats also suggests to me that this library is looking more like a type of framework than a simple archive access tool.
Describing what the constructors do is the main reason for not using new. The WARC::Volume, WARC::Index::{CDX,SDBM,...}, and WARC::Collection classes all work only for reading existing data. (The WARC::Index->build class method inherited by index implementations constructs an index builder, planned as build WARC::Index::CDX (...) returning a WARC::Index::CDX::Builder object if not given the from option. Or should it always return the index builder, even if it "took care" of indexing some volumes for you?) So, in the current draft, volumes are mounted, indexes are attached, and collections are assembled.
So, $record->replay to read whatever most closely matches the actual record (and probably croak() if we do not have a class for it), $record->replay( as => 'http' ) to read an HTTP::Response (possibly translated a la LWP from some other protocol, probably also croak()ing if we cannot do it), $record->replay( as => 'http', with => 'request' ) to actually read the HTTP request rather than synthesizing a stub, and $record->replay( as => 'http', with => 'chain' ) to fetch an entire HTTP redirect chain along with the final request/response pair?
And feel free to bikeshed the values for the with option, if anyone reading has any ideas.
The concern I had was about having one method do too much, but logically replay is a single operation, even if it dispatches to _replay_as_* methods to handle protocol translations.
Considering that WARC::Fields is a simple in-memory ordered key-value store with a few convenience semantics, I do not expect that to be a problem, although your comment suggests that the array FETCH should perhaps return an object that stringifies to the key name, but also has an "offset" field indicating which of multiple occurrences of the same key this item represents. The idea is that the array interface should provide the "field name" column from an "application/warc-fields" document. (The WARC record headers also have their own MIME type.)
Here is a sample, extracted from a WARC file I have around (actually that I made in order to have some "real-world" data for developing this):
software: Wget/1.16 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdr
+aft.pdf
robots: classic
That is from the "warcinfo" record that Wget wrote. For this example, the tied array would contain: qw/software format conformsTo robots/ or objects that stringify to those values. Although if FETCH returns an object, it could also include the value for that line as well. Hmmmm...
I realized fairly quickly that the tie classes need to be different, and that the tied objects need to be different as well. (I recall something about self-tying causing segmentation faults in several versions of perl, but I do not have an exact citation for that at hand.) As I currently understand, while the various access methods need to be in subclasses, the TIEHASH and TIEARRAY methods are responsible for blessing the references that they return and can put them into any class desired, so tying a hash to WARC::Fields can invoke WARC::Fields->TIEHASH which returns a WARC::Fields::TiedHash object. The tied object class name will be a string constant, to allow the "empty subclass" test to pass, since a subclass can always override TIEHASH, call SUPER::TIEHASH, and then re-bless the returned object.
The overloaded array/hash dereference on WARC::Fields is convenience for tie, which would remain documented, (I think the underlying tied object would actually be a scalar reference to the WARC::Fields object or its data) while the overloaded <=> on WARC::Record would probably be use overload '<=>' => 'compareTo'; with the use of camelCase in the method name as a hint that there is something special about that method: it is not directly called by perl, but it is called implicitly.
That said, the main reason to overload <=> on WARC::Record is to redefine == to return true iff both objects refer to the same physical record, even if they are distinct objects. This is "value semantics" if I understand the term correctly.
The WARC::Record generally is that low-level API. The open_payload method returns a tied filehandle which is a higher-level API that reads the stored entity in a record or possibly multiple records if segmentation is used. (I would expect an Archive::WARC::open_member_file call to eventually map to open_payload somehow.)
This suggests an open_content method that returns a tied filehandle that reads from the body of a (single) WARC record without performing decoding. Now that I think about it, that could be very useful for implementing the open_payload method. Thanks for pointing me in this direction.
The most significant issue I see is "which volume should be 'next'?" — a collection can use multiple indexes that may partially overlap and that are presumably from multiple (possibly simultaneous) crawls. How to impose a total order amongst the WARC volumes that is least surprising or is this not possible in general?
Remember that reading indexes into memory may not be possible and even just a list of WARC volumes may be too large to hold in RAM. While physical hardware with "that kind of disk space" probably has "that kind of RAM" too, thanks to networks and cloud computing, we may be on an instance that has access to that much data, even mapped into the local filesystem, but definitely does not have "that much" RAM. I am thinking about Common Crawl here. While I personally do not have much use for that at this time, I do want this library to scale well enough for those who do have those uses.
This comes back to WARC being a generic format, and one of the goals when developing WARC was to allow dumping network traffic (at a certain layer) directly into the growing archive. This is why WARC stores HTTP messages as records with Content-Type "application/http" and entities with transfer encodings intact.
I have an eventual goal to be able to use WARC on a small scale as a type of persistent cache, nearly transparently integrating into LWP. This library is the first step: routines for handling the on-disk format. Later steps include interfaces that allow LWP::UserAgent to transparently return items from a WARC collection when appropriate, or even to (transparently) use only a WARC collection, which could be useful for testing. Long term ideal goals include coordinating with the LWP maintainers to add hooks that enable an LWP/WARC interface to record the exact bytes sent and received over the socket. But first, I need to implement reliable access to and construction of WARC files. All the rest builds on this layer.
| [reply] [d/l] [select] |