Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp)

Just a couple of my opinions:

I'm not aware of any requirement placed on the Archive:: namespace for all modules there to have a similar API or to only work on certain archives. At the moment it feels to me like the most natural place for such a module.
No, I don't see any issues with using constructor names different from new, in fact this might make the code more readable later on. Just make sure to pick names that really do describe what the constructor is doing, and don't overload it too much - feel free to add more than one constructor with different names if that fits better.
I would say key/value pairs (hash), as in $record->replay( foo => "bar" ) - that is IMO one of the most flexible ways of doing it.
If you mean that $object->method as well as $object->[...] and $object->{...} should work, then yes, overloaded array/hash dereferencing that returns a tied array/hash does work (Update: I've done this myself before, but my classes for the two tie classes are different from the object's class!). Just keep in mind that you wouldn't be able to use that API for anything else then.
Before you overload an operator, I'd suggest providing a method to do the operation. An overloaded operator can always be added later. (Similarly for the above point.)
I'm not sure, but I would suggest providing both a low-level API that doesn't try to do anything fancy, so users can choose to use that for precise control of what happens, and optionally a higher-level API that tries to do the "right" thing (what that means will also be a question of experience with the module).
I would say "why not?", but not meant rhetorically - I probably don't know all the issues involved with doing this?
I don't know enough about WARC to give a good answer here...

Comment on Re: Planning a new CPAN module for WARC support (DSLIP: IdpOp) Select or Download Code

Replies are listed 'Best First'.
Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp) by jcb (Parson) on Aug 10, 2019 at 23:42 UTC
The main reason that I find this reasoning unconvincing is that `Archive::WARC::` felt like the most natural place for this to me for a long time, too. While there may not be a rule that requires this in some bureaucratic sense, the Principle of Least Surprise suggests (at least to me) that modules in the same namespace should share, in principle, similar interfaces. While the method names are often different, all of the modules I have looked at in `Archive::` map some kind of string-like filename to an archive member. While conceptually, this is possible for a subset of WARC records, I want this library to provide complete support for WARC files, and think that that simpler read interface should eventually go into an `Archive::WARC` package that is a front-end to this library. While I mentioned `Archive::Web::` as a possibility in an earlier reply, I have since realized that I cannot actually use that: people will be searching for "WARC" so the name needs to include it. Another reason to put this at top-level is that the WARC format is actually a generic container, not unlike YAML or JSON or MIME. The plan for a `WARC::Alike::` hierarchy to put WARC-like interfaces on other related formats also suggests to me that this library is looking more like a type of framework than a simple archive access tool. Describing what the constructors do is the main reason for not using `new`. The `WARC::Volume`, `WARC::Index::{CDX,SDBM,...}`, and `WARC::Collection` classes all work only for reading existing data. (The `WARC::Index->build` class method inherited by index implementations constructs an index builder, planned as `build WARC::Index::CDX (...)` returning a `WARC::Index::CDX::Builder` object if not given the `from` option. Or should it always return the index builder, even if it "took care" of indexing some volumes for you?) So, in the current draft, volumes are mounted, indexes are attached, and collections are assembled. So, `$record->replay` to read whatever most closely matches the actual record (and probably `croak()` if we do not have a class for it), `$record->replay( as => 'http' )` to read an `HTTP::Response` (possibly translated a la `LWP` from some other protocol, probably also `croak()`ing if we cannot do it), `$record->replay( as => 'http', with => 'request' )` to actually read the HTTP request rather than synthesizing a stub, and `$record->replay( as => 'http', with => 'chain' )` to fetch an entire HTTP redirect chain along with the final request/response pair? And feel free to bikeshed the values for the `with` option, if anyone reading has any ideas. The concern I had was about having one method do too much, but logically `replay` is a single operation, even if it dispatches to `_replay_as_` methods to handle protocol translations. Considering that `WARC::Fields` is a simple in-memory ordered key-value store with a few convenience semantics, I do not expect that to be a problem, although your comment suggests that the array `FETCH` should perhaps return an object that stringifies to the key name, but also has an "offset" field indicating which* of multiple occurrences of the same key this item represents. The idea is that the array interface should provide the "field name" column from an "application/warc-fields" document. (The WARC record headers also have their own MIME type.) Here is a sample, extracted from a WARC file I have around (actually that I made in order to have some "real-world" data for developing this): `software: Wget/1.16 (linux-gnu) format: WARC File Format 1.0 conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdr +aft.pdf robots: classic` [download] That is from the "warcinfo" record that Wget wrote. For this example, the tied array would contain: `qw/software format conformsTo robots/` or objects that stringify to those values. Although if `FETCH` returns an object, it could also include the value for that line as well. Hmmmm... I realized fairly quickly that the tie classes need to be different, and that the tied objects need to be different as well. (I recall something about self-tying causing segmentation faults in several versions of perl, but I do not have an exact citation for that at hand.) As I currently understand, while the various access methods need to be in subclasses, the `TIEHASH` and `TIEARRAY` methods are responsible for blessing the references that they return and can put them into any class desired, so tying a hash to `WARC::Fields` can invoke `WARC::Fields->TIEHASH` which returns a `WARC::Fields::TiedHash` object. The tied object class name will be a string constant, to allow the "empty subclass" test to pass, since a subclass can always override `TIEHASH`, call `SUPER::TIEHASH`, and then re-`bless` the returned object. The overloaded array/hash dereference on `WARC::Fields` is convenience for `tie`, which would remain documented, (I think the underlying tied object would actually be a scalar reference to the `WARC::Fields` object or its data) while the overloaded `<=>` on `WARC::Record` would probably be `use overload '<=>' => 'compareTo';` with the use of camelCase in the method name as a hint that there is something special about that method: it is not directly called by perl, but it is called implicitly. That said, the main reason to overload `<=>` on `WARC::Record` is to redefine `==` to return true iff both objects refer to the same physical record, even if they are distinct objects. This is "value semantics" if I understand the term correctly. The `WARC::Record` generally is that low-level API. The `open_payload` method returns a tied filehandle which is a higher-level API that reads the stored entity in a record or possibly multiple records if segmentation is used. (I would expect an `Archive::WARC::open_member_file` call to eventually map to `open_payload` somehow.) This suggests an `open_content` method that returns a tied filehandle that reads from the body of a (single) WARC record without performing decoding. Now that I think about it, that could be very useful for implementing the `open_payload` method. Thanks for pointing me in this direction. The most significant issue I see is "which volume should be 'next'?" — a collection can use multiple indexes that may partially overlap and that are presumably from multiple (possibly simultaneous) crawls. How to impose a total order amongst the WARC volumes that is least surprising or is this not possible in general? Remember that reading indexes into memory may not be possible and even just a list of WARC volumes may be too large to hold in RAM. While physical hardware with "that kind of disk space" probably has "that kind of RAM" too, thanks to networks and cloud computing, we may be on an instance that has access to that much data, even mapped into the local filesystem, but definitely does not have "that much" RAM. I am thinking about Common Crawl here. While I personally do not have much use for that at this time, I do want this library to scale well enough for those who do have those uses. This comes back to WARC being a generic format, and one of the goals when developing WARC was to allow dumping network traffic (at a certain layer) directly into the growing archive. This is why WARC stores HTTP messages as records with Content-Type "application/http" and entities with transfer encodings intact. I have an eventual goal to be able to use WARC on a small scale as a type of persistent cache, nearly transparently integrating into `LWP`. This library is the first step: routines for handling the on-disk format. Later steps include interfaces that allow `LWP::UserAgent` to transparently return items from a WARC collection when appropriate, or even to (transparently) use only a WARC collection, which could be useful for testing. Long term ideal goals include coordinating with the `LWP` maintainers to add hooks that enable an `LWP`/`WARC` interface to record the exact bytes sent and received over the socket. But first, I need to implement reliable access to and construction of WARC files. All the rest builds on this layer.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Planning a new CPAN module for WARC support (DSLIP: IdpOp)
by jcb (Parson) on Aug 10, 2019 at 23:42 UTC

The main reason that I find this reasoning unconvincing is that Archive::WARC:: felt like the most natural place for this to me for a long time, too.

While there may not be a rule that requires this in some bureaucratic sense, the Principle of Least Surprise suggests (at least to me) that modules in the same namespace should share, in principle, similar interfaces. While the method names are often different, all of the modules I have looked at in Archive:: map some kind of string-like filename to an archive member. While conceptually, this is possible for a subset of WARC records, I want this library to provide complete support for WARC files, and think that that simpler read interface should eventually go into an Archive::WARC package that is a front-end to this library.

While I mentioned Archive::Web:: as a possibility in an earlier reply, I have since realized that I cannot actually use that: people will be searching for "WARC" so the name needs to include it.

Another reason to put this at top-level is that the WARC format is actually a generic container, not unlike YAML or JSON or MIME. The plan for a WARC::Alike:: hierarchy to put WARC-like interfaces on other related formats also suggests to me that this library is looking more like a type of framework than a simple archive access tool.
Describing what the constructors do is the main reason for not using new. The WARC::Volume, WARC::Index::{CDX,SDBM,...}, and WARC::Collection classes all work only for reading existing data. (The WARC::Index->build class method inherited by index implementations constructs an index builder, planned as build WARC::Index::CDX (...) returning a WARC::Index::CDX::Builder object if not given the from option. Or should it always return the index builder, even if it "took care" of indexing some volumes for you?) So, in the current draft, volumes are mounted, indexes are attached, and collections are assembled.
So, $record->replay to read whatever most closely matches the actual record (and probably croak() if we do not have a class for it), $record->replay( as => 'http' ) to read an HTTP::Response (possibly translated a la LWP from some other protocol, probably also croak()ing if we cannot do it), $record->replay( as => 'http', with => 'request' ) to actually read the HTTP request rather than synthesizing a stub, and $record->replay( as => 'http', with => 'chain' ) to fetch an entire HTTP redirect chain along with the final request/response pair?

And feel free to bikeshed the values for the with option, if anyone reading has any ideas.

The concern I had was about having one method do too much, but logically replay is a single operation, even if it dispatches to _replay_as_* methods to handle protocol translations.
Considering that WARC::Fields is a simple in-memory ordered key-value store with a few convenience semantics, I do not expect that to be a problem, although your comment suggests that the array FETCH should perhaps return an object that stringifies to the key name, but also has an "offset" field indicating which of multiple occurrences of the same key this item represents. The idea is that the array interface should provide the "field name" column from an "application/warc-fields" document. (The WARC record headers also have their own MIME type.)

Here is a sample, extracted from a WARC file I have around (actually that I made in order to have some "real-world" data for developing this):
```
software: Wget/1.16 (linux-gnu)
format: WARC File Format 1.0
conformsTo: http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdr
+aft.pdf
robots: classic
[download]
```
That is from the "warcinfo" record that Wget wrote. For this example, the tied array would contain: qw/software format conformsTo robots/ or objects that stringify to those values. Although if FETCH returns an object, it could also include the value for that line as well. Hmmmm...

I realized fairly quickly that the tie classes need to be different, and that the tied objects need to be different as well. (I recall something about self-tying causing segmentation faults in several versions of perl, but I do not have an exact citation for that at hand.) As I currently understand, while the various access methods need to be in subclasses, the TIEHASH and TIEARRAY methods are responsible for blessing the references that they return and can put them into any class desired, so tying a hash to WARC::Fields can invoke WARC::Fields->TIEHASH which returns a WARC::Fields::TiedHash object. The tied object class name will be a string constant, to allow the "empty subclass" test to pass, since a subclass can always override TIEHASH, call SUPER::TIEHASH, and then re-bless the returned object.
The overloaded array/hash dereference on WARC::Fields is convenience for tie, which would remain documented, (I think the underlying tied object would actually be a scalar reference to the WARC::Fields object or its data) while the overloaded <=> on WARC::Record would probably be use overload '<=>' => 'compareTo'; with the use of camelCase in the method name as a hint that there is something special about that method: it is not directly called by perl, but it is called implicitly.

That said, the main reason to overload <=> on WARC::Record is to redefine == to return true iff both objects refer to the same physical record, even if they are distinct objects. This is "value semantics" if I understand the term correctly.
The WARC::Record generally is that low-level API. The open_payload method returns a tied filehandle which is a higher-level API that reads the stored entity in a record or possibly multiple records if segmentation is used. (I would expect an Archive::WARC::open_member_file call to eventually map to open_payload somehow.)

This suggests an open_content method that returns a tied filehandle that reads from the body of a (single) WARC record without performing decoding. Now that I think about it, that could be very useful for implementing the open_payload method. Thanks for pointing me in this direction.
The most significant issue I see is "which volume should be 'next'?" — a collection can use multiple indexes that may partially overlap and that are presumably from multiple (possibly simultaneous) crawls. How to impose a total order amongst the WARC volumes that is least surprising or is this not possible in general?

Remember that reading indexes into memory may not be possible and even just a list of WARC volumes may be too large to hold in RAM. While physical hardware with "that kind of disk space" probably has "that kind of RAM" too, thanks to networks and cloud computing, we may be on an instance that has access to that much data, even mapped into the local filesystem, but definitely does not have "that much" RAM. I am thinking about Common Crawl here. While I personally do not have much use for that at this time, I do want this library to scale well enough for those who do have those uses.
This comes back to WARC being a generic format, and one of the goals when developing WARC was to allow dumping network traffic (at a certain layer) directly into the growing archive. This is why WARC stores HTTP messages as records with Content-Type "application/http" and entities with transfer encodings intact.

I have an eventual goal to be able to use WARC on a small scale as a type of persistent cache, nearly transparently integrating into LWP. This library is the first step: routines for handling the on-disk format. Later steps include interfaces that allow LWP::UserAgent to transparently return items from a WARC collection when appropriate, or even to (transparently) use only a WARC collection, which could be useful for testing. Long term ideal goals include coordinating with the LWP maintainers to add hooks that enable an LWP/WARC interface to record the exact bytes sent and received over the socket. But first, I need to implement reliable access to and construction of WARC files. All the rest builds on this layer.

[reply]
[d/l]
[select]