Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Project Metadata Model

by Xiong (Hermit)
on Jan 28, 2012 at 21:16 UTC ( [id://950529]=perlmeditation: print w/replies, xml ) Need Help??

My head is sore from knocking up against a number of project development issues that I now think are related. Some have chided me for my obsessions but if these issues are trivial, why aren't there good solutions? Note that when I hear many shouts of Oh, that's done and each wagging finger points to a different shortest distance between points, I'm in doubt. And there seems to be no greater disagreements than about overall project management, structure, development, and deployment.

Apart from a few contrarians, everyone agrees that it's wise to use strict, wise to use no hard tabs in code, unwise to write spaghetti code. When dealing with databases some use should be made of DBD; when parsing command line arguments, GetOpt::Long still is the favorite; when testing, Test::More and Test::Harness are standard. Why then does the picture get so blurry when the camera zooms out?

This promises to be a big project itself with many smaller projects attached to it. So I'm not interested in any solution bound to a limited number of platforms or languages. I run Perl on Linux so that will be the reference implementation. Others will have different needs. Therefore the good solution will be platform and language agnostic.

What Metadata?

I've been avoiding the word metadata; too heavily overloaded. I see people using it when they mean description or index. It's also a favorite playground for CS theorists and data professionals. But I must admit that many of my outstanding issues seem to tie into the same thing; which properly is project metadata.

Rather than attempt an abstract definition I'll give examples:

  • Test scripts can be chosen from, and organized into, a suite on many criteria; any of which might be considered metadata:
    • ideal execution sequence
    • if the script should ship with the project
    • optional or mandatory
    • original author and any later contributors
    • general type of test: api, unit, etc.
    • target module
    • unit or feature tested
    • pass/fail history
    • testing approach used and test modules required
  • Time to start a new project. Template placeholders must be replaced by literal values. If at creation I say CLIENT => 'Citizen Canine' then that's metadata about the project as a whole.
  • At release, installation, and in between when the dist is hosted, browsed, and indexed; Build.PL (or your poison of choice) and much of what it creates contains essentially nothing but metadata.
  • What are all those dirs doing in your project tree? Are they there to permit you to have multiple files with the same name? Why not put them all together? Sometimes the filesystem hierarchy is the simplest way to tell tools where to look. So prove looks for test scripts in t/ and not in lib/. Is not "I am a test script" an attribute of a file? What exactly does it mean that a script is found in bin/? "I am a user executable." I see all kinds of metadata crammed into filepaths like Devel-Comments-v1.1.4/t/dc-and-ok-for-vanilla/42-simple-for.t
  • Project-specific config files may contain all manner of things, some of which I wouldn't call project metadata, some I might. But every project needs to be able to find its config files; more precisely, needs to write and read them. Where a project maintains its data and in what format is metadata.
  • During development there's a continuous need to create new files and insert them to the current project. Developer should not have to tell the project management tool anything twice; it should already know. While we're on that topic, note that some elements, such as author name and email, chosen license, and boilerplate style are metadata about the developer, perhaps even the developer team.

Use Case

Given a single, consistent project metadata store, I begin a project by writing some of that metadata first. From this, stub files and a skeleton tree are generated, including a number of working, failing tests. I might write a first approximation of the API and so produce, at one stroke, skeleton routines, API tests, and POD synopsis. Any project description I write is mirrored between POD and README (with appropriate changes in formatting).

As I edit the project code, I'm also generating metadata. When I use a module, that module is added to Build.PL, POD, and README. Ideally the metadata for each such used module is read to flesh out the project code and metadata. Minimum required version can be found by asking when a given feature was implemented... and when it started to work.

During the tight edit-test, edit-code loop all results are stored. In case of regress I consult this metadata to find out which feature stopped working and when. Some integration with version control is helpful here; I can say on which branch the feature works.

Some say the project's test suite is the only reliable documentation; but it's not always easy to read. My project is better documented because, as I create tests and implement functions to make them pass, I also create documentation from the generated metadata. It's less work and more reliable that writing docs by hand.

Those pesky config files are created first in the project dir. Perhaps I write a few alternatives. Each is tagged to identify its purpose: foo.conf is for novice users of my project; bar.conf is an example power user config file. Later, at run time, my user consults the metadata to select. He also decides to create a few alternates of his own, which he tags and files away.

I host my project on GitHub; project metadata shows other developers what I'm doing... and what I think I'm doing and will do. If I pull a contributor's patch then his metadata travels with it. Documentation is updatated; every place where his name should be mentioned, it is. Far down the road it will be possible to say who did what.

When I release, the metadata needed to instruct the tarballing is already available. This same metadata goes into the tarball so CPAN has a head start on indexing.

During installation, including install testing, and user run time; project metadata stands ready to inform the user what went wrong where and makes bug report submission a snap. I pull bug reports (together with their metadata) and see quickly where the issue may lie. If not supplied, I can generate a new failing test from the metadata store.

Finally, I can use the correspondances present in the metadata in reverse to deconstruct project elements into generic boilerplate templates... for use in the next project.

Meta Model

A simple project metadata model needs to be written. This is not it, yet; this is just a stab at it.

I imagine a filesystem in which every file and dir has an associated metadata file. Every application, every tool, would know exactly where to go look for metadata about a given file, subdir, project, user, machine, or team.

The blatantly obvious way to do this would be to store a dot-file (hidden file on any OS) in the same dir as the subject; but this might grow ugly quickly -- dot-files and dot-folders are too numerous as it is. Ideally, there would be a less kludgy method of designating metafiles -- a completely new flag. Of course, one might hope that as the standard gained adoption there would be fewer of these dot-files.

Perhaps a reasonable, portable compromise (not requiring every OS to conform to a new filesystem) would be to store all metafiles within a dir in a single subdir:

projects/ .meta/ self train-set/ .meta/ self lib/ .meta/ self Train/ .meta/ self Set.pm Set.pm t/ .meta/ self bad/ .meta/ self fubar.t fubar.t good/ .meta/ self bar.t foo.t bar.t foo.t

Note that there's no requirement for every metafile to exist; merely the locations where they may be found are defined. The rules are simple: For any file ./foo, it's metafile may be found at ./.meta/foo -- if it exists. Every dir bar/'s metafile can be found at bar/.meta/self. Additional metadata may be found by going up the path all the way to /.meta.

Strictly, the metafile that's the flip side of foo/bar/ should be found in foo/.meta/bar; and that's viable until you get to /. I elect bar/.meta/self both to eliminate the special root case and on grounds that when tarballed a dir should somehow carry its metadata along with it.

There is a weakness here I won't attempt to obscure: some rogue application or crazy user might create a regular file path/self, which would require a path/.meta/self metafile conflicting with the file intended to store metadata about path/ itself. There are obvious workarounds. A truly clean solution requires tighter integration with the OS (with every OS) and although I sincerely desire such a thing, that amounts to a One Ring fantasy.

Turf Wars

When a single metafile is writable by many different tools, chaos might ensue but for three simple rules:

  • Any tool may read all metadata.
  • Any tool may write 'private' metadata to its own section of any metafile.
  • Any tool may write to the 'public' section.

So a build tool may blithely search the entire metafile of a project looking for some PROJECT_NAME but if it decides to assign a private value to that attribute it must namespace it as build:PROJECT_NAME.

No metadata will be 'locked' or 'secret'. These are best taken care of by platform-specific system permissions and other tools.

Metafile Format

YAML is human-readable and implemented across a wide range of platforms and languages; it's popular and well understood; supports recursive data structures and explicit typing. XML has the advantage of extremely formal schemas but it's cumbersome. JSON shares with straight Perl code the vulnerability of being directly executable by its interpreter.

YAML's typing is explicit but it's not enforced. Rx provides schemata for YAML and this has a decent Perl interface. This may be a sufficient combination.

Another Way To Do It

An obvious alternative is one or more databases; perhaps SQLite. As databases go, it's lightweight.

My feeling is that this is not the right way to go; I could be swayed. My thought is that to offer a metadata lingua franca, an unsophisticated marketplace accessible to all who choose to participate; the lowest possible barrier to entry is best.

Implementation

I have now spent the bulk of my Perl time just trying to upgrade my "workbench" to what I consider a usable standard; and I'm not there yet. I realize that others have simply bitten the bullet; they cobble together solutions out of existing, inadequate tools; do many tasks manually; and keep a lot of metadata in their heads. Personally, I just can't do that.

I envision one unified interface and many small tools to assist other, standard tools to plug into a project metadata system. At minimum, an interface will be provided to allow writing metadata correctly and reading it on demand. Cascading will be taken care of internally so when multiple values of the same element are available, a tool can demand the entire set or the most specific value.

My interest (and my competency) begins and, perhaps, ends with Perl on Linux; but project metadata should be open to all languages and all platforms. I plan to write a standard specification and, in Perl, a reference implementation. Gradually, I'll tie in my favorite tools.

I'm well aware that I'll need to make significant progress before anyone else shows much interest. Please know that all comments will be taken very seriously.

What's in a Name?

Some say nothing and if you're in that camp, feel free to skip. For me, a good name is everything and standard interchangable project metadata model doesn't cut it. So if this sounds like an interesting concept then please, by all means, go ahead and try for a better name. You're welcome to slip me suggestions privately or anonymously.

Summary

The bulk of my efforts over the past few years have run aground on what seems to me issues of metadata. Now I believe I will not be content to move ahead with any project until I have a reliable method for interchanging metadata among the various stages of a project's life.

Thanks

  • moritz for shoving me out of the CB on this topic.
  • bigcheese for a few concise words at the right time.

Changes

Suggestions for improvement are welcome and will be incorporated.

2012-01-28:
- new
I'm not the guy you kill, I'm the guy you buy. —Michael Clayton

Replies are listed 'Best First'.
Re: Project Metadata Model
by BrowserUk (Patriarch) on Jan 29, 2012 at 11:07 UTC

    Whilst not completely oblivious to the problem you are seeking to address, I think there are (at least) two problems with what you are proposing:

    1. You would be adding Yet Another Layer to packages that all ready have too many layers.

      And none of the existing layers would 'go away' in the process. Backward compatibility, not to mention vested interests, mean that all the existing layers will need to stay in place. No one is going to give up their 'meta.yml' file or equivalent.

      In the end, you will have just added to the burden.

    2. Your all encompassing vision will result in a rat's nest overburden of data in which the actual required information will be entirely obscured.

      As an example of what I mean. this is a 'simple' .vcproj file. Don't just glance at the size of it, take a few moments to peruse the elements it contains, whilst bearing in mind the question -- what does this contribute to the actual building of the application?

      In my assessment, about 80% to 90% of the content of that file is only there to support the metadata of the metafile format itself. Think about that a moment. Take this tiny snippet:

      <xs:sequence> <xs:element name="Tool" minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:attribute name="Name" type="xs:string" use="require +d" /> <!-- NOTE: all other attributes are properties of that +particular tool object. --> <!-- any unrecognized attribute will be ignored. +--> <xs:anyAttribute processContents="skip" /> </xs:complexType> </xs:element> </xs:sequence>

      What does that actually mean? Anyone? What does it actually contribute to the build maintenance of the project? (My conclusion is: Nada!)

      I recently downloaded a VC project that contains 11 source files. I don't use the MSVC IDE, and so I wanted to work out how to build them from the command line and delved into the 400+ line .vcproj file. I got completely lost. So then I tried to build it from the command line, and it came down to:

      cl /MT *.c

      That was all that was required. It built the entire thing right through to the executable, which then ran perfectly. Sure, I added a few extra options when I wrote a makefile, but still I ended up with a 10 line makefile.

      What do those 400+ lines buy me?

    This kind of thing always reminds me of watching a Michelin starred chef prepare a menu on TV. They stand there screaming orders at a dozen sous chefs, all running around performing gastronomic feats of valour. Eventually, a plate of food is delivered to the pass; the top man wipes away an imagined fingerprint and declares it fit for consumption, to a huge round of applause.

    Meanwhile, my wife has cooked and served a fine meal for four, washing up as she went, whilst also watching and commenting on what the great man was doing: "Did you notice? He didn't taste a damn thing!"

    When the process becomes more important than the product, seriously, it is time to re-evaluate your priorities.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Project Metadata Model
by moritz (Cardinal) on Jan 29, 2012 at 04:14 UTC

    I've heard it said, and believe it to be true, that the accuracy of meta data decreases with the distance between the storage of the data and meta data.

    For example if I edit a file, I type its name (or a part thereof, followed by the tab key) in the command line, and I am basically forced to notice a divergence between file name and the purpose of a file, should such a divergence occur. But if the purpose of a file was written down in a meta data file, I simply wouldn't notice unless I explicitly cared to remember updating the meta data.

    That's why I find that storing meta data in storing so many different places (file names, comments and code in the file, plus some extra in Build.PL/Makefile.PL) isn't actually such a bad idea. It might be a bad idea for the author of a tool that needs the meta data, but for the author or maintainer of the distribution I believe it's the only viable way. After I all I want to program, not spend much time keeping my meta data in sync.

    (I also believe that this is the deeper reason for the hatred that many windows users and developers feel towards the "registry", a huge central place for meta data that is in many ways too disconnected from the data it describes).

    That said, I don't believe we have found the optimal sweep spot for meta data yet. Dist::Zilla and similar tools are an exploration in the opposite direction to what you propose: they try to derive meta data (for example dependencies) from the code as much as possible, allowing the author to focus even more on the code itself. I've never quite followed that path, though I'm not sure why. Maybe because it feels like giving up control. (Yes, dzil is flexible to let you decide which parameters you want to control yourself, and which it derives itself. But it requires learning about yet another complex system, and somehow I haven't yet felt the need to do so).

    Given a single, consistent project metadata store, I begin a project by writing some of that metadata first.

    I know this is just a use case, but it does sound pretty much like a top down approach. My projects usually work quite differently: They start as some throwaway .pl file, and if I happen to run it more often I copy it to ~/bin/, and if I expand it and think it might become beneficial to others, I start to extract most subroutines into a module and then CPANize it. That's also the reason why I don't care too much for Module::Starter and the like: they assume I start from scratch, but I don't. Adding the boilerplate usually seems like less work then starting with the boilerplate and adding my code to it.

    I don't know what the summary of my somehow disconnected ramblings is; maybe it is that I find the idea intriguing, but that I don't think it will work for me. I have the feeling that it will violate the motto "don't repeat yourself"; the proposed meta data scheme seems to encourage repeating information that is already there in some way.

      I agree absolutely that the accuracy of metadata decreases with distance from its subject. I'll go further: The usefulness of metadata decreases with distance from its subject. That's why I favor one file : one metafile. It's the least practical distance. I think of each metafile as the flip side of the subject file. You have a document, you edit the document; you want to make notes about the document, you flip it over and scribble there. I am opposed to the fat bloated trapdoor spider sucking goo from everywhere approach.

      Synchronization is exactly my issue. moritz doesn't want to keep metadata in sync but that's exactly what we're doing manually when we copy from one place to another, with or without some format translation. My projects tend to fall apart as little pockets of metadata desynchonize. When I do release, then later I find getting back into an old project a staggering task. I simply can't remember all the informal relationships and unwritten rules.

      Developers generally seem to dislike writing tests and documentation. Some will feel that writing yet more metadata only increases the burden of not-code. But my object is to streamline all of the not-code tasks as well as some of the coding by providing a means of interchange.

      Multiple points of entry and exit mean that you are not required or expected to master yet another language and grand interface. Rather, the benefit comes incrementally. You continue to work as you always have. Perhaps you pull a feature branch from a contributor. You now have the opportunity to accept updates to your project's README, POD, and test suite. You can dismiss the offers or accept them and go straight to the updatated elements and re-tailor them. Of course if your contributor is using PMM then he may already have accepted these offers and delivered a complete, all files patch instead of a vague bug report. Which means less work for you.

      If your projects don't have a single boilerplate start then you won't want that kind of tool. You may be more interested in a tool that tracks metadata as it accumulates and assists you later on when you want to throw on another Jenga block.

      You may not want any of this. Picasso could paint with a toothbrush. I need more structure.

      I'm not the guy you kill, I'm the guy you buy. —Michael Clayton
Re: Project Metadata Model
by tobyink (Canon) on Jan 29, 2012 at 07:31 UTC

    Does everyone really agree that it's "wise to use no hard tabs in code"? There are quite a few comments at the bottom of that page favouring tabs. See also Why tabs are clearly superior and Why I love having tabs in source code.

    I use tabs. I'd compare tabs versus spaces to semantic HTML versus presentational HTML. By using tabs, you can view my source in a text editor with tab stops set to two spaces, four spaces, eight spaces or whatever, and it still looks sensible - I don't enforce my favourite indentation level on you.

    Anyway, enough of that, that's not the main point of your post...

    Project metadata: what I've been doing lately is to have a directory called meta inside the root directory of the project. This contains zero or more RDF files with project metadata - changelogs, credits, links to the bug tracker, repository, etc.

    This can be all in one file, or split up arbitrarily - they are all combined into one in-memory model when they're processed. Currently I tend to use one file for a changelog, one for general project metadata, and a third one for keeping track of dependencies.

    My Makefile.PL then assembles this into META.yml and Changes files, figures out the project's licence and creates a LICENSE file too. It does all that at the author side when making a distribution - thus the libraries for metadata management don't need to be installed at the end user's side.

    The code for doing all this is on CPAN (of course):

    Here's an example of a project that uses it: repo and distributed code on CPAN. Notice the size of Makefile.PL? 42 bytes. No metadata there - it's all in the meta directory.

    update: here's another (bigger) project using the same metadata system.

      I don't find this any better than the equivalent Makefile.PL:
      # This file provides instructions for packaging. @prefix : <http://purl.org/NET/cpan-uri/terms#> . <http://purl.org/NET/cpan-uri/dist/all-your-base-are-belong-to/project +> :perl_version_from _:main ; :version_from _:main ; :readme_from _:main ; :test_requires "Test::More 0.61", "File::Basename", "File::Spec", +"Data::Dumper" ; :requires "parent", "version 0.77". _:main <http://www.semanticdesktop.org/ontologies/2007/03/22/nfo#fileN +ame> "lib/all/your/base/are/belong/to.pm" .
      Worse, apparently your packaging script just bundles a bunch of RDF-parsing modules, then calls them from Makefile.PL, so an ordinary CPAN author has to learn your nonstandard RDF system to contribute to your modules. Perl already has at least 3 widely-used formats to describe a module: ExtUtils::MakeMaker, Module::Build, and Module::Install; people should think hard before adding yet another.

        Except that Makefile.PL is an executable file, and if you want the data it contains, you need to run it. Run it and hope that it doesn't hose your system.

        Module::Package (which I use) is just a wrapper for Module::Install - which itself is mostly a wrapper for ExtUtils::MakeMaker.

        Perhaps one of the existing formats can be extended to cover the use case?

        I'm not the guy you kill, I'm the guy you buy. —Michael Clayton
      What is DOAP, some kind of XML thing?

        DOAP is a vocabulary for describing a software project using RDF.

        No XML here.

Re: Project Metadata Model
by eyepopslikeamosquito (Archbishop) on Jan 28, 2012 at 21:50 UTC
    I too have long had the feeling that metadata could be applied more effectively in many different areas.

    I floated some vague Perl testing metadata ideas in Perl CPAN test metadata -- though it didn't generate sufficient interest to provoke me into pursuing it further.

    Defining "standard" names for metadata seems helpful yet requires collaboration and cooperation from others. It's annoying when different folks use different names for essentially the same metadata. You might have a go at proposing a standard naming scheme and standard names for the project metadata you find useful.

      That's an interesting node.++

      I'm not the guy you kill, I'm the guy you buy. —Michael Clayton
Re: Project Metadata Model
by Anonymous Monk on Jan 29, 2012 at 04:47 UTC

    I like it

    You should check out these : Config::Model, config-edit#-ui, CPAN::Meta::Spec, CPAN::Meta::History, http://module-build.sourceforge.net/

    I think your idea fits in the /CPAN::Meta::/ namespace, but I also think it needs a Config::Model model

    You should compare your ideas to the evolution of the cpan META spec :) I don't have enough knowledge of each to even start a comparison :)

    I like the Config::Model idea because, if your app needs a config file, you simply create a model, and you get a config editor for free, both CLI and GUI, and a config file in any of ini/shell/perl/yaml....

Re: Project Metadata Model
by deMize (Monk) on Jan 31, 2012 at 15:25 UTC
    We (as with many other organizations) are fine-tuning our metadata. In the following site, there's a link of the more general, but commonly used, metadata keys/field (15 of them): Dublin Core. Just something to keep in mind as you pursue your model.


    Demize

      And Dublin Core is an RDF vocabulary, which brings us back to 950556.

      Thanks, deMize; that kind of concrete suggestion really helps. In fact Dublin Core may be the subject of this Meditation and I don't yet realize it.

      I'm not the guy you kill, I'm the guy you buy. —Michael Clayton

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://950529]
Approved by BrowserUk
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-04-20 07:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found