Metadata Generation/Translation
A key motivating use case for SAM's metadata generation/translation
capabilities is allowing two existing, independent DAV-based applications to
interact. The assumption is that SAM will be used to 'match' the applications'
data models, without requiring any changes to the applications themselves.
Supporting the general case is an intractable problem. However, there are a
variety of importnant classes of application integration whose solution
is simpler and of significant value in creating scientific computing
environments, portals, collaboratories, etc. Two of the most important classes
are applications that work on orthogonal aspects of data and applications that
maintain 'data-flow' relationships between them. The first case might include
pedigree, annotation/notebook, accounting, classification, search, and other
applications, while the latter case might include a computational code and
associated post-processors and analysis applications.
In both cases, the use pattern is for applications to append additional
metadata and/or create derived data sets, treating information stored by other
applications as input/read-only. The requirement for integration then becomes a
requirement to support the creation of metadata in the schema of one application
given relevant information in metadata in a different schema (created by a different application)
or in the body of the data itself).
This is the initial functionality SAM will create - the ability to generate
new properties from existing properties and/or information in the data/document
itself [MMS_3.1,MMS_3.2].
Such derived properties will be read-only to start, until we have a
better idea what the write semantics should be (write to the new name and break
the translation link, modify the original property, allow user configuration of
which action occurs, etc.). Since properties are text/XML, a language such as XSLT
provides significant functionality for deriving new properties from existing ones.
Similarly, XSLT would allow derivation of new properties from XML-encoded data/
documents. To support creation of new properties from ascii/binary data,
the Binary Format Description (BFD) language can be used to automatically
encode data as XML (based on a BFD description of the data's format), at which
point XSLT can again be used. For both of these languages, the description of how
new properties should be generated is encoded in an XML-based document and can
be executed by a generic processing engine. This is an important advantage for use
in a middleware system where executing user-submitted binary code would be a
security risk.
Thus, we plan to incorporate XSLT and BFD based
mechanisms as the initial metadata generation functionality in SAM. Users with
appropriate permissions will be able to upload XSLT and BFD documents to SAM and
specify that they be applied over a specified scope. Initially, metadata generation
will be applied whenever the specified source property exists, or when the mime-type
of the data matches that specified in the generator. We will investigate the utility
of limiting generators based on user identity, sub-trees of the URL space managed
by the SAM instance, etc. [MMS_3.4] Since
there are a variety of use cases in which metadata generation would require a
complex calculation or the use of an external look-up table, we also plan to define a
generic metadata generation interface. This interface, which we will use to
incorporate XSLT and BFD-based mechanisms, could eventually be used to add support
for running user's Java code (in an appropriate sand-box), or to call out to external
metadata generation web services [MMS_3.5].
To assist users in developing and debugging metadata generation documents, we
will create separate web/servlet based test harness that will allow users to go
to a webpage, upload a generator script and data through a form, and see the
resulting new properties in a response page.
To support searching (via DASL) on generated properties [MMS_1.3],
we will attempt to implement generated properties in such a way that search
functionality will be unaware of the properties' origins. Since DASL support
is not yet fully implemented in Slide, we may decide instead to implement this
functionality by translating the queries according to registered generators
instead, dependent upon which approach best fits the DASL-related aspects of
Slide's architecture.
Data Translation
Data translation is a related issue and the general case is similarly complex.
Limiting the scope to addressing the needs of data-flow use cases again produces
a tractable initial problem. Although data translation - generating new data
from existing data and metadata - is essentially the same process as metadata
generation, the fact that DAV only associates one 'document' with a URL (versus
an unlimited number of properties) requires a different mechanism to return
translated data to the user. One option would be to consider translated data to
be new metadata associated with the original data (supported by the proposed
metadata generation mechanism). Such an approach raises several concerns related
to the size and 'independence' of a translated data set. Given that data is
anticipated to be much (up to a factor of 106 or more) larger than
metadata, exposing translated data via metadata properties could severely impact
the performance of applications that naively request all properties. Further, as
the value of a DAV property, translated data would need to be encoded/wrappered
in XML, could not have properties of its own, and would not be exposed through
its own URL. These concerns suggest the need for a separate approach that would
treat translated data uniquely, in such a way that it would be created as, or
could easily be copied to a new, separate DAV document. Such an approach should
preserve the ability to discover and access translations via the DAV protocol.
Our current plan is to define an "available translations" property that will
list the translated formats available along with URLs to request those formats.
(These links would conform to the relationship syntax standard defined in the
semantic services layer[MMS_5.2].)
Requesting one of these URLs would dynamically invoke
the translator. The URL would be read-only, but could be copied to another
location (using DAV:Copy). Properties related to the pedigree of the translated
data, i.e. a link to the original data set, would be created.
This approach may be modified, or alternatives may be created depnding on
user feedback. One complementary approach that has already been discussed is to
specify a translation(s) that would be applied during the intial upload of data, e.g.
converting jpg images to gif images, most likely storing both the original uploaded
data and the translated copy (versus never storing the original).
Initially, translations will be limited to the XSLT and BFD mechanisms describe above.
However, it will be a priority to quickly move to supporting a web services
approach to allow use of existing translators (in Fortran, etc.).
Registering Metadata Generators and Data Translators
The concepts of generation and translation are not supported within the DAV
protocol. Hence applications written to the DAV protocol will have no knowledge
of generation/translation services or how to use them. While it is possible, and
indeed desireable to make the products of generators and translators available
to DAV applications, there is no such constraint on actions such as the registration
of specific generators and translators. We envision use cases in which
generators/translators are 'manually' created and registered for use with
'legacy' DAV applications as well as cases in which applications/agents may
wish to create and register such tools themselves. Thus, we plan to create a
registration servlet that can be invoked manually using a web browser or
programatically [MMS_3.3,MMS_4.2].
We expect the details of these interfaces to evolve rapidly as new
capabilities are developed, but it should be possible to provide backwards
compatibility as needed.