Scientific Annotations Middleware Notebook
Help
About Home
DOE2000 Electronic Notebook  

Page: SAM Implementation Path
Select a by clicking on it ( shows the current selection).

Initial implementation Plans by Jim Myers - 28 Feb 2002 18:36:50 GMT Hide note
Description: Send comments to Jim.Myers@pnl.gov

SAM Implementation Overview

  1. Digital Authoring and Versioning (DAV)

    The DAV protocol is an extension to HTTP that adds 'write' capabilities to the web. It also adds support for associating arbitrary text/XML metadata with a URL through discoverable and independently accessible 'properties'. In combination with the DAV Searching and Locating (DASL) extension, DAV supports the basic interactions envisioned for SAM's Metadata Management Services layer [MMS_1.1, MMS_1.2, MMS_1.3]. As noted in later sections, DAV's power and flexibility also make it a good choice as a starting point for creating advanced SAM services. These characteristics, coupled with the tremendous support for DAV within the commercial and open source communities, which make DAV a strong candidate for use as SAM's client-side data/metadata management interface.

    SAM will use DAV as its primary client-side data/metadata management interface.

  2. The Slide DAV Server

    Slide is an open source content management system being developed as part of the Apache/Jakarta project. Slide fully supports DAV as an access protocol via a DAV servlet (thereby satisfying MMS_1.1, MMS_1.2, MMS_1.3 as noted above). Slide currently supports basic username/password/access-control-list security. Work is currently being done within the Slide community to add support for DASL and the DAV Versioning protocol. Initial investigations of Slide's object-oriented design (Java) suggest that it will straight-forward to extend Slide to implement additional SAM functionality. Slide includes client-side Java classes for making DAV requests and displaying the contents of a DAV store. It also supports a data store interface that can be used to federate multiple data sources including file systems, and relational and XML databases [MMS_2.1, MMS_2.2]. Implementations exist for several commercial and open source databases. Slide exposes additional functionality (e.g. for server adminstration/configuration) through additional servelets, consistent with SAM's architectural goals [MMS_1.4, SAM_3.1, SAM_3.2, SAM_3.3, SAM_3.4].

    SAM will use Slide as a base component for its MMS layer.

  3. Security

    SAM will initially inherit Slide's username/password-based authentication. To support the Grid community while continuing to support communities with other security implementations [SAM_1.1], the SAM team will coordinate with other projects to design and implement a general security interface/component that treats identities and access control information as ‘black box’ entities that are only interpreted within a replaceable security component. Username/password and GSI implementations will be pursued and the SAM team will replace Slide’s existing mechanism with this new configurable mechanism [SAM_1.2, SAM_1.3].

    The team is currently investigating the capabilities of the Java Authentication and Authorization Service (JAAS) and estimating the effort required to implement such an interface in Slide.

    Note: It will be assumed that all SAM servers in use by a community will use the same security implementation. The conversion of security credentials between SAM instances is out-of-scope.

    Note: The development of access control capabilities more sophisticated than access control lists (e.g. Akenti, CAS) will be dependent upon external interest and cooperation on development.

  4. Events

    SAM is envisioned as an active component capable of reporting its internal events and responding to events from external sources. In coordination with the CMCS project, the SAM project will implement initial capabilities using the Java Messaging Service API[SAM_2.1]. Capabilities will be prototyped using the open source OpenJMS product. Additional/alternate mechanisms such as JAXM and Grid Events will be considered based on user interest.

    Implementation of event capabilities will be prioritized based on the anticipated needs of collaborating projects and other SAM service layers. Specifically, we anticipate publishing events related to data/metadata creation/modification/deletion, followed by publication of SAM activity logging and configuration information. The types of events being published will be reviewed and updated periodically [SAM_2.2]. During development, hooks will be built to allow configuration of the list of event types to publish. [SAM_2.3]. Since users will already have the option to subscrbe only to event types of interest, development of the configuration mechanism will not be an initial priority (configuration of which events to publish is important primarily for scaling as repositories grow and become increasingly active). The SAM team will investigate requirements for SAM to be able to subscribe and respond to external events (from applications/agents, data stores, and/or other SAM repositories) [ SAM_2.4]. Implementation of this capability will have a relatively low priority unless/until requested by users.

  5. Metadata Management Services

    1. Basic Data/Metadata Management

      "Metadata" is an imprecise term. Within SAM, metadata is defined to be anything expressed as DAV properties. This definition does not significantly constrain the types of metadata that can be managed by SAM, but it does leverage DAV's strengths to simplify implementing metadata-related functionality. By this definition, DAV servers already support storage and retrieval of arbitrary individual pieces of key/value formatted metadata, where values may be arbitrary XML [MMS_1.1]. DAV Servers also support discovery of metadata keys that are defined on a resource or collection of resources. Servers that support the DAV Searching and Locating (DASL) protocol extension allow simple SQL-like querying based on metadata values [MMS_1.2].

    2. Metadata Generation/Translation

      A key motivating use case for SAM's metadata generation/translation capabilities is allowing two existing, independent DAV-based applications to interact. The assumption is that SAM will be used to 'match' the applications' data models, without requiring any changes to the applications themselves.

      Supporting the general case is an intractable problem. However, there are a variety of importnant classes of application integration whose solution is simpler and of significant value in creating scientific computing environments, portals, collaboratories, etc. Two of the most important classes are applications that work on orthogonal aspects of data and applications that maintain 'data-flow' relationships between them. The first case might include pedigree, annotation/notebook, accounting, classification, search, and other applications, while the latter case might include a computational code and associated post-processors and analysis applications.

      In both cases, the use pattern is for applications to append additional metadata and/or create derived data sets, treating information stored by other applications as input/read-only. The requirement for integration then becomes a requirement to support the creation of metadata in the schema of one application given relevant information in metadata in a different schema (created by a different application) or in the body of the data itself).

      This is the initial functionality SAM will create - the ability to generate new properties from existing properties and/or information in the data/document itself [MMS_3.1,MMS_3.2]. Such derived properties will be read-only to start, until we have a better idea what the write semantics should be (write to the new name and break the translation link, modify the original property, allow user configuration of which action occurs, etc.). Since properties are text/XML, a language such as XSLT provides significant functionality for deriving new properties from existing ones. Similarly, XSLT would allow derivation of new properties from XML-encoded data/ documents. To support creation of new properties from ascii/binary data, the Binary Format Description (BFD) language can be used to automatically encode data as XML (based on a BFD description of the data's format), at which point XSLT can again be used. For both of these languages, the description of how new properties should be generated is encoded in an XML-based document and can be executed by a generic processing engine. This is an important advantage for use in a middleware system where executing user-submitted binary code would be a security risk.

      Thus, we plan to incorporate XSLT and BFD based mechanisms as the initial metadata generation functionality in SAM. Users with appropriate permissions will be able to upload XSLT and BFD documents to SAM and specify that they be applied over a specified scope. Initially, metadata generation will be applied whenever the specified source property exists, or when the mime-type of the data matches that specified in the generator. We will investigate the utility of limiting generators based on user identity, sub-trees of the URL space managed by the SAM instance, etc. [MMS_3.4] Since there are a variety of use cases in which metadata generation would require a complex calculation or the use of an external look-up table, we also plan to define a generic metadata generation interface. This interface, which we will use to incorporate XSLT and BFD-based mechanisms, could eventually be used to add support for running user's Java code (in an appropriate sand-box), or to call out to external metadata generation web services [MMS_3.5].

      To assist users in developing and debugging metadata generation documents, we will create separate web/servlet based test harness that will allow users to go to a webpage, upload a generator script and data through a form, and see the resulting new properties in a response page.

      To support searching (via DASL) on generated properties [MMS_1.3], we will attempt to implement generated properties in such a way that search functionality will be unaware of the properties' origins. Since DASL support is not yet fully implemented in Slide, we may decide instead to implement this functionality by translating the queries according to registered generators instead, dependent upon which approach best fits the DASL-related aspects of Slide's architecture.

    3. Data Translation

      Data translation is a related issue and the general case is similarly complex. Limiting the scope to addressing the needs of data-flow use cases again produces a tractable initial problem. Although data translation - generating new data from existing data and metadata - is essentially the same process as metadata generation, the fact that DAV only associates one 'document' with a URL (versus an unlimited number of properties) requires a different mechanism to return translated data to the user. One option would be to consider translated data to be new metadata associated with the original data (supported by the proposed metadata generation mechanism). Such an approach raises several concerns related to the size and 'independence' of a translated data set. Given that data is anticipated to be much (up to a factor of 106 or more) larger than metadata, exposing translated data via metadata properties could severely impact the performance of applications that naively request all properties. Further, as the value of a DAV property, translated data would need to be encoded/wrappered in XML, could not have properties of its own, and would not be exposed through its own URL. These concerns suggest the need for a separate approach that would treat translated data uniquely, in such a way that it would be created as, or could easily be copied to a new, separate DAV document. Such an approach should preserve the ability to discover and access translations via the DAV protocol.

      Our current plan is to define an "available translations" property that will list the translated formats available along with URLs to request those formats. (These links would conform to the relationship syntax standard defined in the semantic services layer[MMS_5.2].) Requesting one of these URLs would dynamically invoke the translator. The URL would be read-only, but could be copied to another location (using DAV:Copy). Properties related to the pedigree of the translated data, i.e. a link to the original data set, would be created.

      This approach may be modified, or alternatives may be created depnding on user feedback. One complementary approach that has already been discussed is to specify a translation(s) that would be applied during the intial upload of data, e.g. converting jpg images to gif images, most likely storing both the original uploaded data and the translated copy (versus never storing the original).

      Initially, translations will be limited to the XSLT and BFD mechanisms describe above. However, it will be a priority to quickly move to supporting a web services approach to allow use of existing translators (in Fortran, etc.).

    4. Registering Metadata Generators and Data Translators

      The concepts of generation and translation are not supported within the DAV protocol. Hence applications written to the DAV protocol will have no knowledge of generation/translation services or how to use them. While it is possible, and indeed desireable to make the products of generators and translators available to DAV applications, there is no such constraint on actions such as the registration of specific generators and translators. We envision use cases in which generators/translators are 'manually' created and registered for use with 'legacy' DAV applications as well as cases in which applications/agents may wish to create and register such tools themselves. Thus, we plan to create a registration servlet that can be invoked manually using a web browser or programatically [MMS_3.3,MMS_4.2]. We expect the details of these interfaces to evolve rapidly as new capabilities are developed, but it should be possible to provide backwards compatibility as needed.

 Download samimp.html 

bottom