
Co-Principal Investigators:
|
Note: This technical overview represents the initial project concept. New technologies, new requirements from partners, and evolution of our thinking will all result in changes between this document and the product of the SAM project. Hence you are strongly encouraged to read the project reports and plans on this site and to contact the PIs before planning to use SAM components in your work. |
We propose the creation of a Scientific Annotation Middleware (SAM) system that will provide the significant advances in research documentation and data pedigree tracking required for effective management and coordination of the complex, collaborative, cross-disciplinary, compute-intensive research enabled through the Scientific Discovery through Advanced Computing (SciDAC) initiative. The proposed system presents researchers, applications, problem-solving environments (PSE), and software agents with a layered set of components and services that provide successively more specialized capabilities for the creation and management of metadata, the definition of semantic relationships between data objects, and the development of electronic research records. Researchers will access the system through a notebook interface, available via desktop computers and mobile devices, as well as through SAM components embedded in other software systems. SAM will support manual and programmatic queries across entries generated by these multiple sources. This research, performed by the same team that created the very successful DOE2000 electronic notebook project, will lead to fundamentally new and more complete, effective, and efficient ways to document the scientific work performed in SciDAC and across DOE.
The practice of science is critically dependent on complete and accurate documentation of experimental processes and results. Scientific records enable replication of results in the evaluation of scientific hypotheses, allow researchers to share results and avoid duplication of work, and provide a means to establish credit and accountability for scientific discoveries. Before computerization, paper notebooks were the primary means of maintaining scientific records. However, with the advent of computing and the growing use of large-scale simulations, the complexity of experiments, data size and dimensionality, and the overall number of experiments performed have increased significantly, thus stretching traditional annotation methods to their limits. As these trends continue and as experiments and teams become more distributed and cross-disciplinary, the processes used to document research also must improve dramatically. Richer, more detailed, and more searchable annotations will be required. Metadata generated by the myriad tools used within a project will have to be integrated to provide a complete picture of the research being performed.
Researchers have explored a variety of directions in the development of electronic notebooks (EN) over the past two decades. Initially, ENs focused on text- and hypertext-based annotation to support individual researchers [1][2][3]. Technological advances, particularly the rise of the Internet and the World Wide Web, led to development of electronic notebooks that had clear advantages over traditional paper notebooks, particularly in terms of support for multimedia annotations and collaborative use [4][5][6]. Investigations of pen-and-voice input and support for lightweight input devices (e.g., personal digital assistants) lessened the ease-of-use advantages of paper. Digital signature technologies, aided by the increasingly recognized advantages of digital storage, have eroded the advantages that paper notebooks traditionally held as a legal record. Other efforts sought to expand the advantages of ENs by bending the notebook analogy. Automatic annotation capabilities demonstrated the possibility of eliminating manual transcription of information available from instrument control software, problem-solving environments, and laboratory information management systems (LIMS)[7]. Workflow capabilities showed that electronic notebooks could become an active component in the data analysis process [8][9] while the use of domain schema to provide sophisticated search capabilities moved notebooks into the field of knowledge management [10]. Other notebooks have explored the use of forms and page templates to provide scaffolding for repetitive procedures. Today, electronic notebooks are in productive use in many communities. Custom solutions have been built and deployed in industry, and commercial notebooks that target structured environments, such as analytical chemistry and pharmaceutical laboratories, have begun to emerge [11][12]. The electronic notebooks developed by the authors of this proposal as part of the U.S. Department of Energy’s DOE2000 project have been embraced widely within collaboratories as well as by many application groups, including chemistry, accelerator beam lines, and climate research. Efforts in industry and government have documented the need to recognize electronic notebooks as primary, legally defensible records at the enterprise scale [13][14][15].
Electronic Notebooks are still rapidly evolving, and many issues must be resolved to make them a preferred choice for industrial, governmental, and academic research. At the user interface level, questions remain about how close the analogy with paper notebooks should be, how to represent editing capabilities, how to integrate electronic notebooks with the overall scientific workflow and records processes, how to leverage mobile devices for input and display, etc. At a deeper level, current notebooks are limited in their ability to show non-chronological relationships between entries, to support complex searches, and to interact with other producers, curators, and consumers of annotations such as autonomous feature-detection agents, digital libraries, and data pedigree mechanisms. These metadata-based data management issues involve many non-EN applications, but their solution is critical in producing a comprehensive and cohesive scientific record. Finally, issues of long-term data and digital signature management and enterprise-scale management of notebooks have yet to be addressed beyond documentation of the requirements.
A new generation of scientific annotation middleware (SAM) is needed to address these issues and to shift from a paradigm in which electronic notebooks are standalone tools to one in which they provide researchers a view into a comprehensive scientific annotation, documentation, knowledge management, and records system. Such middleware will need to interact with applications, problem-solving environments, and software agents as well as with humans. In addition to a user interface, the system must include components and services that allow access from mobile devices and provide seamless, two-way interaction with applications and agents, as well as tools that support the generation and translation of annotation metadata. Such a system must provide natural, unobtrusive means for creating annotations, allow the discovery of patterns across the integrated annotation metadata, and enable researchers to navigate semantic relationships between data items. It must provide the flexibility and power needed by researchers while providing the accountability necessary for legal defensibility. The ability to view, query, and extend the entire corpus of metadata related to a project through a cohesive middleware system will greatly enhance the ability of researchers to understand the context of their data, and it will lay the groundwork for increased automation of scientific knowledge discovery.
The scenarios described below explore some of the possibilities our concept for SAM enables.
A researcher using biology problem-solving environment (PSE) would like to use an electronic notebook. SAM is configured to map the project/subproject structure of the PSE to the chapter/page hierarchy of the notebook. Appropriate data viewers are registered with the SAM notebook services layer, and the user immediately has a populated notebook that is automatically synched with his PSE. He is able to add comments, draw images, and relate data in different projects from a familiar notebook-style interface. During the next revision of the PSE, developers decide to use SAM notebook components to embed the notebook page display and annotation capabilities into the PSE graphical user interface (GUI). Remote project members working on engineering aspects of the project continue to use the notebook interface, but the local researcher no longer distinguishes the notebook as a separate tool.
A materials researcher decides to automate her group’s data collection and analysis process. She configures SAM to link to her existing Extensible Markup Language (XML) data store and redirects the data acquisition software running on the beam line to send its data to SAM. Using XML, she defines a standard set of metadata that should be extracted from the data files. She also defines a mapping from the existing XML data format to the legacy binary format needed by an analysis tool and registers both with SAM. She then writes a small wrapper for her legacy analysis tool using SAM metadata management components to automatically run the tool when new data becomes available and submits its results as new entries semantically linked to the original (e.g., “elemental analysis of,” “calculated unit cell”). She registers a translator that converts the output geometry to Chemical Markup Language (CML) and uses a freeware CML renderer as the viewer within the group notebook. Her colleagues view the sample analysis results through a notebook and add their own analyses and conclusions to the record.
In the above case, two years after the system is delivered, the researcher starts collaborating with an engineer looking for new materials that might make it possible to modify existing designs to reduce costs or increase performance. After being granted permission to access the data, the engineer uses the SAM query interface to discover that his collaborator’s group has used the term “final structure” to identify the best value for the molecular geometry of the sample. He runs a query to identify all the samples for which a “final structure” exists and that satisfy his other criteria. He writes a short script using the translation capabilities of SAM to generate input decks for his property prediction codes. Through a similar process, he also uses SAM to parse the output of his code and to annotate the original data with the new properties. Finally, he uses SAM to make this new data store queriable as part of the store used by his design software.
A climate researcher develops a feature detection agent that can automatically categorize specific weather patterns (e.g., El Nino). Users of an existing climate PSE would like to perform additional analyses on data sets that show the feature. The researcher runs the agent over existing data and tags it with new metadata. Users of the PSE discover that the SAM-based dataset selection tool in their PSE dynamically adds the new metadata to the list of search terms available, and they immediately begin analyzing the causes and effects of the new feature on other climate variables.
To add a data pedigree mechanism to their software, developers deploy SAM services that connect to their existing database. Responding to events in their workflow, they use SAM interfaces to annotate their data, noting the version of software used in processing the data and specifying the relationship between the various input and output files. To display the pedigree information, they invoke a SAM relationship browser that allows users to browse the various pedigrees involving specific files. Optionally, they can configure the browser to allow non-pedigree relationships to be seen in the display as well. Assuming the development environment supports mechanisms for registering new tools and making its workflow events available, no programming is needed by the users.
We propose to research the advanced annotation needs of SciDAC applications and, based on our findings, to develop the components of a scientific annotation middleware system. The work will be performed in close cooperation with domain-oriented application projects (e.g., climate, chemistry, etc.), and will incorporate relevant security, distributed-computing, data-management, information-analysis, and other technologies from the SciDAC, Grid [16], and related activities. The SAM project team will serve as a SciDAC resource for addressing scientific annotation and records issues. In addition, we will pursue opportunities for deploying SAM as part of the architecture for collaboratory pilot projects. Feedback from the extensive user base of DOE2000 electronic notebooks will guide the initial design of the electronic notebook interface for SAM and will contribute design requirements for application programming interfaces (API). The project will provide lifecycle support for SAM, including a regular schedule of incremental releases of SAM components, services and documentation. Formal monitoring and feedback activities, targeting both end users and developers using SAM services, will be performed to understand how the system is being used and to provide guidance for further developments.
Figure 1 shows a high-level view of the architecture envisioned for the project. SAM will rely on external data and metadata storage mechanisms. Core middleware capabilities are provided through three sets of services.
APIs to each service will allow agents, applications, PSEs, and portals to interact with SAM at a variety of levels. Standard graphical components will be developed to simplify processes such as gathering information from a user to develop a query and returning the results to an application. The GUI for the electronic notebook will be developed using these components and the service APIs. APIs that allow other software to interact with and control the notebook GUI also will be developed. Administrative tools will be created to simplify management of multiple notebooks, configuration of metadata services, and other needed features.

Figure 1. Scientific Annotation Middleware components, interfaces, and services
The SAM design assumes an external data store. This open data strategy ensures that projects will be able to leverage advances in data storage and data grid technologies and that SAM will be able to focus on providing value-added middleware functionality. SAM places two requirements on data stores: 1) it will provide, as one option, access through a standard metadata-aware protocol and 2) it will accept arbitrary metadata rather than limiting it to a predefined schema. The Distributed Authoring and Versioning (DAV)[17] protocol is an emerging standard that satisfies these design goals. We anticipate that these requirements also will be satisfied by protocols being developed within the GridForum Remote Data Access Working Group [18] and we will ensure that SAM conforms to these protocols.
DAV is an IETF standard extension to the Web’s Hyper Text Transfer Protocol (HTTP)[19] that supports arbitrary Extensible Markup Language (XML)[20] encoded metadata. DAV was originally designed to support collaborative authoring, but DAV “documents” are not restricted to text-oriented formats and should be considered analogous to files or binary large objects. Extensions to DAV, such as DAV Searching and Locating (DASL), Advanced Collections, and Versioning, which are currently in development, promise additional relevant capabilities [21][22][23]. DAV is quickly gaining popularity in the Web industry. Before the end of the 1999, the Apache Software Foundation, IBM, and Microsoft had already deployed DAV servers as extensions to Web servers. Client support has been offered by the Microsoft Office 2000 suite and by Java, C++, and the Python tool kits. Database vendors such as Oracle also will support DAV in future releases.
Such broad acceptance of DAV makes it likely that its interfaces will become available for most commodity-technology-based scientific data stores, thus making SAM widely applicable. Data Grid implementations, such as the Storage Resource Broker (SRB) and Extensible Metadata Catalog (EMCAT), provide similar wrapper capabilities for large archives.
A third assumption, that metadata will be XML-encoded, is implicit in our design. It also is consistent with DAV and the directions being taken within the Grid data management community (e.g., within the NPACI Data-Intensive Computing Environment projects). XML provides rich capabilities for schema description (XML Schema) and translation (XSLT), thus avoiding name collisions (XML Namespaces) and representing relationships (XLink, XML-encoded RDF). The emergence of scientific domain languages, defined in XML, and generic XML parsing tools provide additional leverage for including XML-coded metadata in our design.
DAV and Data Grids provide basic capabilities for storing and retrieving data and metadata. They also provide capabilities for querying metadata to find relevant data sets. These capabilities provide significant power for curated collections where metadata schemas are well defined and access is from within a single, well-coordinated community. We propose additional capabilities that support use of SAM in evolving communities and across communities. These additional capabilities will be used to address issues related to supporting queries in multiple or evolving schema; providing federated views of multiple, independently managed data stores; providing data in community-accepted formats; and generating new metadata to support new types of usage.
XML provides powerful mechanisms such as the Extensible Style Sheet Language Transformations (XSLT) that allow mapping from one XML schema to another. We propose a mechanism to register such mappings with the MMS and to invoke them during metadata storage, retrieval, and query operations. We will investigate possibilities for selecting mappings based on user or community preferences.
Metadata translation mechanisms will be a key component in providing a collective view of multiple underlying data stores. We will develop mechanisms to register underlying data stores with the MMS and associate the schema translations necessary to access them. We anticipate uses where researchers from one domain (e.g., biology) wish to base a search on a few relevant properties in a data store that targets another domain (e.g., chemistry). We will implement mechanisms to translate incoming queries, forward them to the underlying data stores, and gather the query results.
We anticipate that cases will arise in which new metadata would be considered private (e.g., preliminary results in a notebook or new intellectual property), different types of data might need to be stored in different systems (e.g., based on size), and an existing community might not be willing to host metadata, within their own data store, that could be useful to another community. To support these scenarios, we will investigate various ways of associating metadata in one data store with an object in another store. We will extend our federation capabilities to allow queries to be made across such metadata layers.
We also will develop a means to dynamically register metadata generators with the MMS. This enhancement will have two primary advantages: 1) it will allow metadata to be extracted from data objects to supplement metadata reported directly by applications, and 2) it will allow legacy applications to use file-like semantics to store data while allowing newer applications to take advantage of metadata. We plan to leverage existing efforts in this area, such as the effort that focused on development of the PNNL Scientific Data Management system [24]. Exploration of the use of XML-based languages to describe the extraction of metadata from binary files will continue with the goal of providing a rapid and generic mechanism for developing metadata generators. This is similar to work begun in a PNNL internal project that is focusing on the use of the Extensible Scientific Interchange Language (XSIL). Lightweight mechanisms to define dynamically derived metadata (i.e., properties that can be calculated from existing metadata) will also be investigated.
In discussions of the SAM concept with potential SciDAC pilot Collaboratory projects, scenarios in which local, private metadata undergoes validation and then migration to a community server were postulated. Similarly, a need for special-purpose indexes to enhance performance for common queries has been discussed. To address these cases, and others that might arise, we plan to work with the Collaboratory pilot projects to define mechanisms that will allow the MMS to be extended to support their needs and also allow the higher level services of SAM to make use of the optimized query capabilities of the underlying data stores.
The Metadata Management Services provide capabilities for standard metadata-based queries (i.e., matching keys and values through simple logic to locate relevant data). However, richer metadata models are possible that will support more advanced queries and more powerful navigation of relationships between data objects. Semantic Services (SS) is a feature of SAM that will provide a way to describe metadata in a larger context and in a way that allows both human and programmatic reasoning about metadata entries. The development of Semantic Services will leverage the languages and components being created to support the next-generation “Semantic Web”[25], specifically those related to the Resource Description Framework (RDF) [26] and Ontology Inference Layer (OIL). Both RDF and OIL can be written using XML syntax, thus making it possible for the Semantic Services to leverage the Metadata Management Services component. RDF provides a standard mechanism for describing relationships between objects (e.g., “file X” is “the Fourier transform” of “file Y”). Thus, it provides a common way of representing data pedigree, annotation, and workflow relationships as well as scientific relationships such as the linkages between a gene and related genes in other organisms, protein(s) encoded, physiological effects, etc. RDF also allows relationships to be the objects of other relationships, allowing expression of a researcher’s belief about previously identified relationships. Therefore, it becomes possible to encode comments about correctness or importance within notebooks or as part of a community review system. OIL, through its base standard and extensions, provides additional power for representing domain ontologies (i.e., the concepts and the relationships between them that describe a domain) and supporting reasoning in agent systems [27].
The Semantic Service layer is designed to allow discovery of semantic information and provide a way of working with the information through generic tools (e.g., within Electronic Notebooks). Semantic Services will include a discovery mechanism that will allow researchers and their applications to identify the relationships used within a repository. Semantic relationships will be stored using the underlying MMS, and we will work to extend the MMS query language to directly support relational and ontology-based queries. We also will explore mechanisms for registering standard relationship names and ontologies and for providing ontological guidance to researchers and developers. User-level components that provide access to SS are described in the following sections.
Semantic services are not intended to provide duplicate support for the kinds of relationships now encoded within community databases. Rather, it is designed to allow those relationships to be dynamically discovered by other communities and to allow the generation and evolution of new relationships that crosscut disciplinary boundaries. This includes supporting the development of PSE capabilities for data pedigree and notebook representation as well as the development of relationships that specify scientific connections between data produced in different disciplines or sub-disciplines. Since Semantic Services allow arbitrary relationships to be defined and provide immediate capabilities for searching and browsing these relationships, they can be used as an exploratory tool. By design, they will allow fine-grained mixing of existing (partial) ontologies, thus allowing cross-disciplinary queries within a single data store or across an MMS-federated store. As at the MMS level, we will investigate mechanisms to register maps between relationship schemas, allowing for customized query translations and evolutionary standardization of schema. We will investigate ways of instrumenting Semantic Services to automatically discover similarities and conflicts in relationships defined by multiple communities and to dynamically generate ontologies that capture the nature of cross-disciplinary interactions [28].
While providing an architecture that will allow independent development and evolution of metadata types/naming conventions, the SAM team will work closely with other SciDAC developers to move toward standard representations for software-generated metadata such as data pedigrees (e.g., experiment parameters, system description, input files, version of software/algorithms used), summary information (e.g., low-resolution subsets, identified features), and relationships to other data (e.g., part of a project or parameter study). The SAM team also will work with SciDAC end users to define a set of basic, discipline-independent annotation types and semantic relationships that are necessary to represent project plans, hypotheses and conclusions, ideas for follow-on experiments, meeting notes, etc.
Electronic notebooks require a variety of data and metadata management services, ranging from mechanisms to add and query annotations to functionality related to records management – collections, digital signatures and timestamps, pagination, annotation display mechanisms, etc. In the SAM architecture, notebook services will leverage the semantic and metadata management services. Author names, time stamps, data types, and other notebook specific metadata will be defined in a notebook metadata schema and manipulated through standard MMS mechanisms. The MMS metadata generator registration capability can be used to automatically invoke an external time-stamping service during the notebook submission process. Similarly, the concept of chapter-page and page-note relationships can be encoded using the SS capabilities. Thus, many of the basic Notebook Services capabilities will be implemented through the definition of schema and development of plug-ins for the MMS and SS layers. This architecture has significant advantages for supporting generalized data-mining tools. As an illustration, consider a case in which agents from a collaborative, problem-solving environment seeking to collect all the information on the use of a particular substance during a large, multiyear project can use SS capabilities to discover and traverse collections of notebooks that are part of the project.
Additional mechanisms will be needed at the NS layer to support notebook user interface configuration and true records management. Mechanisms for the registration and discovery of components for creating, editing, and displaying various data types, mechanisms that are analogous to the editor and viewer interfaces developed in the DOE2000 notebook project, will be needed. In SAM, these mechanisms will be extended to support selection of components based on device capabilities and user preferences. NS mechanisms will also be needed for tracking notebooks through their lifecycle, implementing digital signature and witnessing policies, maintaining a signed audit log of notebook configuration, long-term archiving, migrating data and signatures, and for other features required to conform with DOE record-keeping regulations. Depending on the needs of a particular community, electronic notebooks might need unalterable serial numbers, retention schedules based on how they were used, etc. In the notebook services layer, we will strive to provide ways of customizing for different record-keeping policies while implementing default policies suitable for DOE projects.
In creating these services, the SAM team will draw on a variety of existing work. Notebook records management requirements have been documented in industry [29], and by government agencies [15][30]. An XML standard for representing digital signatures of documents and parts of documents is nearing completion [31], and commercial time-stamping and notarization services are becoming available. Discussions between these service providers and the DOE2000 electronic notebook project teams already have led to changes in their interfaces to support the high-volume usage expected compared with other applications such as time-stamping reports and contracts. SAM development efforts will also be guided by the proof-of-concept implementation of digital signatures and time-stamping capabilities with electronic notebooks that were implemented by the DOE2000 project. Development of the NS feature also may be able to leverage work being done to allow customization of portals to in turn enable customization of notebook page displays based on device and user preferences.
Although the intent of the NS feature of SAM is to support the development of electronic notebooks, we will investigate possibilities for exposing specific capabilities to applications in a separable manner. For example, an application that requires a legally defensible audit log might be able to independently access the NS signing, times-tamping, and audit capabilities. Similarly, applications and agents might wish to discover a component capable of rendering a specific type of data for purposes other than creating a full notebook interface.
As noted above, SAM services will be made accessible through broad APIs that expose the full capabilities at each level. While third-party systems can access these APIs directly, we anticipate the development of components that support common use scenarios. Thus an agent system that needs to spider through data pedigree information to discover possible common root causes for anomalous results present in multiple data objects may need to use the SS API directly, whereas an application wishing to provide a query interface for locating data within SAM might incorporate a standard search component. We will work with SciDAC pilot projects to define and prioritize components for development. The following list presents a preliminary set of component concepts that have been developed in response to use cases that have arisen during initial discussion with the developers of other SciDAC proposals, and in internal discussions of notebook interface requirements:
Programming examples detailing the use of SAM components will be developed to guide efforts within other SciDAC pilots to incorporate them into domain applications and PSEs.
We also plan to build a prototype SAM-based notebook using these components. This notebook will provide significant advantages over DOE2000 and other current notebook systems along many dimensions including ease-of-use, search capabilities, legal defensibility, and representational power. It will be designed as both a practical solution for DOE researchers needing an electronic records solution and as a research test bed for advanced notebook concepts.
The SAM notebook interface will also provide enhanced usability within PSEs and portal environments. Specifically, it will allow integration with system-wide event services allowing applications and agents to have awareness of a researcher’s activity within the notebook and for them to trigger actions within the notebook, such as causing the notebook interface to flip to a specific page. Together with the SAM components and direct service APIs, the notebook will provide an extremely powerful and flexible research documentation sub-system for SciDAC researchers.
Several aspects of the SAM architecture span individual services and components. Specifically, security, notification, and administration capabilities are needed throughout SAM. Security capabilities must address issues of authentication, authorization, encryption, and non-repudiation. We plan to implement security capabilities in SAM through standard component-and-service interfaces that hide details of security processes, such as acquiring the user’s credentials for authentication and signing, and the underlying security service implementation. This capability will allow SAM to adopt the implementations used in a community or PSE. Thus, SAM may use a simple username/password authentication in a small academic group setting, or it could provide much stronger authentication via the Grid Security Infrastructure (GSI) or other advanced implementations within a SciDAC pilot project.
SAM presents some interesting security challenges, specifically in the area of authorization. Since individual metadata properties can be considered to be intellectual property, object/file-level access controls may not be sufficient. Further, queries that do not directly return restricted metadata may still expose its existence through the set of objects returned. The level of complexity that must be supported in real scientific communities is not well known. We will investigate general solutions through partnerships with security middleware providers and pilot users.
Such an evolutionary approach will also be used to determine which SAM events should be reported externally and which events SAM should be able to respond to directly. Initial discussions have suggested that events related to the management (creation, modification, deletion, etc.) of objects and associated metadata will be important within PSE environments. Similarly, exposure of events related to user actions (login, queries, etc.) and system configuration will allow integration with system-level logging and auditing capabilities. We plan to work closely with potential SAM users to define motivating use scenarios and to support the appropriate event types. We will investigate commercial and Grid global event service mechanisms, and as with security, we will attempt to encapsulate any implementation dependencies.
Encapsulation also will be used in SAM to separate configuration interfaces from those used directly for managing metadata, semantic relationships, and notebooks. As noted previously in the description of MMS, we will strive to make the SAM interface for creating, querying, and retrieving data and metadata as close as possible to the interface of the underlying data store. Interfaces for configuring MMS translations and metadata generation capabilities will be separated. This, together with default Web configuration pages for SAM services, could make it possible to set up an MMS service and then run non-SAM-aware applications against it, thus gaining the benefits of SAM query translations and metadata generation capabilities, without any programming. Similarly, we hope to separate configuration and administration capabilities at the SS and NS levels to minimize the functionality an application or agent must implement to use SAM. As for the MMS, we anticipate building Web configuration and administration pages as the default access mechanism. These pages could be used on a standalone basis, or from within a portal. Automatic control of configuration and administration functions could easily be built on this foundation by third parties wishing to embed this functionality in agents or applications.