
We are creating a Scientific Annotation Middleware (SAM) system that will provide researchers and developers with the capabilities necessary to manage the complexity resulting from the collaborative, cross-disciplinary, compute-intensive research enabled through the SciDAC initiative. SAM will include components and services that enable researchers, applications, problem solving environments (PSE) and software agents to create metadata and annotations about data objects and the semantic relationships between them. Human access to the middleware will be through a researcher’s notebook interface available via desktop computers and PDA devices. SAM will also support manual and programmatic queries across entries generated by these multiple sources. This research, performed by the same team that created the very successful DOE2000 electronic notebook project, will lead to fundamentally new and more complete, effective, and efficient ways to document the scientific work performed in SciDAC and across DOE.
The practice of science is critically dependent on complete and accurate documentation of experiment processes and results. Scientific records enable repeatability in the evaluation of scientific hypotheses, allow researchers to share results and avoid duplication of work, and provide a means to establish credit and accountability for scientific discoveries. Before computerization, paper notebooks were the primary scientific record. However, with the advent of computing and large scale simulation, experiment complexity, data size and dimensionality, and the overall number of experiments performed have exploded, stretching traditional annotation methods to their limits. As these trends continue, and as experiments and teams themselves become more distributed and cross-disciplinary, the research process must become self-documenting. Richer, more detailed, more searchable annotations will be required. Metadata generated by the myriad tools used within a project will have to be integrated to provide a complete picture of the scientific research being performed.
Electronic notebooks, such as those developed as part of the DOE2000 program, have advanced the practice of scientific documentation by allowing multimedia input, links to data files, digital signatures, simple searching, and long-distance collaboration. They have been widely embraced by existing collaboratory pilots as well as by many application groups including chemistry, accelerator beam lines, and climate research. However, electronic notebooks are limited in their ability to show non-chronological relationships between entries, to support complex searches, and to interact with other producers, curators, and consumers of annotations such as autonomous feature-detection agents, digital libraries, and data pedigree mechanisms.
A new generation of scientific annotation middleware is needed that can interact with applications, problem solving environments, and software agents as well as with humans. In addition to a user interface, the system must include components and services that allow access from mobile devices and provide seamless, two-way interaction with applications and agents, as well as tools that support the generation and translation of annotation metadata. Such a system must provide natural, unobtrusive means for creating annotations, allow the discovery of patterns across the integrated annotation metadata, and enable researchers to navigate semantic relationships between data items. The capability to view, query, and extend the entire corpus of metadata relating to a project through a cohesive middleware system will greatly enhance researcher’s ability to understand the context of their data and will lay the groundwork for increased automation of scientific knowledge discovery.
We propose to research the advanced annotation needs of SciDAC applications and develop the components of a scientific annotation middleware system. The work will be performed in close cooperation with domain oriented application projects (e.g. climate, chemistry), and will incorporate relevant security, distributed computing, data management, information analysis, and other technologies from the SciDAC, Grid and related activities. Opportunities for deploying SAM as part of the architecture for collaboratory pilot projects will also be pursued. Feedback from the extensive user base of the DOE2000 electronic notebooks will guide the initial design of the electronic notebook interface for SAM and contribute design requirements for programmatic interfaces. The project will provide lifecycle support for SAM, including a regular schedule of incremental releases of SAM components and documentation. Formal monitoring and feedback activities, targeting both end users and developers using SAM services, will be performed to understand how the system is being used and to provide guidance for further developments.
Figure 1 shows a high-level view of the architecture envisioned for the project. Core middleware capabilities are provided through three sets of services.
SAM will rely on external data and metadata storage mechanisms. Application programmatic interfaces (API) to each service will allow agents, applications, PSEs, and portals to interact with SAM at a variety of levels.

The SAM team will work closely with other SciDAC developers to move toward standard representations for software-generated metadata such as data pedigrees (experiment parameters, system description, input files, version of software/algorithms used), summary information (low-resolution subsets, identified features), and relationships to other data (e.g. part of a project or parameter study). The SAM team will work with SciDAC end users to define and develop a set of basic annotation types and semantic relationships necessary to represent project plans, hypotheses and conclusions, ideas for follow-on experiments, meeting notes, etc. At both levels, SAM will be architected such that these standards can be evolved and customized by specific groups.
SAM services will be made accessible through broad APIs that expose the full capabilities at each level. While third-party systems can access these APIs directly, we anticipate the development of components that support common use scenarios. Thus an agent system that needs to spider through data pedigree information to discover possible common root causes for anomalous results present in multiple data objects may need to use the SS API directly, whereas an application wishing to provide a query interface for locating data within SAM might incorporate a standard search component. We will work with SciDAC pilot projects to define and prioritize components for development.
Standard graphical components will be developed to simplify processes such as gathering information from a user to develop a query and returning the results to an application. An electronic notebook graphical user interface (GUI) will be developed using the service APIs. APIs that allow other software to interact with and control the notebook GUI will also be developed. Administrative tools will be created to simplify management of multiple notebooks, configuration of metadata services, etc.
Additional technical information, project deliverables, management plan, and current status are available from the SAM Project Home Page.