The Binary Format Description (BFD) language is an XML dialect based on the eXtensible Scientific Interchange Language(XSIL) that supports the executable documentation of 'arbitrary' binary and ascii data sets. Applying a BFD template to a set of files produces an XML output containing the original data in an XML-tagged format that can be interpreted by other programs or subjected to further processing (i.e. using XSLT).
The problems involved in sharing scientific data files are myriad. They range from low-level issues of programming language and operating system differences in number formats (e.g. “big-endian” versus “little-endian”) and the ordering of elements for multi-dimensional arrays to higher level issues involving optional elements (format variations), assumed units (e.g. “feet” versus “meters”), multi-file data sets, and general lack of documentation. These types of problems are clearly not unique to science, but the dynamics of scientific research have limited the success of solutions such as standard data formats and self-documenting files; any up-front effort to make data accessible to potential future users incurs an opportunity cost measurable in terms of research time.
The growing interest in and infrastructure for scientific data mining and informatics approaches that gather and analyze data from across a wide range of techniques and disciplines further complicates the issue, as data may be of interest in multiple communities with conflicting interests. The work described here attempts to address both the technical and domain-dynamics issues involved in sharing scientific data.
The core concepts involved are a language for describing file formats and a universal engine that reads binary (or ASCII) files and their formats to produce transformed output. While the concepts themselves are not new, the implementation in terms of a science-oriented XML format language, portable Java engine, and web-accessible services that allow format descriptions to be registered and automatically applied to data files offers unique advantages. Specifically, the ability to decouple the acts of producing data, operationally documenting its format, defining required outputs, and actually generating the desired metadata, data subsets, and/or translated data provides a very powerful mechanism to shift data sharing towards an efficient on-demand / just-in-time model.
The basic framework of the BFD language comes from XSIL and is described in the XSIL Documentation. The BFD language adds two Elements and several attributes to the XSIL specification. The Elements support an 'if' control structure (XBFDif) and dynamic reference to previously read values (i.e. variables) using XPath expressions (XBFDvalue-of). The new attributes provide mechanisms to describe additional file contructs (e.g. fixed length string buffers) and to reference external files via generic 'stream numbers' instead of specific file names/URLs. More documentation on the BFD extensions, as well as on the areas in which BFD resolves ambiguities in the XSIL specification, is in development.
The link below to the BFD Test Form can be used to run the BFD engine on arbitrary file/BFD template combinations. (The form was developed as part of the SAM project and is also designed to perform a second XSLT transformation and to finally extract metadata to be associated with data stored using SAM. These parts of the form can be ignored if you just wish to try BFD). A simple example is given here:
|BFD Template||Product of running BFD Template on data file|
<?xml version="1.0"?> <!DOCTYPE XSIL SYSTEM "bfd.dtd"> <XSIL> <Param Name="month" Type="int"/> <Param Name="day" Type="int"/> <Param Name="year" Type="int"/> <Param Name="numColumns" Type="int"/> <Param Name="flag" Type="int"/> <XBFDif test="/XSIL/Param[@Name='flag'] = 5"> <Array Name="frequencyData" Type="double"> <Dim> <XBFDvalue-of select="/XSIL/Param[@Name='numColumns']"/> </Dim> <Dim>6</Dim> </Array> </XBFDif> <XBFDif test="/XSIL/Param[@Name='flag'] = 9"> <Array Name="timeData" Type="double"> <Dim>4</Dim> <Dim> <XBFDvalue-of select="/XSIL/Param[@Name='numColumns']"/> </Dim> </Array> </XBFDif> <Stream Encoding="Binary" Type="Remote" XBFDStreamnumber = "1"> </Stream> </XSIL>
<?xml version="1.0"?> <!DOCTYPE XSIL SYSTEM "bfd.dtd"> <XSIL> <Param Name="month" Type="int">10</Param> <Param Name="day" Type="int">12</Param> <Param Name="year" Type="int">2001</Param> <Param Name="numColumns" Type="int">3</Param> <Param Name="flag" Type="int">9</Param> <Array Name="energyData" Type="double"> <Dim>3</Dim> <Dim>5</Dim> <Stream Delimiter=","> 8.5,9.6,10.7,11.8,1.9,2.0,3.1,4.2 34.1,56.2,68.3,80.4,45.7,49.2,72.7 </Stream> </Array>
An interesting consequence of having an XML file format description is that human- readable documentation can be produced from it using a standard XSLT script. The image below shows the result of using a BFD to HTML documentation script on an ASCII file format description.
BFD was developed under internal PNNL funding in 2000 and is currently being maintained/extended within the DOE-funded Scientific Annotation Middleware project. We anticipate an open-source release of the BFD engine. The BFD Java engine (bfd.jar) is available in the download section below.