CompanyProductsScienceSupportWhatsnew
[Product Releases]
Index
[Blog]

Most recent post

[News]

Can we trust docking results?
Sept 2010

IBM Systems and Technology Group releases a white paper with eHiTS and Cell
Oct 2008

EPA's ToxCastTM project will use SimBioSys' eHiTS as docking engine
Nov, 2007

[Events]

243rd ACS
Mar 25-29, 2012
San Diego, CA
see >> more

Index

Molecule File Format Problems

1. Specificity

Most molecular file formats have been designed with one particular application in mind. E.g. the PDB format is designed for storing protein data typically obtained from X-ray crystallography studies - hence it has special fields for temperature factor and occupancy, but lacks the ability to express bond orders, the residue connectivity is implied by the labels rather than explicit. Similarly, most formats lack the ability to express certain type of information while offering fields for some very specific information that has very limited use.

2. Unused (optional) fields occupy memory

Some formats offer optional fields to express information that is not always present or required. However, these fields still occupy memory and disk space in traditional atom-record based formats.

3. The version problem

Extensions in new versions of the format render earlier programs obsolete. Data files have to be converted, all software has to be upgraded if new fields are added to a format. Version compatibility within the same format becomes a nightmare.

4. Information loss at conversions between various formats

Due to the specificity described above, converting from one format to another will always lose some information, namely what was specific to the source format and not contained by the target format. Furthermore, information specific to the target format has to be generated, computed, estimated, often guesstimated, or simply assigned a meaningless dummy value. Therefore, conversions back and forth will hardly ever produce equivalent file to the original.

5. Deviations from standard to add information, leads to incompatibility

Companies and software packages often try to add information to file formats that do not support them, placing it in comments or hacking them into other fields not used. This leads to incompatibility problems among software packages and companies.

Why can't we make a Super-Format ?

One possible solution idea would be to come up with a super-format which contains all possible field types present in any existing format, plus anything else we can think about, that might become useful in the future. Such format would have the following problems:

  • Huge size (all rarely used fields present for every atom and bond)
  • Inefficient record handling: accessing specific data (e.g. coordinates of atoms), requires the software to page through (skip) large amount of other data causing very bad cache performance.
  • Further extension problem: it is inevitable, that new fields will need to be added later - due to new technologies emerging. The extension would face all the incompatibility problems mentioned above.

Solution: Tagged Molecular Format (TMF)

Data structure

  • Same idea as TIFF graphics format: information is packaged by type/kind and a tag explains its format, size, and what is it for. The header of the format lists the tags that are present in the particular file (with a pointer to them for fast access).
  • Vertical data organization: Imagine the traditional atom records placed in a matrix, each row representing an atom and the columns representing the various fields. TMF contains the same data by vertical columns instead of horizontal atom records, i.e. data of one particular field for all atoms/bonds are packed into one information block with a descriptor tag.
  • All data fields are optional: the vertical organization allows each data field to be optional - the header lists what tags are present in the actual file.
  • Field dimensionality and type described in tags: coordinate data maybe 2D or 3D, charge data can be integer or floating point (representing partial charges). To accommodate all these variations, the tag describes elementary data type and the dimensionality as well as the total number of units of data.
  • Multiple data fields are allowed, e.g. conformers can be stored by multiple 3D data tags without repeating all other information units.

Advantages over traditional records

  • Ultimately wide range of application
  • Backward and forward compatibility: old software will look for the specific tags it needs and ignores the rest, hence it will still work when new tag types are introduced.
  • Naturally Extendible: new tag types can be introduced without affecting the validity of existing programs and data files - no conversion or upgrade necessary.
  • Storage Efficiency: no storage space wasted for optional data that is not actually present in the given file.
  • Performance gain in cache locality: the same type of information is stored in a continuous chunk of memory, therefore the processing of a given information type (e.g. 3D coordinates when applying a transformation) will gain better performance due to cache locality - more of the relevant data fits in a cache page.




Copyright © 2011 SimBioSys Inc., All rights reserved.