[Product Releases]
|
|
|
|
|
[Blog]
|
|
Most recent post
|
[News]
|
|
Can we trust docking results? Sept 2010 IBM Systems and Technology Group releases a white paper with eHiTS and Cell
Oct 2008
EPA's ToxCastTM project will use SimBioSys' eHiTS as docking engine
Nov, 2007
|
[Events]
|
| 243rd ACS
Mar 25-29, 2012 San Diego, CA
see >> more
|
|
|
|
|
Molecule File Format Problems
1. Specificity
Most molecular file formats have been designed with one particular
application in mind. E.g. the PDB format is designed for storing
protein data typically obtained from X-ray crystallography studies -
hence it has special fields for temperature factor and occupancy, but
lacks the ability to express bond orders, the residue connectivity is
implied by the labels rather than explicit. Similarly, most formats
lack the ability to express certain type of information while offering
fields for some very specific information that has very limited use.
2. Unused (optional) fields occupy memory
Some formats offer optional fields to express information that is not
always present or required. However, these fields still occupy memory
and disk space in traditional atom-record based formats.
3. The version problem
Extensions in new versions of the format render earlier programs
obsolete. Data files have to be converted, all software has to be
upgraded if new fields are added to a format. Version compatibility
within the same format becomes a nightmare.
4. Information loss at conversions between various formats
Due to the specificity described above, converting from one format to
another will always lose some information, namely what was specific to
the source format and not contained by the target format. Furthermore,
information specific to the target format has to be generated,
computed, estimated, often guesstimated, or simply
assigned a meaningless dummy value. Therefore, conversions back and
forth will hardly ever produce equivalent file to the original.
5. Deviations from standard to add information, leads to
incompatibility
Companies and software packages often try to add information to file
formats that do not support them, placing it in comments or hacking
them into other fields not used. This leads to incompatibility
problems among software packages and companies.
Why can't we make a Super-Format ?
One possible solution idea would be to come up with a super-format
which contains all possible field types present in any existing format,
plus anything else we can think about, that might become useful in the
future. Such format would have the following problems:
-
Huge size (all rarely used fields present for every atom and bond)
-
Inefficient record handling: accessing specific data (e.g. coordinates
of atoms), requires the software to page through (skip) large amount of
other data causing very bad cache performance.
-
Further extension problem: it is inevitable, that new fields will need
to be added later - due to new technologies emerging. The extension
would face all the incompatibility problems mentioned above.
Solution: Tagged Molecular Format (TMF)
Data structure
-
Same idea as TIFF graphics format: information is packaged by type/kind
and a tag explains its format, size, and what is it for. The header of
the format lists the tags that are present in the particular file (with
a pointer to them for fast access).
-
Vertical data organization: Imagine the traditional atom records placed
in a matrix, each row representing an atom and the columns representing
the various fields. TMF contains the same data by vertical columns
instead of horizontal atom records, i.e. data of one particular field
for all atoms/bonds are packed into one information block with a
descriptor tag.
-
All data fields are optional: the vertical organization allows each
data field to be optional - the header lists what tags are present in
the actual file.
-
Field dimensionality and type described in tags: coordinate data maybe
2D or 3D, charge data can be integer or floating point (representing
partial charges). To accommodate all these variations, the tag
describes elementary data type and the dimensionality as well as the
total number of units of data.
-
Multiple data fields are allowed, e.g. conformers can be stored by
multiple 3D data tags without repeating all other information units.
Advantages over traditional records
-
Ultimately wide range of application
-
Backward and forward compatibility: old software
will look for the specific tags it needs and ignores the rest, hence it
will still work when new tag types are introduced.
-
Naturally Extendible: new tag types can be introduced without affecting
the validity of existing programs and data files - no conversion or
upgrade necessary.
-
Storage Efficiency: no storage space wasted for optional data that is
not actually present in the given file.
-
Performance gain in cache locality: the same type of information is
stored in a continuous chunk of memory, therefore the processing of a
given information type (e.g. 3D coordinates when applying a
transformation) will gain better performance due to cache locality -
more of the relevant data fits in a cache page.
|
|
|