英語原文(wwPDB)

PDBx/mmCIF に関する一般的なよくある質問

  1. PDBから配布する各PDBエントリーのPDBフォーマットデータは、内容の手引き バージョン3.30(2012年11月21日)に記された仕様に基づいています。今後PDBフォーマットの仕様は、新たな内容に対応するための修正や拡張は行いません。
  2. 巨大構造(鎖数が62本を超えるもの、または原子数が99999を超えるもの)は、PDBフォーマットでは完全に表現することはできませんが、PDBでは1つの構造に対し1つのPDBx/mmCIFファイルでデータ提供を行っています。このようなエントリーに対しては、可能な範囲で記述を行ったPDBファイル群をTARでまとめたバンドルファイルをも提供しています。
  1. PDBx/mmCIFは2014年にPDBの標準フォーマットとなりました。
  2. 全てのPDB登録処理は、いずれのwwPDBサイトでもPDBx/mmCIFフォーマットを使い行っています。
  3. PDBx/mmCIFは、キーワードと値の組を表にした情報を持つ「カテゴリ」で構成されています。
  4. PDBx/mmCIFのカテゴリは、互いに関係することが明確に定義されています。
  5. PDBx/mmCIFフォーマットには原子数、残基数、鎖数などに制限はありません。全て単一のPDBエントリーとして表現できます(つまりフォーマットの都合による分割は発生しません!)
  6. PDBx/mmCIFファイルの各データ項目は、PDBxデータ交換辞書によって厳密に定義されています。データ辞書の内容は完全にソフトウェアから利用できます。
  7. PDBフォーマットに記載されている全てのデータは、対応するPDBx/mmCIFフォーマットファイルに全て記載されています。
  8. PDBエントリーのいずれかに含まれる全ての単量体、リガンドはPDB化合物辞書で定義されています。この辞書はPDBx/mmCIFフォーマットで書かれています。
  9. PDBx/mmCIFはJmol、Chimera、OpenRasMolなどの構造視覚化ソフトや、CCP4、Phenixなどの構造決定システムで使えます。
  1. The format is based on a context-free grammar. PDBx/mmCIF has a simple grammar. Data are presented in either key-value or tabular form. It is much easier to parse than the record-oriented PDB format. Say good-bye to "exception" handling when reading old-style PDB flat files!
  2. There are no column width limitations.
  3. All relationships between common data items (e.g. atom and residue identifiers) are explicitly documented within the PDBx Exchange Dictionary. This permits software applications to evaluate and validate referential integrity with any PDB entry.
  4. The mmCIF/PDBx Exchange Dictionary provides metadata (e.g. data types, allowed ranges, controlled vocabularies) which can be used to generate a validating mmCIF parser or a database loader.
  5. Parsing tools are available in most popular languages (e.g. C/C++, Java, Python, Perl, FORTRAN) and toolkits (e.g. BioJava and BioPython).
  6. Mapping information between the residue sequences of the experimental sample and the model coordinates is included within each entry.
  7. PDB Chemical reference data are maintained and distributed in PDBx/mmCIF format.
Plans for a more PDB friendly mmCIF/PDBx ATOM records
  • All records on a single text line
  • Columns presented in standard column order.
  • Tabular presentation with leading record names (e.g. ATOM, CELL, REFINE)
  • Method independent features in left-most columns (e.g. identifiers & coordinates)
  • Method specific features in the right-most columns (e.g. ADPs, NMR order/disorder parameters)
  • Continue to support PDB nomenclature semantics (e.g. PDB style chains, residue numbering, and insertion codes)

The following examples show the ATOM records from the current PDB format and an example from the proposed stylized PDBx/mmCIF format. In the PDBx/mmCIF example the order of columns places the chain, residue and atom nomencature items in the left-most columns. Data items that depend on the experimental method (e.g. occupancy, B-value ) are placed in columns to the left. All of the items of the atom record in the PDBx/mmCIF format example are placed on a single text line and are white-space delimited.

Example of Record-oriented PDB Format ATOM Records
ATOM 1 N GLN A 39 24.690 -27.754 24.275 1.00 60.76 N ATOM 2 CA GLN A 39 23.581 -26.768 24.416 1.00 60.98 C ATOM 3 C GLN A 39 23.990 -25.379 23.905 1.00 59.98 C ATOM 4 O GLN A 39 25.070 -25.209 23.330 1.00 60.25 O ATOM 5 CB GLN A 39 23.136 -26.685 25.878 1.00 60.69 C ATOM 6 N VAL A 40 23.115 -24.395 24.122 1.00 59.58 N ATOM 7 CA VAL A 40 23.342 -23.010 23.690 1.00 57.26 C ATOM 8 C VAL A 40 24.000 -22.152 24.778 1.00 56.00 C ATOM 9 O VAL A 40 23.992 -20.920 24.692 1.00 55.53 O ATOM 10 CB VAL A 40 22.015 -22.337 23.275 1.00 57.32 C
Example of PDBx/mmCIF ATOM Records (atom_site category)
loop_ _atom_site.group_PDB _atom_site.id _atom_site.auth_atom_id _atom_site.type_symbol _atom_site.auth_comp_id _atom_site.auth_asym_id _atom_site.auth_seq_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.pdbx_PDB_model_num _atom_site.occupancy _atom_site.pdbx_auth_alt_id _atom_site.B_iso_or_equiv ATOM 1 N N GLN A 39 24.690 -27.754 24.275 1 1.000 . 60.760 ATOM 2 CA C GLN A 39 23.581 -26.768 24.416 1 1.000 . 60.980 ATOM 3 C C GLN A 39 23.990 -25.379 23.905 1 1.000 . 59.980 ATOM 4 O O GLN A 39 25.070 -25.209 23.330 1 1.000 . 60.250 ATOM 5 CB C GLN A 39 23.136 -26.685 25.878 1 1.000 . 60.690 ATOM 6 N N VAL A 40 23.115 -24.395 24.122 1 1.000 . 59.580 ATOM 7 CA C VAL A 40 23.342 -23.010 23.690 1 1.000 . 57.260 ATOM 8 C C VAL A 40 24.000 -22.152 24.778 1 1.000 . 56.000 ATOM 9 O O VAL A 40 23.992 -20.920 24.692 1 1.000 . 55.530 ATOM 10 CB C VAL A 40 22.015 -22.337 23.275 1 1.000 . 57.320 ATOM 11 N N ALA A 41 24.560 -22.804 25.797 1 1.000 . 54.570

PDB entries in PDBx/mmCIF format are stored on the ftp sites of the wwPDB partners at one of the locations:

Entries containing very large structures stored PDBx/mmCIF format are currently stored separately one of the locations:

The PDBx/mmCIF format files are named following the convention <PDB_4-LETTER-ID_CODE>.cif.gz (e.g. 1abc.cif.gz). Experimental data files containing X-ray structure factors are only distributed in PDBx/mmCIF format and are named following an older PDB naming convention r<PDB_ID_CODE>sf.ent.gz (e.g. r1abcsf.ent.gz).

A complete description of the download options for PDB data files is maintained at here by the wwPDB. The special handling of PDB entries containing very large structures is available here.

The PDBx/mmCIF format has a simple appearance with only a few syntax elements. All of syntax elements used in PDBx data files are shown in the following snippet describing polymer sequence.

The essential syntax features include:

  • All data items are identified by name and begin with the underscore character, _entity_poly.entity_id.
  • Data item names can be decomposed into a category name and an attribute name, _category.attribute which are separated by a period.
  • Data categories are presented in two styles: key-value and tabular. In the example, categories entity_name_com and entity_poly both use the key-value style and the entity_poly_seq category uses the tabular style. In the tabular sytle, the data item names correpsonding to the table columns follow a reserved loop_ token which are followed by the rows of data rows of white-space delimited data values.
  • Any character data value may be quoted using encapsulating single or double quotes; however, character values containing internal whitespace (e.g. the value of _entity_name_com.name) must be quoted. Character values that extend over multiple lines are quoted using leading and trailing semi-colons positioned at the first character position of the records surronding the multi-line character value (e.g._entity_poly.pdbx_seq_one_letter_code).
  • Lines beginning with the hash symbol # are comments.

Look here for a more complete description of PDBx/mmCIF data file and dictionary syntax.

#  <-- a comment line 
_entity_name_com.entity_id  1
_entity_name_com.name       "Pantoate--beta-alanine ligase, Pantoate-activating enzyme"
 
_entity_poly.entity_id                      1 
_entity_poly.type                           'polypeptide(L)' 
_entity_poly.nstd_linkage                   no 
_entity_poly.nstd_monomer                   no 
_entity_poly.pdbx_seq_one_letter_code       
;AMAIPAFHPGELNVYSAPGDVADVSRALRLTGRRVMLVPTMGALHEGHLALVRAAKRVPGSVVVVSIFVNPMQFGAGGDL
DAYPRTPDDDLAQLRAEGVEIAFTPTTAAMYPDGLRTTVQPGPLAAELEGGPRPTHFAGVLTVVLKLLQIVRPDRVFFGE
KDYQQLVLIRQLVADFNLDVAVVGVPTVREADGLAMSSRNRYLDPAQRAAAVALSAALTAAAHAATAGAQAALDAARAVL
DAAPGVAVDYLELRDIGLGPMPLNGSGRLLVAARLGTTRLLDNIAIEIGTFAGTDRPDGYR
;

# 
loop_
_entity_poly_seq.entity_id 
_entity_poly_seq.num 
_entity_poly_seq.mon_id 
_entity_poly_seq.hetero 
1 1   ALA n 
1 2   MET n 
1 3   ALA n 
1 4   ILE n 
1 5   PRO n 
1 6   ALA n 
1 7   PHE n 
# ....  abbreviated ....
The PDBx/mmCIF data files produced by the wwPDB conform to both the CIF 1.0 and 1.1 syntax specifications. The current syntax specification for CIF 1.1 is maintained at the IUCr CIF site.

Yes, the atom coordindate records in the PDBx/mmCIF data distributed by the wwPDB are stored on individual lines each beginning with either 'ATOM' or 'HETATM'. The elements of each coordinate record are white-space delimited. For example, PDBx/mmCIF coordinate records in PDB entries all have the following regular layout.

loop_
_atom_site.group_PDB 
_atom_site.id 
_atom_site.type_symbol 
_atom_site.label_atom_id 
_atom_site.label_alt_id 
_atom_site.label_comp_id 
_atom_site.label_asym_id 
_atom_site.label_entity_id 
_atom_site.label_seq_id 
_atom_site.pdbx_PDB_ins_code 
_atom_site.Cartn_x 
_atom_site.Cartn_y 
_atom_site.Cartn_z 
_atom_site.occupancy 
_atom_site.B_iso_or_equiv 
_atom_site.Cartn_x_esd 
_atom_site.Cartn_y_esd 
_atom_site.Cartn_z_esd 
_atom_site.occupancy_esd 
_atom_site.B_iso_or_equiv_esd 
_atom_site.pdbx_formal_charge 
_atom_site.auth_seq_id 
_atom_site.auth_comp_id 
_atom_site.auth_asym_id 
_atom_site.auth_atom_id 
_atom_site.pdbx_PDB_model_num 
ATOM   1    N  N   . VAL A 1 1   ? 6.204   16.869  4.854   1.00 49.05 ? ? ? ? ? ? 1   VAL A N   1 
ATOM   2    C  CA  . VAL A 1 1   ? 6.913   17.759  4.607   1.00 43.14 ? ? ? ? ? ? 1   VAL A CA  1 
ATOM   3    C  C   . VAL A 1 1   ? 8.504   17.378  4.797   1.00 24.80 ? ? ? ? ? ? 1   VAL A C   1 
ATOM   4    O  O   . VAL A 1 1   ? 8.805   17.011  5.943   1.00 37.68 ? ? ? ? ? ? 1   VAL A O   1 
ATOM   5    C  CB  . VAL A 1 1   ? 6.369   19.044  5.810   1.00 72.12 ? ? ? ? ? ? 1   VAL A CB  1 
ATOM   6    C  CG1 . VAL A 1 1   ? 7.009   20.127  5.418   1.00 61.79 ? ? ? ? ? ? 1   VAL A CG1 1 
ATOM   7    C  CG2 . VAL A 1 1   ? 5.246   18.533  5.681   1.00 80.12 ? ? ? ? ? ? 1   VAL A CG2 1 
ATOM   8    N  N   . LEU A 1 2   ? 9.096   18.040  3.857   1.00 26.44 ? ? ? ? ? ? 2   LEU A N   1 
ATOM   9    C  CA  . LEU A 1 2   ? 10.600  17.889  4.283   1.00 26.32 ? ? ? ? ? ? 2   LEU A CA  1 
ATOM   10   C  C   . LEU A 1 2   ? 11.265  19.184  5.297   1.00 32.96 ? ? ? ? ? ? 2   LEU A C   1 
ATOM   11   O  O   . LEU A 1 2   ? 10.813  20.177  4.647   1.00 31.90 ? ? ? ? ? ? 2   LEU A O   1 
ATOM   12   C  CB  . LEU A 1 2   ? 11.099  18.007  2.815   1.00 29.23 ? ? ? ? ? ? 2   LEU A CB  1 
ATOM   13   C  CG  . LEU A 1 2   ? 11.322  16.956  1.934   1.00 37.71 ? ? ? ? ? ? 2   LEU A CG  1 
ATOM   14   C  CD1 . LEU A 1 2   ? 11.468  15.596  2.337   1.00 39.10 ? ? ? ? ? ? 2   LEU A CD1 1 
ATOM   15   C  CD2 . LEU A 1 2   ? 11.423  17.268  0.300   1.00 37.47 ? ? ? ? ? ? 2   LEU A CD2 1 

The following command will extract the PDB atom record name, atom name, residue name, chain identifier, residue number, Cartesian X, Y, and Z coordinates from the above snippet of PDBx/mmCIF coordinate data for PDB entry 4HHB.

                grep '^ATOM' 4HHB.cif | awk '{print $1, $25, $23, $24, $22, $11, $12, $13}'
            
Coordinate data is recorded in PDBx/mmCIF ATOM_SITE data category. This brief tutorial describes the PDBx/mmCIF representation of coordinated data and the relationship to PDB format coodinate data items.
This brief tutorial describes the PDBx/mmCIF representation of polymer and non-polymer molecular entities.
The collection of PDBx/mmCIF data categories used in the Chemical Component Dictionary are in the CHEM_COMP_DICTIONARY category group.
The collection of PDBx/mmCIF data categories and items used in the Biologically Interesting molecule Reference Dictionary (BIRD) are in the BIRD_DICTIONARY category group.