• Show last search results

Advanced search

  • PubChem Compounds
  • RCSB Protein Data Bank
  • Crystallography Open Database
  • Information
  • Structural formula image
  • 3D model image
  • Chemical data

Information card

Spectroscopy.

  • 3D model source
  • Substructure
  • Superstructure

Representation

  • Ball and Stick
  • van der Waals Spheres

Crystallography

  • Load unit cell
  • Load 2×2×2 supercell
  • Load 1×3×3 supercell

Show bio assembly

Chain representation.

  • Cylinder and plate
  • B-factor tube
  • C-alpha trace
  • Chain color scheme
  • Secondary structure

High Quality

Calculations.

  • MEP surface lucent
  • MEP surface opaque
  • Bond dipoles
  • Overall dipole
  • Energy minimization

Measurement

Formula
Molecular weight
Hydrogen bond donors
Hydrogen bond acceptors

Percent composition

Update (August 2024): We created a brand new app at molview.com , featuring a new sketcher and a new viewer. Let us know what you think!

By closing this popup and using MolView, you agree to the Terms of Use .

We need your support to create more cool stuff! Donate

MolView is an intuitive web-application to make science and education more awesome! MolView is mainly intended as web-based data visualization platform. You can use MolView to search through different scientific databases including compound databases, protein databases and spectral databases, and view records from these databases as interactive visualizations using WebGL and HTML5 technologies. This web application is built on top of the JavaScript libraries and online services listed below. The Virtual Model Kit has been a source of inspiration for the birth of this project.

  • Ketcher : Chemical 2D data reader/writer
  • GLmol v0.47 : primary 3D render engine
  • JSmol : 3D render engine
  • ChemDoodle Web Components v6.0.1 : 3D render engine and spectrum display
  • NCI/CADD Chemical Identifier Resolver
  • RCSB Protein Data Bank (~100.000 macromolecules)
  • The PubChem Project (~51 million compounds)
  • Crystallography Open Database (~300.000 crystals)
  • NIST Chemistry WebBook (~30.000 spectra)
  • NMR Database

MolView v2.4  •  Terms of Use  •  Copyright © 2014-2023 Herman Bergwerf MolView Blog  •  YouTube Channel

Click one of the subjects below to learn more. You can also watch some videos on YouTube to get started.

MolView consists of two main parts, a structural formula editor and a 3D model viewer. The structural formula editor is surround by three toolbars which contain the tools you can use in the editor. Once you’ve drawn a molecule, you can click the 2D to 3D button to convert the molecule into a 3D model which is then displayed in the viewer. Below is a list of all sketch tools.

Top toolbar

Top toolbar

  • Trash: clear the entire canvas
  • Eraser: erase atoms, bonds or the current selection
  • Undo/redo: undo or redo your recent changes
  • Drag: move the entire molecule (you can already use the left mouse button for this)
  • Rectangle select: select atoms and bonds using a rectangular selection area
  • Lasso select: select atoms and bonds by drawing a freehand selection area
  • Color mode: display atoms and bonds using colors
  • Full mode: displays all C and H atoms instead of skeletal display
  • Center: centers the whole molecule
  • Clean: cleans the structural formula using an external service
  • 2D to 3D: converts the structural formula into a 3D model

Left toolbar

Left toolbar

  • Bonds: pick one of the bond types (single, double, triple, up, down) and add or modify bonds
  • Fragments: pick one of the fragments (benzene, cyclopropane, etc.) and add fragments
  • Chain: create a chain of carbon atoms
  • Charge: increment (+) or decrement (-) the charge of atoms

Right toolbar

Right toolbar

In this toolbar you can select from a number of elements, you can also pick an element from the periodic table using the last button. You can use the element to create new atoms or modify existing atoms.

Search bar

You can load molecules from large databases like PubChem and RCSB using the search form located on the left side of the menu-bar. Just type what you are looking for and a list of available molecules will appear.

You can also click on the dropdown button next to the search field to select a specific database. This will perform a more extensive search on the selected database.

The Tools menu contains several utility functions which are listed below.

You can embed a specific compound, macromolecule or crystal using the provided URL or HTML code. Note that the linked structure is the one which is currently displayed in the model window. You can also copy the URL from the address bar in order to link to the current structure.

Export options:

  • Structural formula image: sketcher snapshot (PNG with alpha channel)
  • 3D model image: model snapshot (PNG, alpha channel in Glmol and ChemDoodle)
  • MOL file: exports a MDL Molfile from the 3D model (common molecules)
  • PDB file: exports a Protein Data Bank file from the 3D model (macromolecules)
  • CIF file: exports a Crystallographic Information File from the 3D model (crystal structures)

This collects and displays information about the structural formula.

This shows a new layer where you can view molecular spectra of the current structural formula (loaded from the Sketcher) More details are covered in the Spectroscopy chapter.

3D model resource

This redirects you to the web-page for the current 3D model on the website of its source database (except when the model is resolved using the Chemical Identifier Resolver)

These functions allow you to perform some advanced searches through the PubChem database using the structural formula from the sketcher.

  • Similarity search: search for compounds with a similar structural formula
  • Substructure search: search for compounds with the current structure as subset
  • Superstructure search: search for compounds with the current structure as superset

You can open the Spectroscopy view via Tools > Spectroscopy . You can view three kinds of molecular spectra.

  • Mass spectrum
  • IR spectrum
  • H1-NMR prediction

Export data

You can also export different kinds of data from the currently selected spectrum.

  • PNG image: snapshot from interactive spectrum
  • JCAMP file: JCAMP-DX file of the current spectrum

The Model menu contains some general functions for the 3D model.

This function sets the model position, zoom and rotation back to default.

You can choose from a list of different molecule representations including; ball and stick, stick, van der Waals spheres, wireframe and lines. Macromolecules are automatically drawn using ribbons.

You can switch between a black, gray or white background. The default background is black (exported images from GLmol or ChemDoodle have a transparent background)

You can choose from three different render engines: GLmol , Jmol and ChemDoodle . GLmol is used as default render engine. GLmol and ChemDoodle are based on WebGL, a browser technology to support 3D graphics. If WebGL is not available in your browser, Jmol will be used for all rendering.

MolView automatically switches to:

  • Jmol if you execute functions from the Jmol menu
  • GLmol if you load macromolecules (due to significant higher performance)
  • ChemDoodle if you load a crystal structure (GLmol cannot render crystal structures)

You might want to switch back to GLmol when you do no longer need Jmol or ChemDoodle since GLmol has a better performance.

Note that macromolecules are drawn slightly different in each engine. ChemDoodle provides the finest display. You should, however, avoid using ChemDoodle for very large macromolecules.

Model transformation

You can rotate, pan and zoom the 3D model. Use the right button for rotation, the middle button for translation (except for ChemDoodle) and the scrollwheel for zooming. On touch devices, you can rotate the model with one finger and scale the model using two fingers.

You can load an array of crystal cells (2x2x2 or 1x3x3) or a single unit cell when viewing crystal structures.

Fog and clipping

When you are viewing large structures, like proteins, it can be useful to hide a certain part using fog or a clipping plane. GLmol offers a few options to do this.

  • Fog: you can move the fog forward by dragging the mouse up while holding CTRL + SHIFT (drag in the opposite direction to move the fog backward)
  • Clipping plane: you can move a frontal clipping plane into the structure by dragging the mouse to the left while holding CTRL + SHIFT (drag in the opposite direction to move the clipping plane back)

The Protein menu offers a number of protein display settings including different color schemes and different chain representations.

When loading a protein structure, MolView shows the asymmetric unit by default. This function allows you to view the full biological unit instead.

You can choose from four different chain representations. You can also view the full chain structure by enabling the Bonds option.

  • Ribbon: draws ribbon diagram (default representation)
  • Cylinder and plate: solid cylinders for α-helices and solid plates for β-sheets
  • B-factor tube: tube with B-factor as thickness (thermal motion)
  • C-alpha trace: lines between central carbon atom in amino-acids (very fast rendering)

Chain coloring

You can choose from six chain color schemes.

  • Secondary structures: different colors for α-helices, β-sheets, etc.
  • Spectrum: color spectrum (rainbow)
  • Chain: each chains gets a different color
  • Residue: all amino-acid residues are colored differently
  • Polarity: colors polar amino-acids red and non polar amino-acids white
  • B-factor: blue for low B-factor and red for high B-factor (if provided)

The Jmol menu offers some awesome Jmol-only functions and calculations.

Clears all executed calculations and measurements.

Enables High Quality rendering in Jmol (enabled by default on fast devices) When turned off, anti-aliasing is disabled and the model is drawn using lines while transforming it.

You can perform the following Jmol calculations in Jmol:

  • MEP surface lucent/opaque: calculates and projects molecular electrostatic potential on a translucent or opaque van der Waals surface
  • Charge: calculates and projects atomic charge as text label and white to atom color gradient
  • Bond dipoles: calculates and draws individual bond dipoles
  • Overall dipole: calculates and draws net bond dipole
  • Energy minimization: executes an interactive MMFF94 energy minimization (note that this function only executes a maximum of 100 minimization steps at a time)

You can measure distance, angle and torsion using Jmol. You can activate and deactivate one of these measurement types via the Jmol menu.

  • Distance distance between two atoms in nm
  • Angle angle between two bonds in degrees
  • Torsion torsion between four atoms in degrees

Note that in some cases, the resolved 3D model is only an approach of the real molecule, this means you have to execute an Energy minimization in order to do reliable measurements.

You can use the HTML code below to embed the current 3D model in your website.

Periodic Table

  • Corpus ID: 259298651

Uni-Mol: A Universal 3D Molecular Representation Learning Framework

  • G. Zhou , Zhifeng Gao , +5 authors Guolin Ke
  • Published in International Conference on… 2023
  • Computer Science, Chemistry

Figures and Tables from this paper

figure 1

182 Citations

3d-mol: a novel contrastive learning framework for molecular property prediction with 3d information, mol-ae: auto-encoder based molecular representation learning with 3d cloze test objective, gram matrix: an efficient representation of molecular conformation and learning objective for molecular pretraining, automated 3d pre-training for molecular property prediction, molecule joint auto-encoding: trajectory pretraining with 2d and 3d diffusion, conformational space profile enhances generic molecular representation learning, multi-modal representation learning for molecular property prediction: sequence, graph, geometry, tensorvae: a simple and efficient generative model for conditional molecular conformation generation.

  • Highly Influenced
  • 11 Excerpts

3D graph contrastive learning for molecular property prediction

Molbind: multimodal alignment of language, molecules, and proteins, 111 references, 3d infomax improves gnns for molecular property prediction, geometry-enhanced molecular representation learning for property prediction.

  • Highly Influential

An effective self-supervised framework for learning expressive molecular global representations to drug discovery

Geom: energy-annotated molecular conformations for property prediction and molecular generation, pre-training molecular graph representation with 3d geometry, self-supervised graph transformer on large-scale molecular data, learning neural generative dynamics for molecular conformation generation, an end-to-end framework for molecular conformation generation via bilevel programming, hamnet: conformation-guided molecular representation with hamiltonian neural networks, equibind: geometric deep learning for drug binding structure prediction, related papers.

Showing 1 through 3 of 0 Related Papers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Published: 15 December 2021

Geometric deep learning on molecular representations

  • Kenneth Atz   ORCID: orcid.org/0000-0002-2628-1619 1   na1 ,
  • Francesca Grisoni   ORCID: orcid.org/0000-0001-8552-6615 1 , 2   na1 &
  • Gisbert Schneider   ORCID: orcid.org/0000-0001-6706-1084 1 , 3  

Nature Machine Intelligence volume  3 ,  pages 1023–1032 ( 2021 ) Cite this article

  • Cheminformatics
  • Computational models
  • Computational science

Geometric deep learning (GDL) is based on neural network architectures that incorporate and process symmetry information. GDL bears promise for molecular modelling applications that rely on molecular representations with different symmetry properties and levels of abstraction. This Review provides a structured and harmonized overview of molecular GDL, highlighting its applications in drug discovery, chemical synthesis prediction and quantum chemistry. It contains an introduction to the principles of GDL, as well as relevant molecular representations, such as molecular graphs, grids, surfaces and strings, and their respective properties. The current challenges for GDL in the molecular sciences are discussed, and a forecast of future opportunities is attempted.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

24,99 € / 30 days

cancel any time

Subscribe to this journal

Receive 12 digital issues and online access to articles

111,21 € per year

only 9,27 € per issue

Buy this article

  • Purchase on SpringerLink
  • Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

representation 3d molecule

Similar content being viewed by others

representation 3d molecule

A universal framework for accurate and efficient geometric deep learning of molecular systems

representation 3d molecule

Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing

representation 3d molecule

A geometric deep learning approach to predict binding conformations of bioactive molecules

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

Article   Google Scholar  

Schmidhuber, J. Deep learning in neural networks: an overview. Neur. Netw. 61 , 85–117 (2015).

Gawehn, E., Hiss, J. A. & Schneider, G. Deep learning in drug discovery. Mol. Inform. 35 , 3–14 (2016).

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. In International Conference on Machine Learning Vol. 34, 1263–1272 (2017).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems Vol. 30, 5998–6008 (2017).

Hinton, G. et al. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Proces. Mag. 29 , 82–97 (2012).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems Vol. 25, 1097–1105 (2012).

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A. & Vandergheynst, P. Geometric deep learning: going beyond euclidean data. IEEE Signal Process. Mag. 34 , 18–42 (2017).

Todeschini, R. & Consonni, V. Molecular Descriptors for Chemoinformatics Vols I–II (John Wiley & Sons, 2009).

Townshend, R. J. et al. Geometric deep learning of RNA structure. Science 373 , 1047–1051 (2021).

Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods 17 , 184–192 (2020).

Segler, M. H., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Centr. Sci. 4 , 120–131 (2018).

Bronstein, M. M., Bruna, J., Cohen, T. & Veličković, P. Geometric deep learning: grids, groups, graphs, geodesics, and gauges. Preprint at https://arxiv.org/abs/2104.13478 (2021).

Mumford, D., Fogarty, J. & Kirwan, F. Geometric Invariant Theory V ol. 34 (Springer Science & Business Media, 1994).

Cohen, T. S. & Welling, M. Group equivariant convolutional networks. In International Conference on Machine Learning Vol. 33, 2990–2999 (2016).

Kondor, R. & Trivedi, S. On the generalization of equivariance and convolution in neural networks to the action of compact groups. In International Conference on Machine Learning Vol. 35, 2747–2755 (2018).

Moriguchi, I., Hirono, S., Liu, Q., Nakagome, I. & Matsushita, Y. Simple method of calculating octanol/water partition coefficient. Chem. Pharmaceut. Bull. 40 , 127–130 (1992).

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2 , 303–314 (1989).

Article   MathSciNet   MATH   Google Scholar  

Hoffmann, R. & Laszlo, P. Representation in chemistry. Angew. Chem. Int. Ed. Engl. 30 , 1–16 (1991).

Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations Vol 5. (2017).

Duvenaud, D. et al. Convolutional networks on graphs for learning molecular fingerprints. In International Conference on Neural Information Processing Systems , Vol. 28, 2224–2232 (2015).

Monti, F. et al. Geometric deep learning on graphs and manifolds using mixture model CNNs. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 5115–5124 (2017).

Battaglia, P. et al. Interaction networks for learning about objects, relations and physics. In Advances in Neural Information Processing Systems Vol. 29, 4502–4510 (2016).

Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. Preprint at https://arxiv.org/abs/1806.01261 (2018).

Zhou, J. et al. Graph neural networks: a review of methods and applications. AI Open 1 , 57–81 (2020).

Schütt, K. T., Arbabzadah, F., Chmiela, S., Müller, K. R. & Tkatchenko, A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 8 , 13890 (2017).

Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In International Conference on Learning Representations Vol. 8 (2020).

Feinberg, E. N. et al. PotentialNet for molecular property prediction. ACS Central Science 4 , 1520–1530 (2018).

Torng, W. & Altman, R. B. Graph convolutional neural networks for predicting drug-target interactions. J. Chem. Inf. Model. 59 , 4131–4149 (2019).

Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell 180 , 688–702 (2020).

Somnath, V. R., Bunne, C., Coley, C. W., Krause, A. & Barzilay, R. Learning graph models for retrosynthesis prediction. In Advances in Neural Information Processing Systems Vol. 34 (2021).

Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 10 , 370–377 (2019).

Li, J., Cai, D. & He, X. Learning graph-level representation for drug discovery. Preprint at https://arxiv.org/abs/1709.03741 (2017).

Liu, K. et al. Chemi-Net: a molecular graph convolutional network for accurate drug property prediction. Int. J. Mol. Sci. 20 , 3389 (2019).

Unke, O. T. & Meuwly, M. PhysNet: a neural network for predicting energies, forces, dipole moments, and partial charges. J. Chem. Theory Comput. 15 , 3678–3693 (2019).

Schütt, K. T., Unke, O. T. & Gastegger, M. Equivariant message passing for the prediction of tensorial properties and molecular spectra. Preprint at https://arxiv.org/abs/2102.03150 (2021).

Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A. & Müller, K.-R. SchNet–a deep learning architecture for molecules and materials. J. Chem. Phys. 148 , 241722 (2018).

Schütt, K., Gastegger, M., Tkatchenko, A., Müller, K.-R. & Maurer, R. J. Unifying machine learning and quantum chemistry with a deep neural network for molecular wavefunctions. Nat. Commun. 10 , 5024 (2019).

Bogojeski, M., Vogt-Maranto, L., Tuckerman, M. E., Müller, K.-R. & Burke, K. Quantum chemical accuracy from density functional approximations via machine learning. Nat. Commun. 11 , 5223 (2020).

Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59 , 3370–3388 (2019).

Axelrod, S. & Gomez-Bombarelli, R. Molecular machine learning with conformer ensembles. Preprint at https://arxiv.org/abs/2012.08452 (2020).

Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2 , 573–584 (2020).

Jiménez-Luna, J., Skalic, M., Weskamp, N. & Schneider, G. Coloring molecules with explainable artificial intelligence for preclinical relevance assessment. J. Chem. Inf. Model. 61 , 1083–1094 (2021).

Schnake, T. et al. Xai for graphs: explaining graph neural network predictions by identifying relevant walks. Preprint at https://arxiv.org/abs/2006.03589 (2020).

Sun, M., Xing, J., Wang, H., Chen, B. & Zhou, J. MoCL: data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In Proc. 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 3585–3594 (Association for Computing Machinery, 2021); https://doi.org/10.1145/3447548.3467186

Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S. & Jensen, K. F. Convolutional embedding of attributed molecular graphs for physical property prediction. J. Chem. Inf. Model. 57 , 1757–1772 (2017).

Li, Y., Vinyals, O., Dyer, C., Pascanu, R. & Battaglia, P. Learning deep generative models of graphs. Preprint at https://arxiv.org/abs/1803.03324 (2018).

Simonovsky, M. & Komodakis, N. GraphVAE: towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks Vol. 27, 412–422 (2018).

De Cao, N. & Kipf, T. MolGAN: An implicit generative model for small molecular graphs. Preprint at https://arxiv.org/abs/1805.11973 (2018).

Zhou, Z., Kearnes, S., Li, L., Zare, R. N. & Riley, P. Optimization of molecules via deep reinforcement learning. Sci. Rep. 9 , 10752 (2019).

Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International Conference on Machine Learning Vol. 35, 2323–2332 (2018).

You, J., Liu, B., Ying, Z., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in Neural Information Processing Systems Vol. 31, 6410–6421 (2018).

Jin, W., Barzilay, R. & Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In International Conference on Machine Learning Vol. 37, 4849–4859 (2020).

Lei, T., Jin, W., Barzilay, R. & Jaakkola, T. Deriving neural architectures from sequence and graph kernels. In International Conference on Machine Learning Vol. 34, 2024-2033 (2017).

Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? Preprint at https://arxiv.org/abs/1810.00826 (2018).

Chen, Z., Chen, L., Villar, S. & Bruna, J. Can graph neural networks count substructures? In Advances in Neural Information Processing Systems Vol. 33, 10383–10395 (2020).

Bouritsas, G., Frasca, F., Zafeiriou, S. & Bronstein, M. M. Improving graph neural network expressivity via subgraph isomorphism counting. Preprint at https://arxiv.org/abs/2006.09252 (2020).

Bodnar, C. et al. Weisfeiler and Lehman go topological: message passing simplicial networks. In International Conference on Learning Representations: Workshop on Geometrical and Topological Representation Learning (2021).

Thomas, N. et al. Tensor field networks: rotation-and translation-equivariant neural networks for 3D point clouds. Preprint at https://arxiv.org/abs/1802.08219 (2018).

Satorras, V. G., Hoogeboom, E. & Welling, M. E (n) equivariant graph neural networks. Preprint at https://arxiv.org/abs/2102.09844 (2021).

Anderson, B., Hy, T. S. & Kondor, R. Cormorant: covariant molecular neural networks. In Advances in Neural Information Processing Systems Vol. 32, 14537–14546 (2019).

Miller, B. K., Geiger, M., Smidt, T. E. & Noé, F. Relevance of rotationally equivariant convolutions for predicting molecular properties. Preprint at https://arxiv.org/abs/2008.08461 (2020).

Fuchs, F., Worrall, D., Fischer, V. & Welling, M. SE(3)-transformers: 3D roto-translation equivariant attention networks. In Advances in Neural Information Processing Systems Vol. 33 (2020).

Unke, O. T. et al. SpookyNet: Learning force fields with electronic degrees of freedom and nonlocal effects. Preprint at https://arxiv.org/abs/2105.00304 (2021).

Batzner, S. et al. SE(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Preprint at https://arxiv.org/abs/2101.03164 (2021).

Unke, O. T. et al. SE(3)-equivariant prediction of molecular wavefunctions and electronic densities. Preprint at https://arxiv.org/abs/ 2 106.02347 (2021).

Hermann, J., Schätzle, Z. & Noé, F. Deep-neural-network solution of the electronic Schrödinger equation. Nat. Chem. 12 , 891–897 (2020).

Pfau, D., Spencer, J. S., Matthews, A. G. & Foulkes, W. M. C. Ab initio solution of the many-electron Schrödinger equation with deep neural networks. Phys. Rev. Res. 2 , 033429 (2020).

Choo, K., Mezzacapo, A. & Carleo, G. Fermionic neural-network states for ab-initio electronic structure. Nat. Commun. 11 , 2368 (2020).

Rajan, K., Zielesny, A. & Steinbeck, C. Decimer: towards deep learning for chemical image recognition. J. Cheminform. 12 , 65 (2020).

Cramer, R. D., Patterson, D. E. & Bunce, J. D. Comparative molecular field analysis (comfa). 1. effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 110 , 5959–5967 (1988).

Klebe, G. in 3D QSAR in Drug Design (eds. Kubinyi, H. et al.) 87–104 (Springer, 1998).

Jiménez, J., Skalic, M., Martinez-Rosell, G. & De Fabritiis, G. K DEEP : protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. J. Chem. Inf. Model. 58 , 287–296 (2018).

Ragoza, M., Hochuli, J., Idrobo, E., Sunseri, J. & Koes, D. R. Protein–ligand scoring with convolutional neural networks. J. Chem. Inf. Model. 57 , 942–957 (2017).

Jiménez, J., Doerr, S., Martinez-Rosell, G., Rose, A. S. & De Fabritiis, G. DeepSite: protein-binding site predictor using 3d-convolutional neural networks. Bioinformatics 33 , 3036–3042 (2017).

Ahmed, E. et al. A survey on deep learning advances on different 3d data representations. Preprint at https://arxiv.org/abs/1808.01462 (2018).

Pfaff, T., Fortunato, M., Sanchez-Gonzalez, A. & Battaglia, P. Learning mesh-based simulation with graph networks. In International Conference on Learning Representations Vol. 8 (2020).

Liu, Q. et al. OctSurf: efficient hierarchical voxel-based molecular surface representation for protein-ligand affinity prediction. J. Molec. Graph. Model. 105 , 107865 (2021).

Mylonas, S. K., Axenopoulos, A. & Daras, P. DeepSurf: a surface-based deep learning approach for the prediction of ligand binding sites on proteins. Preprint at https://arxiv.org/abs/2002.05643 (2020).

Wiswesser, W. J. Historic development of chemical notations. J. Chem. Inf. Comput. Sci. 25 , 258–263 (1985).

Wiswesser, W. J. The Wiswesser line formula notation. Chem. Eng. News Arch. 30 , 3523–3526 (1952).

Heller, S., McNaught, A., Stein, S., Tchekhovskoi, D. & Pletnev, I. InChI: the worldwide chemical structure identifier standard. J. Cheminform. 5 , 7 (2013).

Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28 , 31–36 (1988).

Öztürk, H., Özgür, A., Schwaller, P., Laino, T. & Ozkirimli, E. Exploring chemical space using natural language processing methodologies for drug discovery. Drug Discov. Today 25 , 689–705 (2020).

Cadeddu, A., Wylie, E. K., Jurczak, J., Wampler-Doty, M. & Grzybowski, B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem. Int. Ed. 53 , 8108–8112 (2014).

Schwaller, P., Gaudin, T., Lanyi, D., Bekas, C. & Laino, T. Found in translation: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models. Chem. Sci. 9 , 6091–6098 (2018).

O’Boyle, N. M. Towards a universal SMILES representation: a standard method to generate canonical SMILES based on the InChI. J. Cheminform. 4 , 22 (2012).

Weininger, D., Weininger, A. & Weininger, J. L. SMILES. 2. Algorithm for generation of unique SMILES notation. J. Chem. Inf. Comput. Sci. 29 , 97–101 (1989).

Article   MATH   Google Scholar  

Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Centr. Sci. 4 , 268–276 (2018).

O’Boyle, N. & Dalke, A. DeepSmiles: An adaptation of smiles for use in machine-learning of chemical structures. Preprint at https://doi.org/10.26434/chemrxiv.7097960.v1 (2018).

Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): a 100% robust molecular string representation. Mach. Learn. 1 , 045024 (2020).

Google Scholar  

Skinnider, M. A., Stacey, R. G., Wishart, D. S. & Foster, L. J. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3 , 759–770 (2021).

Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning Internal Representations by Error Propagation (eds. Rumelhart, D. E. & McClelland, J. L.) 318–362 (MIT Press, 1985).

Hochreiter, S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int. J. Uncert. Fuzz. Knowl. Based Syst. 6 , 107–116 (1998).

Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neur. Comput. 9 , 1735–1780 (1997).

Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at https://arxiv.org/abs/1412.3555 (2014).

Yuan, W. et al. Chemical space mimicry for drug discovery. J. Chem. Inf. Model. 57 , 875–882 (2017).

Gupta, A. et al. Generative recurrent networks for de novo drug design. Mol. Inform. 37 , 1700111 (2018).

Merk, D., Friedrich, L., Grisoni, F. & Schneider, G. De novo design of bioactive small molecules by artificial intelligence. Mol. Inform. 37 , 1700153 (2018).

Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminform. 9 , 48 (2017).

Popova, M., Isayev, O. & Tropsha, A. Deep reinforcement learning for de novo drug design. Sci. Adv. 4 , eaap7885 (2018).

Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7 , eabg3338 (2021).

Arús-Pous, J. et al. Randomized SMILES strings improve the quality of molecular generative models. J. Cheminform. 11 , 71 (2019).

Bjerrum, E. J. & Threlfall, R. Molecular generation with recurrent neural networks (RNNs). Preprint at https://arxiv.org/abs/1705.04612 (2017).

Grisoni, F., Moret, M., Lingwood, R. & Schneider, G. Bidirectional molecule generation with recurrent neural networks. J. Chem. Inf. Model. 60 , 1175–1183 (2020).

Nagarajan, D. et al. Computational antimicrobial peptide design and evaluation against multidrug-resistant clinical isolates of bacteria. J. Biol. Chem. 293 , 3492–3509 (2018).

Grisoni, F. et al. Designing anticancer peptides by constructive machine learning. ChemMedChem 13 , 1300–1302 (2018).

Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng. 5 , 613–623 (2021).

Zheng, S., Li, Y., Chen, S., Xu, J. & Yang, Y. Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2 , 134–140 (2020).

Wang, X. et al. Optimizing pharmacokinetic property prediction based on integrated datasets and a deep learning approach. J. Chem. Inf. Model. 60 , 4603–4613 (2020).

Senior, A. W. et al. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins 87 , 1141–1148 (2019).

Tsai, S.-T., Kuo, E.-J. & Tiwary, P. Learning molecular dynamics with simple language model built upon long short-term memory neural network. Nat. Commun. 11 , 5155 (2020).

Gomez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Centr. Sci. 4 , 268–276 (2018).

Lin, X., Quan, Z., Wang, Z.-J., Huang, H. & Zeng, X. A novel molecular representation with BiGRU neural networks for learning atom. Brief. Bioinform. 21 , 2099–2111 (2019).

Preuer, K., Renz, P., Unterthiner, T., Hochreiter, S. & Klambauer, G. Fréchet ChemNet distance: a metric for generative models for molecules in drug discovery. J. Chem. Inf. Model. 58 , 1736–1741 (2018).

Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Centr. Sci. 5 , 1572–1583 (2019).

Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chemical Science 11 , 3316–3325 (2020).

Pesciullesi, G., Schwaller, P., Laino, T. & Reymond, J.-L. Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates. Nat. Commun. 11 , 4874 (2020).

Kreutter, D., Schwaller, P. & Reymond, J.-L. Predicting enzymatic reactions with a molecular transformer. Chem. Sci. 12 , 8648–8659 (2021).

Schwaller, P., Vaucher, A. C., Laino, T. & Reymond, J.-L. Prediction of chemical reaction yields using deep learning. Mach. Learn. 2 , 015016 (2021).

Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3 , 144–152 (2021).

Morris, P., St. Clair, R., Hahn, W. E. & Barenholtz, E. Predicting binding from screening assays with transformer network embeddings. J. Chem. Inf. Model. 60 , 4191–4199 (2020).

Rong, Y. et al. Self-supervised graph transformer on large-scale molecular data. In Advances in Neural Information Processing Systems Vol. 33, 12559–12571 (2020).

Grechishnikova, D. Transformer neural network for protein-specific de novo drug generation as a machine translation problem. Sci. Rep. 11 , 321 (2021).

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373 , 871–876 (2021).

Méndez-Lucio, O., Baillif, B., Clevert, D.-A., Rouquié, D. & Wichard, J. De novo generation of hit-like molecules from gene expression signatures using artificial intelligence. Nat. Commun. 11 , 10 (2020).

Griffiths, R.-R. & Hernández-Lobato, J. M. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci. 11 , 577–586 (2020).

Hirohara, M., Saito, Y., Koda, Y., Sato, K. & Sakakibara, Y. Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinform. 19 , 83–94 (2018).

Kimber, T. B., Engelke, S., Tetko, I. V., Bruno, E. & Godin, G. Synergy effect between convolutional neural networks and the multiplicity of SMILES for improvement of molecular prediction. Preprint at https://arxiv.org/abs/1812.04439 (2018).

Zheng, S., Yan, X., Yang, Y. & Xu, J. Identifying structure–property relationships through smiles syntax analysis with self-attention mechanism. J. Chem. Inf. Model. 59 , 914–923 (2019).

ElAbd, H. et al. Amino acid encoding for deep learning applications. BMC Bioinform. 21 , 12 (2020).

Satorras, V. G., Hoogeboom, E., Fuchs, F. B., Posner, I. & Welling, M. E(n) equivariant normalizing flows for molecule generation in 3d. In Advances in Neural Information Processing Systems Vol. 33 (2021).

Gebauer, N. W., Gastegger, M., Hessmann, S. S., Müller, K.-R. & Schütt, K. T. Inverse design of 3d molecular structures with conditional generative neural networks. Preprint at https://arxiv.org/abs/2109.04824 (2021).

Fujita, T. & Winkler, D. A. Understanding the roles of the “two QSARs”. J. Chem. Inf. Model. 56 , 269–274 (2016).

Hu, W. et al. Open Graph Benchmark: datasets for machine learning on graphs. In Advances in Neural Information Processing Systems Vol. 33, 22118–22133 (2020).

Wu, Z. et al. MoleculeNet: a benchmark for molecular machine learning. Chem. Sci. 9 , 513–530 (2018).

Polykovskiy, D. et al. Molecular sets (MOSES): a benchmarking platform for molecular generation models. Front. Pharmacol. 11 , 565644 (2020).

Brown, N., Fiscato, M., Segler, M. H. & Vaucher, A. C. GuacaMol: benchmarking models for de novo molecular design. J. Chem. Inf. Model. 59 , 1096–1108 (2019).

von Lilienfeld, O. A., Müller, K.-R. & Tkatchenko, A. Exploring chemical compound space with quantum-based machine learning. Nat. Rev. Chem. 12 , 945–951 (2020).

Unke, O. T. et al. Machine learning force fields. Chem. Rev. 121 , 10142–10186 (2021).

Isert, C., Atz, K., Jiménez-Luna, J. & Schneider, G. QMugs: quantum mechanical properties of drug-like molecules. Preprint at https://arxiv.org/abs/2107.00367 (2021).

Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1 , 140022 (2014).

Jin, W., Coley, C., Barzilay, R. & Jaakkola, T. Predicting organic reaction outcomes with weisfeiler-lehman network. In Advances in Neural Information Processing Systems Vol. 30, 2607–2616 (2017).

LeCun, Y. et al. in The Handbook of Brain Theory and Neural Networks (e d. Arbib, M. A.) 255–258 (MIT Press, 1995).

Sutton, R. S. & Barto, A. G. Reinforcement Learning: An Introduction (MIT Press, 2018).

Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22 , 1345–1359 (2009).

Nguyen, L. A., He, H. & Pham-Huy, C. Chiral drugs: an overview. Int. J. Biomed. Sci. 2 , 85 (2006).

Valsecchi, C., Grisoni, F., Motta, S., Bonati, L. & Ballabio, D. Nura: a curated dataset of nuclear receptor modulators. Toxicol. Appl. Pharmacol. 407 , 115244 (2020).

Bemis, G. W. & Murcko, M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 39 , 2887–2893 (1996).

Download references

Acknowledgements

This research was supported by the Swiss National Science Foundation (SNSF, grant no. 205321_182176) and the ETH RETHINK initiative.

Author information

These authors contributed equally: Kenneth Atz, Francesca Grisoni.

Authors and Affiliations

Department of Chemistry and Applied Biosciences, RETHINK, ETH Zurich, Zurich, Switzerland

Kenneth Atz, Francesca Grisoni & Gisbert Schneider

Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands

Francesca Grisoni

ETH Singapore SEC Ltd, Singapore, Singapore

Gisbert Schneider

You can also search for this author in PubMed   Google Scholar

Corresponding authors

Correspondence to Francesca Grisoni or Gisbert Schneider .

Ethics declarations

Competing interests.

G.S. declares a potential financial conflict of interest as co-founder of inSili.com LLC, Zurich, and in his role as scientific consultant to the pharmaceutical industry.

Peer review information

Nature Machine Intelligence thanks Jonathan Hirst and Oliver Wieder for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Atz, K., Grisoni, F. & Schneider, G. Geometric deep learning on molecular representations. Nat Mach Intell 3 , 1023–1032 (2021). https://doi.org/10.1038/s42256-021-00418-8

Download citation

Received : 23 July 2021

Accepted : 26 October 2021

Published : 15 December 2021

Issue Date : December 2021

DOI : https://doi.org/10.1038/s42256-021-00418-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

representation 3d molecule

Learning Multi-view Molecular Representations with Structured and Unstructured Knowledge

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, supplemental material, index terms.

Applied computing

Life and medical sciences

Bioinformatics

Computing methodologies

Artificial intelligence

Knowledge representation and reasoning

Natural language processing

Recommendations

Prediction of potential repurposed drugs against sars-cov-2 based on text mining and molecular docking analysis.

Many research papers focusing the relationships between drugs and coronavirus-related protein interactions are indexed in the PubMed database. These studies provide a reference for predicting potential repurposed drugs against SARS-CoV-2. Drug ...

An investigation of molecular dynamics simulation and molecular docking: Interaction of citrus flavonoids and bovine β-lactoglobulin in focus

Citrus flavonoids are natural compounds with important health benefits. The study of their interaction with a transport protein, such as bovine @b-lactoglobulin (BLG), at the atomic level could be a valuable factor to control their transport to ...

KG-MTL: Knowledge Graph Enhanced Multi-Task Learning for Molecular Interaction

Molecular interaction prediction is essential in various applications including drug discovery and material science. The problem becomes quite challenging when the interaction is represented by unmapped relationships in molecular networks, namely ...

Information

Published in.

cover image ACM Conferences

  • General Chairs:

Author Picture

Northeastern University, USA

Author Picture

CENTAI / Eurecat, Italy

  • SIGMOD: ACM Special Interest Group on Management of Data
  • SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data

Association for Computing Machinery

New York, NY, United States

Publication History

Check for updates, author tags.

  • knowledge graphs
  • multi-view molecular representation learning
  • text mining
  • Research-article

Funding Sources

  • Pharmolix Inc.
  • The National Key R&D Program of China

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 15 Total Downloads
  • Downloads (Last 12 months) 15
  • Downloads (Last 6 weeks) 15

View options

View or Download as a PDF file.

View online with eReader .

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

  • Original Article
  • Published: 21 June 2024
  • Volume 27 , article number  71 , ( 2024 )

Cite this article

representation 3d molecule

  • Taojie Kuang 1 , 2 ,
  • Yiming Ren 1 &
  • Zhixiang Ren 1  

Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

representation 3d molecule

Similar content being viewed by others

representation 3d molecule

QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning

representation 3d molecule

MolBench: A Benchmark of AI Models for Molecular Property Prediction

representation 3d molecule

Deep contrastive learning of molecular conformation for efficient property prediction

Explore related subjects.

  • Artificial Intelligence

Data availability

The unlabeled dataset ZINC20 and PubChem, used in pretraining stage, can be accessed at https://zinc20.docking.org/tranches/home/ and https://pubchem.ncbi.nlm.nih.gov/docs/downloads . The downstream benchmarks can be downloaded from MoleculeNet ( https://moleculenet.org/datasets-1 ). It is available for non-commercial use.

Code availability

The software can be accessed at https://github.com/AI-HPC-Research-Team/3D-Mol .

Goh GB, Hodas NO, Siegel C, Vishnu A (2017) SMILES2Vec: an interpretable general-purpose deep neural network for predicting chemical properties https://doi.org/10.48550/ARXIV.1712.02034

Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J (2020) Deeppurpose: a deep learning library for drug-target interaction prediction. Bioinformatics 36(22–23):5545–5547

Google Scholar  

Chithrananda S, Grand G, Ramsundar B (2020) ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. https://doi.org/10.48550/ARXIV.2010.09885

Weininger D (1988) Smiles, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inform Comput Sci 28(1):31–36

Article   Google Scholar  

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. Proc Mach Learn Res 70:1263–1272

Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inform Modeling 59(8):3370–3388

Hu W, Liu B, Gomes J, Zitnik M, Liang P, Pande V, Leskovec J (2019) Strategies for Pre-training Graph Neural Networks. https://doi.org/10.48550/ARXIV.1905.12265

Liu S, Demirel MF, Liang Y (2019) N-gram graph: simple unsupervised representation for graphs, with applications to molecules. Adv Neural Inform Process Syst 32:19

Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H et al (2019) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63(16):8749–8760

Wang Y, Wang J, Cao Z, Barati Farimani A (2022) Molecular contrastive learning of representations via graph neural networks. Nature Mach Intell 4(3):279–287. https://doi.org/10.1038/s42256-022-00447-x

Rong Y, Bian Y, Xu T, Xie W, WEI Y, Huang W, Huang J (2020) Self-supervised graph transformer on large-scale molecular data. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 12559–12571. Curran Associates, Inc., ???. https://proceedings.neurips.cc/paper_files/paper/2020/file/94aef38441efa3380a3bed3faf1f9d5d-Paper.pdf

Schütt K, Kindermans P-J, Sauceda Felix HE, Chmiela S, Tkatchenko A, Müller K-R (2017) Schnet: A continuous-filter convolutional neural network for modeling quantum interactions. Advances in neural information processing systems 30

Gasteiger J, Groß J, Günnemann S (2020) Directional message passing for molecular graphs. arXiv preprint arXiv:2003.03123

Shui Z, Karypis G (2020) Heterogeneous molecular graph neural networks for predicting molecule properties. In: 2020 IEEE International Conference on Data Mining (ICDM), pp. 492–500. IEEE

Danel T, Spurek P, Tabor J, Śmieja M, Struski Ł, Słowik A, Maziarka Ł (2020) Spatial graph convolutional networks. In: Neural Information Processing: 27th International Conference, ICONIP 2020, Bangkok, Thailand, November 18–22, 2020, Proceedings, Part V, pp. 668–675. Springer

Fang X, Liu L, Lei J, He D, Zhang S, Zhou J, Wang F, Wu H, Wang H (2022) Geometry-enhanced molecular representation learning for property prediction. Nature Mach Intell 4(2):127–134

Zhou G, Gao Z, Ding Q, Zheng H, Xu H, Wei Z, Zhang L, Ke G (2023) Uni-mol: a universal 3d molecular representation learning framework

Zhang Z, Xu M, Jamasb A, Chenthamarakshan V, Lozano A, Das P, Tang J (2022) Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9(2):513–530

Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63

Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF (2017) Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inform Model 57(8):1757–1772

Rogers D, Hahn M (2010) Extended-connectivity fingerprints. J Chem Inform Modeling 50(5):742–754

Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of mdl keys for use in drug discovery. J Chem Inform Comput Sci 42(6):1273–1280

Wang S, Guo Y, Wang Y, Sun H, Huang J (2019) Smiles-bert: large scale unsupervised pre-training for molecular property prediction. Computat Biol Health Inform 4:429–436

Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/ARXIV.1810.04805

Wang J, Cao D, Tang C, Xu L, He Q, Yang B, Chen X, Sun H, Hou T (2021) Deepatomiccharge: a new graph convolutional network-based architecture for accurate prediction of atomic charges. Brief Bioinform 22(3):183

Li X-S, Liu X, Lu L, Hua X-S, Chi Y, Xia K (2022) Multiphysical graph neural network (mp-gnn) for COVID-19 drug design. Brief Bioinform 23(4):231

Lu C, Liu Q, Wang C, Huang Z, Lin P, He L (2019) Molecular property prediction: a multilevel quantum interactions modeling perspective. Proc Conf Artif Intell 33:1052–1060

Qiao Z, Welborn M, Anandkumar A, Manby FR, Miller TF (2020) Orbnet: deep learning for quantum chemistry using symmetry-adapted atomic-orbital features. J Chem Phys 153(12):686

Li Z, Jiang M, Wang S, Zhang S (2022) Deep learning methods for molecular representation and property prediction. Drug Discov Today 27:103373

Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P (2018) Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics 34(21):3666–3674

Sunseri J, Koes DR (2020) Libmolgrid: graphics processing unit accelerated molecular gridding for deep learning applications. J Chem Inform Modeling 60(3):1079–1084

Liu Q, Wang P-S, Zhu C, Gaines BB, Zhu T, Bi J, Song M (2021) Octsurf: efficient hierarchical voxel-based molecular surface representation for protein-ligand affinity prediction. J Mol Graph Modelling 105:107865

Floridi L, Chiriatti M (2020) Gpt-3: its nature, scope, limits, and consequences. Minds Mach 30:681–694

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692

Honda S, Shi S, Ueda HR (2019) Smiles transformer: Pre-trained molecular fingerprint for low data drug discovery. arXiv preprint arXiv:1911.04738

You Y, Chen T, Sui Y, Chen T, Wang Z, Shen Y (2020) Graph contrastive learning with augmentations. Adv Neural Inform Process Syst 33:5812–5823

Sun M, Xing J, Wang H, Chen B, Zhou J (2021) Mocl: Data-driven molecular fingerprint via knowledge-aware contrastive learning from molecular graph. In: proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp. 3585–3594

Li P, Wang J, Qiao Y, Chen H, Yu Y, Yao X, Gao P, Xie G, Song S (2021) An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Brief Bioinform 22(6):109

Wang Y, Magar R, Liang C, Barati Farimani A (2022) Improving molecular contrastive learning via faulty negative mitigation and decomposed fragment contrast. J Chem Inform Modeling 62(11):2713–2725

Sun Q, Li J, Peng H, Wu J, Ning Y, Yu PS, He L (2021) Sugar: Subgraph neural network with reinforcement pooling and self-supervised mutual information mechanism. In: proceedings of the web conference 2021, pp. 2081–2091

Ji Z, Shi R, Lu J, Li F, Yang Y (2022) Relmole: molecular representation learning based on two-level graph similarities. J Chem Inform Modeling 62(22):5361–5372

Cho H, Choi IS (2019) Enhanced deep-learning prediction of molecular properties via augmentation of bond topology. Chem Med Chem 14(17):1604–1609

Liu S, Wang H, Liu W, Lasenby J, Guo H, Tang J (2021) Pre-training molecular graph representation with 3d geometry. arXiv preprint arXiv:2110.07728

Landrum G, et al (2013) Rdkit: A software suite for cheminformatics, computational chemistry, and predictive modeling. Greg Landrum 8

Irwin JJ, Tang KG, Young J, Dandarchuluun C, Wong BR, Khurelbaatar M, Moroz YS, Mayfield J, Sayle RA (2020) Zinc20-a free ultralarge-scale chemical database for ligand discovery. J Chem Inform Modeling 60(12):6065–6073

Wang Y, Xiao J, Suzek TO, Zhang J, Wang J, Bryant SH (2009) Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 37(2):623–633

Stärk H, Beaini D, Corso G, Tossou P, Dallago C, Günnemann S, Liò P (2022) 3d infomax improves gnns for molecular property prediction. In: international conference on machine learning, pp. 20479–20502. PMLR

Download references

Acknowledgements

The research was supported by the Peng Cheng Cloud-Brain.

This work is supported by Peng Cheng Laboratory and by the Major Key Project of PCL PCL2021A13.

Author information

Authors and affiliations.

Peng Cheng Laboratory, Shenzhen, 518000, Guangdong Province, China

Taojie Kuang, Yiming Ren & Zhixiang Ren

School of Future Technology, South China University of Technology, Guangzhou, 510000, Guangdong Province, China

Taojie Kuang

You can also search for this author in PubMed   Google Scholar

Contributions

Taojie Kuang: Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft, Writing - review & editing. Yiming Ren: Validation, Writing - review & editing. Zhixiang Ren: Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Resources, Supervision, Writing - original draft, Writing - review & editing.

Corresponding author

Correspondence to Zhixiang Ren .

Ethics declarations

Conflict of interest.

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: 3D conformation descriptor and fingerprint

1.1 a.1 fingerprint.

In our study, we integrate molecular fingerprints, particularly Morgan fingerprints, to calculate weights for negative pairs in our model. These fingerprints, which provide a compact numerical representation of molecular structures, are crucial for computational chemistry tasks. The Morgan fingerprint method iteratively updates each atom’s representation based on its chemical surroundings, resulting in a detailed binary vector of the molecule. By evaluating the similarity between Morgan fingerprints, we derive a precise weighting mechanism for negative pairs, enhancing our model’s ability to detect and differentiate molecular structures. This methodology not only improves our model’s accuracy in molecular interaction analysis but also adds to its overall predictive capabilities.

1.2 A.2 3D conformation descriptor

Molecular 3D conformation descriptors are computational tools used to represent the three-dimensional arrangement of atoms within a molecule, capturing critical aspects of its spatial geometry. These descriptors are crucial in understanding how molecular shape influences chemical and biological properties, and they play a significant role in fields like drug design and materials science. The 3D-Morse descriptor, specifically, is a type of 3D molecular descriptor that quantifies the molecular structure using electron diffraction patterns, offering a unique approach to encapsulating the spatial distribution of atoms. It provides a detailed and nuanced representation of molecular conformation, making it highly valuable in computational chemistry and cheminformatics. In our research, we employ 3D-Morse descriptors to measure the similarity of molecular 3D conformations, enabling us to compare and analyze molecular structures effectively and identify potential similarities in their biological or chemical behavior. This application of 3D-Morse descriptors is instrumental in fields such as drug discovery, where understanding molecular similarities can lead to the identification of new therapeutic compounds or the prediction of their activities.

Appendix B: The contribution of pretraining method

In this section, we discuss the contributions of contrastive learning and supervised pretraining methods to our pretraining approach. We pretrained our model using three approaches: contrastive Learning only, supervised pretraining only, and complete pretraining method. We compared their performance on 7 benchmark datasets. As the Table  4 shown, the contributions of both contrastive learning and supervised pretraining were less significant than the complete method. These findings emphasize that while both contrastive learning and supervised pretraining contribute positively to the model’s performance, their combination is crucial for achieving optimal results.

Appendix C: Finetuning details

During finetuning for each downstream task, we randomly search the hyper-parameters to find the best performing setting on the validation set and report the results on the test set. Table  5 lists the combinations of different hyper-parameters.

Appendix D: Environment

\(\bullet \) Architect: X 86 64

\(\bullet \) Number of CPUs: 96

\(\bullet \) Model: Intel(R) Xeon(R) Platinum 8268 CPU @ 2.90GHz

\(\bullet \) Type: Tesla V100-SXM2-32GB

\(\bullet \) Count: 8

\(\bullet \) Driver Version: 450.80.02

\(\bullet \) CUDA Version: 11.7

Software Environment :

\(\bullet \) Operating System: Ubuntu 20.04.6 LTS

\(\bullet \) Python Version: 3.10.9

\(\bullet \) Paddle Version: 2.4.2

\(\bullet \) PGL Version: 2.2.5

\(\bullet \) RDKit Version: 2023.3.2

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Kuang, T., Ren, Y. & Ren, Z. 3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information. Pattern Anal Applic 27 , 71 (2024). https://doi.org/10.1007/s10044-024-01287-8

Download citation

Received : 01 April 2024

Accepted : 18 June 2024

Published : 21 June 2024

DOI : https://doi.org/10.1007/s10044-024-01287-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Molecular property prediction
  • Contrastive learning
  • Graph neural network
  • Molecular modeling
  • Find a journal
  • Publish with us
  • Track your research

This week: the arXiv Accessibility Forum

Help | Advanced Search

Computer Science > Machine Learning

Title: unified 2d and 3d pre-training of molecular representations.

Abstract: Molecular representation learning has attracted much attention recently. A molecule can be viewed as a 2D graph with nodes/atoms connected by edges/bonds, and can also be represented by a 3D conformation with 3-dimensional coordinates of all atoms. We note that most previous work handles 2D and 3D information separately, while jointly leveraging these two sources may foster a more informative representation. In this work, we explore this appealing idea and propose a new representation learning method based on a unified 2D and 3D pre-training. Atom coordinates and interatomic distances are encoded and then fused with atomic representations through graph neural networks. The model is pre-trained on three tasks: reconstruction of masked atoms and coordinates, 3D conformation generation conditioned on 2D graph, and 2D graph generation conditioned on 3D conformation. We evaluate our method on 11 downstream molecular property prediction tasks: 7 with 2D information only and 4 with both 2D and 3D information. Our method achieves state-of-the-art results on 10 tasks, and the average improvement on 2D-only tasks is 8.3%. Our method also achieves significant improvement on two 3D conformation generation tasks.
Comments: KDD-2022
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

www.pymol.org

View 3d molecular structures.

  • VIEW 3D Molecular Structures
  • RENDER Figures Artistically
  • ANIMATE Molecules Dynamically
  • EXPORT PyMOL Geometry
  • PRESENT 3D Data with AxPyMOL

Compare Incentive PyMOL with other versions

After years of development and testing in the open-source community, PyMOL has established itself as a leading software package for customization of 3-D biomolecular images, with more than 600 settings and 20 representations to provide users with precise and powerful control.  

PyMOL can interpret over 30 different file formats from PDB files to multi-SDF files to volumetric electron density maps. PyMOL's straightforward graphical user interface allows first-time and expert users alike to create stunning 3-D images from their favorite file formats. Images and movies can then be saved in a cross-platform Session file, ensuring that every object position, atom color, molecule representation, molecular state, frame, and movie can be viewed by colleagues exactly as intended. 

Image Representations

representation 3d molecule

Using PyMOL, data can be represented in nearly 20 different ways. Spheres provides a CPK-like view, surface and mesh provide more volumetric views, lines and sticks put the emphasis on bond connectivity, and ribbon and cartoon are popular representations for identifying secondary structure and topology. PyMOL's quick demo, accessible through the built-in Wizard menu, gets users started with all of the standard representations. 

  • Open access
  • Published: 17 September 2020

Molecular representations in AI-driven drug discovery: a review and practical guide

  • Laurianne David   ORCID: orcid.org/0000-0002-6455-1958 1 ,
  • Amol Thakkar   ORCID: orcid.org/0000-0003-0403-4067 1 , 2 ,
  • Rocío Mercado   ORCID: orcid.org/0000-0002-6170-6088 1 &
  • Ola Engkvist   ORCID: orcid.org/0000-0003-4970-6461 1  

Journal of Cheminformatics volume  12 , Article number:  56 ( 2020 ) Cite this article

58k Accesses

243 Citations

74 Altmetric

Metrics details

The technological advances of the past century, marked by the computer revolution and the advent of high-throughput screening technologies in drug discovery, opened the path to the computational analysis and visualization of bioactive molecules. For this purpose, it became necessary to represent molecules in a syntax that would be readable by computers and understandable by scientists of various fields. A large number of chemical representations have been developed over the years, their numerosity being due to the fast development of computers and the complexity of producing a representation that encompasses all structural and chemical characteristics. We present here some of the most popular electronic molecular and macromolecular representations used in drug discovery, many of which are based on graph representations. Furthermore, we describe applications of these representations in AI-driven drug discovery. Our aim is to provide a brief guide on structural representations that are essential to the practice of AI in drug discovery. This review serves as a guide for researchers who have little experience with the handling of chemical representations and plan to work on applications at the interface of these fields.

representation 3d molecule

Introduction

The representation of molecules has been of interest to scientists since the nineteenth century [ 1 , 2 ]. Traditionally, molecules are represented as structure diagrams with bonds and atoms, and this is likely the representation most people think of when they think of molecules. However, other representations are required for the computational processing of chemical structures in cheminformatics. Here, we define a chemical representation as any encoding of a chemical compound; linear representations are referred to as notations.

Over the years, scientists have developed many notations depicting various properties of a compound. A classic notation is the empirical formula, a non-standard form of Hill notation. Seemingly simple at first glance, the empirical formula of alanine, C 3 H 7 NO 2 , exemplifies the complexity of building a notation. Indeed, while information about the atoms is available, it is not possible to know how the atoms are linked from the molecular formula which, moreover, does not encode information related to the molecular geometry. As such, the molecular formula above can be associated to alanine as well as to sarcosine and lactamide. Variants of empirical formulas emphasizing any functional groups also exist but are loosely defined. In these group-centric representations, elements are grouped in the formula as they would be in the molecule so as to highlight any functional groups present e.g. CH 3 CH(NH 2 )COOH to represent alanine.

The advent of computers led to the development of a wide variety of machine-readable chemical representations. Computers allowed for the rapid digital storage and querying of compounds and their structures, swift modifications of digital information, and greater physical storage efficiency. Algorithms were implemented to visualize compounds as 2D depictions [ 3 , 4 ] and the computational visualization of compounds in 3D space was popularized with the development of specialized programs [ 5 , 6 , 7 ].

Many precursors to computer-readable notations were introduced between 1947 and 1964 and were dedicated to small organic molecules [ 2 , 8 ]. At the time, memory efficiency was an important factor impacting the development of chemical notations. Popular representations used nowadays, however, were largely developed in and since the 70′s to represent small molecules [ 9 , 10 , 11 ], macromolecules [ 12 , 13 , 14 , 15 , 16 , 17 ] and chemical reactions [ 18 , 19 , 20 , 21 ].

In this review, we focus on chemical representations in cheminformatics and drug discovery. We first introduce the concept of a molecular graph, which is the most common machine-readable representation, and we give a brief overview of the main notations which paved the way for the current cheminformatics notations. We then focus on the representations that are used nowadays in the field of applying artificial intelligence (AI) to cheminformatics and drug discovery. Finally, we provide examples of AI-related applications using the chemical representations discussed in this review. This review is intended to provide an overview of basic cheminformatics knowledge to practising cheminformaticians, students in chemistry, cheminformatics, bioinformatics, and computer science, and anyone interested to learn more about molecular representations in drug discovery. While the coverage of representations introduced herein is not intended to be exhaustive, we emphasise that the representation used to solve a problem is always dependent on the task. Thus, the coverage is limited to areas where there is active research in applying machine learning (ML) and AI to cheminformatics and drug discovery. For readers interested in further reading on these topics we recommend references [ 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 ], in addition to the references cited in each section.

Graph representations for small molecules

Introduction to the molecular graph representation.

In order to understand the chemical representations presented in this review, it is important that the reader first has a solid understanding of molecular graphs, as most molecular representations discussed in this work are built on the molecular graph representation. However, there is a distinction between the notations and file formats built using molecular graphs (e.g. SMILES strings, Molfiles), and the abstract mathematical structure/data structure of a graph itself. The latter is introduced here.

The idea behind the molecular graph representation lies in mapping the atoms and bonds that make up a molecule into sets of nodes and edges . Intuitively, one could imagine treating the atoms in a molecule as nodes and the bonds as edges , although there is no reason one could not consider other mappings. In typical graph representations, the nodes are represented using circles or spheres, and the edges using lines. In the case of molecular graphs, the nodes are instead often represented using letters indicating the atom type (as on the periodic table), or simply using points where the bonds meet (for carbon atoms).

A molecular graph representation is formally a 2D object that can be used to represent 3D information (e.g. atomic coordinates, bond angles, chirality). However, any spatial relationships between the nodes must be encoded as node and/or edge attributes, as nodes in a graph (the mathematical object) do not formally have spatial positions, only pairwise relationships. There are of course limitations to this representation, which are discussed in a later section. The 2D and 3D representations of graphs can easily be visualized by many software packages, including ChemDraw [ 30 ], Mercury [ 31 ], Avogadro [ 32 ], VESTA [ 33 ], PyMOL [ 34 ], and VMD [ 35 ] (the latter 5 are suitable for small- and macro-molecules, and either free or open-source).

Mathematical definition of a graph

A graph Footnote 1 is formally defined as a tuple G  = ( V , E ) of a set of nodes V and a set of edges E , where each edge e ∊ E connects pairs of nodes in V . In a molecular graph, V is intuitively the set of all atoms in a molecule, and E is the set of all bonds linking the atoms, although this does not have to be the case. Molecular graphs are generally undirected, meaning that the pairs in E are unordered. [ 36 ].

To map a graph from an abstract mathematical concept to a concrete representation that can be handled on a computer, one needs to map the sets of nodes and edges to linear data structures; a common way to do this is using data structures such as matrices or arrays. Linear data structures are necessary in order to specify the connectivity of the nodes. To do so, an artificial node-ordering must first be calculated for encoding a molecule using arrays, even though V and E are formally sets and the order of elements in sets is irrelevant. The information to be mapped can include (1) how the atoms in the molecule are connected, (2) the identity of the atoms, and (3) the identity of the bonds.

How the atoms are connected is commonly represented in the form of an adjacency matrix A ; given that a ij is an element of A , a ij  = 1 means that there exists a bond between nodes v i and v j in molecular graph G , whereas a ij  = 0 means that there exists no bond between them (Fig.  1 b). The adjacency matrix is also sometimes referred to as the connectivity matrix . Note that the adjacency matrix does not necessarily specify what type of bond is connecting each pair of nodes.

figure 1

Example graph representation for acetic acid. a Graph representation of acetic acid with nodes numbered from one to four. b Example adjacency matrix, A , for an acetic acid graph with the corresponding node ordering on the left. c Example node features matrix, X , which one-hot encodes a few selected properties. d Example edge features matrix, E , where each edge feature vector is a one-hot encoding of single, double, or triple bonds. “Implicit Hs” stands for the number of implicit hydrogens on a given node

The identity of the atoms can be represented in the form of a node features matrix X (Fig.  1 c). Each row of X corresponds to a node v i (i.e. an atom in the molecule) in G ; this row is also referred to as the node feature vector x i for that atom. The length of x i corresponds to the number of atom features one has chosen to encode (e.g. a one-hot Footnote 2 encoding of atom type and formal charge).

The identity of the bonds can be represented in the form of an edge features matrix E (Fig.  1 d). Each row of E corresponds to an edge e ij  = ( v i , v j ) in G , and is referred to as the edge feature vector e ij for that edge. The length of e ij corresponds to the number of edge features one has chosen to encode (e.g. a one-hot encoding of possible bond types {single, double, triple, aromatic}).

Although common in AI applications, we would like to point out that it is not necessary to one-hot encode the various node and edge features. For example, the node features matrix shown in Fig.  1 c could instead have only 3 columns using integers to represent the same three properties (atom type, formal charge, and number of implicit Hs).

Graph traversal algorithms

Although, formally, graphs are non-linear data structures made up of sets of nodes and edges, in practice, matrix representations of graphs are node order dependent. The node order used in a matrix representation is determined by a graph traversal algorithm (Fig.  2 ). Depending on the application, it can be desirable to consistently generate the exact same representation for the same molecule. Reliably generating the same representation for a molecule is dependent on getting the same node order every time. To this end, one can use methods such as a depth-first or breadth-first search to generate graph matrix representations. The graph traversal algorithm needs to include a consistent way to break ties when a node branches off and must therefore consistently select the same branch traversal order. In fact, the way in which different software packages break ties in traversing a graph is often what sets them apart. However, if consistency is not important (and, indeed, for some deep learning applications, one might want noisier data), a random search can be used.

figure 2

Graph traversal algorithms. Three widespread graph traversal algorithms are illustrated above for an example branched graph. The numbers correspond to the order in which the nodes are explored, starting at node 1. a A depth-first search first explores each “branch” of a graph to the fullest extent, then goes back and explores branches at the last branched node, until all branches have been explored. b A breadth-first search first explores all nearest neighbours of a node, and then the nearest neighbours of the nearest neighbours, and so on, until the whole graph has been explored. c  A random search explores nodes in the graph in an arbitrary order, regardless of how they are connected

There are many ways to represent a graph

The matrix representations discussed above are not the only way to represent graphs, as there are multiple ways to represent the same information. For example, as has been already mentioned, depending on what graph traversal algorithm is used, the order of the rows in the adjacency matrix (or atom/bond block) will be different.

Furthermore, when working with molecular graphs, there is not one correct way to represent any molecule, and the representation chosen must be appropriate for the task.

Advantages of molecular graph representations

Graphs are formally 2D data structures with no spatial relationships between elements; nonetheless, 3D information (and information that is the “result” of a 3D structure e.g. stereochemistry) can be encoded into a graph representation. One natural place to put this data is in the node features matrix, X , for node information (such as if a chiral node is R or S), or in the edge features matrix, E , for edge information (such as the length of a bond).

The fact that one can naturally encode 3D information in a graph representation gives graphs many advantages over various linear notations, although some linear notations (such as SYBYL Line Notation) can also encode atomic 3D information. Additionally, the fact that all molecular subgraphs (i.e. subsets of G ) are interpretable can confer a particular advantage to graph notations over certain string notations, where, for example, substrings of a SMILES string (which we describe in the next section), do not necessarily correspond to a valid graph. In other words, all subgraphs are interpretable whereas all substrings are not. Nonetheless, there are also disadvantages to working directly with the molecular graph representation for many applications.

Limitations of molecular graph representations

Breakdown of graph model.

There are many types of molecules which cannot be described by the graph model. This is any structure containing any form of delocalized bonds, such as coordination compounds, as well as any molecule containing any of the following: polycentric bonds, ionic bonds, or metal–metal bonds. For example, organometallic compounds such as metallocenes or metal carbonyl complexes are difficult to describe using molecular graphs because their bonding scheme cannot be explained by valence bond theory. In other words, it would be difficult to describe the bonds using only pairwise relationships between atoms.

Solutions to the handling of multi-valent bonds have been introduced via the use of hypergraphs ; in a hypergraph, edges are sets of at least two atoms ( hyperedges ) instead of tuples of atoms [ 37 ]. However, the use of hypergraphs is not further discussed here as they are not currently widespread in the field.

For molecules where the arrangement of atoms is constantly changing in 3D space, the graph representation might not be meaningful, especially if pairwise bonds are breaking and forming or if the structure is frequently rearranging. That is, for applications where one is limited to using a single static representation for a molecule that is in fact rearranging on the timescale of the problem (e.g. tautomers), then a single molecular graph representation would not be appropriate and could even be detrimental to solving the problem.

Challenges of working directly with the graph representation

Another difficulty of working directly with graph representations is that they are not compact (both memory-wise and literally). To represent a molecular graph one would need, for example: an image, a tuple of matrices, lists, or tables; all these representations are generally more difficult to search through than a more compact linear representation (compare this to a string encoding a structure ID). They also become more and more cumbersome the bigger the graphs get, and their memory requirement would increase with the square of the number of nodes, at least.

This is not a problem with linear notations, which build upon the graph framework to create more compact and memory-efficient representations for molecules [ 38 ]. Linear notations have the advantage that they can be, for example, entries in a table, as well as easily searchable (for identity search, not substructure search), when a matrix representation is not convenient.

Connection tables and the MDL file formats

Below we discuss two formats closely related to the molecular graph representation: connection tables and the MDL (now BIOVIA) file format.

Connection tables

Whilst graphs underlie the representation of molecules, the matrices by which they are described are not a compact representation, and scale as the square of the number of atoms. The connection table (Ctab) [ 39 ] is composed of six parts: (1) Counts line , (2) Atom block , (3) Bond block , (4) Atom list block , (5) Stext block , and (6) Properties block . Readers are referred to the referenced material for a detailed description of each component. The counts line is always the first line, and as such gives an overview of the structure by specifying the number of atoms, bonds, and atom lists, as well as the presence or absence of chirality. The version (V2000 or V3000) is also specified on this line. The atom block describes the identities of the atoms as a list with arbitrary index values, as well as the atomic symbols, mass differences, charge, stereochemistry, and associated hydrogens. Note that it is often practical to treat any hydrogens in a molecule as implicit —that is, not storing hydrogens as atom objects and instead implicitly defining the hydrogen count using a valence model. Treating hydrogens as properties of the heavy atoms rather than as explicit nodes significantly reduces the size of the atom and bond block, making the format more compact. These can be recalculated based on valence rules if required; in such cases, valency information must be given explicitly in the atom block. The bond block describes the connectivity of the atoms as well as the identity of the bonds connecting them. The atoms may be fully or partially connected by bonds, thereby supporting the description of fragments and unconnected atoms. The bond block is composed of the atom indices and bond types; the bond order is also provided as an additional column. There is no requirement for the bond block of the connection table to be ordered in a particular way. The two blocks are combined to form the core of the Ctab. Similarly to a matrix representation, the Ctab is extensible, meaning that lists describing supported properties may be added to the properties block. Notably, any entries associated with charges, radicals, or isotopes supersede those in the atom block, if present. As a result of backwards compatibility to previous versions and the prevalence programs utilising them, connection tables have become one of the standard formats for handling chemical structural information and underlie the widely used Molfile formats. It should be noted that connection tables are in themselves not a file format but are the core building block around which CT files are built.

The Molfile format

The Molfile format family developed by MDL are collectively known as CTfiles (chemical table files) as they use connection tables to describe molecular structures. In addition, CTfiles are highly extensible and as such have formed a series of file formats that have been widely used for chemical information transfer. The series is shown in Fig.  3 , which shows how the connection table is wrapped within the Molfile format, which can be subsequently wrapped into a structure/data (SD) file, containing both structural information and additional property data for any number of molecules.

figure 3

The MDL family of file formats are collectively known as CTfiles (chemical table files) as they are built upon connection tables (Ctab), shown at the top of the figure. The connection table is split into an atom and bond block, describing the atoms and their corresponding connectivity. The Ctab is built upon to form the Molfile for the description of single molecules, RGfile for handling queries, SDfile for structure and associated data, RXNfile for the description of single reactions, RDfile for either a series of molecules/reactions and their associated data, and the XDfile for the transfer of structure or reaction data based on the XML format

Similarly, the RXNfile contains the description of single reactions and the RDfile enables storage of reactions or molecules as well as their associated data. RGfiles on the other hand have been designed for handling queries and the XDfile is an XML based format for the transfer of structures or reactions along with their associated metadata. Further details about each of the files and their structure can be found in the MDL documentation and various textbooks introducing the field of cheminformatics [ 40 ].

Linear notations for small molecules

Matrix representations require a large amount of disk space and are not well adapted for basic cheminformatic analysis (i.e. generation of list of compounds, online query of compounds). As a result, molecules nowadays are often represented as strings of characters encoding the Ctab and that can be interpreted by systematic sets of rules. For example, using implicit hydrogen, representing d -alanine using a Molfile takes 612 bytes, while using linear notations such as SMILES or InChI, which are described in this section, takes 15 and 59 bytes, respectively. As mentioned above, linear notations have the advantage of being compact and easy to manipulate (e.g. to use as command-line option or copy in an Excel spreadsheet). The main linear notations introduced in this section are exemplified in Table  1 .

The IUPAC quest for a universal notation

Over time, the way scientists name molecules has varied following the capabilities and needs of the scientific community. In the ages of alchemy, compounds and elements were named based on their properties; for example, aqua fortis and sweet oil of vitriol referred to nitric acid and diethyl ether, respectively. In the nineteenth century, the need for a systematic nomenclature of organic chemistry grew stronger, and a terminology was developed by the International Union of Pure and Applied Chemistry (IUPAC) [ 41 ]. This terminology is described at length in the IUPAC Color Books [ 42 ] and is universally used in the literature, patents, and government legislation. Nonetheless, this nomenclature is not ideal for cheminformatics applications and, in 1949, the IUPAC requested an international standard for electronic chemical notations requiring 11 desirable properties or “desiderata” [ 28 ]: simplicity of use, ease of printing and typewriting, conciseness, recognizability, ability to generate a unique chemical nomenclature, compatibility with the accepted practices of inorganic chemical nomenclature, uniqueness, generation of an unambiguous and useful enumeration pattern, ease of manipulation by machine methods, exhibition of associations, and ability to deal with partial indeterminates.

According to the IUPAC formalism set in 1964 [ 28 ], notations can be classified as being unique (i.e. one notation for a given compound), non-unique (i.e. more than one notation for a given compound), ambiguous (i.e. the notation will regenerate more than one compound), or/and unambiguous (i.e. the notation will regenerate only the original compound). This formalism is used to describe notations in this section.

Although seven notations Footnote 3 were proposed to IUPAC as potential standards, only two retained the interest of the committee: the Dyson cyphering [ 43 ] and the Wiswesser Line Notation (WLN) [ 44 ]. Descriptions and related references for the remaining five notations can be found in two notable publications [ 2 , 45 ] which detail many chemical notations introduced until 1984, and in a report by Alan Gelberg [ 8 ]. After many revisions, Dyson’s notation, originally developed in 1947, was adopted in 1961 as an international notation by IUPAC. The Dyson cyphering was not very popular among the scientific community as it could not be handled on standard typewriters or ordinary punched-card machines and contained many arbitrary rules. The most used notation by the community was WLN, which was created in 1949 and did not present the drawbacks of the Dyson cyphering. For a detailed comparison between WLN and Dyson cyphering, we refer the readers to the Survey of Chemical Notations [ 28 ]. We do not provide further details on WLN and IUPAC-Dyson as the notations have fallen into desuetude; however, we feel that the competition between both notations illustrate the technological and technical aspects that were under consideration for the selection of a universal notation.

The advent of contemporary notations

Simplified Molecular Input Line Entry System

WLN requires an extensive knowledge and understanding of the notation’s rule. A more intuitive notation, the Simplified Molecular Input Line Entry System (SMILES), was developed in 1988 by Weininger et al. [ 9 ] and has been the most popular line notation ever since. SMILES notation system was then incorporated into the Daylight Chemical Information Systems [ 48 ] toolkit; the company is still currently maintaining it. The SMILES representation, non-unique and unambiguous, is obtained by assigning a number to each atom in the molecule and then traversing the molecular graph using that order; in the case of RDKit [ 49 ], the graph traversal algorithm used is depth-first search.

There can be multiple atom numberings for a given molecule, leading to different SMILES. SMILES can thus be enumerated for data augmentation [ 50 ]. The ensemble of SMILES representing one molecule can be referred to as enumerated or randomized SMILES and are obtained by, for each molecule, randomly selecting an initial node for graph traversal while keeping the same graph traversal algorithm, thus leading to different atom orderings [ 51 ]. For clarity, we emphasize that randomized SMILES do not use a random search to generate representations, they rely on a depth-first search. To avoid conflicting SMILES representations for the same molecule, a unique SMILES can be designated, and several canonicalization methods exist to this end [ 38 , 52 , 53 ]. A schematic illustrating the difference between two SMILES variants is shown in Fig.  4 .

figure 4

Canonical ( a ) and randomized ( b ) SMILES representations of aspirin. Randomized SMILES correspond to the various representations of a molecule obtained by randomly selecting the starting node in the graph traversal algorithm, thus changing the order of the nodes traversed in the molecular graph (still using depth-first search). Numbers represent the order of graph traversal, where 1 is the initial node (user defined). Considering a as being the canonical representation of aspirin, b shows a different ordering of the atoms of the molecule. The final SMILES is one possible SMILES among all the randomized SMILES which can be generated. Green arrows indicate how the molecular graph is traversed. Both SMILES strings shown represent the same molecule but, as the atom numberings are different, the generated SMILES strings are, too. The original figure can be found in [ 47 ]

Initially, SMILES did not encode for stereochemistry. A specification, referred to as isomeric SMILES, was introduced later on and is now the default SMILES in many software. SMILES can thus encode isomeric specifications, configurations around double bonds (Z or E), and configurations around tetrahedral centres as well as many other types of chiral centres which are rarely supported (e.g. allene-like, octahedral). Nonetheless, a problematic set of structures to describe using SMILES notation is those which cannot be easily described using molecular graphs (see “ Limitations of molecular graph representations ” section), such as organometallic compounds and ionic salts.

Generally, if the total sum of bond orders is not equal to one of the standard valences for a given atom in a molecule, this is addressed in the corresponding SMILES notation using square brackets. When the molecules involved in bonding are also aromatic, lowercase tokens may be used, though this becomes problematic with some cheminformatics software which do not allow “extra” bonds for aromatic atoms [ 54 ]. ChemAxon Extended SMILES (CXSMILES) can overcome some of these issues by storing special features [ 55 ]. These are stored after the SMILES string separated by a space or tab and can be ignored when parsing SMILES if necessary. In addition, several fields can be stored for any given SMILES string. One such feature is fragment grouping, which specifies which components are grouped together using a list of fragment indexes; this aids in the grouping of ions and salts. Additional specifications of ligand order and coordinate bonds aid in the description of organometallic compounds and are supported by CXSMILES. The atom-to-atom coordinate bonds are represented by single bonds in the SMILES but corrected by the additional information CXSMILES provides in the extension.

The OpenSMILES [ 56 ] specification was developed in 2007 to provide a SMILES standard form and to clarify some interpretations of corner cases present in the Daylight’s SMILES system. A major issue with Daylight’s SMILES is that its canonicalization algorithm is proprietary, and as such implementations vary between companies and research teams. A novel open source method to generate canonical SMILES was developed in 2012 [ 38 ]. Such SMILES are generated using the canonical label provided by the InChI representation [ 10 ], which is described later in this section. Using such “universal” SMILES seeks to facilitate the comparison between chemical models used by different toolkits.

SMILES Arbitrary Target Specification (SMARTS)

There exists an extension of SMILES developed for substructure searching, named SMILES Arbitrary Target Specification (SMARTS) [ 57 ]. In SMILES there exist two types of symbols to signify atoms and bonds which describe the underlying connectivity of a given molecular graph. However, in SMARTS, the available symbols allow for a more general specification of the molecular graph. This can be likened to the use of regular expressions in computer science. Classical SMARTS can describe an ensemble of molecules that differ at one atom or bond position. It is also possible to include logical operators such as “OR” and “NOT”. Contrary to SMILES, SMARTS can specify different isotopes or bonds types (aromatic or aliphatic). Detailed information about an atom environment can be given using Recursive SMARTS (e.g. ortho, meta, or para substitution patterns in arenes). All SMILES can be valid SMARTS, however the reverse is not true, and decoding a SMILES as a SMARTS will generally not yield the same decoded pattern.

International chemistry identifier

The best example of open-source canonical notation is the InChI (International Chemistry Identifier) representation [ 10 ], which was introduced in 2006 by NIST, under the auspices of the IUPAC, as a standard and freely available formula representation. InChI are composed of multiple layers, such as the Main , Charge , Stereochemical , and Isotopic layers, to name a few, which are themselves constituted of sublayers. For example, the Main layer is composed of the Chemical formula , Atom connections , and Hydrogen atoms sublayers (Fig.  5 ).

figure 5

InChI notation of aspirin. Red letters are the standard beginning of the notation. The following 1 corresponds to the InChI version number, and S states that the notation is a standard InChI. Slashes (blue) are delimiters

A hashed version of the InChI, the InChIKey, is used for open-web searching and library searching [ 58 ]. The first block of an InChIKey represents the molecular skeleton, and the second block encodes for isomerism. InChIKeys are designed to be unique representations of their corresponding parent InChI representations. However, an InChIKey can sometimes map to more than one InChI, the situation being referred to as an InChIKey collision [ 59 ]. Unlike SMILES, InChIs are not guaranteed to be decodable back to the molecular graphs from which they originate, and SMILES have the advantage of being more human-readable. For a detailed overview of the applications of InChIs and the underlying algorithm, readers are referred to the works of Heller [ 10 ] and Warr [ 60 ].

Using chemical descriptors to represent molecules

The representations presented above are atom-based, meaning it is possible to rebuild the molecule based on the representation. There exist, however, other types of notations which, rather than encoding the exact structure of a compound, instead encode the physicochemical, structural, topological, and/or electronic properties of a compound. These are referred to as molecular descriptors [ 11 ] among which two main classes are structural keys and the hashed fingerprints. Descriptors are unique and ambiguous notations widely used in cheminformatics, and their complete descriptions would require a review of their own. A non-exhaustive list can be found in association with the Dragon software [ 61 ], which can calculate 4885 descriptors.

Structural keys

Structural keys are bit strings, encoding for the absence (using a 0) and the presence (using a 1) of a specific chemical group. To provide a general understanding of the structural keys concept, we present here a few widely used keys.

MACCS Keys The first set of keys is referred to as MACCS (Molecular ACCess System) keys or MDL keyset [ 62 , 63 ] and is frequently used for similarity searching. In MACCS keys, each bit indicates the presence or absence of a particular structural fragment. Many variants of the MACCS keys exist [ 64 ], with the most commonly used being 166 and 960-bits long. thus encoding for the presence or absence of 166 and 960 structural fragments. It should be noted that there are many software implementations of the 166-bit MACCS keys, thus one should be cautious as one substructure will not be assigned to the same bit from one software to the other.

CATS For application in scaffold hopping, a topological pharmacophore descriptor, Chemically Advanced Template Search (CATS) [ 65 ], was developed. It can encode for six potential pharmacophore points: H-bond donor/acceptor, positively/negatively charged, aromaticity, and lipophilicity.

Hashed fingerprints

Chemical fingerprints are vectors which contain indexed (ordered) elements encoding for physicochemical or structural properties. Hashed fingerprints differ from other descriptors by the fact that each feature is generated from the molecule itself, while in keys, patterns are pre-defined. Their lengths can be set prior to their generation and a hash function assigns molecular patterns to (non-unique) bits, hence the name. Topological or path-based fingerprints are represented by Daylight fingerprints, which usually consist of 512, 1024, or 2048 bits. The Daylight fingerprint encodes for every connectivity pathway within a molecule up to a given length. Circular fingerprints are representations of chemical structures by atom neighbourhoods and have been widely applied in Quantitative Structure–Activity Relationship (QSAR) analysis. A widely used class of circular fingerprints is ECFP (Extended Connectivity Fingerprints) [ 66 ], based on the Morgan algorithm. In ECFPs, heavy (i.e. non-hydrogen) atoms are encoded into multiple circular layers up to a given diameter.

Whether fingerprints can be called a chemical notation per say is debatable and comes down to a matter of opinion between experts. Regardless, chemical fingerprints are widely used in cheminformatics and drug discovery as they provide a quick and direct mapping from a graph to a vector representation that can be used as input to numerical models, such as QSAR models. It should be noted that fingerprints are flexible representations and can also encode physicochemical properties as integers (e.g. the hydrogen count) and floats (e.g. molecular weight).

Representations for chemical reactions

Harnessing reaction data for drug discovery.

Chemical reactions represent the interconversion of one set of molecules into another related set, under a set of specified conditions. A vast body of reaction data has been amassed to date, with approximately 127 million reactions recorded from 1840 to the present day according to the Chemical Abstracts Service (CAS) [ 67 ].

In recent years, there has been a resurgence of interest in the development of models for the prediction of outcomes of chemical reactions, synthetic routes, and analysis of reaction networks, to name a few application areas. For a more comprehensive coverage of the representations of chemical reactions in databases and computer-aided synthesis design, we refer the reader to a review by Warr and the bibliography therein [ 68 ]. In addition, for a more comprehensive coverage of applications within autonomous discovery, we refer the reader to an extensive review by Jensen et al. [ 69 ].

Many of the representations described in the previous sections natively allow for, can be extended, or have analogous representations for describing chemical reactions. As the description of reactions is that of a set of molecules, limitations in each of the previously described representations are inherited in the description of chemical reactions. A reaction is often represented graphically with the reactants written to the left of a reaction arrow , and a set of resulting products written to the right of this arrow. The conditions under which the transformation occurs are written above or below the arrow, including information such as reagents, catalysts, solvents, temperature, and so forth. The graphical illustrations of reaction schemes often found in publications are, however, not easily machine-readable. Therefore, there exist a series of reaction data exchange formats that enable reactions to be represented in a machine-readable format. There is no inherent requirement for one format or another, as this is dependent on the application, toolkit, or software package used. Commonly used formats include the RXN and RD files described in an earlier section.

Reaction SMILES and SMIRKS

The SMILES format used for describing molecules has been extended to so-called Reaction SMILES by Daylight Chemical Information Systems. Each molecule in the reactants, agents, and products is represented by a SMILES string, and disconnected structures are separated by a period; this includes the individual molecules, ions and ligands, which are listed in an arbitrary manner. Reactants, agents, and products are separated by either the ‘>’ or ‘ ≫ ’ symbol (the latter used when agents are not given). Atom-mappings (i.e. mappings of atoms in the reactants to their equivalent atoms in the products) can be stored in Reaction SMILES as a non-negative integer following the character ‘:’ within an atom expression. Atom mappings do not apply to agents. Furthermore, the storage of additional textual information such as the reaction centre (i.e. the atom and bonds that change during a transformation) or reaction conditions is not supported. Nonetheless, formats such as the RXN and RD file formats, especially the latter, can store this additional metadata, as can other file formats or databases.

SMIRKS belong to the same family as SMILES and SMARTS. Where SMARTS describe molecular patterns or substructures generically, SMIRKS patterns can be used to define generic reaction transformations. They can be used to describe the reaction centre, to enumerate virtual libraries, and to form the knowledge base for reaction and retrosynthetic prediction systems. If one considers that a reaction is a set of atoms and bonds that change during a reaction and the reactant or substrate upon which that change occurs, then SMIRKS must encode the same set of atoms and bonds that change during the reaction, and the site at which that change occurs in the substrate as specified by a SMARTS pattern. The SMARTS pattern is used to specify both the site at which the atom and bond changes occur, and to capture any indirect effects that may influence the reaction. The atomic expressions must be defined such that (a) for any part of a molecule that is to be considered in a generic transformation for which the bonding does not change, SMARTS are to be used, and (b) in cases where bonds change, SMILES are to be used. In this sense, SMIRKS is a hybrid approach between SMILES and SMARTS. There are some rules that must be followed in order to ensure that SMIRKS patterns can be applied. The two sides of the transformation, the reactant(s) and product(s), must contain the same number of mapped atoms, and they must correspond on either side of the reaction. Additionally, any explicit hydrogens must appear explicitly on either side of the reaction and have corresponding atom mapping numbers. SMIRKS are converted into a reaction graph for their subsequent use. The reaction SMILES and corresponding SMIRKS are shown in Fig.  6 .

figure 6

A selection of representations for a simple esterification reaction. The atom mapped reaction is shown in the top left as a structural diagram. The atom maps are consistent between reactant and product as shown. The atom maps in the SMIRKS do not correspond to the atom maps in the full reaction. Rather, they are used to keep track of the atoms within the SMIRKS. The condensed reaction graph and corresponding signature was generated using CGRtools [ 73 ]

An extension of the InChI, RInChI [ 18 , 70 ], was developed between 2008 and 2018 and introduced a unique, order invariant identifier for reactions. It was developed in response to the growing size of reaction data to aid reproducibility, to consider more information than just the participating molecules, and to provide enough information such that practically identical reactions would be represented the same way. RInChI grammar, however, is relatively more complicated than that of Reaction SMILES.

RInChIs use InChIs to describe each molecule. Where InChIs cannot be generated for a molecule, the RInChI tracks the number of “structureless” entities that are present in each of the reactants, agents, and products. In addition to specifying each molecule and reaction role, the RInChI must include information about equilibrium, unbalanced, or multi-step reactions. The RInChI employs a layering system, whereby each layer can describe a different aspect of the chemical reaction. Solvents and catalysts may be accounted for in a similar manner as in Reaction SMILES; however, RInChIs additionally allow for the direction of the reaction to be described. This is particularly useful, as different labs may conduct the same reaction under slightly different conditions, potentially reaching different conclusions about the direction of the reaction. The RInChI generated in this case would be the same, except for the direction flag. This aids in the identification of reactions that are in practical terms identical.

A proposed further extension to RInChI, ProcAuxInfo, enables the storage of metadata relating to yields, temperature, concentration, and other reaction conditions [ 71 ]. RInChI offers an alternative to Reaction SMILES that enables the identification of duplicate reactions, as the order in which molecules are listed in Reaction SMILES is arbitrary. Hashing the RInChI to yield the RInChI key provides a powerful tool for efficiently indexing and searching reaction data [ 18 , 71 ]. However, there is no SMARTS or SMIRKS equivalent for RInChI, limiting its use in substructure searching and in encoding generic chemical transformations. The RInChI and corresponding keys are shown in Fig.  6 .

Condensed graph of reaction (CGR)

Varnek and co-workers have developed the CGR approach [ 19 ], whereby molecular structures are encoded in a matrix containing the occurrence of fragments of a given type. The CGR is a superposition of the reactant and product molecules, and additionally defines what atoms and bonds have changed as well as their properties. This builds on the description of organic reactions using imaginary transition states as described by Fujita [ 72 ]. In analogy to SMIRKS, the CGR can be used to describe a reaction transformation. An example CGR is shown in Fig.  6 .

With the renewed interest in chemical reactions within cheminformatics in recent years, Varnek and co-workers have developed an open source toolkit enabling the wider use of CGR [ 73 ].

Bond electron matrices (BE-matrix)

To exemplify the representation of reactions as matrices, the bond-electron matrix developed by Dugundji and Ugi was previously employed for reaction classification and has also been used as an inspiration for the representation used in programs such as the Elaboration of Reactions for Organic Synthesis (EROS) [ 74 ], and the Workbench for the Organisation of Data for Chemical Applications (WODCA). The BE-matrix is an N by N matrix, where N is the number of atoms in a molecule, and the diagonal entries specify the number of free valence electrons. The off-diagonals specify the bond orders between atoms as found in the bond matrix. The reaction is represented by an “R-matrix” which corresponds to bond changes or changes of non-bonded valence electrons. Positive values indicate bond formation, whereas negative values indicate bond breakage. Adding the “R-matrix” to the BE-matrix of a reactant gives the BE-matrix of a product. The “R-matrix” is therefore an alternate method for representation of the reaction centre [ 20 ]. The BE-matrix illustrates the concept of adding additional information into the matrix representation.

Hierarchical organization of reactions through attribute and education (HORACE)

HORACE [ 21 ] employs a machine learning algorithm for the classification of chemical reactions and is mentioned here because of the hierarchical description of chemical reactions that it uses. It was developed to describe specific reaction instances as well as abstractions of reaction types. Three levels of abstraction are employed. The lowest level describes the partial order of atom types, which gives an explicit hierarchy at the atom level by specifying the degree of similarity between atoms. Following the atom level description is the structural level description. This uses a list of functional groups as structural features by which to characterize individual molecules. The structural characterization is then used to specify which molecules correspond to atoms in the reaction centre. The highest level of abstraction specifies physiochemical properties, which describe the function of the corresponding structure. The hierarchy therefore enables a richer description of a chemical reaction than a purely structural one (as with SMILES).

InfoChem CLASSIFY

The approaches used by Saller and co-workers to represent reactions underlies [ 75 ] and has inspired many of the approaches used for rule-based synthesis planning [ 76 , 77 ]. The first step is to identify and extract the reaction centre, defined as a set of atoms that have changed their number of implicit hydrogens, valency, number of π-electrons, atomic charges, or if at least one connecting bond belongs to the reaction centre. Bonds are defined as belonging to the reaction centre if they are made or broken. In order to identify such changing atoms and bonds, a mapping is used to identify equivalent atoms in the reactants and products.

Regardless of the representation used, a key problem in the representation of chemical reactions is the identification of the reaction centre. One approach to reaction centre detection and atom mapping is finding the maximum common substructure (MCS) between reactant and product molecules. The determination of the MCS is an NP-complete problem, meaning that the solution is non-deterministic in polynomial time. Several reviews discuss these approaches and have been referenced for the interested reader [ 78 , 79 , 80 ].

Having identified the reaction centre, atom hash codes are calculated for all atoms belonging to the reaction centre using a modified Morgan algorithm [ 53 ]. The hash codes include the following atom properties: atom type, valence state, total number of bonded hydrogen atoms (implicit and explicit), number of π-electrons, aromaticity, and formal charges as per the reaction centre definition. The hash codes generated for each atom in the reaction centre are summed for all reactants and one product of a reaction to provide a unique representation of the reaction centre.

The description of the reaction centre can be extended to include the neighbouring chemical environment depending on the level of specificity required (Fig.  7 ). The reaction centre alone corresponds to a “broad” or more general description of the reaction, whereas inclusion of alpha atoms (atoms adjacent to those in the reaction centre) corresponds to a “medium” description of the reaction centre. Expanding the description to include the next set of adjacent atoms “narrows” the description of the reaction owing to increased specificity. The generated hash codes have been used in reaction classification and the approaches for reaction centre extraction utilized in a variety of synthetic planning tools.

figure 7

Atomic environments included in the description of the reaction centre. The reaction centre is used in calculations of atom hash codes for varying degrees of specificity

Reaction fingerprints

Reaction fingerprints are vector representations of reactions. They specifically represent the structural changes taking place in the reaction centre. This information is captured by constructing fingerprints, such as the ECFP variant described previously, and taking the difference between the product and reactant vectors, optionally considering the agent. Schneider et al. [ 81 ] have used the difference fingerprint with the atom-pair variant to build a machine learning system for a 50-class reaction classification model. A similar approach to the computation of reaction vectors was described by Patel et al. [ 82 ] and has been used in de novo design and classification approaches [ 83 ]. The reaction fingerprint highlights an alternate approach to reaction centre detection and representation; however, it cannot be easily converted to a reaction graph. Lastly, the handling of stereochemistry has not been mentioned but is an active area of research [ 84 ].

Representations for macromolecules

While there have been many advances in the representation of small molecules, in comparison very few studies [ 85 ] have addressed the representation of macromolecules, which are polymeric structures. In this section, we present representations made for biopolymers and bio-oligomers, like proteins and oligosaccharides, and synthetic polymers. The process of representing macromolecules can be hindered by the fact that, while many polymers are monodisperse (i.e. constituting monomers have the same chain length), others can be polydisperse, such that their stochastic nature results in an undefined chain length. Examples of macromolecules and their notations are shown on Fig.  8 .

figure 8

Example of linear notations for different types of macromolecules. Cyclosporin is an immunosuppressant medication and natural product. Lactose is a disaccharide used in the food industry. Insulin is a peptide hormone which regulates the metabolism of carbohydrates, fats, and protein. pHEMA or poly(2-hydroxyethyl methacrylate) is a polymer that forms hydrogel in water. Copolymers of pHEMA are used to make contact lenses

This section places itself at the interface between bioinformatics and cheminformatics. While in cheminformatics small molecules are described on the atomic-level, in bioinformatics, polymers, such as a proteins or polynucleotide molecules, are more commonly defined using their nucleotide and amino acid sequences. Representations which combine atomic and sequence information are presented here.

Amino acid-based structures

The building blocks of peptides and proteins are amino acids (AAs), which are each made up of an amine group, a carboxyl group, and a side chain specific to each AA. AAs are commonly represented by a one-letter symbol, which commonly imply the L configuration for chiral AAs, or a three-letter abbreviation [ 86 ]. A limitation of the one-letter symbol is that while the Latin alphabet is large enough to describe the 20 AAs which appear in known genetic code, there are far more naturally occurring AAs.

Peptides are chains of 2 to 50 AAs linked by peptide bonds. They can be antibiotics, immunosuppressants, or antitumor agents. This broad range of biological activity sparked the interest of the community.

A method named CHUCKLES [ 12 ] was developed in 1994 to infer SMILES of polymers from their sequences and vice versa. In cheminformatics, this method is particularly useful in inferring the SMILES from the peptide sequence, which is referred to as Forward Translation (FT). In FT, monomers sequences and SMILES are stored in a lookup table, with the SMILES excluding any atoms which would be involved in monomer bonding. For linear structures, SMILES corresponding to each residue are concatenated. In branched and cyclized structures, monomer indices are mapped to the SMILES, thus encoding for structures such as disulfide bridges. CHUCKLES is applicable to oligomeric structures and is used in BIOPEP-UWM [ 87 ]. An extension of CHUCKLES, CHORTLES, was designed to handle oligomeric mixtures. Two notations are well known for their ability to describe a broad range of macromolecules: the Hierarchical Editing Language for macromolecules (HELM) [ 14 , 88 ] and the Self-Contained Sequence Representation (SCSR) [ 89 ]. Both representations were developed concurrently, the first one relying on SMILES and the second one on the v3000 Molfile format. SCSR was developed by BIOVIA which provides automated interconversion between HELM and SCSR.

In the following lines, we provide further details on HELM, which was developed by Pfizer under the auspices of the Pistoia Alliance. The objective of the project was to design a system representing combinations of component structure types (e.g. peptides, antibodies, chemical modifiers). An example of HELM notation is shown in Fig.  9 . Initially, HELM was limited to well-defined structures; however, HELM2 overcame this limitation and can describe polymer mixtures and free-form annotations. HELM represents monomers in a SMILES-like format, simple polymers using a simplified version of CHUCKLES and complex polymers using graphs. Its structure hierarchy follows the granularity of the structures: Complex Polymer, Simple Polymer, Monomer, and Atom. HELM is implemented in many pharmaceutical companies [ 90 ], in public databases (in 2016, ChEMBL21 contained 20,000 peptides annotated with HELM [ 91 ]), as well as in various packages and software such as RDKit (limited to peptides), ChemDraw, the Biomolecule Toolkit, ChemAxon, and Sugar&Splice, which can all encode for peptides, DNA, and RNA.

figure 9

Graph and HELM representation of a biphalin analog. Amino acids are coloured coded as followed: blue, green, red, and pink for tyrosine (Y), alanine (A), glycine (G), and phenylalanine (F), respectively)

In comparison with purely atomic-based notations such as SMILES, biocheminformatics representations can facilitate the development of modified drug peptides. For example, the substitution of natural L-AA with D-AA can improve the oral bioavailability of a peptide [ 92 ]. Such modifications would be intuitive with HELM, which provides readability at the polymer level, whereas SMILES provides descriptions on the atomic level. While these methods constitute a step forward to a better understanding and unification of cheminformatics and bioinformatics, errors in the translation of peptide notation from biological into chemical language have been detected and practical solutions proposed [ 93 ].

Proteins are polypeptides made up of 50 or more amino acids. They are generally biological targets; however, therapeutic protein drugs have been engineered [ 94 ]. The largest repository of 3D structures of proteins and nucleic acids is the Protein Data Bank (PDB) [ 17 ], which contains more than 150,000 structures. PDB entries contain the atomic coordinates of every atom in a protein structure as well as solvent molecules (if applicable). Each atom is identified by a sequential number, a specific atom name, the name and number of its corresponding residue, a one-letter code specifying the chain, its spatial coordinates ( x , y , and z ), and an occupancy and temperature factor. Furthermore, any notations mentioned in the Peptide section can be used for protein representations. In 2008, the Protein Line Notation (PLN) [ 16 ] was created by Biochemfusion and is implemented in PubChem. Pseudo-atoms were used to represent a simplified version of a residue structure, which enabled a lossless conversion between chemistry and sequence formats.

Key macromolecules

Most drugs are small organic molecules. However, drugs can also be macromolecular in nature, such as glycans (also referred to as carbohydrates) [ 95 ] or synthetic polymers [ 96 ].

Oligosaccharides and polysaccharides are glycans containing more than 3 and 20 monosaccharides (the smallest sugar unit), respectively. In drug discovery, glycans are interesting as receptors, small molecule glycomimetics, therapeutic glycopeptides, and vaccines. Glycan databases are used by carbohydrate researchers and structures are generally recorded using monosaccharide-based notations [ 97 , 98 , 99 , 100 ]. These representations do not allow for the analysis of the interactions between glycans and proteins using docking techniques, which require atom-based representations. Converter tools have been developed [ 101 , 102 ] that translate these notations to atom-based representations. With the aim of creating a linear and unique notation for glycan data, compatible with the usage of the semantic web, the Web3 Unique Representation of Carbohydrate Structures (WURCS) [ 15 ] was developed and combined bioinformatics and cheminformatics features. The newest version of WURCS [ 103 ], used by the International Glycan Structure Repository GlyTouCan [ 104 ], encodes the following features: the main carbon backbone of a monosaccharide residue, the backbone modifications (i.e. atoms belonging to a monosaccharide which are not part of the backbone), and the linkage information between the backbone and a modification. The notation provides explicit anomeric information and can handle ambiguous monosaccharide structures (e.g. unknown ring closure or anomeric information). Currently, WURCS is implemented in many databases but unsupported by most cheminformatics software.

Independent representations have been developed to address specific challenges. Pillong and Schneider [ 105 ] published a representation of monosaccharides based on pharmacophoric properties. Bojar et al. [ 106 ] developed a language model based on natural language processing (NLP) providing information on glycans connectivity and composition.

Polymeric drugs

In the context of drug discovery, polymers are primarily used as drug deliverers. Nonetheless, some polymers have been used as active ingredients. Recently, the BigSMILES [ 107 ] syntax was introduced to encode for homopolymers, random- and block co-polymers, and molecules with different degrees of complexity in connectivity, such as linear polymers, ring polymers, and branched polymers. The stochastic unit of these polymers is identified by a pair of curly brackets. Repeated units are delimited by a comma and listed inside these brackets. Although BigSMILES are currently not canonical, a canonicalization scheme is under development. No application of this notation is available to this date; however, the development of polymeric drugs is expected to flourish [ 96 ], and ML models could be applied to aid in related studies.

Graphical representations for molecules and macromolecules

The representations presented in the previous sections are designed for the storage and the cheminformatics analysis of compounds. In this section, representations which are made for direct visualization of compounds or/and their physicochemical properties are introduced.

2D depictions

Molecules as raster or vector images are very often represented by their skeletal structures, which are referred to as 2D depictions (Fig.  10 a). Many difficulties can be encountered when generating 2D depictions related to the layout (e.g. orientation, overlap) and rendering (e.g. font, abbreviations, atom labels alignment) of the image. In 2008, the IUPAC issued recommendations for the standard display (typography, orientation of structure, etc.) of 2D depictions [ 108 ]. Such obstacles are overcome by a range of algorithms; however, as of now, none of them can perfectly display every chemical structure. This was exemplified in 2008 in a comparative study of 2 proprietary toolkits (Cactvs [ 109 ], used by PubChem, and Molinspiration [ 110 ]), and 3 open-source toolkits: RDKit, OASA [ 111 ], and CDK [ 112 ]. In 2017, improvements were done in CDK to depict stereochemistry more accurately and to solve atomic overlap [ 112 ]; on the latter point, the algorithm went from a heuristic approach to a refinement process. For more up-to-date details and examples of 2D depiction algorithms and their limitations, we refer the readers to a 2016 presentation by John Mayfield [ 113 ].

figure 10

Examples of various molecules drawn using different display types. b – d Generated with Avogadro [ 32 ]. a Skeletal structure of the Fe-porphyrin subunit of haem B. b Ribbon diagram of haemoglobin. c Space-filling model of the Fe-porphyrin subunit of haem B. d Ball-and-stick model of the Fe-porphyrin subunit of haem B. Note the different orientations. e 2D visualization of protein–ligand interactions (PDB code: 2HPS). Reprinted with permission from [ 115 ]. Copyright 2020 American Chemical Society. f 3D visualization of protein–ligand interactions (PDB code: 6KYA)

Apart from the 2D depictions of the structures themselves, molecules can be depicted in various ways for reactions and interaction studies (Fig.  10 e). In the latter, the aim is to investigate the environment or the behaviour of a molecule rather than its structure. A specific 2D depiction worth mentioning is the Markush structure, especially useful in patents, which depicts a specific series of compounds. A Markush structure possesses a fixed core with one or several variable parts which can be described by -R groups, bonds, atoms, etc. For macromolecules, different types of depictions are needed as the visualization often focuses on the polymer or peptide structures rather than the atomic structure. Associated to HELM, the Pfizer Macromolecule Editor (PME) was developed to visualize polymer structures and calculate molecular properties. A notable nomenclature for the depiction of glycans is provided by the Consortium for Functional Glycomics [ 114 ].

3D depictions

Before the advent of computers, various molecular models were developed to visualize and manipulate molecules in 3D and were built by assembling balls and sticks made of materials such as plastics or metals. Nowadays, while physical molecular modelling kits are still used in educational environment to represent basic structures, visualization software has become a tool of choice for 3D graphical displays of molecules (examples of visualization software have been provided in the subsection Introduction to the molecular graph representation ). The software Avogadro, PyMOL, and VMD all offer the popular representations ball - and - stick , cartoon , and van der Waals (vdW), as well as many independent representations. Each representation is useful for the visualization of specific properties, be it the structure coloured by atom types (Fig.  10 d), the secondary biological structure (Fig.  10 b), or the space-filling vdW spheres (Fig.  10 c). The vdW spheres help visualize the surface on which the molecule can build interactions; using this depiction, interactions between proteins and ligands can be visualized in 3D (Fig.  10 f). 3D depictions are especially useful in docking and mechanistic studies, while 2D depictions are standard in structure–activity investigations.

AI applications within drug discovery using molecular representations

Most of the representations we have discussed above have seen widespread use within the fields of drug discovery and artificial intelligence. If there are any molecular representations which the reader feels were not discussed, it is because we, the authors, were not aware of the widespread use of that representation within our specialized fields of research. The omission of some representations may have been intentional, or it may have been due to the fast rate of change and developments in this field as the availability of useful datasets for drug discovery applications grows. Lastly, several concepts that were historically used may see a resurgence as they are adapted to suit current methods.

Despite this being a review of molecular representations, many of these representations are themselves used in representation learning applications within deep learning. Representation learning is the idea of learning an internal representation (e.g. a vector) for a given object (e.g. a molecule) and then using that internal representation for a predictive task. These internal representations are learned, meaning models can be trained to create them using classic techniques such as backpropagation in neural networks. With representation learning tasks, it is key to first identify a suitable input representation of a molecule that contains as much of the desired/necessary information to solve a problem as possible. Of the applications described below, any using deep neural network (DNN) architectures are essentially carrying out representation learning tasks, whereas classical ML methods such as random forests (RFs) and support vector machines do not operate by learning internal representations.

With the development of graph neural networks, a wave of recent work in drug discovery has focused on using the molecular graph representation directly for both property prediction and de novo design. As such, the molecular graph representation can be used for various applications within AI, and there is a large body of work discussing its use for molecular property prediction [ 26 , 116 , 117 , 118 , 119 ], and, more recently, molecular graph generation [ 120 , 121 , 122 , 123 , 124 , 125 ] and synthesis prediction [ 126 ]. In most cases this is done through graph representation learning, by which a graph embedding is obtained from the full graph representation using a graph network [ 127 , 128 ]; the learned graph embedding can be used as input to a property prediction model, such as a RF or DNN, in the same way a classic molecular fingerprint [ 66 , 129 , 130 ] is used. Until recently, more compact linear notations such as SMILES strings were favoured for many ML applications involving molecules, in part due to the larger memory requirement of molecular graph representations; this is, however, slowly changing. For two excellent reviews of deep learning applications in chemistry and drug discovery, we recommend [ 26 ] and [ 131 ]. For a good review on molecular generative models using AI, we recommend [ 132 ].

Another popular use of molecular graphs both within and outside drug discovery lies in part outside AI, where graphs are used as input for atomistic simulations (e.g. molecular dynamics) where atomic coordinates and periodic boundary conditions are used as the starting point for program-specific file formats which contain not only all atom coordinates, but also detailed bond information (e.g. bond length, dihedral angles, torsional angles) necessary to calculate the energy of a given molecular configuration using force fields. As such, the molecular graph representation has widespread use in molecular dynamics applications within drug discovery, such as docking, protein folding, and free energy perturbation calculations. These applications have been assisted by recent developments in AI [ 133 , 134 ].

Popular applications of linear notations such as SMILES and molecular fingerprints are in molecular property prediction and QSAR. SMARTS patterns have been used to define substructures with the aim of selecting or eliminating associated compounds [ 135 , 136 , 137 ]. Additionally, the use of string representations such as SMILES has seen a lot of unexpected success in the field of de novo molecular design using tools from NLP. Data augmentation can be done for many applications using randomized SMILES [ 51 , 138 ]. String representations have also seen success in property prediction using the learned latent space representations obtained using autoencoder frameworks [ 139 ]. As mentioned above, many of the aforementioned neural network models work by learning a vector representation for molecules in the training set, and using that learned representation to predict properties [ 116 , 118 , 140 ]; this is analogous to the older use of hashed fingerprint representations for molecular property prediction using traditional ML approaches, hence the term learned fingerprints .

Common applications of the reaction representations are in retrosynthesis and reaction prediction. This is an important field of research, as the synthesizability of proposed compounds is key to computational drug design and having suitable retrosynthesis tools would allow scientists to “close the loop” of AI-driven drug discovery. Many of these applications are also discussed in [ 26 , 69 ].

Representations for macromolecules

Popular applications of the macromolecular representations introduced in this work are in protein structure prediction, as having an accurate picture of a protein and the role it plays in a given disease can help scientists to develop molecules for the right target. Pillong and Schneider [ 141 ] successfully applied their pseudo-receptor model in a virtual screening study aiming to identify aminoglycoside scaffolds with antibacterial potential. The interactions between glycans and proteins have been investigated [ 142 , 143 ] using ML. An important field of investigation linked to glycans is the prediction of glycosylation sites. Many tools were developed to infer such predictions and have been applied recently in the pipeline for the prediction of oncology drug targets [ 144 ] and the characterisation of the novel coronavirus (2019-nCoV) [ 145 ].

Graphical representations for molecules and macromolecules

We previously showed how the process of visualizing molecules has become faster, more practical, and more enjoyable thanks to better computational tools. This process is still an important field of research for which virtual reality and 3D printing techniques have been developed. Moreover, as the need for harvesting the large amounts of published data grows, the demand for methods for easily mining structures from papers and patent data is also growing. Optical Character Recognition (OCR) systems, relying on a variety of ML and probabilistic pattern recognition techniques, were created to translate 2D depictions of chemical structures to standard chemical representations [ 146 , 147 , 148 ]. Nonetheless, the development of OCR systems can be hindered by the images’ resolutions, the computational interpretations of chemical abbreviations, and the nature of the image representation, which can be embedded in text, in figures containing multiple structures, or in reaction pathways, and can be represented as either a skeletal formula or a Markush structure.

At this point, it might become clear to the reader that many applications within drug discovery require multiple representations to be used simultaneously to solve a problem. For example, in protein structure prediction, one might start with the protein sequence, create a rudimentary 3D model of the structure, and then use advanced molecular dynamics methods to understand how the protein folds and what the final configuration/structure for the protein might be. The coordinates of the optimized protein structure (e.g. a PDB file) might then go on to be used in docking calculation, etc. Technical aspects may factor into the choice of representation(s) a researcher might make, such as the complexity of the method(s) for generating the representation(s), and if they are openly accessible.

It is interesting to note that some representations have held the test of time better than others. This can be partly explained by the evolution of computer technologies, which have improved in terms of storage capabilities, processor qualities, and parallel programming capabilities. Standard representations such as IUPAC-Dyson and WLN were sensible during their times and were made to be manipulated by humans, but difficult to work with on a computer. Computationally simpler representations are now frequently used. Furthermore, detailed representations which require greater computational time to compute (compared to molecular string representations) can nowadays be used; this is the case for hashed fingerprints. Another possible explanation for the endurance of certain molecular representation is that they are more human readable than others, and thus have been better received by the cheminformatics community. Lastly, another reason why some notations persist and others do not is that within different fields and subfields (e.g. cheminformatics, bioinformatics, or AI), different notations are often preferred for either historical or continuity reasons within groups.

Conclusions

Molecules are complex structures and their representations must account not only for a wide variety of properties, such as stereochemistry and valence, but also for the different nature of these small molecules and macromolecules. The rise of cheminformatics and bioinformatics has led to a faster and more efficient drug discovery process as well as to a better understanding of molecular behaviour. In this review, we presented various popular notations and representations for small molecules, polymers, and proteins and their most common uses related to AI within computational drug discovery. We hope that this review will benefit practising cheminformaticians, students, and anyone else interested to learn more about the underlying molecular representations in cheminformatics that can be used in AI-driven drug discovery applications.

Availability of data and materials

Not applicable.

Note that throughout this section, using the typical convention, bold italicized symbols are used to represent matrices and vectors, where an uppercase symbol specifically denotes a matrix (e.g. X ) and a lowercase symbol specifically denotes a vector (e.g. x i ). Furthermore, uppercase symbols that are italicized but not bolded are used to represent sets (e.g. V ), whereas lowercase symbols that are italicized but not bolded are used to represent either a) items from a set (e.g. v i ) or b) elements of a matrix (e.g. a ij ), depending on context.

One-hot encoding is a widely used technique in AI to convert categorical data into numerical data using binary vectors, where a 1 indicates the presence of a quality and a 0 indicates the absence. See Fig.  1 c; the rows of the node features matrix are examples of one-hot encodings.

List of scientists who proposed seven notation to IUPAC: M. Gordon, C. E. Kendall and W. H. T. Davson, W. Gruber, J. A. Silk, E. S. Cockburn, G. M. Dyson, G. K. Zipf and W. J. Wiswesser.

Abbreviations

2-dimensional

3-dimensional

Amino acids

  • Artificial intelligence

Chemically Advanced Template Search

ChemAxon Extended SMILES

Extended Connectivity Fingerprints

Forward Translation

Hierarchical Editing Language for Macromolecules

International Chemistry Identifier

International Union of Pure and Applied Chemistry

Molecular Design Limited, Inc

Machine-learning

Natural Language Processing

Optical character recognition

Protein Line Notation

Quantitative structure activity relationship

Reaction-Data Files

Reaction InChI

SMILES Arbitrary Target Specification

SMILES ReaKtion Specification

Wiswesser Line Notation

Web Unique Representation of Carbohydrate Structures

Lawlor B (2016) The chemical structure association trust. Chem Int. 38(2):12–15

Article   CAS   Google Scholar  

Wiswesser WJ (1968) 107 years of line-formula notations (1861–968). J Chem Doc. 8(3):146–150

Zhou P, Shang Z. 2D molecular graphics: a flattened world of chemistry and biology

Clark AM, Labute P, Santavy M (2006) 2D structure depiction. J Chem Inf Model 46(3):1107–1123

Article   CAS   PubMed   Google Scholar  

RasMol and OpenRasMol. http://www.openrasmol.org/ . Accessed 27 Apr 2020.

Francoeur E (2002) Cyrus Levinthal, the Kluge and the origins of interactive molecular graphics. Endeavour 26(4):127–131

Feldmann RJ, Heller SR, Bacon CRT (1972) An interactive, versatile, three-dimensional display, manipulation and plotting system for biomedical research. J Chem Doc. 12(4):234–237

Gelberg A. Chemical notations. In: Encyclopedia of library and information science. 1970. p. 510–28

Weininger D (1988) SMILES, a Chemical Language And Information System: 1: introduction to methodology and encoding rules. J Chem Inf Comput Sci 28(1):31–36

Heller SR, McNaught A, Pletnev I, Stein S, Tchekhovskoi D (2015) InChI, the IUPAC International Chemical Identifier. J Cheminform. 7(1):23

Article   PubMed   PubMed Central   CAS   Google Scholar  

Cereto-Massagué A, Ojeda MJ, Valls C, Mulero M, Garcia-Vallvé S, Pujadas G (2015) Molecular fingerprint similarity search in virtual screening. Methods 71:58–63

Article   PubMed   CAS   Google Scholar  

Siani MA, Weininger D, Blaney JM (1994) CHUCKLES: a method for representing and searching peptide and peptoid sequences on both monomer and atomic levels. J Chem Inf Comput Sci 34(3):588–593

Siani MA, Weininger D, James CA, Blaney JM (1995) CHORTLES: a method for representing oligomeric and template-based mixtures. J Chem Inf Comput Sci 35:1026–1033

Zhang T, Li H, Xi H, Stanton RV, Rotstein SH (2012) HELM: a hierarchical notation language for complex biomolecule structure representation. J Chem Inf Model 52(10):2796–2806

Tanaka K, Aoki-Kinoshita KF, Kotera M, Sawaki H, Tsuchiya S, Fujita N et al (2014) WURCS: the Web3 Unique Representation Of Carbohydrate Structures. J Chem Inf Model 54(6):1558–1566

Jensen JH, Hoeg-Jensen T, Padkjær SB (2008) Building a biochemformatics database. J Chem Inf Model 48(12):2404–2413

Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al (2000) The protein data bank. Nucleic Acids Res 28(1):235–242

Article   CAS   PubMed   PubMed Central   Google Scholar  

Grethe G, Blanke G, Kraut H, Goodman JM (2018) International chemical identifier for reactions (RInChI). J Cheminform. 10(1):22

Varnek A, Fourches D, Hoonakker F, Solovev VP (2005) Substructural fragments: an universal language to encode reactions, molecular and supramolecular structures. J Comput Aided Mol Des. 19(910):693–703

Dugundji J, Ugi I. An algebraic model of constitutional chemistry as a basis for chemical computer programs. In: Computers in chemistry. Springer; 2006. p. 19–64

Rose JR, Gasteiger J (1994) HORACE: an automatic system for the hierarchical classification of chemical reactions. J Chem Inf Comput Sci 34(1):74–90

Ertl P (2010) Molecular structure input on the web. J Cheminform. 2(1):1–9

Guha R, Wiggins GD, Wild DJ, Baik MH, Pierce ME, Fox GC (2011) Improving usability and accessibility of cheminformatics tools for chemists through cyberinfrastructure and education. Silico Biol. 11(12):41–60

CAS   Google Scholar  

Varnek A, Baskin II (2011) Chemoinformatics as a theoretical chemistry discipline. Mol Inform. 30(1):20–32

Vazquez M, Krallinger M, Leitner F, Valencia A (2011) Text mining for drugs and chemical compounds: methods, tools and applications. Mol Inform. 30(6–7):506–519

Mater AC, Coote ML (2019) Deep learning in chemistry. J Chem Inf Model 59:2545–2559

Warr WA (2011) Representation of chemical structures. Wiley Interdiscip Rev Comput Mol Sci. 1(4):557–579

National Academy of Sciences UNRC. In: Survey of chemical notations systems. 1964. p. 1–467

Leach AR, Gillet VJ (2007) An introduction to chemoinformatics. Springer, Netherlands, pp 1–255

Book   Google Scholar  

ChemDraw. PerkinElmer Informatics.

MacRae CF, Sovago I, Cottrell SJ, Galek PTA, McCabe P, Pidcock E et al (2020) Mercury 40: from visualization to analysis, design and prediction. J Appl Crystallogr. 53(Pt 1):226–235

Marcus DH, Donald EC, David CL, Tim EZ, Vandermeersch GRH (2012) Avogadro: an advanced semantic chemical editor, visualization, and analysis platform. J Cheminform. 4:17

Momma K, Izumi F (2011) VESTA 3 for three-dimensional visualization of crystal, volumetric and morphology data. J Appl Crystallogr 44(6):1272–1276

Delano WL. PyMOL: An Open-Source Molecular Graphics Tool. https://www.ccp4.ac.uk/newsletters/newsletter40/11_pymol.pdf . Accessed May 27 2020.

Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14(1):33–38

Kay E, Bondy JA, Murty USR. Graph Theory with Applications. Vol. 28, Operational Research Quarterly (1970-1977). 1977. p. 237

Dietz A (1995) Yet another representation of molecular structure. J Chem Inf Comput Sci 35(5):787–802

O’Boyle NM (2012) Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI. J Cheminform. 4:9

Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA et al (1992) Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Comput Sci 32(3):244–255

Engel T, Gasteiger J (2018) Chemoinformatics: basic concepts and methods. Wiley, New York

Leigh GJ, Favre HA, Metanomski WV. Principles of chemical nomenclature: a guide to IUPAC recommendations. Blackwell Science Ltd, editor. European Journal of Medicinal Chemistry. The Royal Society of Chemistry; 1998

Color Books - IUPAC | International Union of Pure and Applied Chemistry. https://iupac.org/what-we-do/books/color-books/ . Accessed 15 Dec 2019

Dyson GM, Lynch MF, Morgan HL (1968) A modified IUPAC-Dyson notation system for chemical structures. Inf Storage Retr 4(1):27–83

Wiswesser WJ (1982) How the WLN began in 1949 and how it might be in 1999. J Chem Inf Comput Sci 22(2):88–93

Wiswesser WJ (1985) Historic development of chemical notations. J Chem Inf Comput Sci 25(3):258–263

Wiswesser WJ (1955) Molecular structure and taste simulation. Va J Sci. 6:16–21

David L, Arús-Pous J, Karlsson J, Engkvist O, Bjerrum EJ, Kogej T et al (2019) Applications of deep-learning in exploiting large-scale and heterogeneous compound data in industrial pharmaceutical research, vol 10. Frontiers in Pharmacology, Frontiers Media SA, New York

Google Scholar  

Daylight. https://www.daylight.com/ . Accessed 23 Apr 2020

RDKit, Open-Source Cheminformatics. http://www.rdkit.org

Bjerrum E, Sattarov B (2018) Improving chemical autoencoder latent space and molecular de novo generation diversity with heteroencoders. Biomolecules. 8(4):131

Article   PubMed Central   CAS   Google Scholar  

Bjerrum EJ. SMILES Enumeration as Data Augmentation for Neural Network Modeling of Molecules. arXiv Prepr. 2017

Schneider N, Sayle RA, Landrum GA (2015) Get your atoms in order-an open-source implementation of a novel and robust molecular canonicalization algorithm. J Chem Inf Model 55(10):2111–2120

Morgan HL (1965) The generation of a unique machine description for chemical structures—a technique developed at chemical abstracts service. J Chem Doc. 5(2):107–113

Quirós M, Gražulis S, Girdzijauskaitė S, Merkys A, Vaitkus A (2018) Using SMILES strings for the description of chemical connectivity in the Crystallography Open Database. J Cheminform 10(1):1–17

ChemAxon Extended SMILES and SMARTS - CXSMILES and CXSMARTS - Documentation. https://docs.chemaxon.com/display/docs/ChemAxon_Extended_SMILES_and_SMARTS_-_CXSMILES_and_CXSMARTS.html#src-1806633_ChemAxonExtendedSMILESandSMARTS-CXSMILESandCXSMARTS-Fragmentgrouping . Accessed 8 Apr 2020

OpenSMILES Home Page. http://opensmiles.org/ . Accessed 23 Apr 2020

Daylight Theory: SMARTS - A Language for Describing Molecular Patterns. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html . Accessed 15 Nov 2020

Southan C (2013) InChI in the wild: an assessment of InChIKey searching in Google. J Cheminform. 5(1):10

Pletnev I, Erin A, McNaught A, Blinov K, Tchekhovskoi D, Heller S (2012) InChIKey collision resistance: an experimental testing. J Cheminform. 4:12

Warr WA (2015) Many InChIs and quite some feat. J Comput Aided Mol Des 29(8):681–694

Kode-Chemoinformatics. https://chm.kode-solutions.net/products_dragon.php . Accessed 23 Apr 2020

Dalke A. MACCS key 44. http://www.dalkescientific.com/writings/diary/archive/2014/10/17/maccs_key_44.html . Accessed 28 Mar 2020

MDL Information Systems I. MACCS keys

Durant JL, Leland BA, Henry DR, Nourse JG (2002) Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci 42(6):1273–1280

Schneider G, Neidhart W, Giller T, Schmid G (1999) “Scaffold-Hopping” by topological pharmacophore search: a contribution to virtual screening. Angew Chemie Int Ed. 38(19):2894–2896

Rogers D, Hahn M (2010) Extended-Connectivity Fingerprints. J Chem Inf Model 50(5):742–754

CAS Content | CAS. https://www.cas.org/about/cas-content . Accessed 8 Apr 2020

Warr WA (2014) A short review of chemical reaction database systems, computer-aided synthesis design, reaction prediction and synthetic feasibility. Mol Inform. 33(6–7):469–476

Jensen KF, Coley CW, Eyke NS (2019) Autonomous discovery in the chemical sciences part I: Progress. Angew Chemie Int Ed

Grethe G, Goodman JM, Allen CH (2013) International chemical identifier for reactions (RInChI). J Cheminform. 5(1):45

Jacob PM, Lan T, Goodman JM, Lapkin AA (2017) A possible extension to the RInChI as a means of providing machine readable process data. J Cheminform. 9:1

Fujita S (1986) Description of organic reactions based on imaginary transition structures. 1. introduction of new concepts. J Chem Inf Comput Sci. 26(4):205–212

Nugmanov RI, Mukhametgaleev RN, Akhmetshin T, Gimadiev TR, Afonina VA, Madzhidov TI et al (2019) CGRtools: Python library for molecule, reaction, and condensed graph of reaction processing. J Chem Inf Model 59(6):2516–2521

Gasteiger J, Jochum C (2006) EROS A computer program for generating sequences of reactions. In: Organic Compunds. Springer, pp 93–126

Kraut H, Eiblmaier J, Grethe G, Löw P, Matuszczyk H, Saller H (2013) Algorithm for reaction classification. J Chem Inf Model 53(11):2884–2895

Bøgevig A, Federsel HJ, Huerta F, Hutchings MG, Kraut H, Langer T et al (2015) Route design in the 21st century: the IC SYNTH software tool as an idea generator for synthesis prediction. Org Process Res Dev 19(2):357–368

Segler MHS, Preuss M, Waller MP (2018) Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555(7698):604–610

Chen WL, Chen DZ, Taylor KT (2013) Automatic reaction mapping and reaction center detection. Wiley Interdiscip Rev Comput Mol Sci 3(6):560–593

Ehrlich H, Rarey M (2011) Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. WIREs Comput Mol Sci 1(1):68–79

Raymond JW, Willett P (2002) Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J Comput Aided Mol Design 16:521–533

Schneider N, Lowe DM, Sayle RA, Landrum GA (2015) Development of a novel fingerprint for chemical reactions and its application to large-scale reaction classification and similarity. J Chem Inf Model 55(1):39–53

Patel H, Bodkin MJ, Chen B, Gillet VJ (2009) Knowledge-based approach to de novo design using reaction vectors. J Chem Inf Model 49(5):1163–1184

Ghiandoni GM, Bodkin MJ, Chen B, Hristozov D, Wallace JEA, Webster J et al (2019) Development and application of a data-driven reaction classification model: comparison of an electronic lab notebook and medicinal chemistry literature. J Chem Inf Model 59(10):4167–4187

Coley CW, Green WH, Jensen KF (2019) RDChiral: an RDKit wrapper for handling stereochemistry in retrosynthetic template extraction and application. J Chem Inf Model 59(6):2529–2537

Peerless JS, Milliken NJB, Oweida TJ, Manning MD, Yingling YG (2019) Soft matter informatics: current progress and challenges. Adv Theory Simulations. 2(1):1800129

Nomenclature and symbolism for amino acids and peptides (1984) Pure Appl Chem 56(5):595–624

Article   Google Scholar  

Minkiewicz P, Iwaniak A, Darewicz M (2019) BIOPEP-UWM database of bioactive peptides: current opportunities. Int J Mol Sci. 20:23

Milton J, Zhang T, Bellamy C, Swayze E, Hart C, Weisser M et al (2017) HELM Software for Biopolymers. J Chem Inf Model 57(6):1233–1239

Chen WL, Leland BA, Durant JL, Grier DL, Christie BD, Nourse JG et al (2011) Self-contained sequence representation: bridging the gap between bioinformatics and cheminformatics. J Chem Inf Model 51(9):2186–2208

HELM - Pistoia Alliance. https://www.pistoiaalliance.org/projects/current-projects/helm/ . Accessed 23 Apr 2020

Knispel R, Büki E, Hornyák G, Mihala N, Tomin A, Keresztes G, et al. Informatics tools leveraging the open HELM standard for managing and exploring databases of chemically modified complex biomolecules. https://chemaxon.com/app/uploads/2016/04/biotoolkit_2016-04_102_A4.pdf . Accessed 27 May 2020

Bruno BJ, Miller GD, Lim CS (2013) Basics and recent advances in peptide and protein drug delivery. Ther Deliv. 4(11):1443–1467

Minkiewicz P, Iwaniak A, Darewicz M (2017) Annotation of peptide structures using SMILES and other chemical codes-practical solutions. Molecules 22(2075):1–17

Sauna ZE, Lagassé HAD, Alexaki A, Simhadri VL, Katagiri NH, Jankowski W et al (2017) Recent advances in (therapeutic protein) drug development. F1000 Research. 6:F1000

Valverde P, Ardá A, Reichardt NC, Jiménez-Barbero J, Gimeno A (2019) Glycans in drug discovery. Medchemcomm. 10(10):1678–1691

Connor EF, Lees I, Maclean D (2017) Polymers as drugs-Advances in therapeutic applications of polymer binding agents. J Polym Sci Part A: Polym Chem 55(18):3146–3157

Bohne-Lang A, Lang E, Förster T, Von der Lieth CW (2001) LINUCS: LInear notation for unique description of carbohydrate sequences. Carbohydr Res 336(1):1–11

Herget S, Ranzinger R, Maass K, Lieth CW (2008) GlycoCT-a unifying sequence format for carbohydrates. Carbohydr Res. 343(12):2162–2171

Ranzinger R, Kochut KJ, Miller JA, Eavenson M, Lütteke T, York WS (2017) GLYDE-II: the GLYcan data exchange format. Perspect Sci 11:24–30

Toukach PV, Egorova KS (2020) New features of carbohydrate structure database notation (CSDB Linear), as compared to other carbohydrate notations. J Chem Inf Model 60(3):1276–1289

Tsuchiya S, Yamada I, Aoki-Kinoshita KF (2019) GlycanFormatConverter: a conversion tool for translating the complexities of glycans. Bioinformatics 35(14):2434–2440

Chernyshov IY, Toukach PV (2018) REStLESS: automated translation of glycan sequences from residue-based notation to SMILES and atomic coordinates. Bioinformatics 34(15):2679–2681

Matsubara M, Aoki-Kinoshita KF, Aoki NP, Yamada I, Narimatsu H (2017) WURCS 2.0 update to encapsulate ambiguous carbohydrate structures. J Chem Inf Model. 57(4):632–637

Tiemeyer M, Aoki K, Paulson J, Cummings RD, York WS, Karlsson NG et al (2017) GlyTouCan: an accessible glycan structure repository. Glycobiology 27(10):915–919

Pillong M, Schneider G (2012) Representing carbohydrates by pseudoreceptor models for virtual screening in drug discovery. pp 131–46

Bojar D, Camacho DM, Collins JJ (2020) Using Natural Language Processing to Learn the Grammar of Glycans. bioRxiv

Lin TS, Coley CW, Mochigase H, Beech HK, Wang W, Wang Z et al (2019) BigSMILES: a structurally-based line notation for describing macromolecules. ACS Cent Sci. 5(9):1523–1531

Brecher J (2008) Graphical representation standards for chemical structure diagrams: (IUPAC Recommendations 2008). Pure Appl Chem 80(2):277–410

Xemistry Chemoinformatics. https://www.xemistry.com/ . Accessed 10 Jun 2020

Molinspiration Cheminformatics. https://www.molinspiration.com/ . Accessed 10 Jun 2020

OASA. http://bkchem.zirael.org/oasa_en.html . Accessed 10 Jun 2020

Willighagen EL, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N et al (2017) The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform. 9(1):1–19

Mayfield J (2016) Higher quality chemical depictions: lessons learned and advice

The Consortium for Functional Glycomics. http://www.functionalglycomics.org/static/consortium/consortium.shtml . Accessed 27 May 2020

Stierand K, Rarey M (2010) Drawing the PDB: protein-ligand complexes in two dimensions. ACS Med Chem Lett. 1(9):540–545

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural Message Passing for Quantum Chemistry. arXiv Prepr

Withnall M, Lindelöf E, Engkvist O, Chen H (2019) Building attention and edge convolution neural networks for bioactivity and physical-chemical property prediction building attention and edge convolution neural networks for. p 2

Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388

Coley CW, Barzilay R, Green WH, Jaakkola TS, Jensen KF (2017) Convolutional embedding of attributed molecular graphs for physical property prediction. J Chem Inf Model 57(8):1757–1772

Li Y, Vinyals O, Dyer C, Pascanu R, Battaglia P (2018) Learning Deep Generative Models of Graphs. arXiv Prepr

Li Y, Zhang L, Liu Z (2018) Multi-objective de novo drug design with conditional graph generative model. J Cheminform. 10(1):1–24

Jin W, Barzilay R, Jaakkola T (2018) Junction Tree Variational Autoencoder for Molecular Graph Generation. arXiv Prepr

Popova M, Shvets M, Oliva J, Isayev O (2019) MolecularRNN: Generating realistic molecular graphs with optimized properties. arXiv Prepr

Jin W, Barzilay R, Jaakkola T (2019) Multi-Resolution Autoregressive Graph-to-Graph Translation for Molecules. chemArXiv. p 8266745

Jin W, Yang K, Barzilay R, Jaakkola T (2018) Learning multimodal graph-to-graph translation for molecular optimization. arXiv Prepr. pp 1–14

Coley CW, Jin W, Rogers L, Jamison TF, Jaakkola TS, Green WH, et al (2018) A graph-convolutional neural network model for the prediction of chemical reactivity

Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? pp 1–16

Battaglia PW, Hamrick JB, Bapst V, Sanchez-Gonzalez A, Zambaldi V, Malinowski M, et al. Relational inductive biases, deep learning, and graph networks. 2018;1–40

Hassan M, Brown RD, Varma-OBrien S, Rogers D (2006) Cheminformatics analysis and learning in a data pipelining environment. Mol Divers. 10(3):283–299

Todeschini R, Consonni V (2007) Methods and principles in medicinal chemistry. pp 438–438

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T (2018) The rise of deep learning in drug discovery. Drug Discov Today. 23(6):1241–1250

Article   PubMed   Google Scholar  

Sanchez-Lengeling B, Aspuru-Guzik A (2018) Inverse molecular design using machine learning: generative models for matter engineering. Science. 361(6400):360–365

Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T et al (2020) Improved protein structure prediction using potentials from deep learning. Nature 577(7792):706–710

Liu B, He H, Luo H, Zhang T, Jiang J (2019) Artificial intelligence and big data facilitated targeted drug discovery. Stroke Vasc Neurol. 4:290

SureChEMBL: Non MedChem-Friendly SMARTS. https://www.surechembl.org/knowledgebase/169485-non-medchem-friendly-smarts . Accessed 5 Dec 2019

Sushko I, Salmina E, Potemkin VA, Poda G, Tetko IV (2012) ToxAlerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. J Chem Inf Model 52(8):2310–2316

Baell JB, Holloway GA (2010) New substructure filters for removal of pan assay interference compounds (PAINS) from screening libraries and for their exclusion in bioassays. J Med Chem 53(7):2719–2740

Arús-Pous J, Johansson SV, Prykhodko O, Bjerrum EJ, Tyrchan C, Reymond JL et al (2019) Randomized SMILES strings improve the quality of molecular generative models. J Cheminform. 11:1

Kadurin A, Aliper A, Kazennov A, Mamoshina P, Vanhaelen Q, Khrabrov K et al (2017) The cornucopia of meaningful leads: applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget. 8(7):10883–10890

Duvenaud D, Maclaurin D, Aguilera-Iparraguirre J, Gómez-Bombarelli R, Hirzel T, Aspuru-Guzik A, et al. Convolutional networks on graphs for learning molecular fingerprints. In: Advances in neural information processing systems. 2015. pp 2224–32

Urbanek DA, Proschak E, Tanrikulu Y, Becker S, Karas M, Schneider G (2011) Scaffold-hopping from aminoglycosides to small synthetic inhibitors of bacterial protein biosynthesis using a pseudoreceptor model. Medchemcomm. 2(3):181–184

Nassif H, Al-Ali H, Khuri S, Keirouz W (2009) Prediction of protein-glucose binding sites using support vector machines. Proteins Struct Funct Bioinforma. 77(1):121–132

Pai PP, Mondal S (2016) MOWGLI: prediction of protein–MannOse interacting residues With ensemble classifiers usinG evoLutionary Information. J Biomol Struct Dyn 34(10):2069–2083

Dezso Z, Ceccarelli M (2020) Machine learning prediction of oncology drug targets based on protein and network properties. BMC Bioinf. 21:1

Kumar S, Maurya VK, Prasad AK, Bhatt MLB, Saxena SK (2020) Structural, glycosylation and antigenic variation between 2019 novel coronavirus (2019-nCoV) and SARS coronavirus (SARS-CoV). VirusDisease. 31(1):13–21

Article   PubMed   PubMed Central   Google Scholar  

Nguyen A, Huang YC, Tremouilhac P, Jung N, Bräse S (2019) ChemScanner: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files. J Cheminform. 11(1):1–9

Frasconi P, Gabbrielli F, Lippi M, Marinai S (2014) Markov logic networks for optical chemical structure recognition. J Chem Inf Model 54:37

Staker J, Marshall K, Abel R, McQuaw CM (2019) Molecular structure extraction from documents using deep learning. J Chem Inf Model 59(3):1017–1029

Picture - 5Y6N Zika virus helicase in complex with ADP. http://www.rcsb.org/3d-view/5Y6N . Accessed 8 Jan 2020

Picture - Lemon. https://pixabay.com/sv/vectors/citron-citrus-mat-frukt-orange-148119/ . Accessed 8 Jan 2020

Picture - Orange. https://pixabay.com/sv/vectors/apelsiner-frukt-saftiga-citrus-42394/ . Accessed 8 Jan 2020

Picture - Pills. https://pixabay.com/fr/photos/thermomètre-maux-de-tête-la-douleur-1539191/ . Accessed 30 Dec 2019

Picture - Rose Graphic Flower. https://pixabay.com/vectors/rose-graphic-flower-deco-398576/ . Accessed 31 Dec 2019

Picture - Red contact lens. https://unsplash.com/photos/R5CX8XDQLV0 . Accessed 14 Jul 2020

Picture - Insulin. https://www.flickr.com/photos/102642344@N02/10083633053/ . Accessed 26 Dec 2019

Picture - Cyclosporin A. https://pubchem.ncbi.nlm.nih.gov/compound/Cyclosporin-A#section=2D-Structure . Accessed 6 Dec 2019

Picture - Milk Bottle. https://pixabay.com/vectors/milk-bottle-glass-dairy-breakfast-2012800/ . Accessed 26 Dec 2019

Creative Commons—Attribution 3.0 Unported—CC BY 3.0. https://creativecommons.org/licenses/by/3.0/ . Accessed 5 Dec 2019

Download references

Acknowledgements

L.D. and A.T. would like to thank the European Union’s Horizon 2020 research and innovation program. R.M. would like to thank the AstraZeneca Postdoc Program. Third-party materials used in figures [ 149 , 150 , 151 , 152 , 153 , 154 , 155 , 156 , 157 ] are licensed under CC BY [ 158 ]. The authors would like to thank the editor and the reviewers, two anonymous and Dr. Noel O’Boyle, for their extensive comments, suggestions, and patience throughout the many rounds of review.

L.D. and A.T. have received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska Curie grant agreement No 676434, “Big Data in Chemistry” (“BIGCHEM”, http://bigchem.eu ). The article reflects only the authors view and neither the European Commission nor the Research Executive Agency (REA) are responsible for any use that may be made of the information it contains.

Author information

Authors and affiliations.

Hit Discovery, Discovery Sciences, BioPharmaceuticals R&D, Astrazeneca Gothenburg, Sweden

Laurianne David, Amol Thakkar, Rocío Mercado & Ola Engkvist

Department of Chemistry and Biochemistry, University of Bern, Bern, Switzerland

Amol Thakkar

You can also search for this author in PubMed   Google Scholar

Contributions

RM and AT wrote the section entitled “ Graph representations for small molecules ”. All the authors contributed to the sections “ Notations for small molecules ” and “ AI applications within drug - discovery using molecular representations ”. The section “ Representations for chemical reactions ” was written by AT; LD wrote the sections “ Notations for macromolecules ” and “ Graphical representations for molecules and macromolecules ”. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laurianne David .

Ethics declarations

Competing interests.

All the authors were employed by AstraZeneca and declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

David, L., Thakkar, A., Mercado, R. et al. Molecular representations in AI-driven drug discovery: a review and practical guide. J Cheminform 12 , 56 (2020). https://doi.org/10.1186/s13321-020-00460-5

Download citation

Received : 11 January 2020

Accepted : 05 September 2020

Published : 17 September 2020

DOI : https://doi.org/10.1186/s13321-020-00460-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Molecular representation
  • Cheminformatics
  • Drug discovery
  • Small molecules
  • Macromolecules
  • Linear notation
  • Molecular graphs
  • Reaction prediction

Journal of Cheminformatics

ISSN: 1758-2946

representation 3d molecule

virtual Chemistry 3D
  • AXnEm types and pi systems: N
  • AXnEm types and pi systems: O
  • 1,2-dichloroethane
  • Cyclohexane
  • Substituted Cyclohexane
  • 3-methylpentane
  • 3-methylhexane
  • Thalidomide
  • R/S Nomenclature
  • Diastereisomers
  • Atomic Orbitals
  • Hybrid Orbitals
  • d metal complexes: Crystal Field Theory
  • M(σ donor-L) 6
  • M(π acceptor-L) 6
  • Mo-Fischer carbene complex
  • MOs and Electrostatic Potential
  • C 2 H 6 eclipsed
  • C 2 H 6 staggered
  • Ru 4 H 4 (CO) 12
  • Point Groups
  • Methylene CH 2 R 2
  • Methyl CH 3 R
  • 1,3,5-triazin
  • trans -C 2 H 2 Cl 2
  • cis -C 2 H 2 Cl 2
  • n -butylamine
  • Propanoic Acid
  • Benzaldehyde
  • Benzonitrile
  • 1H NMR predictor
  • 13C NMR predictor
  • Simple Reaction
  • Hexadiene + Acrolein
  • Butadiene + SO2
  • Spinel (MgAl 2 O 4 )
  • Cobalt oxide (Co 3 O 4 )
  • Playground and (hkl) planes
  • Au: Cl-protected
  • Nanocrystals
  • Macromolecules
  • 2D Sketching
  • Thematic Index
  • Bibliography
Interactive 3D animations and structures, with supporting information for some important topics covered during an undergraduate chemistry degree... and beyond

Yes, new subjects and examples are regularly added to this site.

Please Formulate it (kindly) to the author


application, and its variant, developped for websites.

You can play on this first page by searching molecules on the PubChem or RSCB servers.

By clicking on the appropriate parts of the scheme and rotating the structures in 3D, you can explore the geometry of a molecule.
use the scroll wheel to enlarge and shrink the models; shift + double-click on any atom and drag to translate along (x,y)
right clicking the mouse (Control-click MacOS) over the JSmol screen will give additional options and features.

To navigate your way to the required page, simply use the side-bar on the left.
According to the page, additional buttons are available and perform the labelled features.

WebAnalytics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PMC11333583

Data-driven quantum chemical property prediction leveraging 3D conformations with Uni-Mol+

1 DP Technology, Beijing, China

Zhifeng Gao

2 Peking University, Beijing, China

Linfeng Zhang

Associated data.

The datasets used in this study are all publicly available. PCQM4MV2 dataset which is to predict HOMO-LUMO GAP on small molecules is available at https://ogb.stanford.edu/docs/lsc/pcqm4mv2/#dataset and the OC20 dataset which is to conduct energy prediction on catalyst system is available at https://github.com/Open-Catalyst-Project/ocp/blob/main/DATASET.md .  Source data are provided with this paper.

The source code of this study is publicly available on GitHub( https://github.com/deepmodeling/Uni-Mol/ ) and zenodo (10.5281/zenodo.12670462) to allow replication of the results.

Quantum chemical (QC) property prediction is crucial for computational materials and drug design, but relies on expensive electronic structure calculations like density functional theory (DFT). Recent deep learning methods accelerate this process using 1D SMILES or 2D graphs as inputs but struggle to achieve high accuracy as most QC properties depend on refined 3D molecular equilibrium conformations. We introduce Uni-Mol+, a deep learning approach that leverages 3D conformations for accurate QC property prediction. Uni-Mol+ first generates a raw 3D conformation using RDKit then iteratively refines it towards DFT equilibrium conformation using neural networks, which is finally used to predict the QC properties. To effectively learn this conformation update process, we introduce a two-track Transformer model backbone and a novel training approach. Our benchmarking results demonstrate that the proposed Uni-Mol+ significantly improves the accuracy of QC property prediction in various datasets.

Quantum chemical (QC) property prediction is crucial in computational chemistry. Here, the authors introduce Uni-Mol+, a deep model that uses iterative updates of 3D molecular conformations to improves the accuracy of QC property prediction.

Introduction

The application of computational methods has become a widely employed strategy in the development of new materials and drugs. A crucial aspect of this approach involves the calculation of quantum chemical (QC) properties of molecular structures 1 . These quantitative properties are highly dependent on the refined equilibrium conformations of molecules.

In the field of materials and drug design, researchers primarily focus on the quantitative properties of equilibrium conformations. The process to achieve this generally involves two key steps, both of which depend on electronic structure methods such as density functional theory (DFT) 2 . The initial step entails performing conformation optimization, also known as energy minimization, on the molecular structure to determine the equilibrium conformation. Subsequently, the quantum chemical (QC) properties of this equilibrium conformation are computed. However, the combined process of conformation optimization and property calculation using DFT can be extremely time-consuming and computationally expensive, potentially requiring several hours to evaluate the properties of just a single molecule. This constraint hinders the applicability of DFT in large-scale data screening endeavors. Consequently, it is of paramount importance to develop alternative methods that maintain the requisite accuracy while reducing computational costs.

Recent studies have demonstrated the potential of using deep learning to accelerate QC property calculations 3 – 5 . This approach involves training a deep neural network model to predict the property using molecular inputs, thereby circumventing the need for computationally-intensive DFT calculations. Prior research has mainly utilized 1D SMILES 6 – 8 sequences or 2D molecular graphs 4 , 9 – 13 as molecular inputs due to their easy obtainability. However, predicting QC properties from 1D SMILES and 2D molecular graphs can be ineffective since most QC properties are highly related to the refined 3D equilibrium conformations.

To address this challenge, we propose a method called Uni-Mol+ in this paper, illustrated in Fig.  1 a. In contrast to previous approaches that directly predict QC properties from 1D/2D data, Uni-Mol+ takes advantage of the 3D conformation of the molecule as input, in accordance with physical principles. Uni-Mol+ first generates a raw 3D conformation from 1D/2D data using cheap methods, such as RDKit 14 . As the raw conformation is inaccurate, Uni-Mol+ then iteratively updates it towards the DFT equilibrium conformation using neural networks and predicts QC properties from the learned conformation. To obtain accurate equilibrium conformation predictions, we use large-scale datasets (e.g., PCQM4MV2 benchmark) to build up millions of pairs of RDKit-generated raw conformation and high-quality DFT equilibrium conformation and learn the update process from this supervised information. With a carefully designed model backbone and training strategy, Uni-Mol+ shows superior performance in various benchmarks.

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51321_Fig1_HTML.jpg

a In contrast to prior methods that directly predict QC properties from 1D/2D data, Uni-Mol+ uses a different approach. It first generates raw 3D conformation from 1D/2D data using cheap tools like RDKit, and then iteratively updates it towards the DFT equilibrium conformation. Finally, it predicts QC properties using the learned conformation. The abbreviation HOMO-LUMO gap represents the Highest Occupied Molecular Orbital— Lowest Unoccupied Molecular Orbital gap. b The Uni-Mol+ backbone consists of L blocks, each of which maintains two tracks of representations—atom and pair, initialized by atom features and 2D graph/3D conformation, respectively. These representations communicate with each other at every block. Based on this backbone model, Uni-Mol+ iteratively updates the raw conformation (i.e., 3D coordinates of atoms) towards the DFT equilibrium conformation for R iterations. The abbreviation FFN represents the Feed-Forward Neural network and QC property represents Quantum Chemical property. c A linear noisy interpolation between raw conformation and DFT conformation is used to generate a pseudo trajectory, effectively augmenting the input conformations. Uni-Mol+ uses a mixture of Bernoulli distribution and Uniform distribution to sample the noise interpolation weight q during training. The symbol q represents the interpolation weight between raw conformation and DFT conformation.

Our main contributions can be summarized as follows:

  • We develop a novel paradigm for QC property prediction by leveraging the conformation optimization from RDKit-generated conformation to DFT equilibrium conformation.
  • We create a new training strategy for 3D conformation optimization by generating a pseudo trajectory and a sampling strategy from it, based on a mixture of Bernoulli distribution and Uniform distribution.
  • The entire framework of Uni-Mol+ holds significant empirical value, as it achieves markedly better performance than all previous works on two widely recognized benchmarks, PCQM4MV2 15 and Open Catalyst 2020 (OC20) 16 .

In this section, we initially present a concise overview of the Uni-Mol+ framework, followed by comprehensive benchmarking using two well-recognized public datasets: PCQM4MV2 15 and OC20 16 . These datasets enable the assessment of Uni-Mol+ ’s performance in small organic molecules and catalyst systems. Following this, we perform an ablation study to investigate the impact of various model components and training strategies on the overall performance. Lastly, we present a visual analysis to effectively demonstrate the conformation update process within Uni-Mol+. The complete model configuration can be found in the Supplementary Section  2 .

Uni-Mol+ overview

As illustrated in Fig.  1 a, for any molecule, Uni-Mol+ first obtains a raw 3D conformation generated by cheap methods, such as template-based methods from RDKit and OpenBabel. It then learns the target conformation, i.e., the equilibrium conformation optimized by DFT, by an iterative update process from the raw conformation. In the final step, the QC properties are predicted based on the learned conformation. To achieve this goal, we introduce a new model backbone and a novel training strategy for updating conformation and predicting QC properties.

The Uni-Mol+ ’s model backbone is a two-track transformer, consisting of an atom representation track and a pair representation track, as shown in Fig.  1 b. In comparison to the transformer backbone used in the prior study Uni-Mol 17 , two significant updates have been implemented. i) The pair representation is enhanced by an outer product of the atom representation (referred to OuterProduct) for atom-to-pair communication, and a triangular operator (referred to TriangularUpdate) to bolster the 3D geometric information. These two operators are proven effective in AlphaFold2 18 . ii) An iterative process is employed to continuously update the 3D coordinates towards the equilibrium conformation. We use R to denote the number of rounds for conformation optimization.

For the learning of the conformation update process, we introduce a novel training strategy as shown in Fig.  1 c. We sample conformations from the trajectory between the RDKit-generated raw conformation and the DFT equilibrium conformation, and use the sampled conformation as input to predict the equilibrium conformation. It is crucial to note that the actual trajectory is often unknown in many datasets; therefore, we utilize a pseudo trajectory that presumes a linear process between two conformations. Furthermore, we devise a sampling strategy for obtaining conformations from the pseudo trajectory to serve as the model’s input during training. This strategy uses a mixture of Bernoulli distribution and Uniform distribution. The Bernoulli distribution addresses (1) the distributional shift between training and inference and (2) enhances the learning of an accurate mapping from the equilibrium conformation to the QC properties. Meanwhile, the Uniform distribution generates additional intermediate states to serve as model inputs, effectively augmenting the input conformations. The details of Uni-Mol+ can be found in Sec. 4.

Benchmark on small molecule (PCQM4MV2)

The PCQM4Mv2 dataset, derived from the OGB Large-Scale Challenge 15 , is designed to facilitate the development and evaluation of machine learning models for predicting QC properties of molecules, specifically the target property known as the HOMO-LUMO gap. This property represents the difference between the energies of the highest occupied molecular orbital (HOMO) and the lowest unoccupied molecular orbital (LUMO). The dataset, consisting of approximately 4 million molecules represented by SMILES notations, offers HOMO-LUMO gap labels for the training and validation sets; however, the labels for the test set remain undisclosed. Furthermore, the training set encompasses the DFT equilibrium conformation, which is not included in the validation and test sets. The benchmark’s goal is to utilize SMILES notation, without the DFT equilibrium conformation, to predict the HOMO-LUMO gap during the inference process.

Based on SMILES, we generate 8 initial conformations for each molecule by RDKit, at a per-molecule cost of about 0.01 seconds. Specifically, we use ETKDG 19 method to generate 3D conformations. Subsequent optimization of these conformations is achieved through the MMFF94 20 force field. In molecules where the generation of a 3D conformation is unsuccessful, we default to producing a 2D conformation with a flat z -axis using RDKit’s AllChem.Compute2DCoords function instead. During training, we randomly sample 1 conformation as input at each epoch, while during inference, we use the average HOMO-LUMO gap prediction based on 8 conformations.

We incorporate previous submissions to the PCQM4MV2 leaderboard as baselines. In addition to the default 12-layer model, we evaluate the performance of Uni-Mol+ with two variants consisting of 6 and 18 layers, respectively. This aims to explore how model performance changes when varying the model parameter sizes.

The results are summarized in Table  1 , and our observations are as follows: (1) Uni-Mol+ surpasses the previous SOTA by a margin of 0.0079 on validation data on single-model performance, a relative improvement of 11.4% . (2) All three variants of Uni-Mol+ demonstrate substantial performance improvements over previous baselines. (3) The 6-layer Uni-Mol+, despite having considerably fewer model parameters, outperforms all prior baselines. (4) Increasing the layers from 6 to 12 results in a significant accuracy enhancement, surpassing all baselines by a considerable margin. (5) The 18-layer Uni-Mol+ exhibits the highest performance, outperforming all baselines by a remarkable margin. These findings underscore the effectiveness of Uni-Mol+. (6) The performance of a single 18-layer Uni-Mol+ model on the leaderboard (test-dev set) is noteworthy, particularly as it surpasses previous state-of-the-art methods without employing an ensemble or additional techniques. In contrast, the previous state-of-the-art GPS++ relied on a 112-model ensemble and included the validation set for training.

The benchmark results on PCQM4MV2

Model# param.# layersValid MAE ( )Leaderboard MAE ( )
MLP-Fingerprint 16.1M-0.17350.1760
GCN 2.0M-0.13790.1398
GIN 3.8M-0.11950.1218
GINE- , , 13.2M-0.1167-
GCN- , 4.9M-0.11530.1152
GIN- , 6.7M-0.10830.1084
DeeperGCN- , 25.5M120.1021-
GraphGPS 6.2M50.0938-
TokenGT 48.5M120.09100.0919
GRPE 46.2M120.0890-
EGT 89.3M240.08690.0872
GRPE 46.2M180.08670.0876
Graphormer , 47.1M120.0864-
GraphGPS 19.4M100.0858-
GraphGPS 13.8M160.08520.0862
GEM-2 32.1M120.07930.0806
GPS++ 44.3M160.07780.0720
Transformer-M 47.1M120.0787-
69M180.07720.0782
Uni-Mol+27.7M60.0714   ± 6e − 5-
52.4M120.0696   ±  5e − 50.0708
77M18   ±     

1 The leaderboard was accessed on October 15, 2023, the date of this paper’s submission.

2 GPS++’s leaderboard submission consists of a 112-model ensemble and utilizes the validation data for training.

We highlight the best results in bold. Source data are provided as a Source Data file.

Benchmark on catalyst system (OC20)

The Open Catalyst 2020 (OC20) dataset 16 is specifically designed to promote the development of machine-learning models for catalyst discovery and optimization. OC20 encompasses three tasks: Structure to Energy and Force (S2EF), Initial Structure to Relaxed Structure (IS2RS), and Initial Structure to Relaxed Energy (IS2RE). In this paper, we focus on the IS2RE task, as it aligns well with the objectives of the proposed methodology. The goal of the IS2RE task is to predict the relaxed energy based on the initial conformation. It comprises approximately 460K training data points. While DFT equilibrium conformations are provided for training, they are not permitted for use during inference. Moreover, in contrast to the PCQM4MV2 dataset, the initial conformation is already supplied in the OC20 IS2RE task, eliminating the need to generate the initial input conformation by ourselves.

We present a performance comparison of various models on the OC20 IS2RE validation and test set, as illustrated in Table  2 . The table displays the Mean Absolute Error (MAE) for energy in electron volts (eV) and the percentage of Energies Within a Threshold (EwT) for each model. As evident from the tables, our proposed Uni-Mol+ significantly outperforms all previous baselines in terms of both MAE and EwT. For example, in the test set, Uni-Mol+ exceeds the previous SOTA in Average MAE and Average EwT by margins of 0.0366 ( 8.8% relative improvement) and 1.73 ( 26.6% relative improvement), respectively. This demonstrates the exceptional performance of Uni-Mol+. Notably, our method attains the lowest MAEs across all categories, including In-Domain (ID), Out-of-Domain Adsorption (OOD Ads.), Out-of-Domain Catalysis (OOD Cat.), Out-of-Domain Both (OOD Both), and Average (AVG.). Furthermore, in terms of EwT, Uni-Mol+ consistently achieves the highest values in all categories. These findings underscore the robustness of our method in handling both in-domain and out-of-domain data. In conclusion, the results emphasize the efficacy of our approach in capturing intricate interactions in material systems and its potential for extensive applicability in various computational material science tasks.

The benchmark results on OC20 IS2RE task

Results on validation set
Energy MAE (eV) EwT (%)
ModelIDOOD Ads.OOD Cat.OOD BothAVG.IDOOD Ads.OOD Cat.OOD BothAVG.
SchNet 0.64650.70740.64750.66260.66602.962.223.032.382.65
DimeNet++ 0.56360.71270.56120.64920.62174.252.484.402.563.42
GemNet-T 0.55610.73420.56590.69640.63824.512.244.372.383.38
SphereNet 0.56320.66820.55900.61900.60244.562.704.592.703.64
Graphormer-3D 0.43290.58500.44410.52990.4980-----
GNS 0.540.650.550.590.5825-----
GNS+NN 0.470.510.480.460.4800-----
EquiFormer 0.42220.54200.42310.47540.46577.233.777.134.105.56
EquiFormer+NN 0.41560.49760.41650.43440.44107.474.647.194.846.04
DRFormer 0.41870.48630.43210.43320.44258.395.428.125.446.84
Uni-Mol+
± 0.0007± 0.0049± 0.0001± 0.0037± 0.0036
Results on test set
SchNet 0.6390.7340.6620.7040.68482.962.332.942.212.61
DimeNet++ 0.5620.7250.5760.6610.6314.252.074.12.413.21
SphereNet 0.5630.7030.5710.6380.61884.472.294.092.413.32
Graphormer-3D 0.39760.57190.41660.50290.47228.973.458.183.796.1
GNS+NN 0.42190.56780.43660.46510.47289.124.258.014.646.5
EquiFormer 0.50370.68810.52130.63010.58585.142.414.672.693.73
EquiFormer+NN 0.41710.54790.42480.47410.46607.713.707.154.075.66
DRFormer 0.38650.54350.40600.46770.45099.184.018.394.336.48
Uni-Mol+

NN refers to Noisy Nodes 43 . We highlight the best results in bold. Source data are provided as a Source Data file.

Ablation study

In this subsection, we present a comprehensive ablation study for Uni-Mol+. To fully comprehend the configurations discussed herein, we recommend referring to the “Methods” section and the model specifications detailed in Supplementary Section  2 . We conduct the ablation study on the PCQM4Mv2 dataset, employing the default 12-layer Uni-Mol+ configuration. The findings are summarized in Table  3 , where No. 1 is the default setting, and No.2–7 focus on the examination of the model backbone, and No. 8 to No. 17 focus on the examination of the training strategies. A detailed analysis follows in the subsequent paragraphs.

Ablation study for model backbone and for sampling strategies for q , on PCQM4MV2

No.TriangularUpdateOuterProductPair Repr. Valid MAE ( )
1 10.10.80.10.0696
Ablation study on model backbone
2 10.10.80.10.0704
3 10.10.80.10.0710
410.10.80.10.0709
5 00.10.80.10.0715
6 20.10.80.1
700.10.80.10.0738
Ablation study on training strategy
8 11.0--0.0771
9 1-1.0-0.1122
10 1--1.00.0724
11 10.10.9-0.0697
12 1-0.90.10.0753
13 10.10.70.20.0698
14 10.20.70.10.0703
15 10.10.60.30.0702
16 10.20.60.20.0706
17 10.30.60.10.0714
18 1Noisy Nodes0.0760
191Noisy Nodes0.0798

R refers to the number of rounds of conformation updates. w 1.0 refers to the sample probability of RDKit conformation, w 0.0 refers to the sample probability of target conformation with noise and w u refers to the sa probability of intermediate conformation. We highlight the best results in bold. We use underlines to indicate the results under the standard settings. Source data are provided as a Source Data file.

As detailed in Sec. 4 and Supplementary Section  1 , Uni-Mol+ introduces two novel components, OuterProduct and TriangularUpdate, and iteratively updates the 3D coordinates. An examination of the results (No. 1–7) in Table  3 provides insights into the implications of these modifications.

(1) We first examine the necessity of the new components in the model backbone. Upon examining the first three settings (No. 1 to 3), it becomes evident that both TriangularUpdate and OuterProduct significantly contribute to the model’s performance. A comparison between No. 3 and No. 4 reveals that utilizing pair representation exclusively, without incorporating OuterProduct or TriangularUpdate, does not enhance performance. This result is expected because the pair representation is not communicated with the atom representation (without OuterProduct) and is simply updated by FFN, resulting in a performance that is almost the same as not using pair representation, as there are merely more parameters. However, the proposed OuterProduct and TriangularUpdate can better utilize the pair representation, leading to an overall performance improvement (No.1 and No.2). This makes the pair representation an essential component in the backbone of our approach, even if its standalone effectiveness might appear limited.

(2) We then examine the performance brought by iterative coordinate updates. A comparison of No. 1 with No. 5 and No. 6 leads to the conclusion that omitting the iterative update (No. 5) yields suboptimal results. Note that even without the iterative refinement of 3D conformation (R = 0), Uni-Mol+ ’s score of 0.0715 (No. 5) significantly surpasses the previous SOTA GPS++ (0.0778). However, performing one additional iteration proves highly effective (No. 1), whereas further increasing the number of iterations offers marginal improvements (No. 6).

(3) Lastly, we check the result using the same model backbone as previous work. In particular, when the model retains the same structure as the one employed in previous works 3 , 5 , 17 and excludes the iterative update (No. 7), its performance is the least favorable. Nonetheless, even with this substandard performance, the model surpasses all prior baselines, thereby highlighting the efficacy of the proposed training strategy. It is important to note that No.7 employs the proposed training strategy as outlined in Sec. 4.2. Although No.7 does not explicitly use conformation optimization (R = 0), the model is still trained to predict the target conformation. Consequently, the Atom Repr. and Pair Repr. of the last layer inherently contain the information required to predict the target conformation. Hence, even without explicitly conformation optimization (R=0), the result of No.7 still supports our primary contribution, namely the accurate prediction of QC properties by leveraging an auxiliary task of conformation optimization.

The training strategy primarily concentrates on sampling q (interpolation weight, details in Sec. 4) to obtain input conformations during training. Formally, q is sampled from a mixture of Bernoulli and Uniform distributions, denoted as w 1.0 I { 1.0 } ( q ) + w 0.0 I { 0.0 } ( q ) + w u I [ a , b ] ( q ) , where I { c } ( q ) is an indicator function that equals 1 if q  =  c and 0 otherwise, and I [ a , b ] ( q ) is an indicator function that equals 1 if a  ≤  q  ≤  b and 0 otherwise. The weights w 1.0 , w 0.0 , and w u must be non-negative and add up to 1, i.e., w 1.0  +  w 0.0  +  w u  = 1. In this notation, the default sampling strategy employed in Uni-Mol+ can be represented as ( w 1.0  = 0.1,  w 0.0  = 0.8,  w u  = 0.1, [ a ,  b ] = [0.4, 0.6]). We investigate additional settings for the ablation study, and the results are summarized in Table  3 (No. 8 to No. 17). Except for No. 10 and 12, which use [ a ,  b ] = [0.0, 1.0], all other settings use [ a ,  b ] = [0.4, 0.6]. From these results, we make the following observations:

(1) Comparing No. 8, 9, and 10, we find that sampling from only one type of conformation is not effective. For No. 8, it lacks data augmentation and cannot learn an accurate mapping from equilibrium conformation to QC property. For No. 9, it experiences a distributional shift between training and inference. Although No. 10 is better, it has a low probability of sampling 0.0 and 1.0, resulting in suboptimal performance.

(2) By comparing No. 8, 9, and 11, we can deduce that sampling from the mixture of RDKit and target conformations yields a satisfactory result (Valid MAE with 0.0697). However, if only sampling from target and intermediate conformations (No. 12), the result is unsatisfactory (Valid MAE with 0.0753). This result indicates that sampling from w 1.0 is necessary, as it reduces the distributional shift between training and inference.

(3) The default strategy that samples from three types of conformations (No. 1) exhibits the best performance.

(4) Altering the weights of the mixture distribution (No. 13–17) does not result in better performance over the default strategy. Furthermore, we notice that with a decreased w 0.0 , the performance worsens. This suggests that the default weighting scheme is appropriate for this task.

(5) Upon comparing the results of No.18 and No.1, it’s clear that the performance of Noisy Nodes (No.18, Valid MAE with 0.0760) is significantly lower than that of Uni-Mol+ (No.1, Valid MAE with 0.0696). This large performance gap (0.0760 vs. 0.0696) highlights the superior efficacy of the proposed training strategy, as opposed to the one employed previously.

(6) A comparison between No.19 and No.18 shows that the model structure employed in previous works 3 , 5 , 17 yields worse results than using Uni-Mol+ ’s backbone when using Noisy Nodes strategy. This finding lends additional support to the superiority of Uni-Mol+ ’s backbone over the model architectures previously proposed.

In conclusion, the ablation study demonstrates the effectiveness of the default sampling strategy employed in Uni-Mol+, emphasizing the importance of utilizing a mixture of different conformations to achieve superior performance.

Visualized analysis of conformation learning

In addition to QC property prediction, Uni-Mol+ can also predict equilibrium conformations. Although this study primarily focuses on QC property prediction and the previous experimental results have clearly demonstrated the effectiveness of the proposed Uni-Mol+, visualized results can help to better understand how Uni-Mol+ works. Therefore, we also provide two additional analyzes for the conformation learning of Uni-Mol+ in the PCQM4MV2 dataset.

The First analysis evaluates the predicted conformations. Since the DFT conformations of the validation set (and test set) are not provided by the PCQM4MV2 dataset, we generated DFT conformations ourselves, using the same settings as the PCQM4MV2 source data 21 . As shown in Fig.  2 , Uni-Mol+ can effectively predict equilibrium conformations. Moreover, as the number of update iterations increases, the RMSD is smaller, further demonstrating the effectiveness of the proposed iterative coordinate update. We provide the conformation files used in Fig.  2 in Supplementary Data  1 .

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51321_Fig2_HTML.jpg

Comparison of RDKit-generated conformation and predicted conformations from first ( R  = 0) and second ( R  = 1) iterations, superimposed onto the target DFT conformation. Corresponding RMSDs are provided, demonstrating Uni-Mol+ 's effectiveness in predicting accurate DFT equilibrium conformations. The abbreviations RMSD represents Root Mean Square Deviation. The conformations are provided in the Supplementary Data  1 .

The second analysis aims to show that Uni-Mol+ can predict conformations with lower energies, which approaches equilibrium conformations. To demonstrate this, we selected 100 data points and calculated the energies of their initial and predicted conformations and that between their initial conformations and the DFT conformations. Here the DFT conformations is Computed by ourself using the B3LYP functional and 6-31G* basis set, consistent with the settings used in the PCQM4MV2 dataset. As shown in Fig.  3 , Uni-Mol+ can predict the conformations with lower energies. Moreover, the energy difference distribution between the initial and predicted conformations closely aligns with that between the initial and equilibrium conformations. This similarity demonstrates Uni-Mol+ ’s effectiveness in predicting equilibrium conformations accurately. We provide the conformation files used in Fig.  3 in Supplementary Data  1 .

An external file that holds a picture, illustration, etc.
Object name is 41467_2024_51321_Fig3_HTML.jpg

We selected 100 data points and used DFT to calculate the following values: ( a ) the delta energies between their initial and Uni-Mol+ 's predicted conformations; b the delta energies between their initial conformations and the DFT conformations, where the DFT conformations are calculated by ourselves using DFT tool. Cross-marks indicate data points with increased energies, while circle-marks denote those with decreased energies. This visualization demonstrates that Uni-Mol+ effectively predicts conformations with lower energies. The conformations are provided in the Supplementary Data  1 .

The aforementioned results provide additional evidence of the effectiveness of the proposed Uni-Mol+, as it can indeed predict conformations with lower energy and iteratively approach the target DFT conformations.

Previous studies have primarily relied on 1D/2D information, such as SMILES or molecular graphs, for making predictions 6 – 9 . Recently, numerous investigations 4 , 9 – 13 have employed Transformer models for graph tasks, resulting in significant advancements. Given the importance of 3D information in predicting quantum chemistry (QC) properties, several recent studies have incorporated 3D data into their approaches.

Some research has utilized 3D structural information and maximised mutual information between 2D and 3D molecular to augment 2D representations during training 3 , 22 – 24 . However, these studies only implicitly embed 3D information into 2D representations, with 2D data utilized exclusively during inference. We represent these models as x 2D  → ( x 3D ,  y ), where x 2D represents the 2D molecular graph input, x 3D represents the 3D conformation input and y denotes a QC property. A crucial shortcoming of these approaches is that they don’t explicitly learn a mapping from the 3D equilibrium conformation x 3D to y while y is highly correlated with x 3D . Some models, like Transformer-M 3 , attempt to learn both x 2D  →  y and x 3D  →  y . However, during inference, these models rely solely on x 2D , which compromises the prediction performance. Uni-Mol+, on the other hand, employs a strategy x 3D ′ → … → x 3D → y . This process starts with a raw 3D conformation x 3D ′ , iteratively refines it towards x 3D , and then predicts y . By explicitly learning a mapping from 3D conformation to QC properties, Uni-Mol+ proves to be more effective than previous models.

A few recent works have focused on property prediction using 3D conformations as input. For example, Uni-Mol 17 employs the 3D conformation generated by RDKit as input. Uni-Mol is a pre-training method centred on designing pre-text tasks for molecular data, while Uni-Mol+ is a supervised learning approach aimed at predicting QC properties from raw conformations, aided by equilibrium conformation during training. Graphormer-3D 5 utilizes the initial 3D conformation provided by the OC20 dataset 16 to predict energy at equilibrium. However, it focuses on directly learning the mapping from input to target conformations without considering a training strategy specifically tailored for conformation optimization, as done in our work. The Noisy Nodes approach 25 takes corrupted DFT conformations as inputs and aims to predict the uncorrupted ones. When an initial 3D conformation is provided, as in the OC20 dataset, Noisy Nodes generates an interpolated conformation between the initial and target conformations during training, which is similar to the uniform sampling of q in our study. In comparison to Noisy Nodes, our training strategy also incorporates a Bernoulli distribution, which has proven advantageous in addressing distributional shifts and improving QC property predictions. Moreover, both Graphformer-3D and Noisy Nodes necessitate the use of initial conformations provided by the dataset. In contrast, our study is not constrained by this requirement, as it can employ RDKit to generate initial conformations. Several studies 26 – 29 concentrate on designing new model backbones with rotation and translation equivalence or invariance in 3D space. In contrast, our work emphasizes a novel paradigm for QC property prediction, rather than developing a new model backbone.

Conformation optimization is a critical challenge in computational chemistry. Density Functional Theory (DFT) is the most prevalent method for this task, offering high accuracy but at considerable computational expense. Several deep learning-based potential energy models, such as Deep Potential 30 , have been proposed to tackle this issue by using neural networks to replace costly potential calculations in DFT, thereby enhancing efficiency. However, deep potential models still necessitate dozens or even hundreds of iterative steps to optimize the conformation based on predicted potentials. In contrast, our approach, Uni-Mol+, requires only a few optimization rounds and can optimize conformations end-to-end, whereas deep potential models cannot.

Although other studies 17 , 31 also optimize RDKit-generated conformations towards DFT conformations, they primarily focus on benchmarking conformation rather than predicting QC property. These works simply employ existing model backbones and learn the mapping between raw and equilibrium conformations. In contrast, Uni-Mol+ adopts a novel training strategy to effectively learn conformation optimization. However, it is important to note that conformation optimization serves merely as an auxiliary task; the primary objective of Uni-Mol+ is to predict QC properties.

The research most closely related to ours is EMPNN 32 , which utilizes a 2D molecular graph as input for predicting the 3D equilibrium conformation. However, EMPNN learns to map a 2D graph to a 3D equilibrium conformation, which differs from our model that optimizes from an RDKit-generated conformation. Moreover, EMPNN requires an additional model, such as SchNet 26 , to predict quantum chemistry (QC) properties using the 3D conformation generated by EMPNN as input.

In summary, our study presents a novel method capable of accurately predicting QC properties through an auxiliary task of conformation optimization. This approach has the potential to enhance the efficiency of high-throughput screening and facilitate the design of innovative materials and molecules in future research.

Model backbone

The designed model backbone can predict the equilibrium conformation and QC property simultaneously, denoted as ( y , r ^ ) = f ( X , E , r ; θ ) . The model takes three inputs, (i) atom features ( X ∈ R n × d f , where n is the number of atoms and d f is atom feature dimension), (ii) edge features ( E ∈ R n × n × d e , where d e is the edge feature dimension), and (iii) 3D coordinates of atoms ( r ∈ R n × 3 ). θ is the set of learnable parameters. And the model predicts a quantum property y and updated 3D coordinates r ^ ∈ R n × 3 .

As illustrated in Fig.  1 b, the L -block model maintains two distinct representation tracks: atom representation and pair representation. The atom representation is denoted as x ∈ R n × d x , where d x represents the dimension of the atom representation. Similarly, the pair representation is denoted as p ∈ R n × n × d p , where d p signifies the dimension of the pair representation. The model comprises L blocks, with x ( l ) and p ( l ) representing the output representations of the l -th block. Within each block, the atom representation is initially updated through self-attention, incorporating an attention bias derived from the pair representation, followed by an update via a feed-forward network (FFN). Concurrently, the pair representation undergoes a series of updates, beginning with an outer product of the atom representation (referred to OuterProduct), followed by triangular multiplication (referred to TriangularUpdate) as implemented in AlphaFold2 18 , and finally, an update using a FFN. This backbone, in comparison to the one used in Uni-Mol 17 , enhances the pair representation through two key improvements: (i) employing an outer product for effective atom-to-pair communication, and (ii) utilizing a triangular operator to bolster the 3D geometric information. Next we will introduce each module in detail.

Positional encoding

Similar to previous works 4 , 17 , we use pair-wise encoding to encode the 3D spatial and 2D graph positional information. Specifically, for 3D spatial information, we utilize the Gaussian kernel for encoding, as done in previous studies 5 , 17 . The encoded 3D spatial positional encoding is denoted by ψ 3D .

In addition to the 3D positional encodings, we also incorporate graph positional encodings similar to those used in Graphormer. This includes the shortest-path encoding, represented by ψ i , j SP = Embedding ( sp i j ) where sp i j is the shortest path between atoms ( i ,  j ) in the molecular graph. Additionally, instead of the time-consuming multi-hop edge encoding method used in Graphormer, we utilize a more efficient one-hop bond encoding, denoted by ψ Bond = ∑ i = 1 d e Embedding ( E i ) , where E i is the i -th edge feature. Combined above, the positional encoding is denoted as ψ  =  ψ 3D  +  ψ SP  +  ψ Bond . And the pair representation p is initialized by ψ , i.e., p (0)  =  ψ .

Update of atom representation

The atom representation x (0) is initialized by the embeddings of atom features, the same as Graphormer. At l -th block, x ( l ) is sequentially updated as follow:

The SelfAttentionPairBias function is denoted as:

where d h is the head dimension, W Q ( l , h ) , W K ( l , h ) , W V ( l , h ) ∈ R d x × d h , W B ( l , h ) ∈ R d p × 1 . FFN is a feed-forward network with one hidden layer. For simplicity, layer normalizations are omitted. Compared to the standard Transformer layer, the only difference here is the usage of attention bias term B ( l ,  h ) to incorporate p ( l −1) from the pair representation track.

Update of pair representation

The pair representation p (0) is initialized by the positional encoding ψ . The update process of pair representation begins with an outer product of x ( l ) , followed by a O ( n 3 ) triangular multiplication, and is then concluded with an FFN layer. Formally, at l -th block, p ( l ) is sequentially updated as follow:

The OuterProduct is used for atom-to-pair communication, denoted as :

where W O1 ( l ) , W O2 ( l ) ∈ R d x × d o , d o is the hidden dimension of OuterProduct, and W O3 ( l ) ∈ R d o 2 × d p , o  = [ o i , j ]. Please note that a ,  b ,  o are temporary variables in the OuterProduct function. TriangularUpdate is used to enhance pair representation further, denoted as:

where W T1 ( l ) , W T2 ( l ) , W T3 ( l ) , W T4 ( l ) ∈ R d p × d t , W T5 ( l ) ∈ R d p × d p , W T6 ( l ) ∈ R d t × d p , o  = [ o i , j ], and d t is the hidden dimension of TriangularUpdate. a ,  b ,  o are temporary variables. The TriangularUpdate is inspired by the Evoformer in AlphaFold2 18 . The difference is that AlphaFold2 uses two modules, “outgoing” ( o i , j  = ∑ k a i , k  ⊙  b j , k ) and “incoming” ( o i , j  = ∑ k a k , i  ⊙  b k , j ) respectively. In Uni-Mol+, we merge the two modules into one to save the computational cost.

Conformation optimization

The conformation optimization process in many practical applications, such as Molecular Dynamics, is iterative. This approach is also employed in the Uni-Mol+. The number of conformation update iterations denoted as R , is a hyperparameter. We use superscripts on r to distinguish the 3D positions of atoms in different iterations. for example, at the i -th iteration, the update can be denoted as ( y ,  r ( i ) ) =  f ( X ,  E ,  r ( i −1) ;  θ ). It is noteworthy that parameters θ are shared across all iterations. Moreover, please note that the iterative update in Uni-Mol+ involves only a few rounds, such as 1 or 2, instead of dozens or hundreds of steps in Molecular Dynamics.

3D position prediction head

Regarding the 3D position prediction head within Uni-Mol+, we have adopted the 3D prediction head proposed in Graphormer-3D 5 , as cited in our manuscript. The architecture takes atom representation x L , pair representation p L , and initial coordinates c as inputs. An attention mechanism is initially employed and then the attention weights is multiplied point-wisely with the pairwise delta coordinates derived from the initial coordinates. Similar to SelfAttentionPairBias, the attention mechanism is denoted as:

where d h is the head dimension, W Q h , W K h , W V h ∈ R d x × d h , W B h ∈ R d p × 1 . A h is the attention weights, Δ c i j is the delta coordinate between c i and c j where the superscript 0, 1 and 2 represent the X axis, Y axis and Z axis respectively. Then the position prediction head predicts coordinate updates using three linear projections of the attention head values onto the three axes, which is denoted as:

where Δ c ′ is the predicted coordinate updates and c ′ is the predicted coordinates.

As described in the above formula, the coordinate prediction head used in our study does not inherently enforce strict equivariance. This challenge can be addressed through one of two strategies: (1) Strict equivariance of the model can be achieved by sharing the parameters across the three linear layers in Eq. ( 7 )-denoted as linear 1 , linear 2 , and linear 3 -and concurrently eliminating the bias terms within these layers; (2) the model’s robustness to spatial transformations can be enhanced by incorporating random rotations into the input coordinates as a form of data augmentation. During our experimental phase, both techniques were rigorously tested. The latter approach-data augmentation via random rotations yielded better accuracy in quantum chemistry property predictions and was thus selected for our model architecture. In this case, empirical evidence suggests that with a sufficiently large training dataset, such as the PCQM4MV2 dataset, the model naturally tends towards an equivariant state. Specifically, our observations indicate that the parameters of the three linear layers tend to converge to the same, and the bias terms asymptotically approach zero, with the discrepancies being marginal (on the order of 1e − 4).

Training strategy

In DFT conformation optimization or Molecular Dynamics simulations, a conformation is optimized step-by-step, resulting in a trajectory from a raw conformation to the equilibrium conformation in Euclidean space. However, saving such a trajectory can be expensive, and publicly available datasets usually provide the equilibrium conformations only. Providing a trajectory would be beneficial as intermediate states can be used as data augmentation to guide the model’s training. Inspired by this, we propose a novel training approach, which generates a pseudo trajectory first, samples a conformation from it, and uses the sampled conformation as input to predict the equilibrium conformation. This approach allows us to better exploit the information in the molecular data, which we found can greatly improve the model’s performance. Specifically, we assume that the trajectory from a raw conformation r   init to a target equilibrium conformation r   tgt is a linear process. We generate an intermediate conformation along this trajectory via noisy interpolation, i.e.,

where scalar q ranges from 0 to 1, the Gaussian noise c ∈ R n × 3 has a mean of 0 and standard deviation υ (a hyper-parameter). Taking r (0) as input, Uni-Mol+ learns to update towards the target equilibrium conformation r   tgt . During inference, q is set to 1.0 by default. However, during training, simply sampling q from a uniform distribution ([0.0, 1.0]) may cause (1) a distributional shift between training and inference, due to the infrequent sampling of q  = 1.0 (RDKit-generated conformation), and (2) an inability to learn an accurate mapping from the equilibrium conformation to the QC properties, as q  = 0.0 (target conformation) is also not sampled often. Therefore, we employ a mixture of Bernoulli and Uniform distributions to flexibly assign higher sample probabilities to q  = 1.0 and q  = 0.0, while also sampling from interpolations. The above process is illustrated in Fig.  1 c in  Supplementary .

The model takes r (0) as input and generates r ( R ) after R iterations. Then, the model uses r ( R ) as input and predicts the QC properties. L1 loss is applied to the QC property regression and the 3D coordinate prediction. All loss calculations are performed solely on the final conformer at the last iteration.

Model configuration

Similar to both Graphormer 4 and Transformer-M 3 , Uni-Mol+ comprises 12 layers with an atom representation dimension of d x  = 768 and a pair representation dimension of d p  = 256. The hidden dimension of FFN in the atom representation track is set to 768, while that of the pair representation track is set to 256. Additionally, the hidden dimension in the OuterProduct is d o  = 32, and the hidden dimension in the TriangularUpdate is d t  = 32 as well. The number of conformation optimization iterations R is set to 1, indicating that the model iterates twice in total (once for conformation optimization and once for quantum chemistry property prediction). For the training strategy, we specified a standard deviation of υ  = 0.2 for random noise and employed a particular sampling method for q . Specifically, q was set to 0.0 with probability 0.8, set to 1.0 with probability 0.1, and uniformly sampled from [0.4, 0.6] with probability 0.1. With this setting, the number of parameters of Uni-Mol+ is about 52.4M.

Setting for PCQM4MV2

We used the AdamW optimizer with a learning rate of 2 e  − 4, a batch size of 1024, ( β 1 ,  β 2 ) set to (0.9, 0.999), and gradient clipping set to 5.0 during training, which lasted for 1.5 million steps, with 150K warmup steps. Additionally, an exponential moving average (EMA) with a decay rate of 0.999 was utilized. The training took approximately 5 days, utilizing 8 NVIDIA A100 GPUs. The inference on the 147k test-dev set took approximately 7 minutes, utilizing 8 NVIDIA V100 GPUs.

Setting for OC20

We use the default 12-layer Uni-Mol+ setting for OC20 experiments. The model configuration deviates slightly from the settings employed in PCQM4MV2. Firstly, since OC20 lacks graph information, graph-related features are excluded from the model. Secondly, due to the greater number of atoms present in OC20 compared to PCQM4MV2, the model capacity is marginally reduced for efficiency reasons. In particular, the pair representation dimension d p is set to 128, while the hidden dimensions in the OuterProduct and TriangularUpdate are set to d o   = 16 and d t  = 16, respectively. Third, the periodic boundary condition needs to be considered; we adopt the solution proposed in 5 , which pre-expands the neighbor cells and then applies a radius cutoff to reduce the number of atoms. The AdamW optimizer was employed during the training process, which lasted for 1.5 million steps, including 150K warmup steps. The optimizer was configured with a learning rate of 2 e  − 4, a batch size of 64, ( β 1 ,  β 2 ) values of (0.9, 0.999), and a gradient clipping parameter of 5.0. The training process spanned approximately 7 days and made use of 16 NVIDIA A100 GPUs.

Supplementary information

Source data, acknowledgements.

We thank Bohang Zhang, Siyuan Liu and Hang Zheng for their helpful suggestion and discussion. Di He was supported by National Key R&D Program of China (2022ZD0160300) and National Science Foundation of China (NSFC62376007).

Author contributions

S.L. and G.K. designed the model, conducted the experiments, and wrote the paper. Z.G. and D.H. assisted in writing and designing experiments. G.K. and L.Z. secured funding. All have commented on and edited the manuscript.

Peer review

Peer review information.

Nature Communications thanks Yaochen Xie, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

Code availability, competing interests.

The authors declare no competing interests.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The online version contains supplementary material available at 10.1038/s41467-024-51321-w.

IMAGES

  1. 3d model molecule ch4 Royalty Free Vector Image

    representation 3d molecule

  2. Abstract molecule 3d 438076 Vector Art at Vecteezy

    representation 3d molecule

  3. 3d molecule carbon dioxide Royalty Free Vector Image

    representation 3d molecule

  4. 3D model atoms molecules

    representation 3d molecule

  5. 3D chemical molecules 438199 Vector Art at Vecteezy

    representation 3d molecule

  6. 3D Illustration Molecule Structure. Scientific Medical Background with

    representation 3d molecule

VIDEO

  1. 3D Molecules Edit&Test

  2. Huckel Molecular Orbital of Ethylene molecule/pi bond energy / M.Sc. Sem 1 / HNGU Patan

  3. General Organic Chemistry (GOC) 3D- Representation of a molecule

  4. Molecule visualizer using 3D laplacian matrix Matlab

  5. Lecture on Representation of 3D molecule by Prof. Sharmila Pandey

  6. 21. Fischer Projection Formula: Stereochemistry-1A

COMMENTS

  1. MolView

    Drawing structural formulas. MolView consists of two main parts, a structural formula editor and a 3D model viewer. The structural formula editor is surround by three toolbars which contain the tools you can use in the editor. Once you've drawn a molecule, you can click the 2D to 3D button to convert the molecule into a 3D model which is then ...

  2. Geometry-enhanced molecular representation learning for property

    This section introduces the details of our proposed geometry-enhanced molecular representation learning method (GEM), which includes two parts: a novel geometry-based GNN and various geometry ...

  3. [PDF] Uni-Mol: A Universal 3D Molecular Representation Learning

    Uni-Mol is a universal MRL framework that significantly enlarges the representation ability and application scope of MRL schemes, and achieves superior performance in 3D spatial tasks, including protein-ligand binding pose prediction, molecular conformation generation, etc. Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited ...

  4. Generation of 3D molecules in pockets via a language model

    A new FSMILES molecule representation that incorporates both local and global coordinates is introduced, enabling the generation of 3D molecules with reasonable 3D conformations and two ...

  5. Deep learning methods for molecular representation and property

    The 3D molecular graph records the 3D locations of each atom, and the 3D molecular grid is a special 3D image in which the voxels in the grid indicate different elements or attributes of molecular conformation through different methods. In this review, we highlight DL models using for molecular representation.

  6. Geometric deep learning on molecular representations

    Molecular systems (and 3D representations thereof) can be considered as objects in Euclidean space. In such a space, one can apply several symmetry operations (transformations) that are performed ...

  7. Learning Multi-view Molecular Representations with Structured and

    Uni-Mol: a universal 3D molecular representation learning framework. (2023). Google Scholar [81] Jinhua Zhu, Yingce Xia, Lijun Wu, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, and Tie-Yan Liu. 2022. Unified 2d and 3d pre-training of molecular representations. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data ...

  8. Pre-training Molecular Graph Representation with 3D Geometry

    Molecular graph representation learning is a fundamental problem in modern drug and material discovery. Molecular graphs are typically modeled by their 2D topological structures, but it has been recently discovered that 3D geometric information plays a more vital role in predicting molecular functionalities. However, the lack of 3D information in real-world scenarios has significantly impeded ...

  9. Uni-Mol: A Universal 3D Molecular Representation Learning Framework

    Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited supervised data for applications like drug design. In most MRL methods, molecules are treated as 1D sequential tokens or 2D topology graphs, limiting their ability to incorporate 3D information for downstream tasks and, in particular, making it almost impossible for 3D ...

  10. Unified 2D and 3D Pre-Training of Molecular Representations

    2.2 3D Molecular Representation Encoding 3D spatial structure into molecular representation is im-portant to determine molecular property. Anderson et al. [2], Lu et al. [24], Schütt et al. [31] take the atomic distance into consid-eration and design a set of novel architecture to deal with atomic positions.

  11. 3D-Mol: A Novel Contrastive Learning Framework for Molecular ...

    To address these issues, we propose a novel framework, 3D-Mol, for molecular representation and property prediction. We employ three graphs to hierarchically represent the atom-bond, bond-angle, and dihedral information of molecule, integrating information from these hierarchies through a message-passing strategy to obtain a comprehensive ...

  12. MolNet‐3D: Deep Learning of Molecular Representations and Properties

    The model can learn an invariant representation without the need for the transformation of atom coordinates into interatomic distances, thus preserving the intrinsic 3D topography information of molecules. ... This work may provide new insight into the construction of molecular ML models from 3D topography recognition perspectives. Conflict of ...

  13. Unified 2D and 3D Pre-Training of Molecular Representations

    Molecular representation learning has attracted much attention recently. A molecule can be viewed as a 2D graph with nodes/atoms connected by edges/bonds, and can also be represented by a 3D conformation with 3-dimensional coordinates of all atoms. We note that most previous work handles 2D and 3D information separately, while jointly leveraging these two sources may foster a more informative ...

  14. PDF Uni-Mol: A Universal 3D Molecular Representation Learning ...

    Molecular representation learning (MRL) has gained tremendous attention due to its critical role in learning from limited supervised data for applications like ... • Uni-Mol, to our best knowledge, is the first pure 3D molecular pretraining framework, which contains a SE(3) Transformer-based backbone model to directly handle 3D positions, 2 ...

  15. Deep generative models for 3D molecular structure

    a: Each molecule 3D representation is first embedded in an intermediate representation such as atomic density or atom type, bond type and pairwise distance matrices, that will be encoded in a latent space. The latent space can be sampled from a standard normal distribution or from the original molecule encoding, to be decoded back to the ...

  16. View 3D Molecular Structures

    Using PyMOL, data can be represented in nearly 20 different ways. Spheres provides a CPK-like view, surface and mesh provide more volumetric views, lines and sticks put the emphasis on bond connectivity, and ribbon and cartoon are popular representations for identifying secondary structure and topology. PyMOL's quick demo, accessible through the built-in Wizard menu, gets users started with ...

  17. Molecular representations in AI-driven drug discovery: a review and

    A molecular graph representation is formally a 2D object that can be used to represent 3D information (e.g. atomic coordinates, bond angles, chirality). However, any spatial relationships between the nodes must be encoded as node and/or edge attributes, as nodes in a graph (the mathematical object) do not formally have spatial positions, only ...

  18. virtual Chemistry 3D

    You can play on this first page by searching molecules on the PubChem or RSCB servers. By clicking on the appropriate parts of the scheme and rotating the structures in 3D, you can explore the geometry of a molecule. Mouse gestures: use the scroll wheel to enlarge and shrink the models; shift + double-click on any atom and drag to translate ...

  19. PDF A UNIVERSAL 3D MOLECULAR REPRESENTATION LEARNING FRAMEWORK

    Uni-Mol can learn a better 3D representation than o. her baselines. 3) Uni-Mol fails to beat SOTA on the SIDER dataset. After investigation, we find that Uni-Mol fails to generate 3D conformation. for many molecules (like natural products and peptides) in SIDER. Therefore, due to the missing 3D i.

  20. ‪VSEPR‬

    ‪VSEPR‬ - PhET Interactive Simulations

  21. Data-driven quantum chemical property prediction leveraging 3D

    Some research has utilized 3D structural information and maximised mutual information between 2D and 3D molecular to augment 2D representations during training 3, 22 - 24. However, these studies only implicitly embed 3D information into 2D representations, with 2D data utilized exclusively during inference.

  22. For parts 3a-3d, draw out a complete and correct

    For parts 3a-3d, draw out a complete and correct skeletal structure representation of each molecule described below and answer the additional questions where indicated.3a. A molecule with the molecular formula C10H18ClNO such that it only has a secondary amide and tertiary alkyl halide functional group. ( 2 points)3b.