Skip to main content

Molecular

Main Import

import jaqpotpy.descriptors.molecular

Main Reference

Code and descriptors are based on open-source DeepChem Library: https://github.com/deepchem/deepchem

MACCSKeysFingerprint

CLASS: MACCSKeysFingerprint()

Class for computing chemical fingerprints for molecules using MACCS (Molecular ACCess System). For each molecule it generates a total of 167 different fingerprints. For a list of molecules in SMILES format use .featurize(list) to obtain the fingerprints.
[1] https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/MACCSkeys.py
[2] Durant, Joseph L., et al. "Reoptimization of MDL keys for use in drug discovery." Journal of chemical information and computer sciences 42.6 (2002): 1273-1280.

Example:
from jaqpotpy.descriptors.molecular import MACCSKeysFingerprint

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = MACCSKeysFingerprint()
# Create numpy array of features
features = featurizer.featurize(smiles)
# Create dataframe of features
features_df = featurizer.featurize_dataframe(smiles)

RDKitDescriptors

CLASS: RDKitDescriptors(use_fragment=True, ipc_avg=True)

This class computes a list of chemical descriptors like molecular weight, number of valence electrons, maximum and minimum partial charge, etc using RDKit library. For each molecule it calculates a total of 208 different features. For a list of molecules in SMILES format use .featurize(list) to obtain the features. RDkit package must be installed.
[1] https://www.rdkit.org/

Parameters

  • use_fragment(bool,optional) : If fragment binary descriptors are included.
  • ipc_avg(bool,optional) : If IPC descriptor calculates with rdkit avg=True option.
Example
from jaqpotpy.descriptors.molecular import RDKitDescriptors

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = RDKitDescriptors()
features = featurizer.featurize(smiles)

PubChemFingerprint

CLASS: PubChemFingerprint()

Class for computing structural keys for molecules using PubChem. For each molecule it calculates a total of 881 different fingerprints. It is mainly used for similarity searching. For a list of molecules in SMILES format use .featurize(list) to obtain the features. PubChemPy use REST API to get the fingerprint, so you need the internet access.
[1] https://web.cse.ohio-state.edu/~zhang.10631/bak/drugreposition/list_fingerprints.pdf

Example
from jaqpotpy.descriptors.molecular import PubChemFingerprint

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = PubChemFingerprint()
features = featurizer.featurize(smiles)

MordredDescriptors

CLASS: MordredDescriptors(ignore_3D = True)

Class for computing chemical descriptors for molecules using Mordred Descriptors. For each molecule it calculates a total of 1613 features. For a list of molecules in SMILES format use .featurize(list) to obtain the features. This class requires Mordred to be installed.
[1] http://mordred-descriptor.github.io/documentation/master/descriptors.html

Parameters

  • ignore_3D(bool,optional) : If 3D information is used for descriptor calculation.
Example
from jaqpotpy.descriptors.molecular import MordredDescriptors

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = MordredDescriptors()
features = featurizer.featurize(smiles)

TopologicalFingerprint

CLASS: TopologicalFingerprint(radius=2,size=2048,chiral=False,bonds=True,features=False,sparse=False,smiles=False)

Class for computing Extended Connectivity Circular Fingerprints. Those fingerprints compute a bag-of-words style representation of a molecule by breaking it into local neighborhoods and hashing into a bit vector of the specified size. It is used specifically for structure-activity modelling molecular Circular (Morgan) fingerprints. For each molecule it calculates a total of 2048 different fingerprints. For a list of molecules in SMILES format use .featurize(list) to obtain the features. Rdkit is required to be installed.
[1] https://pubmed.ncbi.nlm.nih.gov/20426451/

Parameters

  • radius(int,optional): Specify fingerprint radius.
  • size(int,optional): Length of features.
  • chiral(bool,optional): If chirality is used for fingerprint generation.
  • bonds(bool,optional): If bonds are used for fingerprint generation.
  • features(bool,optional): If feature information is used instead of atom information.
  • sparse(bool,optional): Returns dict with sparse fingerprint.
  • smiles(bool,optional): Calculates SMILES strings for fragment IDs.
Example
from jaqpotpy.descriptors.molecular import TopologicalFingerprint

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = TopologicalFingerprint()
features = featurizer.featurize(smiles)

SmilesToImage

CLASS: SmilesToImage(img_size=80, res=0.5, max_len=250, img_spec="std")

Class for converting a SMILES string to an image tensor. The default size of for the image is 80 x 80. Two image modes are currently supported - std & engd. std is the gray scale specification, with atomic numbers as pixel values for atom positions and a constant value of 2 for bond positions. engd is a 4-channel specification, which uses atom properties like hybridization, valency, charges in addition to atomic number. Bond type is also used for the bonds. The coordinates of all atoms are computed, and lines are drawn between atoms to indicate bonds. For the respective channels, the atom and bond positions are set to the property values as mentioned in the paper below.
[1] https://dl.acm.org/doi/10.1145/3219819.3219838

Parameters

  • img_size(int,optional): Size of output tensor.
  • res(float,optional): Resolution of each pixel in Angstrom.
  • max_len(int,optional): Maximum length of input SMILES string.
  • img_spec(str,optional): Channel organization of the image tensor, ("std","engd").
Example
from jaqpotpy.descriptors.molecular import SmilesToImage

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = SmilesToImage()
images = featurizer.featurize(smiles)

SmilesToSeq

CLASS: SmilesToSeq(char_to_idx, max_len=250,pad_len)

Class for converting a SMILES string into a sequence. Use .featurize to convert. SMILES strings smaller than a specified max length (max_len) are padded using the PAD token while those larger than the max length are not considered. Based on the paper, there is also the option to add extra padding (pad_len) on both sides of the string after length normalization. Using a character to index (char_to_idx) mapping, the SMILES characters are turned into indices and the resulting sequence of indices serves as the input for an embedding layer.
[1] https://dl.acm.org/doi/10.1145/3219819.3219838

Parameters

  • char_to_idx(dict): Dictionary containing character to index mapping for unique characters. Must be dict{str: int}.
  • max_len(int,optional): Maximum length of the SMILES string.
  • pad_len(int,optional): Amount of padding to add on either side of the SMILES seq.
Example
from jaqpotpy.descriptors.molecular import SmilesToSeq

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = SmilesToSeq(char_to_idx=....)
sequence = featurizer.featurize(smiles)

OneHotSequence

CLASS: OneHotSequence(charset = ZINC_CHARSET, max_length = 100)

Class for encoding any arbitrary string or molecule as a one-hot array. This featurizer encodes the characters within any given string as a one-hot array. It also works with RDKit molecules: it can convert RDKit molecules to SMILES strings and then one-hot encode the characters in said strings. Use .featurize to convert.

Parameters

  • charset(list,optional) : List of string where each string is length 1 and unique. Default is ZINC Charset.
  • max_length(int,optional) : Max length of each string.
Example
from jaqpotpy.descriptors.molecular import OneHotSequence

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
featurizer = OneHotSequence()
one_hot = featurizer.featurize(smiles)

MolGraphConvFeaturizer

CLASS : MolGraphConvFeaturizer(use_edges=False, use_chirality=False, use_partial_charge=False)

Class for featurizing molecules for graph convolution. It calculates node(atom) features, adjacency matrix and edge(bond) features. Default feature representations are based on https://arxiv.org/abs/1603.00856 paper. Node features are 30 and edge features are 11. Use .featurize method to obtain the features.
[1] Kearnes, Steven, et al. "Molecular graph convolutions: moving beyond fingerprints." Journal of computer-aided molecular design 30.8 (2016):595-608

Parameters

  • use_edges(bool,optional): If edge features are calculated.
  • use_chirality(bool,optional): If chirality is used in features.
  • use_partial_charge(bool,optional): If partial charge data are used.
Example
from jaqpotpy.descriptors.molecular import MolGraphConvFeaturizer

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
graphconv_featurizer = MolGraphConvFeaturizer(use_edges=True)
graphconv = graphconv_featurizer.featurize(smiles)

# Numpy arrays of features
graphconv[0].node_features
graphconv[0].edge_features
graphconv[0].edge_index # Adjacency Matrix

PagtnMolGraphFeaturizer

CLASS: PagtnMolGraphFeaturizer(max_length=5)

Class for featurizing molecules for graph convolution. It calculates node(atom) features, adjacency matrix and edge(bond) features. Default feature representations are based on PAGTN model https://arxiv.org/abs/1905.12712. Node features are 94 and edge features are 42. Use .featurize method to obtain the features.
[1] Chen, Barzilay, Jaakkola "Path-Augmented Graph Transformer Network" 10.26434/chemrxiv.8214422.

Parameters

  • max_length: Maximum distance up to which shortest paths must be considered. Paths shorter than max_length will be padded and longer will betruncated, default to 5.
Example
from jaqpotpy.descriptors.molecular import PagtnMolGraphFeaturizer

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
pagtn_featurizer = PagtnMolGraphFeaturizer()
pagtn = pagtn_featurizer.featurize(smiles)

pagtn[0].node_features
pagtn[0].edge_features
pagtn[0].edge_index

TorchMolGraphConvFeaturizer

CLASS: TorchMolGraphConvFeaturizer(use_edges=False, use_chirality=False, use_partial_charge=False)

Class for featurizing molecules for graph convolution based on pytorch-geometric. It calculates node(atom) features, adjacency matrix and edge(bond) features. Default feature representations are based on https://arxiv.org/abs/1603.00856. Node features are 30 and edge features are 11. Use .featurize method to obtain the features. Also the features are of type->torch.Tensor type. Requires numpy version <1.21 to be installed.. Doesn't work with newer versions.

Parameters

  • use_edges(bool,optional): If edge features are calculated.
  • use_chirality(bool,optional): If chirality is used in features.
  • use_partial_charge(bool,optional): If partial charge data are used.
Example
from jaqpotpy.descriptors.molecular import TorchMolGraphConvFeaturizer

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
torch_featurizer = TorchMolGraphConvFeaturizer(use_edges=True)
torch_features = torch_featurizer.featurize(smiles)

torch_features[0][0][1] # Adjacency Matrix
torch_features[0][1][1] # Node Features
torch_features[0][2][1] # Edge Features

AttentiveFPFeaturizer

AttentiveFPFeaturizer(use_loops=False)

Class for featurizing molecules for Attentive FP GNNs convolution paper https://pubmed.ncbi.nlm.nih.gov/31408336/ . It calculates node(atom) features, adjacency matrix and edge(bond) features. Node features are 39 and edge features are 10. Use .featurize method to obtain the features.
[1] Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H, Zheng M. Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. J Med Chem. 2020 Aug 27;63(16):8749-8760. doi: 10.1021/acs.jmedchem.9b00959. Epub 2019 Aug 27. PMID: 31408336.

Parameters

  • use_loops(bool,optional): If self loops are added.
Example
from jaqpotpy.descriptors.molecular import AttentiveFPFeaturizer

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
att_featurizer = AttentiveFPFeaturizer()
att_features = att_featurizer.featurize(smiles)

att_features[0].edge_index
att_features[0].node_features
att_features[0].edge_features

CoulombMatrix

CLASS: CoulombMatrix(max_atoms, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1,seed=None)

Class to calculate Coulomb matrices for molecules. For N atoms the coulomb matrix is N x N based on the method from the paper https://proceedings.neurips.cc/paper_files/paper/2012/hash/115f89503138416a242f40fb7d7f338e-Abstract.html. This class requires RDKit to be installed Use .featurize method to calculate the matrix.
[1] Montavon, Grégoire, et al. "Learning invariant representations of molecules for atomization energy prediction." Advances in neural information processing systems. 2012.

Parameters

  • max_atoms(int): Maximum number of atoms to featurize (Dimensions of the Matrix).
  • remove_hydrogens(bool,optional): If hydrogens are removed before processing them.
  • randomize(bool,optional): If matrix are randomized.
  • upper_tri(bool,optional): If only upper triangle of Coulomb matrix is generated.
  • n_samples(bool,optional): Number of samples to draw if randomize=True.
  • seed: Generate random seed for deterministic results.
Example
from jaqpotpy.descriptors.molecular import CoulombMatrix

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
cm_featurizer = CoulombMatrix(max_atoms=20)

cm_features = cm_featurizer.featurize(smiles)

# Coulomb Matrix
cm_features[0]

CoulombMatrixEig

CLASS: CoulombMatrixEig(max_atoms, remove_hydrogens=False, randomize=False, upper_tri=False, n_samples=1,seed=None)

Class to calculate eigenvalues of Coulomb matrices for molecules. Use .featurize method to calculate the matrix.
[1] Montavon, Grégoire, et al. "Learning invariant representations of molecules for atomization energy prediction." Advances in neural information processing systems. 2012.

Parameters

  • max_atoms(int): Maximum number of atoms to featurize(Dimensions of the Matrix).
  • remove_hydrogens(bool,optional): If hydrogens are removed before processing them.
  • randomize(bool,optional): If matrix are randomized.
  • upper_tri(bool,optional): If only upper triangle of Coulomb matrix is generated.
  • n_samples(bool,optional): Number of samples to draw if randomize=True.
  • seed: Generate random seed.
Example
from jaqpotpy.descriptors.molecular import CoulombMatrixEig

smiles = ['c1cc(O)ccc1', 'O=Cc1ccc(O)c(OC)c1']
eig_featurizer = CoulombMatrixEig(max_atoms=20)

eig_features = eig_featurizer.featurize(smiles)

# EigenvaluesCoulomb Matrix
eig_features[0]

MolGanFeaturizer

MolGanFeaturizer(max_atom_count=9,kekulize=True,sanitize=False,bond_labels=None,atom_labels=None)

Class to feauturize molecules for de-novo molecular generation based on https://arxiv.org/abs/1805.11973. Use .featurize method to get the features.

Parameters

  • max_atom_count: Maximum number of atoms for adjacency matrix creation.
  • kekulize: If molecules are kekulized.
  • bond_labels: List of bond types for adjacency matrix creation.
  • atom_labels: List of atom numbers used for generation of node features.
Example
from jaqpotpy.descriptors.molecular import MolGanFeaturizer
from rdkit import Chem

smiles = ['c1cc(O)ccc1']
mol_object = Chem.MolFromSmiles('CCC')
molecules = [smiles, mol_object]
gan_featurizer = MolGanFeaturizer()

gan_features = gan_featurizer.featurize(molecules)

gan_features[0].node_features
gan_features[0].adjacency_matrix