Package protkit
Protkit
Protkit is an open source Python library that can be used for a variety of tasks in computational biology and bioinformatics, focusing on structural bioinformatics, protein engineering and machine learning.
It is designed to support the broad community of computational biologists, bioinformaticians and machine learning researchers in academia, industry and government labs.
Protkit can be used for a variety of computational biology tasks across the computational biology pipeline, such as:
- Reading and writing data from popular structure file formats, such as PDB, PQR, MMTF, mmCIF; and sequence file formats, such as FASTA.
- Downloading data from popular databases of protein structures, such as the PDB RCSB, UniProt and SAbDab.
- Data structures for representing proteins, protein complexes, chains, residues, atoms and sequences. These data structures provide capabilities to extract data in both hierarchical and linear formats. It is extensible and easy to add new properties to the data structure. It has a rich set of methods for extracting and filtering data from the data structure.
- Detecting and fixing anomalies in protein structures, such as missing atoms, missing residues, detecting sequence gaps, detecting atomic clashes, removing hetero residues or water molecules, and removing alternate conformations.
- Calculating properties of proteins, such as hydrophobicity, charge, surface areas, secondary structures, dihedral angles, interface residues and more.
- Geometric operations on proteins, such as aligning and superimposing structures.
- Metrics for comparing proteins, such as RMSD and Sequence Similarity.
- Featurization of proteins and their properties enabling preparation of datasets for machine learning applications.
- Performing and enabling a large variety of computational tasks on proteins, such as protein folding, protein docking, protein-protein binding affinity prediction, humanisation of antibodies, prediction of developability characteristics etc. Care is taken that the various tools are interoperable and can be used together in a seamless manner.
Protkit is an open source library that is free to use and modify. We welcome contributions from the community.
Installation
Installation from PyPI
protkit
requires Python 3.6 or higher.
It can be installed using pip
:
pip install protkit
A number of dependencies will be installed automatically, such as numpy
, joblib
, requests
and others.
See Protkit on PyPI for more details.
Cloning the Repository
You can also clone the repository and install it from source:
git clone https://github.com/silicogenesis/protkit.git
You can install the project requirements using pip
:
pip install -r requirements.txt
Quick Start Example
Protkit is designed to be intuitive and easy to use. An extensive set of examples can be found in the Quick Start Guide.
Here is a simple example to get you started. It illustrates how powerful computation can be done with Protkit in just a few lines of code.
In the example, we download a PDB file from the RCSB, extract the A and B chains and do some cleanup like removing hetero atoms and fixing disordered atoms. We then compute dihedral angles and surface areas for the protein and save it to a file. We then load the protein from the file and print the surface area and a note that we added to the protein.
from protkit.download import Download
from protkit.file_io import PDBIO, ProtIO
from protkit.properties import DihedralAngles, SurfaceArea
# Download a PDB file from the RCSB PDB database and save it to a file.
Download.download_pdb_file_from_rcsb("1ahw", "1ahw.pdb")
# Load a PDB file into a Protein object.
protein = PDBIO.load("1ahw.pdb")[0]
# Print the number of chains in the protein.
print(protein.num_chains)
# Keep only the A and B chains
protein.keep_chains(["A", "B"])
print(protein.get_chain('A').sequence)
# Do a bit of cleanup, by removing any hetero atoms and fixing disordered atoms.
protein.remove_hetero_residues()
protein.fix_disordered_atoms()
# Compute dihedral angles for the protein, and assign them as extended attributes to residues.
DihedralAngles.dihedral_angles_of_protein(protein, assign_attribute=True)
print(protein.get_chain('A').get_residue(1).get_attribute('dihedral_angles')['PHI'])
# Compute surface areas for the protein. Surface areas are automatically computed and assigned
# at the residue, chain and protein level.
SurfaceArea.surface_area_of_protein(protein, assign_attribute=True)
print(protein.get_attribute('surface_area'))
# Save the protein to a protkit (.prot) file. All attributes, such as the
# computed dihedral angles and surface areas, will be saved as well and
# is available for later retrieval!
protein.set_attribute("note", "Experimenting with Protkit")
ProtIO.save(protein, "1ahw.prot")
protein2 = ProtIO.load("1ahw.prot")[0]
print(protein2.get_attribute('surface_area'))
print(protein2.get_attribute('note'))
Please consult the Quick Start Guide for more examples.
Expand source code
"""
# Protkit
Protkit is an open source Python library that can be used for a variety of tasks in computational biology
and bioinformatics, focusing on structural bioinformatics, protein engineering and machine learning.
It is designed to support the broad community of computational biologists,
bioinformaticians and machine learning researchers in academia, industry
and government labs.
Protkit can be used for a variety of computational biology tasks across the computational biology pipeline, such as:
- **Reading and writing data** from popular structure file formats, such as
PDB, PQR, MMTF, mmCIF; and sequence file formats, such as FASTA.
- **Downloading** data from popular databases of protein structures, such as the PDB RCSB, UniProt and SAbDab.
- **Data structures** for representing proteins, protein complexes, chains,
residues, atoms and sequences. These data structures provide capabilities to extract data
in both hierarchical and linear formats. It is extensible and easy to add
new properties to the data structure. It has a rich set of methods for extracting
and filtering data from the data structure.
- **Detecting and fixing anomalies** in protein structures, such as missing atoms,
missing residues, detecting sequence gaps, detecting atomic clashes, removing
hetero residues or water molecules, and removing alternate conformations.
- **Calculating properties** of proteins, such as hydrophobicity, charge, surface areas,
secondary structures, dihedral angles, interface residues and more.
- **Geometric operations** on proteins, such as aligning and superimposing
structures.
- **Metrics** for comparing proteins, such as RMSD and Sequence Similarity.
- **Featurization** of proteins and their properties enabling preparation of datasets
for **machine learning** applications.
- Performing and enabling a large variety of **computational tasks** on proteins,
such as protein folding, protein docking, protein-protein binding affinity prediction,
humanisation of antibodies, prediction of developability characteristics etc. Care is taken
that the various tools are interoperable and can be used together in a seamless manner.
Protkit is an open source library that is free to use and modify. We welcome
contributions from the community.
---
## Installation
### Installation from PyPI
`protkit` requires Python 3.6 or higher. It can be installed using `pip`:
```bash
pip install protkit
```
A number of dependencies will be installed automatically, such as `numpy`, `joblib`, `requests` and others.
See [Protkit](https://pypi.org/project/protkit/) on PyPI for more details.
### Cloning the Repository
You can also clone the repository and install it from source:
```bash
git clone https://github.com/silicogenesis/protkit.git
```
You can install the project requirements using `pip`:
```bash
pip install -r requirements.txt
```
---
## Quick Start Example
Protkit is designed to be intuitive and easy to use. An extensive set of examples can be found in the [Quick Start Guide](QUICK_START_GUIDE.md).
Here is a simple example to get you started. It illustrates how powerful computation can be done with Protkit in just a few lines of code.
In the example, we download a PDB file from the RCSB, extract the A and B chains and do some cleanup like removing hetero atoms and fixing disordered atoms. We then compute dihedral angles and surface areas for the protein and save it to a file. We then load the protein from the file and print the surface area and a note that we added to the protein.
```python
from protkit.download import Download
from protkit.file_io import PDBIO, ProtIO
from protkit.properties import DihedralAngles, SurfaceArea
# Download a PDB file from the RCSB PDB database and save it to a file.
Download.download_pdb_file_from_rcsb("1ahw", "1ahw.pdb")
# Load a PDB file into a Protein object.
protein = PDBIO.load("1ahw.pdb")[0]
# Print the number of chains in the protein.
print(protein.num_chains)
# Keep only the A and B chains
protein.keep_chains(["A", "B"])
print(protein.get_chain('A').sequence)
# Do a bit of cleanup, by removing any hetero atoms and fixing disordered atoms.
protein.remove_hetero_residues()
protein.fix_disordered_atoms()
# Compute dihedral angles for the protein, and assign them as extended attributes to residues.
DihedralAngles.dihedral_angles_of_protein(protein, assign_attribute=True)
print(protein.get_chain('A').get_residue(1).get_attribute('dihedral_angles')['PHI'])
# Compute surface areas for the protein. Surface areas are automatically computed and assigned
# at the residue, chain and protein level.
SurfaceArea.surface_area_of_protein(protein, assign_attribute=True)
print(protein.get_attribute('surface_area'))
# Save the protein to a protkit (.prot) file. All attributes, such as the
# computed dihedral angles and surface areas, will be saved as well and
# is available for later retrieval!
protein.set_attribute("note", "Experimenting with Protkit")
ProtIO.save(protein, "1ahw.prot")
protein2 = ProtIO.load("1ahw.prot")[0]
print(protein2.get_attribute('surface_area'))
print(protein2.get_attribute('note'))
```
Please consult the [Quick Start Guide](QUICK_START_GUIDE.md) for more examples.
---
"""
Sub-modules
protkit.core
-
This module contains the core classes and functions for the protkit package.
protkit.download
-
Package
protkit.download
contains classes to download biological data from the internet … protkit.file_io
-
Package
protkit.file_io
contains classes to read and write data from and to files containing biological data … protkit.geometry
-
Package
protkit.geometry
contains classes to apply geometric operations on proteins. protkit.metrics
-
Package
protkit.metrics
contains classes to perform various evaluations on biological data … protkit.ml
-
Package
protkit.ml
contains classes to prepare data for machine learning applications and to represent machine learning models in computational biology … protkit.properties
-
Package
protkit.properties
contains classes to calculate properties of proteins … protkit.seq
-
Package
protkit.seq
contains classes to represent sequences in computational biology … protkit.structure
-
Package
protkit.structure
contains classes to represent structural data in computational biology … protkit.tasks
-
protkit.tasks
is a package for defining the abstract base classes for tasks. protkit.tools
-
protkit.tools
is a package that contains classes that perform specific tasks on proteins. These include tasks such as docking, folding, affinity prediction, …