Requirements
Make sure you have the following Python packages installed:
- TensorFlow
- NumPy
- sk-learn
- MatplotLib
- Weblogo
Getting started
Synthetic data
Run the script create_biological_data.py to generate a set of synthetic sequences. In the script itself you can change, among others, the length of the sequences, the number of motifs and the ratio between binding and non-binding samples. A new directory will be created in the "Synthetic data" folder, containing .txt files corresponding to your training and test set. Additionally, there is a summary of the settings used to generate this dataset in 'info.txt' and a human-readable format of the sequence pairs in 'readable_pairs.csv'. The TFRecords, i.e. binary formatted sequences, are stored in the 'Records' folder. It is these files that the entries in the training and test set refer to.
Biological data
In Supp-C.txt and Supp-D.txt you will find the positive (interacting) and negative (non-interacting) samples, respectively. These are the files provided by Pan et al. [1]. Run the script create_biological_data.py to convert these .txt files to create separate TFRecords, which are used as input by model. Two folders should appear in the biological data folder: one name Fasta, containing human-readable versions of the separate sequences; one named Records, containing the binary TFRecords. Additionally, the test and training set are created as .txt files. These files contain entries of the format 'seq_id_1 seq_id_2 label', where the sequence ids correspond to TFRecord files in the Records directory.
Running the network
After creating the TFRecords, training set and test set, you can run 'train.py'. Should you choose to do so, you can edit the hyperparameter settings first, which are found in the main() function. After training, a folder is created in the 'Results' directory, containing the following.
- The weights of the network
- Tensorflow model of the network, which can be used to load the model again at another time
- The convolution filters plotted as WebLogos
- A .txt file containing performance metrics (AUC-ROC, accuracy, specificity, precision)
- A .txt file containing the settings used to train the model
- A prediction heatmap; a heatmap that indicates which filters are relevant for classifying the samples in the test set.
- ROC plot
References
[1] Pan, X. Y., Zhang, Y. N., and Shen, H. B. (2010). Large- scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. Journal of Proteome Research, 9(10):4992–5001.