Computational Modeling and Analysis to Predict Intracellular Parasite Epitope Characteristics Using Random Forest Technique.

Background
In a new approach, computational methods are used to design and evaluate the vaccine. The aim of the current study was to develop a computational tool to predict epitope candidate vaccines to be tested in experimental models.


Methods
This study was conducted in the School of Allied Medical Sciences, and Center for Research and Training in Skin Diseases and Leprosy, Tehran University of Medical Sciences, Tehran, Iran in 2018. The random forest which is a classifier method was used to design computer-based tool to predict immunogenic peptides. Data was used to check the collected information from the IEDB, UniProt, and AAindex database. Overall, 1,264 collected data were used and divided into three parts; 70% of the data was used to train, 15% to validate and 15% to test the model. Five-fold cross-validation was used to find optimal hyper parameters of the model. Common performance metrics were used to evaluate the developed model.


Results
Twenty seven features were identified as more important using RF predictor model and were used to predict the class of peptides. The RF model improves the performance of predictor model in comparison with the other predictor models (AUC±SE: 0.925±0.029). Using the developed RF model helps to identify the most likely epitopes for further experimental studies.


Conclusion
The current developed random forest model is able to more accurately predict the immunogenic peptides of intracellular parasites.


Introduction
Historically, the most effective public-health prevention against infectious disease is vaccination. Development of an effective vaccine against any disease is a major breakthrough to control the disease. There have been tremendous efforts to de-velop vaccines against infectious and non-infectious diseases, but yet no vaccine is available against many infectious diseases. The development of a new vaccine from theory to practice is a complex process task. Preclinical studies to de-velop a vaccine are a long process, time-consuming, needs enough funds and infrastructure, which are not available in the regions of the world, which suffer from the infectious diseases the most. Emerging modern technology and computational models in biomedicine have provided new horizons for discovering, and designing vaccines. Using in-silico approach, the designed epitopes might be used and tested experimentally in the preclinical setting. Nowadays, using in-silico approach has been advanced rapidly and assists in different aspects of biomedical sciences. In in-silico approach, vaccine logically is designed using computational algorithms and evaluate using computer simulation (1)(2)(3). Using in-silico approach is a shortcut method to identify novel immunogenic peptides for the development of a vaccine prior to in vitro and in vivo evaluation. Several computational methods, including binding motifs (BM), quantitative matrices (QM), machine learning algorithms (ML), evolutionary algorithms, linear programming and hybrid methods are usually used to predict the class of peptides (4)(5)(6). The computational methods mostly distinguish the peptides based on amino acid properties. Among the computational methods, ML is more commonly used to identify the class of peptides and design epitope-based vaccines for the prevention and/or possibly treatment of infectious and noninfectious diseases (7). Some of the common supervised ML algorithms for pattern recognition include support vector machine (SVM), neural networks (NN), naïve Bayes, decision tree (DT), random forest (RF), and hybrid methods (8,9). Among the above-mentioned methods, RF is the more popular ML approach, due to the fact that it is easy to understand, handy to use, interpretation and robustness. In RF algorithm, various decision trees with a high diversity between the individual trees were generated in the forest. Every one of the created decision trees independently predicts the class of the peptides. The diversities of the trees are controlled using bootstrap replacement sampling from the training dataset and a subset of the features is randomly selected. Then, the final decision is made based on the majority of the votes of the aggregated predicted trees (10)(11)(12)(13)(14)(15).
The aim of the current study was to develop computational tools based on ensemble random forest machine learning model to facilitate Th1 epitopes identification to be used as the vaccine candidate for intracellular parasites.

Materials and Methods
The methods used in the current study for the data collection, peptide properties extraction, data processing, and the development of RF model are as follows:

Data resources
The sequences of the proteins were retrieved from UniProtKB/Swiss-Prot database http://www.uniprot.org/ (16,17). T cell epitopes were retrieved from Immune Epitope Database (IEDB) http://www.iedb.org/ (18). Access to both databases are free. The date of the data retrieval is Apr 18, 2017.

Data preparation
From 6,223 MHC class II T cell epitopes retrieved from IEDB database, 3,200 epitopes with a length of 9-to 21-mer were selected from 524 antigens previously showed to be immunogenic and as such were marked as positive assays epitopes. Gibbs sampler method (19,20) was used to align 9-mer core-binding motif and stored as epitope dataset class. To select non-epitope peptides, the proteins which contain epitopes with define sequences were retrieved from UniProtKB/Swiss-Prot database, after removing the epitopes, the remaining sequences were scanned using windows size of 9mer to extract non-epitope peptides. The non-redundant extracted peptides were stored as the non-epitope dataset class. Two stored datasets were used to train, validate and test the RF model.

Peptide descriptor extraction
The properties used to develop the model are peptide AA composition (AAC) and AA physicochemical properties (AAPP). The AAC for each peptide was calculated with the following equation where k is one of each 20 AA: The AAPP used to identify the class of peptide are as follows: The distribution of residue AA in each position, aliphatic index (21), hydropathy scale (22), polarity scale (23), isoelectric point (PI) (24), net charge, number of bulky AA (Leu, Ile, Phe, Try, Tyr, Val), number of less bulky AA (Ala, Arg, Asp, Asn, Cys, Glu, Gln, Gly, His, Lys, Met, Pro, Ser, Thr) (25), chemical characteristic of the peptides (aromatic, aliphatic, sulfur, hydroxyl, and amide), and the number of potential side-chain hydrogen bonds (donor, acceptor, both, and Non) (26).

Model development
The MATLAB ver. 2014 software was used to develop the RF model, RF is an ensemble-learning approach usually used for classification and regression. RF combines various classifications DT is used to produce a more accurate classification. Bootstrap aggregation algorithm was used to create the ensemble DT classifiers. Each classifier independently predicts the class of peptides and the majority vote on the DT classifiers defines class of peptides in RF model. In this study, Gini's Diversity Index (GDI) was used to measure the node impurity, and feature with the highest GDI was selected as the split feature in the node. The performance of RF algorithm depends on the tuning of a number of hyper parameters. The optimal hyper parameters were distinguished using assign multiple values to develop a suitable model. The 5-fold cross-validation was used to evaluate and tune the hyper parameters. The values assigned to each parameter are as follows: The maximum number of random ensemble trees (n-Tree) in RF model is set to 2,000. The number of predictors used to split the appropriate node (m-try) was set to 9 (square root of features number in the dataset). The minimum size of the leaf node (node-size) was set to 2. The maximum growth depth (tree-Depth) for each RT was set to 100.

Performance evaluation
The collected data set was randomly divided into three parts; 70% of the data was used to train; 15% of the data was used to validate, and the rest 15% of the data was used to test the model. The performance of the model was calculated by accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), error rate, and area under the ROC (AUC) (27)(28)(29).

Statistical analysis
The Cohen's kappa statistics was used to quantify degree of agreement and assess reliability of the model. The Pearson's chi-square, McNemar's, Wald, and Z test were used to analysis of data (30,31). The probability values less than 0.05 were considered statistically significant. The statistical analyses were performed using SPSS 16.0 (SPSS Inc., Chicago, IL, USA).

Results
From the 3,200 epitopes, 1,264 non-redundant 9mer core-binding motif and 1,264 similar nonepitope peptides were used to train, validate, and test the model. The feature selection is an important step to develop RF model. In this step, the impurity features were included and the noisy and redundant features were excluded to improve the performance. Figure 1 shows the feature importance score distribution with the positive score of the peptide properties. From 59 features, twenty-seven features are positive score and fourteen are detected as having the greatest effect to discriminate the class of peptides (criteria greater than 40% was considered as cut off). The AA residue at position 1 is the highest rank feature to identify the class of peptide with a 99% score. The next feature is AA residue at position 9 with 91% importance score. The number of alanine and glycine are 82%, and 70% importance, respectively. The AA residue at position 6 is 64% importance score. The number of bulky AA and glutamic acid in the peptide is 53%, and 48% importance, respectively. The PI index is a 46% importance score. The number of bulky less and aromatic AA in the peptide is 45%, and 42% importance, respectively. The numbers of isoleucine, phenylalanine, and valine AA in the peptide was 40% importance. The other features in the model are less important, with percentages of 34% to 5% (Fig. 1).

Fig. 1: Features importance plot
The accuracy of the RF model to train, validate, and test the dataset are 96.7%, 95.8%, and 91.6%, respectively. Therefore, only at maximum 8.4% of the data was incorrectly classified. The minimum sensitivity and specificity for the 3 datasets are 92.6% and 90.5%, respectively, which means the RF model correctly detects at least 92.6% of the epitopes and 90.5% of non-epitopes. The minimum PPV and NPV for each of the 3 datasets are 90.7% and 92.5%, respectively, that means the RF model categorized the epitopes correctly at least in 90.7% of the epitopes in this class and categorized correctly at least 92.5% of nonepitope class in this class (Table 1).   Table 2 shows the description of the decision rules (DRs) extracted to classify the peptides, ordered by the rule accuracy. The developed RF model obtained 6 rules with accuracy from 88% to 97%.
The DRs 1, 4, 5, and 6 identified the non-epitope class with rule accuracy of 97%, 93%, 90%, and 88% respectively. The DRs 2 and 3 identified the epitope class with rule accuracy of 97% and 95%, Along with AA type in positions 1 and 9 at DR4, the number of bulky less AA greater than 6 and the number of alanine AA greater than 1 is indicative of a non-epitope. Along with AA type in position 1 and position 9 at DR5, the number of bulky less AA greater than 5 is indicative that the peptide is a non-epitope. Along with AA type in position 1 and 9 at DR2, the AA type at position 6 including (A, C, H, R, V) and PI of peptide between 3.7 to 6.5 is an indicative that the peptide belongs to non-epitope class. Along with AA type in position 1 and 9 at DR3, the number of aromatic AA greater than 0 is an indication that the peptide belongs to epitope class.

Discussion
The in-silico approach is a proper strategy to develop a novel epitope-based vaccine. In epitopebased vaccine design, identification of immunogenic peptide is the first and critical step. Using, computational approaches in vaccinology assist The EpiTOP and MHCPred tools use quantitative structure-activity relationship method (QSAR) to detect mathematical meaningful relationships between the peptide physicochemical properties, molecular structure and biological activities. The average of AUC for the EpiTop is 0.79 with a range of 0.72 to 0.89. MHCPred using a partial least squares multivariate statistical method to predict binder peptides to MHC molecules with overall accuracy of 0.79 (32,33). ProPred, uses quantitative affinity matrix (QAM) method to identify protein-protein interactions (PPIs), The average of AUC for ProPred is 0.76 with a range of 0.66 to 0.89 (34). TEPITOPE uses position-specific scoring matrix algorithms (PSSM) to score the conserved regions of the proteins. The average of AUC for TEPITOPE is 0.73 with a range of 0.67 to 0.77 (35). MHC2Pred, SVRMHC, and SVMHC are SVMbase methods with different kernel functions (linear, polynomial and RBF) to predict the class of the peptides. MHC2Pred uses matrix optimization technique (MOT) to detect 9-mer core-binding motif and predict promiscuous MHC class II binding core with overall accuracy of about 0.79 (33,36). SVRMHC uses quantitative SVM regression method to predict peptide-MHC binding affinities with an average of AUC=0.786 and a range of 0.74 to 0.83 (37). SVMHC predicts MHC-binding peptides with an average of AUC=0.76 and a range of 0.66 to 0.86 (38,39). RANKPEP uses PSSM algorithms to score the conserved regions of the protein for both MHC class I and II molecules with an average of AUC=0.78 and a range of 0.54 to 0.93 (40). NetMHCII and NetMHCIIpan are networkbased (NN) ensemble methods. These methods estimate the optimal peptide binding-core motif and neuron weighted connection. NetMHCII uses a set of individual networks for each MHC class, and NetMHCIIpan uses a single public NN model to predict epitope. The average of AUC for NetMHCII is 0.79, and a range of 0.71 to 0.85, and NetMHCIIpan is 0.858 and a range of 0.75 to 0.96 (41,42).
The range of AUC tools mentioned above is (0.73-0.86). The performance of RF developed model is at least 0.95 for the test dataset. The comparison AUC of mentioned tools and RF developed model showed that the performance of RF models is 11% to 30% higher than the 10 mentioned models. Moreover, the kappa coefficient indicated that there is as strong agreement between the 190 pairs of the test dataset. All of these indices showed that the developed model has a proper performance to predict the class of peptides. The experimental studies on epitope of human showed that the epitopes contain hydrophobic, aliphatic or aromatic AA at positions 1, 4, 6, and 9 (43,44). The hydrophobic AA is the priority at position 1 and 9 (45)(46)(47). Six extracted decision rules in RF models for discriminate to class of peptide. Based on the results of this study, the developed RF model is highly efficient in the prediction of parasite MHC class II T cell epitopes.

Conclusion
The random forest algorithm is a flexible, robustness and accurate statistical approach. This method is able to handle unbalanced datasets; many input features without variable deletion, estimates important scores for each feature without any required assumption and restriction in the traditional statistical methods. These advantages make it the most common method for classification of peptides. In the current study, an RF model was developed based on biochemical peptide properties to identify the class of peptides exist in a given protein. The performance measures of RF developed model improve in comparison with the common T-cell epitopes prediction tools. Accordingly, using the RF model facilitates selection of most likely immunogenic epitopes for further complementary experimental studies.

Ethical considerations
Ethical issues (Including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, redundancy, etc.) have been completely observed by the authors.