Article by Disease on “Haematology”
Abstract: To classify human protein-protein interaction information and consolidate existing data, supervised learning algorithms are implemented. These algorithms require a feature vector to generate a prediction model and feature vectors could be constructed based on various input data. The suitability of feature vector for classification algorithm results in a more predictive model and predictions with higher accuracies based on low-dimension vectors. To investigate the proper combination of feature sets and the algorithms, three feature vectors including AA Frequency, AA Graphical Parameter, and AA Triplex based on the sole knowledge of primary structure of human red blood cell proteins were constructed and then applied to five different classification methods. The results indicated that support vector machine (SVM) algorithm produced the highest accuracy of 84.65% with AA Graphical Parameter feature set while it reached accuracy of 80.65% with AA Triplex feature set. Random forest (RF) achieved high accuracy of 83.69% with all three feature sets on average. Bayesian classifier of TAN performed better than NB using all three features. Artificial neural network (ANN) classifier demonstrated the lowest average accuracy of 76%; however, the performance was comparable with TAN where AA triplex learning feature was used with the accuracy of 77.90%. These figures demonstrated that selecting an appropriate feature set for a classification task results in a higher accuracy with the advantage of utilizing low-dimension feature vectors constructed from more simple data.