Reconhecimento de padrões proteômicos e genômicos por aprendizagem de máquinas para o disgnóstico médico. / Employ machine learning to unveil encrypted molecular patterns within proteomic and genomic profiles to assist in personalized medical diagnosis.

AUTOR(ES)
DATA DE PUBLICAÇÃO

2005

RESUMO

Motivation: Employ machine learning to unveil encrypted molecular patterns within proteomic and genomic profiles to assist in personalized medical diagnosis. Results and conclusions: 1. Proteomic profile studies: Patients with Hodgkins disease (HD), a rare type of lymphoma, had their serum proteomic profile compared to control subjects (CS) in order to search for differentially expressed protein patterns. Initially, a serum protein 1D gel analysis revealed two over-expressed proteins (~26 and 18 kDa) in HD patients (p <0.01). To further hunt for discriminatory patterns, serum mass spectra from 30 CS and 30 HD patients were obtained by electrospray mass spectrometry. A support vector machine (SVM) approach correctly classified all spectra as either controls or Hodgkins disease patients by the leave-one-out cross-validation method. Subsequently, a new algorithm named maximum divergence analysis (MDA) was employed to track biomarkers in the multi-charged spectra data. Two differentially expressed peaks were able to correctly classify 97% of all subjects. To our knowledge, this was the first time SVM was applied to ESI multi-charged spectra for medical diagnosis. A new approach for resolving multi-class problems called ellipsoid clustering machine (ECM) was then used to define a CS domain in a feature space. This method is advantageous when dealing with heterogeneous sets because it efficiently defines a pattern, is able to generalize and is applicable to multi-class problems. All CS and HD patients were correctly classified by the leave-one-out cross validation using the ECM model. The elliptical boundaries could be a geometrical definition of a Hodgkins disease-free / control serum standard. It is hoped that, by adding new biomarkers to the model, it could be used for multi-diagnose for various types of cancers. 2. Genetic profile analysis: In this study, an improved hypertension risk evaluation method combining ones renin-angiotensin-aldosterone system (RAAS) genomic profile with pertinent clinical data is demonstrated. The most relevant clinical features are chosen by querying a pre-computed for a given genetic profile feature subset database. The disease risk is evaluated by classifying patients data with a support vector machine model, then measuring the Euclidian distance to the hyperplane decision function. To create this database, a new hybrid feature selection / ranking method was used to generate feature subsets from information that we acquired from Brazilian hypertension patients. The application of feature selection in RAAS haplotypes ascertained its association with hypertension and elucidated distinct polymorphism patterns for different ethnic groups. 3. Distributed computing for future studies: To carry out faster feature selection and classification studies, grid computing should be employed. Most distributed computing / grid solutions have complex installation procedures requiring specialist support, or have limitations regarding operating systems. In this work, we demonstrate Squid, a new multi-platform, open-source program designed to keep things simple while offering high-end computing power for large scale applications. Squid also has an efficient fault tolerance and crash recovery system against data loss, being able to re-route jobs upon node failure and recover even if the master node fails.

ASSUNTO(S)

computação distribuída análise de perfis genéticos biologia molecular genetic profile analysis proteomic profile proteomic profile studies perfil proteômico

Documentos Relacionados