Mapping biological sequences or tracking tweets: computer scientists find new, unique approaches to data analysis
By Dr. Frank-Michael Schleif
Classification models are of overall importance in a variety of data analysis tasks, like classifying patient data to diseases or images to categories. If the training data for these models are vectorial and expected to be fully labeled, effective classifiers are readily available. In the modern world where data are generated much quicker than we are able to analyze it, the labeling of data becomes very costly and is often partially missing.
Further real world data, like biological sequences, tweets or text-documents are non- vectorial compared by novel non-metric similarity measures, for example the edit-distance or the compression distance. Then the data are just given by non-metric pairwise proximities and almost no methods are at hand. With increasing complexity of the data the interpretability of the model becomes more important in order to effectively communicate results, model behavior or just to simplify the search for errors.
Prototype-based methods provide intuitive and simple model access: they represent their decision in terms of typical representatives of the data which can be directly inspected by domain experts in the field. We extend a prototype based approach to address the modeling of non-metric proximities with partially missing labels. This is achieved by linking the theory of conformal prediction to prototype learning for non-metric proximities. Conformal prediction provides a confidence measure of the classification which can be used to identify secure regions of unlabeled data. These regions can then be included at high confidence in the training process by assigning the most likely labeling. Similarly we can find insecure regions of labeled data, used to adapt the model complexity.
The new approach can directly deal with arbitrary symmetrized proximity matrices, offers intuitive classification by sparse prototypes and adapts the model complexity leading to an efficient classifier for weakly labeled data represented by non-metric (and metric) proximities. Potential applications of this method are extensive and cover any classification problem were data is compared by a symmetric or symmetrized proximity function. This work is beneficial particularly for the classification of protein sequences in bioinformatics, the detection of shapes in image processing and robotics or the classification of textual documents in information retrieval, where effective domain specific similarity measures are available.
Xibin Zhu, Frank-Michael Schleif and Barbara Hammer were awarded College Best Publication November 2014 for ‘Adaptive conformal semi-supervised vector quantization for dissimilarity data’.
The paper is available online here.
Image caption: Simulated banana shape data illustrating semi-supervised learning for very few labeled points:
(a) Initial training
(b) 10. iteration
(c) Final model
Tweet to share