Analyse strukturierter Daten in der Bioinformatik
| Veranstaltungstyp | Vorlesung (2 SWS) + Übung (1 SWS) |
|---|---|
| ects | 4.0 |
| Dozent | Stefan Kramer (Vorlesung) Ullrich Rückert (Übung) |
| Zeit | Mittwoch, 10:15–11:45 (Vorlesung) Mittwoch, 12:00 -13:00 (Übung) |
| Turnus | wöchentlich vom 17.10.2007 bis zum 10.02.2008 |
| Raum | Seminarraum MI 01.06.011 |
| Unterrichtssprache | Deutsch |
| Materialien (anmeldung erforderlich) |
|---|
| Vorlesungsfolien |
| Übungsblätter |
| Bibliographie |
| Tool: Aleph |
| Tool: ACE |
The topic of the course is mining structured data, which is of increasing importance to bioinformatics, since the majority of biological data is not kept in databases consisting of a single, flat table. Instead, we are frequently dealing with databases of structured and linked objects. In other words, the “objects” in bioinformatics databases often have a rich internal structure and are connected by some relation. Consider, for instance, databases of proteins, small molecules, metabolic and regulatory networks, text databases, etc. We will present techniques for both descriptive and predictive data mining in this context. In descriptive mining, we are often looking for local patterns to characterize the data. In predictive mining, we are looking for models that can be used to make predictions for new, unseen cases. Moreover, algorithms in data mining can be distinguished according to the type of data they are operating on: itemsets, strings, sequences, trees, graphs, relational databases, logic, etc. Along these two dimensions (predictive/descriptive, type of data), the course is structured as follows: The first part of the course is devoted to descriptive mining, predominantly in databases of itemsets, strings, sequences, trees and graphs. The second part deals with predictive mining, predominantly in relational databases.
No prior knowledge on Machine Learning and Data Mining is required to follow this course. The course will be partly based on the book “Relational Data Mining” edited by Saso Dzeroski and Nada Lavrac (Springer Verlag, 2001), and material from our ISMB-2004 tutorial on “Advanced Data Mining for Bioinformatics”.
Descriptive Mining I
- Itemsets and pattern mining
- String mining
- MolFea
- Version space trees
- Index structures for string mining
- Inexact string mining
- Application: mining DNA and protein sequences
- Episode rules
- Parallel, serial, general episodes
- Minimal occurrences
- Tree mining
- TreeMiner
- FREQT and Unot
- PathJoin
- FreeTreeMiner
- Application: XML
- Graph mining
- AGM
- gSpan
- Application: small molecules and (Q)SAR
Logic Programming and Database Basics
- Resolution
- Structured terms and lists
- Theoretical foundations
- SQL, datalog and deductive databases
- Conjunctive queries
- Query containment and theta subsumption
Descriptive Mining II
- First-order association rule mining
- Declarative language bias
- Application: predicting protein functional class
Predictive Mining
- Representations
- Propositional
- Multi-instance
- Multi-relational
- Aggregation-based propositionalization (RELAGGS)
- Application: KDD Cup 2001 data
- Introduction to predictive mining
- Relational rule learning
- FOIL
- Relational decision tree learning
- Tilde, S-CART
- Relational instance-based learning
- Set distances
- RIBL system
- Application: classifying protein fingerprints
- Statistical relational learning
- Type-1 and type-2 semantics
- Knowledge-based model construction
- Probabilistic relational models
- Graph kernels
- Introduction and preliminaries
- Walk kernels, cyclic pattern kernel, optimal assignment kernel
- Link Mining
- Introduction
- Data representation
- Overview of tasks
- Link-based object classification
- Link prediction
