This is the course website for BMTRY 790, Machine Learning and Data Mining.
Contents
Course Description
The course aims to provide broad exposure to various machine learning and data mining techniques, mainly concentrating on statistical learning and its application in biomedical domain.
Textbooks
The required book: Bishop, CM. (2006)http://research.microsoft.com/~cmbishop/PRML/ Pattern Recognition and Machine Learning, Springer
Other Reference books
MacKay, D (2005). http://www.inference.phy.cam.ac.uk/mackay/itprnn/book.html Information theory, Inference, and Learning Algorithms. Oxford Press
Duda, RO., Hart, PE, and Stork DG (2001). http://rii.ricoh.com/~stork/DHS.html Pattern Classification, Wiley Intersciences
Lecture notes
Jan 8th Introduction
Jan 10&15 Linear Classifier
March 4th, EM Algorithm
Slides on Multi-variant Information Bottleneck by Slomin.
Lecture Notes in Tex
Homework Assignment
- Homework assignment 1: Derive the equation specified in Question 4.13 of the text. Due Jan 29th before class
Program assignment 1: Implement a naive Bayes classifier to classifier the classProjectData.xml data. The data contains the text document form three classes. The documents are MEDLINE abstracts that have been processed by removing stop words, stemmed and mapped to integer indices. The assignment is due back on Tues, Feb 12th. Hand in: a) The code in the language of your choice, Matlab, R and Python are preferred. b) Perform 10 fold cross-validation and report the performance as the percent of correct predictions (accuracy).
- Program Assignment 2 (out Feb 5th, due on Feb 21st electronically). Use SVMlight program to classify the MEDLINE text document provided in previous assignment. Turn in a report, 1) the results of cross-validation accuracy. 2) An essay that connects main materials covered in the course so far: what is linear classification machine; how "linear" characteristics are reflected in both naive Bayes and SVM; what is the difference of discriminative and generative approach of classification and how such differences are reflected in NB and SVM.
Homework 4 (out Feb 14, due Feb 26th). Download the http://genie.sis.pitt.edu/ Genie program. Load the following Homework4.xdsl network into Genie. Instantiate the conditional probability tables associated with each node and perform the following experiments: 1) Set x2 as evidence, list all the nodes that are independent of x1 conditioning on x2. Cite the rule you used to derive the predictions. Verify your prediction using Genie. 2) Set x1 and x7 as evidence nodes, list all variables that become independent of x3, list the rules and verify with Genie. 3) Identify the nodes that constitute the Markov blanket of x3.
Homework 5 (out March 5th, due March 20th) Bayesian network inference: calculate the marginal probability distribution based on observed evidence. We are to use hidden Markov Model as the special case of Bayesian network for performing such inferences. The parameters of a hidden Markov model simulating is defined and used to generate 30 sequences. See genHiddenMarkovSeq.m Matlab code for the parameters and the code for generating sequences data. The model is a mock HMM for transmembrane proteins, in which there are two latent states: hydrophilic (non-transmembrane) and hydrophobic (transmembrane). The observed are amino acids, which is indexed and represented as numeric values in the HMMSeq4HW.xml XML data file (also available in an R format HMMSeq4HW.RData R data file). You are to extract the parameters, the transition and emission matrices from the code to use in your own code, and the amino acid sequences from the xml sequences data file. The requirement is to write a code to perform the inference of the marginal distribution of the state variables for each sequence using the belief propagation algorithm discussed in the class. The program should return a 2xL matrix for each of 30 sequences, where L is the length of a given sequence. You are to further return a path by connecting the most probable state for each amino acid and compare the path with the true path given in the data file. Report the percent of match as metric of goodness of fit. In next homework, part of this program will be reused to perform unsupervised learning of HMM parameters using EM algorithm. The R data file can be loaded using the following R command:
> data <- dget("HMMSeq4HW.RData")- which returns a list structure representation of the original XML file.
Homeword 6. EM algorithm for HMM (Out on March 20th, due March 27th) Extend the HMM implementation from HW5 to make it capable of performing unsupervised learning of HMM model, the transition and emission matrices, based on provided observed IonChannles.fasta sequence. The data is in the fasta format, which is commonly used in bioinformatics field. In this format, each sequence consists of two lines, the first line (header) starts with ">" followed by a description of the sequences, which can be NULL. The second line is the biological sequence, nucleic acid sequence or amino acid sequence. The data consist of tens of sequences known to be ion channels, which can be described by two states: hydrophilic and hydrophobic regions. Return the following for the home work: 1) the code. 2) find the most probable path of the first 2 sequences. Compare your prediction to the following results: learned emission matrix, pick .
April 3rd, HMM-xlu.R here is a peudo program I may use for the HW6, written in R syntax. The code is not tested but only serve to show the framework to deal with the problem. Here is the tested HMMEM.txt code for the EM learning by Beth.
HW 7. Out April 9th, due April 24th. Learning Latent Dirichlet Allocation model using Gibbs sampling algorithm discussed in the class. Use the indexed text data for Naive Bayes classifier as training data. Training model using 5 topics. Return the code and top 20 most frequent words for each of 5 topics.
Useful links
http://matrixcookbook.com/ Matrix Cookbook with all sorts of trick to deal with matrices.
http://www.roble.info/basicST/stat/html/Bayes-1.html Bayesian Probability Theory An introduction and reference
- attachment:SVM
http://www.stanford.edu/~boyd/ Stephen Boyd and http://www.ee.ucla.edu/~vandenbe/ Lieven Vandenberghe's free text book on http://www.stanford.edu/~boyd/cvxbook/ Convex Optimization
http://genie.sis.pitt.edu/ Genie & SMILE, a program and a programming library for Bayesian network.
Reading materials
A nice paper on http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf naive Bayes on text classification. Please read before the class on 17th.
A. Y. Ng and M. I. Jordan. (2002) http://www.cs.berkeley.edu/~jordan/papers/ng-jordan-nips01.ps On discriminative vs. generative classifiers: A comparison of logistic regression and naive Bayes. In T. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 15
A http://www.kernel-methods.net/tutorials/KMtalk.pdf tutorial on kernel-based learning by John Shawe-Taylor and Nello Cristianini. They also have a http://www.kernel-methods.net/ book on the subject.
Tipping, ME (2001) http://jmlr.csail.mit.edu/papers/volume1/tipping01a/tipping01a.pdf Sparse Bayesian Learning and the Relevance Vector Machine. J Machine Learning Research, 1(Jun):211-244
A review on http://www.cs.berkeley.edu/~jordan/papers/statsci.ps Graphical Models by Dr. Michael Jordan
http://jmlr.csail.mit.edu/papers/volume3/blei03a/blei03a.pdf Latent Dirichlet allocation. D. M. Blei, A. Y. Ng, and M. I. Jordan. Journal of Machine Learning Research, 3, 993-1022, 2003. http://www.cs.berkeley.edu/~blei/lda-c [[C code]].
http://www.cs.berkeley.edu/~jordan/papers/mlintro.ps An introduction to MCMC for machine learning. C. Andrieu, N. de Freitas, A. Doucet and M. I. Jordan. Machine Learning, 50, 5-43, 2003.
Griffiths, T., & Steyvers, M. (2004). Finding Scientific Topics. Proceedings of the National Academy of Sciences, 101 (suppl. 1), 5228-5235. http://psiexp.ss.uci.edu/research/papers/sciencetopics.pdf pdf
Steyvers, M. & Griffiths, T. (in press). Probabilistic topic models. In T. Landauer, D McNamara, S. Dennis, and W. Kintsch (eds), Latent Semantic Analysis: A Road to Meaning. Laurence Erlbaum http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf pdf
N Slonim, N Friedman, and N Tishby (2006). http://www.mitpressjournals.org/doi/pdf/10.1162/neco.2006.18.8.1739?cookieSet=1 Multivariate Information Bottleneck. Neural Computation, 18, pp. 1739-1789.
Robert E. Schapire. msri.pdf The boosting approach to machine learning: An overview.
hierarchiesMixtureOfExpert.pdf Hierarchical mixtures of experts and the EM algorithm. M. I. Jordan and R. A. Jacobs. Neural Computation, 6, 181-214, 1994.
Dever Dash and Gregory Cooper (2004) http://jmlr.csail.mit.edu/papers/volume5/dash04a/dash04a.pdf Model Averaging for Prediction with Discrete Bayesian Networks. Journal of Machine Learning Research 5:1177–1203
