Oregon State University



Event Details

PhD Final Oral Examination – Behrouz Behmardi

Wednesday, November 28, 2012 9:00 AM - 11:00 AM

A Probabilistic Framework and Algorithms for Modeling and Analyzing Multi-Instance Data
Multi-instance data, in which each object (e.g., a document) is a collection of instances (e.g., word), are widespread in machine learning, signal processing, computer vision, bioinformatic, music, and social sciences. Probabilistic models, e.g., latent Dirichlet allocation (LDA), probabilistic latent semantic indexing (pLSI), and discrete component analysis (DCA), have been developed for modeling and analyzing multi-instance data. These approaches attempt to summarize the data by inferring latent structure shared among all the objects. Moreover, all of these approaches propose a generative model for multi-instance data as well as an explicit representation for the latent structure in the low-dimensional space. However, there are several issues with them. For example, model selection is done through cross validation which does not guarantee the correct structure recovery. Even though the problem of model selection is addressed by the mechanism in Hierarchical Bayesian process, their inferences are problematic due to their complexity corresponding to sampling approach or inaccurate corresponding to the variational approach.

This dissertation demonstrates a unified convex framework for probabilistic modeling of multi-instance data. The three main aspects of the proposed framework are as follows. First, joint regularization is incorporated into multiple density estimation to simultaneously learn the structure of the distribution space and infer each distribution. Second, a novel confidence constraints framework is used to facilitate a tuning-free approach to control the amount of regularization required for the joint multiple density estimation with theoretical guarantees on correct structure recovery. Third, we formulate the problem using a convex framework and propose efficient optimization algorithms to solve it. This can overcome the issues of Bayesian inferences such as $i)$ computational complexity associated with sampling methods, $ii)$ approximation associated with variational Bayes approach.

This work addresses the unique challenges associated with both discrete and continuous domains. In the discrete domain we propose a confidence-constrained rank minimization (CRM) to recover the exact number of topics in topic models with theoretical guarantees on recovery probability and mean squared error of the estimation. We provide a computationally efficient optimization algorithm for the problem to further the applicability of the proposed framework to large real world datasets. In the continuous domain, we propose to use the maximum entropy (MaxEnt) framework for multi-instance datasets. In this approach, bags of instances are represented as distributions using the principle of MaxEnt. We learn basis functions which span the space of distributions for jointly regularized density estimation. The basis functions are analogous to topics in a topic model.

We validate the efficiency of the proposed framework in the discrete and continuous domains by extensive set of experiments on synthetic datasets as well as on real world image and text datasets and compare the results with state-of-the-art algorithms.

Major Advisor: Raviv Raich
Committee: Thinh Nguyen
Committee: Weng-Keen Wong
Committee: Yanming Di
GCR: Kagan Tumer 

Kelley Engineering Center (campus map)
Nicole Thompson
1 541 737 5556
Nicole.Thompson at oregonstate.edu
Sch Elect Engr/Comp Sci
This event appears on the following calendars: