Stochastic & statistical modeling of biological processes


Statistical models of disease phenotypes
Sayan Mukherjee, Phillip Febbo, Mauro Maggioni, Jen-Tsan Chi, and Anil Potti.

Inference of scientific hypotheses or models from high-dimensional data sources is fundamental to much of modern scientific research. My main motivation in studying this problem has been to develop and apply mathematical and computational formalisms to model variation in biological systems. This is an interdisciplinary endeavor and I have had great success in discovering and developing relations between disparate theoretical frameworks -- computational, mathematical, and statistical.

An example of this is developing connections between probability, algebraic topology, and geometry to model complicated dependencies in high dimensional data. The other point of interdisciplinary contact is richer and more complete models in cancer and systems biology as well as population genetics. The modeling approaches being developed will have significant impact in novel patient personalized treatment strategies.

Statistics on spaces of phylogenetic trees
Ezra Miller, Megan Owen, and Scott Provan

In evolutionary biology and systematics, gene trees are reconstructed directly from DNA sequences, while phylogenies ("species trees") represent evolutionarily identifiable speciation events. Abundant data indicate that because of stochastic effects, either in the determination of gene trees or in genetic histories themselves (such as via deep coalescence or horizontal gene transfer), a single experiment usually outputs multiple gene trees in a distribution around some fixed phylogeny.

In work with Megan Owen (SAMSI and NC State, Math) and Scott Provan (UNC, Operations Research), we are testing this hypothesis rigorously by developing algorithms for statistical analyses, including computation of statistical means ("centroids") in spaces of phylogenetic trees. These spaces are composed of convex polyhedra -- they are not manifolds, even topologically -- but nonetheless a law of large numbers holds. On the biological side, we will need to find (or construct) and analyze real phylogenetic data sets; this will involve connections with the biology department.

This project has another motivation: in medical imaging, the trees represent branching arteries in brains, and the goal is to detect clustering. This is a special case of analysis of object data, where the general desire is to fit analogues of linear subspaces to data sets whose points represent objects with internal structure, such as curves or trees.