Minh Tang
Statistics
Associate Professor
Statistical Network Analysis
SAS Hall 5236
919.515.1923 minh_tang@ncsu.eduBio
Minh Tang is an associate professor in the Department of Statistics at North Carolina State University. He earned a Ph.D. in Computer Science from Indiana University Bloomington. Before joining NC State, he held research and teaching positions at Johns Hopkins University. His research focuses on statistical pattern recognition, dimensionality reduction and graph-based statistical inference. In particular, he develops methods that help researchers analyze complex networks and large datasets. As a result, his work advances statistical learning and data science. He has published widely in leading journals in statistics and machine learning. In addition, he has received research support from the National Science Foundation, DARPA and Microsoft Research. He also teaches courses in probability, statistical inference, data science and graph analytics while mentoring graduate students and early-career researchers.
Education
Ph.D. Computer Science Indiana University, Bloomington 2010
M.S. Computer Science Univeristy of Wisconsin-Milwaukee 2004
B.S. Computer Science Assumption University 2001
Area(s) of Expertise
Minh Tang specializes in statistical pattern recognition, dimensionality reduction, and statistical inference on graphs. He develops methods that identify patterns in complex datasets and improve data analysis. In addition, he creates techniques that simplify high-dimensional data while preserving important information. He also studies graph-structured data to uncover relationships and support reliable statistical conclusions. As a result, his work helps researchers better understand and analyze large, complex networks.
Publications
- An omnibus embedding of multiple random graphs and implications for multiscale network inference , Electronic Journal of Statistics (2026)
- Out-of-Sample Embedding with Proximity Data: Projection Versus Restricted Reconstruction , Journal of Computational and Graphical Statistics (2026)
- Perturbation Analysis of Randomized SVD and its Applications to Statistics , Journal of the American Statistical Association (2026)
- Chain-Linked Multiple Matrix Integration via Embedding Alignment , Journal of the American Statistical Association (2025)
- Eigenvector fluctuations and limit results for random graphs with infinite rank kernels , arXiv (Cornell University) (2025)
- Novel network trimming for robust vertex nomination in contaminated networks , Electronic Journal of Statistics (2025)
- Chain-linked Multiple Matrix Integration via Embedding Alignment , arXiv (Cornell University) (2024)
- Regression for matrix-valued data via Kronecker products factorization , arXiv (Cornell University) (2024)
- A Theoretical Analysis of DeepWalk and Node2vec for Exact Recovery of Community Structures in Stochastic Blockmodels , IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)
- Hypothesis testing for equality of latent positions in random graphs , Bernoulli (2023)
Grants
Accurate statistical inference on large, complex networks is a vitally important, inter-disciplinary research area that has witnessed exponential growth over the last several years, ranging from the construction of a plethora of random graph models themselves to a host of approaches for inference of graph model parameters Nevertheless, many graph estimation techniques are somewhat ad-hoc: maximum likelihood estimates for certain exponential random graph models, for instance or spectral methods for combinatorial graph analysis. But a mere regression coefficient here, a parametric estimate there, a clustering here, and an upper bound there do not constitute a unified, parsimonious approach to random graph inference. Thus the synthesis of disparate models and methods into a more comprehensive and familiar paradigm for graph inference is both necessary and welcome. This proposal address the need for such foundational approach to graph inference. We focus on the development of a unified spectral framework for mathematical statistics on graphs, itself inspired by cornerstones of classical Euclidean inference. In particular, for random graphs with independent edges, we use low-rank approximation of their adjacency matrices to build estimates of underlying model parameters. We then systematically address the graph-inferential analogues of the central tenets of Euclidean inference: consistency of estimators; asymptotic normality or appropriate limit distributions of estimators; asymptotic relative efficiency and optimality; one-, two- and multi-sample graph hypothesis testing; and robustness.
This proposal aims to develop methodologies for automated inference in high-dimensional and complex data. The proposal is part of the D3M (Data Driven Discovery of Models) program in which we have just raw data as input and we need to discover primitives -- simple yet robust and agile procedures that can be easily combined to form sophisticated framework/methodologies -- and generate models for presentation to domain experts for feedback & selection, all of this done without a data scientist assistance. As an example of the applicability such a framework, consider our experience with linear regression where there is a well-understood pipeline to take multivariate linear regression data and automatically generate plots and diagnostics that assist the non-expert user. For thir proposal we will consider datasets such as (a) multivariate time series together with event-of-interest time points (t1,t2,��������������������������� ,tn), (b) multispectral imagery together with event-of-interest locations (x1, x2,��������������������������� , xn), and (c) a relational network together with event-of-interest nodes (v1, v2,��������������������������� , vn). We will first develop methodologies to automatically discover primitives for these type of data. We will then develop methodologies to automatically compose these discovered primitives into a collection of models for performing subsequent inference. The final result is a discoverable archive of data modeling primitives, procedures for automatic selection of primitives, and frameworks for composition of primitives into complex modeling pipelines.