برای استفاده از امکانات سیستم، گزینه جاوا اسکریپت در مرورگر شما باید فعال باشد
صفحه
از
0
Machine learning : a probabilistic perspective
Murphy, Kevin P.,
اطلاعات کتابشناختی
Machine learning : a probabilistic perspective
Author :
Murphy, Kevin P.,
Publisher :
MIT Press,
Pub. Year :
2012
Subjects :
Machine learning. Probabilities.
Call Number :
Q 325 .5 .M87 2012
جستجو در محتوا
ترتيب
شماره صفحه
امتياز صفحه
فهرست مطالب
Cover Page
(1)
Half Title Page
(2)
Title Page
(4)
Copyright Page
(5)
Dedication
(6)
Contents
(8)
Preface
(28)
1 Introduction
(32)
1.1 Machine learning: what and why?
(32)
1.1.1 Types of machine learning
(33)
1.2 Supervised learning
(34)
1.2.1 Classification
(34)
1.2.2 Regression
(39)
1.3 Unsupervised learning
(40)
1.3.1 Discovering clusters
(41)
1.3.2 Discovering latent factors
(42)
1.3.3 Discovering graph structure
(44)
1.3.4 Matrix completion
(45)
1.4 Some basic concepts in machine learning
(47)
1.4.1 Parametric vs non-parametric models
(47)
1.4.2 A simple non-parametric classifier: K-nearest neighbors
(47)
1.4.3 The curse of dimensionality
(49)
1.4.4 Parametric models for classification and regression
(50)
1.4.5 Linear regression
(50)
1.4.6 Logistic regression
(52)
1.4.7 Overfitting
(53)
1.4.8 Model selection
(53)
1.4.9 No free lunch theorem
(55)
2 Probability
(58)
2.1 Introduction
(58)
2.2 A brief review of probability theory
(59)
2.2.1 Discrete random var
(59)
2.2.2 Fundamental rules
(59)
2.2.3 Bayes rule
(60)
2.2.4 Independence and conditional independence
(61)
2.2.5 Continuous random variables
(63)
2.2.6 Quantiles
(64)
2.2.7 Mean and variance
(64)
2.3 Some common discrete distributions
(65)
2.3.1 The binomial and Bernoulli distributions
(65)
2.3.2 The multinomial and multinoulli distributions
(66)
2.3.3 The Poisson distribution
(68)
2.3.4 The empirical distribution
(68)
2.4 Some common continuous distributions
(69)
2.4.1 Gaussian (normal) distribution
(69)
2.4.2 Degenerate pdf
(70)
2.4.3 The Laplace distribution
(72)
2.4.4 The gamma distribution
(72)
2.4.5 The beta distribution
(73)
2.4.6 Pareto distribution
(74)
2.5 Joint probability distributions
(75)
2.5.1 Covariance and correlation
(75)
2.5.2 The multivariate Gaussian
(77)
2.5.3 Multivariate Student t distribution
(77)
2.5.4 Dirichlet distribution
(78)
2.6 Transformations of random variables
(80)
2.6.1 Linear transformations
(80)
2.6.2 General transformations
(81)
2.6.3 Central limit theorem
(82)
2.7 Monte Carlo approximation
(83)
2.7.1 Example: change of variables, the MC way
(84)
2.7.2 Example: estimating π by Monte Carlo integration
(85)
2.7.3 Accuracy of Monte Carlo approximation
(85)
2.8 Information theory
(87)
2.8.1 Entropy
(87)
2.8.2 KL divergence
(88)
2.8.3 Mutual information
(90)
3 Generative Models for Discrete Data
(96)
3.1 Introduction
(96)
3.2 Bayesian concept learning
(96)
3.2.1 Likelihood
(98)
3.2.2 Prior
(98)
3.2.3 Posterior
(99)
3.2.4 Posterior predictive distribution
(102)
3.2.5 A more complex prior
(103)
3.3 The beta-binomial model
(103)
3.3.1 Likelihood
(104)
3.3.2 Prior
(105)
3.3.3 Posterior
(106)
3.3.4 Posterior predictive distribution
(108)
3.4 The Dirichlet-multinomial model
(109)
3.4.1 Likelihood
(110)
3.4.2 Prior
(110)
3.4.3 Posterior
(110)
3.4.4 Posterior predictive
(112)
3.5 Naive Bayes classifiers
(113)
3.5.1 Model fitting
(114)
3.5.2 Using the model for prediction
(116)
3.5.3 The log-sum-exp trick
(117)
3.5.4 Feature selection using mutual information
(117)
3.5.5 Classifying documents using bag of words
(118)
4 Gaussian Models
(128)
4.1 Introduction
(128)
4.1.1 Notation
(128)
4.1.2 Basics
(128)
4.1.3 MLE for an MVN
(130)
4.1.4 Maximum entropy derivation of the Gaussian *
(132)
4.2 Gaussian discriminant analysis
(132)
4.2.1 Quadratic discriminant analysis (QDA)
(133)
4.2.2 Linear discriminant analysis (LDA)
(134)
4.2.3 Two-class LDA
(135)
4.2.4 MLE for discriminant analysis
(137)
4.2.5 Strategies for preventing overfitting
(137)
4.2.6 Regularized LDA *
(138)
4.2.7 Diagonal LDA
(139)
4.2.8 Nearest shrunken centroids classifier *
(140)
4.3 Inference in jointly Gaussian distributions
(141)
4.3.1 Statement of the result
(142)
4.3.2 Examples
(142)
4.3.3 Information form
(146)
4.3.4 Proof of the result *
(147)
4.4 Linear Gaussian systems
(150)
4.4.1 Statement of the result
(150)
4.4.2 Examples
(151)
4.4.3 Proof of the result *
(155)
4.5 Digression: The Wishart distribution *
(156)
4.5.1 Inverse Wishart distribution
(157)
4.5.2 Visualizing the Wishart distribution *
(158)
4.6 Inferring the parameters of an MVN
(158)
4.6.1 Posterior distribution of μ
(159)
4.6.2 Posterior distribution of Σ *
(159)
4.6.3 Posterior distribution of μ and Σ *
(163)
4.6.4 Sensor fusion with unknown precisions *
(169)
5 Bayesian Statistics
(180)
5.1 Introduction
(180)
5.2 Summarizing posterior distributions
(180)
5.2.1 MAP estimation
(180)
5.2.2 Credible intervals
(183)
5.2.3 Inference for a difference in proportions
(185)
5.3 Bayesian model selection
(186)
5.3.1 Bayesian Occam’s razor
(187)
5.3.2 Computing the marginal likelihood (evidence)
(189)
5.3.3 Bayes factors
(194)
5.3.4 Jeffreys-Lindley paradox *
(195)
5.4 Priors
(196)
5.4.1 Uninformative priors
(196)
5.4.2 Jeffreys priors *
(197)
5.4.3 Robust priors
(199)
5.4.4 Mixtures of conjugate priors
(199)
5.5 Hierarchical Bayes
(202)
5.5.1 Example: modeling related cancer rates
(202)
5.6 Empirical Bayes
(203)
5.6.1 Example: beta-binomial model
(204)
5.6.2 Example: Gaussian-Gaussian model
(204)
5.7 Bayesian decision theory
(207)
5.7.1 Bayes estimators for common loss functions
(208)
5.7.2 The false positive vs false negative tradeoff
(211)
5.7.3 Other topics *
(215)
6 Frequentist Statistics
(222)
6.1 Introduction
(222)
6.2 Sampling distribution of an estimator
(222)
6.2.1 Bootstrap
(223)
6.2.2 Large sample theory for the MLE *
(224)
6.3 Frequentist decision theory
(225)
6.3.1 Bayes risk
(226)
6.3.2 Minimax risk
(227)
6.3.3 Admissible estimators
(228)
6.4 Desirable properties of estimators
(231)
6.4.1 Consistent estimators
(231)
6.4.2 Unbiased estimators
(231)
6.4.3 Minimum variance estimators
(232)
6.4.4 The bias-variance tradeoff
(233)
6.5 Empirical risk minimization
(235)
6.5.1 Regularized risk minimization
(236)
6.5.2 Structural risk minimization
(237)
6.5.3 Estimating the risk using cross validation
(237)
6.5.4 Upper bounding the risk using statistical learning theory *
(240)
6.5.5 Surrogate loss functions
(241)
6.6 Pathologies of frequentist statistics *
(242)
6.6.1 Counter-intuitive behavior of confidence intervals
(243)
6.6.2 p-values considered harmful
(244)
6.6.3 The likelihood principle
(245)
6.6.4 Why isn’t everyone a Bayesian?
(246)
7 Linear Regression
(248)
7.1 Introduction
(248)
7.2 Model specification
(248)
7.3 Maximum likelihood estimation (least squares)
(248)
7.3.1 Derivation of the MLE
(250)
7.3.2 Geometric interpretation
(251)
7.3.3 Convexity
(252)
7.4 Robust linear regression *
(254)
7.5 Ridge regression
(256)
7.5.1 Basic idea
(256)
7.5.2 Numerically stable computation *
(258)
7.5.3 Connection with PCA *
(259)
7.5.4 Regularization effects of big data
(261)
7.6 Bayesian linear regression
(262)
7.6.1 Computing the posterior
(263)
7.6.2 Computing the posterior predictive
(264)
7.6.3 Bayesian inference when σ2 is unknown *
(265)
7.6.4 EB for linear regression (evidence procedure)
(265)
8 Logistic Regression
(276)
8.1 Introduction
(276)
8.2 Model specification
(276)
8.3 Model fitting
(276)
8.3.1 MLE
(277)
8.3.2 Steepest descent
(278)
8.3.3 Newton’s method
(280)
8.3.4 Iteratively reweighted least squares (IRLS)
(281)
8.3.5 Quasi-Newton (variable metric) methods
(282)
8.3.6 2e regularization
(283)
8.3.7 Multi-class logistic regression
(283)
8.4 Bayesian logistic regression
(285)
8.4.1 Laplace approximation
(286)
8.4.2 Derivation of the BIC
(286)
8.4.3 Gaussian approximation for logistic regression
(287)
8.4.4 Approximating the posterior predictive
(287)
8.4.5 Residual analysis (outlier detection) *
(291)
8.5 Online learning and stochastic optimization
(292)
8.5.1 Online learning and regret minimization
(293)
8.5.2 Stochastic optimization and risk minimization
(293)
8.5.3 The LMS algorithm
(295)
8.5.4 The perceptron algorithm
(296)
8.5.5 A Bayesian view
(297)
8.6 Generative vs discriminative classifiers
(298)
8.6.1 Pros and cons of each approach
(299)
8.6.2 Dealing with missing data
(300)
8.6.3 Fisher’s linear discriminant analysis (FLDA) *
(302)
9 Generalized Linear Models and the Exponential Family
(312)
9.1 Introduction
(312)
9.2 The exponential family
(312)
9.2.1 Definition
(313)
9.2.2 Examples
(313)
9.2.3 Log partition function
(315)
9.2.4 MLE for the exponential family
(317)
9.2.5 Bayes for the exponential family *
(318)
9.2.6 Maximum entropy derivation of the exponential family *
(320)
9.3 Generalized linear models (GLMs)
(321)
9.3.1 Basics
(321)
9.3.2 ML and MAP estimation
(323)
9.3.3 Bayesian inference
(324)
9.4 Probit regression
(324)
9.4.1 ML/MAP estimation using gradient-based optimization
(325)
9.4.2 Latent variable interpretation
(325)
9.4.3 Ordinal probit regression *
(326)
9.4.4 Multinomial probit models *
(326)
9.5 Multi-task learning
(327)
9.5.1 Hierarchical Bayes for multi-task learning
(327)
9.5.2 Application to personalized email spam filtering
(327)
9.5.3 Application to domain adaptation
(328)
9.5.4 Other kinds of prior
(328)
9.6 Generalized linear mixed models *
(329)
9.6.1 Example: semi-parametric GLMMs for medical data
(329)
9.6.2 Computational issues
(331)
9.7 Learning to rank *
(331)
9.7.1 The pointwise approach
(332)
9.7.2 The pairwise approach
(332)
9.7.3 The listwise approach
(333)
9.7.4 Loss functions for ranking
(334)
10 Directed Graphical Models (Bayes Nets)
(338)
10.1 Introduction
(338)
10.1.1 Chain rule
(338)
10.1.2 Conditional independence
(339)
10.1.3 Graphical models
(339)
10.1.4 Graph terminology
(340)
10.1.5 Directed graphical models
(341)
10.2 Examples
(342)
10.2.1 Naive Bayes classifiers
(342)
10.2.2 Markov and hidden Markov models
(343)
10.2.3 Medical diagnosis
(344)
10.2.4 Genetic linkage analysis *
(346)
10.2.5 Directed Gaussian graphical models *
(349)
10.3 Inference
(350)
10.4 Learning
(351)
10.4.1 Plate notation
(351)
10.4.2 Learning from complete data
(353)
10.4.3 Learning with missing and/or latent variables
(354)
10.5 Conditional independence properties of DGMs
(355)
10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties)
(355)
10.5.2 Other Markov properties of DGMs
(358)
10.5.3 Markov blanket and full conditionals
(358)
10.6 Influence (decision) diagrams *
(359)
11 Mixture Models and the EM Algorithm
(368)
11.1 Latent variable models
(368)
11.2 Mixture models
(368)
11.2.1 Mixtures of Gaussians
(370)
11.2.2 Mixture of multinoullis
(371)
11.2.3 Using mixture models for clustering
(371)
11.2.4 Mixtures of experts
(373)
11.3 Parameter estimation for mixture models
(376)
11.3.1 Unidentifiability
(377)
11.3.2 Computing a MAP estimate is non-convex
(378)
11.4 The EM algorithm
(379)
11.4.1 Basic idea
(380)
11.4.2 EM for GMMs
(381)
11.4.3 EM for mixture of experts
(388)
11.4.4 EM for DGMs with hidden variables
(389)
11.4.5 EM for the Student distribution *
(390)
11.4.6 EM for probit regression *
(393)
11.4.7 Theoretical basis for EM *
(394)
11.4.8 Online EM
(396)
11.4.9 Other EM variants *
(398)
11.5 Model selection for latent variable models
(401)
11.5.1 Model selection for probabilistic models
(401)
11.5.2 Model selection for non-probabilistic methods
(401)
11.6 Fitting models with missing data
(403)
11.6.1 EM for the MLE of an MVN with missing data
(404)
12 Latent Linear Models
(412)
12.1 Factor analysis
(412)
12.1.1 FA is a low rank parameterization of an MVN
(412)
12.1.2 Inference of the latent factors
(413)
12.1.3 Unidentifiability
(414)
12.1.4 Mixtures of factor analysers
(416)
12.1.5 EM for factor analysis models
(417)
12.1.6 Fitting FA models with missing data
(418)
12.2 Principal components analysis (PCA)
(418)
12.2.1 Classical PCA: statement of the theorem
(418)
12.2.2 Proof *
(420)
12.2.3 Singular value decomposition (SVD)
(423)
12.2.4 Probabilistic PCA
(426)
12.2.5 EM algorithm for PCA
(427)
12.3 Choosing the number of latent dimensions
(429)
12.3.1 Model selection for FA/PPCA
(429)
12.3.2 Model selection for PCA
(430)
12.4 PCA for categorical data
(433)
12.5 PCA for paired and multi-view data
(435)
12.5.1 Supervised PCA (latent factor regression)
(436)
12.5.2 Partial least squares
(437)
12.5.3 Canonical correlation analysis
(438)
12.6 Independent Component Analysis (ICA)
(438)
12.6.1 Maximum likelihood estimation
(441)
12.6.2 The FastICA algorithm
(442)
12.6.3 Using EM
(445)
12.6.4 Other estimation principles *
(446)
13 Sparse Linear Models
(452)
13.1 Introduction
(452)
13.2 Bayesian variable selection
(453)
13.2.1 The spike and slab model
(455)
13.2.2 From the Bernoulli-Gaussian model to 0 regularization
(456)
13.2.3 Algorithms
(457)
13.3 e1 regularization: basics
(460)
13.3.1 Why does e1 regularization yield sparse solutions?
(461)
13.3.2 Optimality conditions for lasso
(462)
13.3.3 Comparison of least squares, lasso, ridge and subset selection
(466)
13.3.4 Regularization path
(467)
13.3.5 Model selection
(470)
13.3.6 Bayesian inference for linear models with Laplace priors
(471)
13.4 e1 regularization: algorithms
(472)
13.4.1 Coordinate descent
(472)
13.4.2 LARS and other homotopy methods
(472)
13.4.3 Proximal and gradient projection methods
(473)
13.4.4 EM for lasso
(478)
13.5 e1 regularization: extensions
(480)
13.5.1 Group Lasso
(480)
13.5.2 Fused lasso
(485)
13.5.3 Elastic net (ridge and lasso combined)
(486)
13.6 Non-convex regularizers
(488)
13.6.1 Bridge regression
(489)
13.6.2 Hierarchical adaptive lasso
(489)
13.6.3 Other hierarchical priors
(493)
13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL)
(494)
13.7.1 ARD for linear regression
(494)
13.7.2 Whence sparsity?
(496)
13.7.3 Connection to MAP estimation
(496)
13.7.4 Algorithms for ARD *
(497)
13.7.5 ARD for logistic regression
(499)
13.8 Sparse coding *
(499)
13.8.1 Learning a sparse coding dictionary
(500)
13.8.2 Results of dictionary learning from image patches
(501)
13.8.3 Compressed sensing
(503)
13.8.4 Image inpainting and denoising
(503)
14 Kernels
(510)
14.1 Introduction
(510)
14.2 Kernel functions
(510)
14.2.1 RBF kernels
(511)
14.2.2 Kernels for comparing documents
(511)
14.2.3 Mercer (positive definite) kernels
(512)
14.2.4 Linear kernels
(513)
14.2.5 Matern kernels
(513)
14.2.6 String kernels
(514)
14.2.7 Pyramid match kernels
(515)
14.2.8 Kernels derived from probabilistic generative models
(516)
14.3 Using kernels inside GLMs
(517)
14.3.1 Kernel machines
(517)
14.3.2 L1VMs, RVMs, and other sparse vector machines
(518)
14.4 The kernel trick
(519)
14.4.1 Kernelized nearest neighbor classification
(520)
14.4.2 Kernelized K-medoids clustering
(520)
14.4.3 Kernelized ridge regression
(523)
14.4.4 Kernel PCA
(524)
14.5 Support vector machines (SVMs)
(527)
14.5.1 SVMs for regression
(528)
14.5.2 SVMs for classification
(529)
14.5.3 Choosing C
(535)
14.5.4 Summary of key points
(535)
14.5.5 A probabilistic interpretation of SVMs
(536)
14.6 Comparison of discriminative kernel methods
(536)
14.7 Kernels for building generative models
(538)
14.7.1 Smoothing kernels
(538)
14.7.2 Kernel density estimation (KDE)
(539)
14.7.3 From KDE to KNN
(540)
14.7.4 Kernel regression
(541)
14.7.5 Locally weighted regression
(543)
15 Gaussian Processes
(546)
15.1 Introduction
(546)
15.2 GPs for regression
(547)
15.2.1 Predictions using noise-free observations
(548)
15.2.2 Predictions using noisy observations
(549)
15.2.3 Effect of the kernel parameters
(550)
15.2.4 Estimating the kernel parameters
(552)
15.2.5 Computational and numerical issues *
(555)
15.2.6 Semi-parametric GPs *
(555)
15.3 GPs meet GLMs
(556)
15.3.1 Binary classification
(556)
15.3.2 Multi-class classification
(559)
15.3.3 GPs for Poisson regression
(562)
15.4 Connection with other methods
(563)
15.4.1 Linear models compared to GPs
(563)
15.4.2 Linear smoothers compared to GPs
(564)
15.4.3 SVMs compared to GPs
(565)
15.4.4 L1VM and RVMs compared to GPs
(565)
15.4.5 Neural networks compared to GPs
(566)
15.4.6 Smoothing splines compared to GPs *
(567)
15.4.7 RKHS methods compared to GPs *
(569)
15.5 GP latent variable model
(571)
15.6 Approximation methods for large datasets
(573)
16 Adaptive Basis Function Models
(574)
16.1 Introduction
(574)
16.2 Classification and regression trees (CART)
(575)
16.2.1 Basics
(575)
16.2.2 Growing a tree
(576)
16.2.3 Pruning a tree
(580)
16.2.4 Pros and cons of trees
(581)
16.2.5 Random forests
(581)
16.2.6 CART compared to hierarchical mixture of experts *
(582)
16.3 Generalized additive models
(583)
16.3.1 Backfitting
(583)
16.3.2 Computational efficiency
(584)
16.3.3 Multivariate adaptive regression splines (MARS)
(584)
16.4 Boosting
(585)
16.4.1 Forward stagewise additive modeling
(586)
16.4.2 L2boosting
(588)
16.4.3 AdaBoost
(589)
16.4.4 LogitBoost
(590)
16.4.5 Boosting as functional gradient descent
(591)
16.4.6 Sparse boosting
(592)
16.4.7 Multivariate adaptive regression trees (MART)
(593)
16.4.8 Why does boosting work so well?
(593)
16.4.9 A Bayesian view
(594)
16.5 Feedforward neural networks (multilayer perceptrons)
(594)
16.5.1 Convolutional neural networks
(595)
16.5.2 Other kinds of neural networks
(599)
16.5.3 A brief history of the field
(599)
16.5.4 The backpropagation algorithm
(600)
16.5.5 Identifiability
(603)
16.5.6 Regularization
(603)
16.5.7 Bayesian inference
(607)
16.6 Ensemble learning
(611)
16.6.1 Stacking
(611)
16.6.2 Error-correcting output codes
(612)
16.6.3 Ensemble learning is not equivalent to Bayes model averaging
(612)
16.7 Experimental comparison
(613)
16.7.1 Low-dimensional features
(613)
16.7.2 High-dimensional features
(614)
16.8 Interpreting black-box models
(616)
17 Markov and Hidden Markov Models
(620)
17.1 Introduction
(620)
17.2 Markov models
(620)
17.2.1 Transition matrix
(620)
17.2.2 Application: Language modeling
(622)
17.2.3 Stationary distribution of a Markov chain *
(627)
17.2.4 Application: Google’s PageRank algorithm for web page ranking
(631)
17.3 Hidden Markov models
(634)
17.3.1 Applications of HMMs
(635)
17.4 Inference in HMMs
(637)
17.4.1 Types of inference problems for temporal models
(637)
17.4.2 The forwards algorithm
(640)
17.4.3 The forwards-backwards algorithm
(641)
17.4.4 The Viterbi algorithm
(643)
17.4.5 Forwards filtering, backwards sampling
(647)
17.5 Learning for HMMs
(648)
17.5.1 Training with fully observed data
(648)
17.5.2 EM for HMMs (the Baum-Welch algorithm)
(649)
17.5.3 Bayesian methods for “fitting” HMMs *
(651)
17.5.4 Discriminative training
(651)
17.5.5 Model selection
(652)
17.6 Generalizations of HMMs
(652)
17.6.1 Variable duration (semi-Markov) HMMs
(653)
17.6.2 Hierarchical HMMs
(655)
17.6.3 Input-output HMMs
(656)
17.6.4 Auto-regressive and buried HMMs
(657)
17.6.5 Factorial HMM
(658)
17.6.6 Coupled HMM and the influence model
(659)
17.6.7 Dynamic Bayesian networks (DBNs)
(659)
18 State Space Models
(662)
18.1 Introduction
(662)
18.2 Applications of SSMs
(663)
18.2.1 SSMs for object tracking
(663)
18.2.2 Robotic SLAM
(664)
18.2.3 Online parameter learning using recursive least squares
(667)
18.2.4 SSM for time series forecasting *
(668)
18.3 Inference in LG-SSM
(671)
18.3.1 The Kalman filtering algorithm
(671)
18.3.2 The Kalman smoothing algorithm
(674)
18.4 Learning for LG-SSM
(677)
18.4.1 Identifiability and numerical stability
(677)
18.4.2 Training with fully observed data
(678)
18.4.3 EM for LG-SSM
(678)
18.4.4 Subspace methods
(678)
18.4.5 Bayesian methods for “fitting” LG-SSMs
(678)
18.5 Approximate online inference for non-linear, non-Gaussian SSMs
(678)
18.5.1 Extended Kalman filter (EKF)
(679)
18.5.2 Unscented Kalman filter (UKF)
(681)
18.5.3 Assumed density filtering (ADF)
(683)
18.6 Hybrid discrete/continuous SSMs
(686)
18.6.1 Inference
(687)
18.6.2 Application: data association and multi-target tracking
(689)
18.6.3 Application: fault diagnosis
(690)
18.6.4 Application: econometric forecasting
(691)
19 Undirected Graphical Models (Markov Random Fields)
(692)
19.1 Introduction
(692)
19.2 Conditional independence properties of UGMs
(692)
19.2.1 Key properties
(692)
19.2.2 An undirected alternative to d-separation
(694)
19.2.3 Comparing directed and undirected graphical models
(695)
19.3 Parameterization of MRFs
(696)
19.3.1 The Hammersley-Clifford theorem
(696)
19.3.2 Representing potential functions
(698)
19.4 Examples of MRFs
(699)
19.4.1 Ising model
(699)
19.4.2 Hopfield networks
(700)
19.4.3 Potts model
(702)
19.4.4 Gaussian MRFs
(703)
19.4.5 Markov logic networks *
(705)
19.5 Learning
(707)
19.5.1 Training maxent models using gradient methods
(707)
19.5.2 Training partially observed maxent models
(708)
19.5.3 Approximate methods for computing the MLEs of MRFs
(709)
19.5.4 Pseudo likelihood
(709)
19.5.5 Stochastic maximum likelihood
(710)
19.5.6 Feature induction for maxent models *
(711)
19.5.7 Iterative proportional fitting (IPF) *
(712)
19.6 Conditional random fields (CRFs)
(715)
19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem
(715)
19.6.2 Applications of CRFs
(717)
19.6.3 CRF training
(723)
19.7 Structural SVMs
(724)
19.7.1 SSVMs: a probabilistic view
(724)
19.7.2 SSVMs: a non-probabilistic view
(726)
19.7.3 Cutting plane methods for fitting SSVMs
(729)
19.7.4 Online algorithms for fitting SSVMs
(731)
19.7.5 Latent structural SVMs
(732)
20 Exact Inference for Graphical Models
(738)
20.1 Introduction
(738)
20.2 Belief propagation for trees
(738)
20.2.1 Serial protocol
(738)
20.2.2 Parallel protocol
(740)
20.2.3 Gaussian BP *
(741)
20.2.4 Other BP variants *
(743)
20.3 The variable elimination algorithm
(745)
20.3.1 The generalized distributive law *
(748)
20.3.2 Computational complexity of VE
(748)
20.3.3 A weakness of VE
(751)
20.4 The junction tree algorithm *
(751)
20.4.1 Creating a junction tree
(751)
20.4.2 Message passing on a junction tree
(753)
20.4.3 Computational complexity of JTA
(756)
20.4.4 JTA generalizations *
(757)
20.5 Computational intractability of exact inference in the worst case
(757)
20.5.1 Approximate inference
(758)
21 Variational Inference
(762)
21.1 Introduction
(762)
21.2 Variational inference
(763)
21.2.1 Alternative interpretations of the variational objective
(764)
21.2.2 Forward or reverse KL? *
(764)
21.3 The mean field method
(766)
21.3.1 Derivation of the mean field update equations
(767)
21.3.2 Example: mean field for the Ising model
(768)
21.4 Structured mean field *
(770)
21.4.1 Example: factorial HMM
(771)
21.5 Variational Bayes
(773)
21.5.1 Example: VB for a univariate Gaussian
(773)
21.5.2 Example: VB for linear regression
(777)
21.6 Variational Bayes EM
(780)
21.6.1 Example: VBEM for mixtures of Gaussians *
(781)
21.7 Variational message passing and VIBES
(787)
21.8 Local variational bounds *
(787)
21.8.1 Motivating applications
(787)
21.8.2 Bohning’s quadratic bound to the log-sum-exp function
(789)
21.8.3 Bounds for the sigmoid function
(791)
21.8.4 Other bounds and approximations to the log-sum-exp function
(793)
21.8.5 Variational inference based on upper bounds
(794)
22 More Variational Inference
(798)
22.1 Introduction
(798)
22.2 Loopy belief propagation: algorithmic issues
(798)
22.2.1 A brief history
(798)
22.2.2 LBP on pairwise models
(799)
22.2.3 LBP on a factor graph
(800)
22.2.4 Convergence
(802)
22.2.5 Accuracy of LBP
(805)
22.2.6 Other speedup tricks for LBP *
(806)
22.3 Loopy belief propagation: theoretical issues *
(807)
22.3.1 UGMs represented in exponential family form
(807)
22.3.2 The marginal polytope
(808)
22.3.3 Exact inference as a variational optimization problem
(809)
22.3.4 Mean field as a variational optimization problem
(810)
22.3.5 LBP as a variational optimization problem
(810)
22.3.6 Loopy BP vs mean field
(814)
22.4 Extensions of belief propagation *
(814)
22.4.1 Generalized belief propagation
(814)
22.4.2 Convex belief propagation
(816)
22.5 Expectation propagation
(818)
22.5.1 EP as a variational inference p
(819)
22.5.2 Optimizing the EP objective using moment matching
(820)
22.5.3 EP for the clutter problem
(822)
22.5.4 LBP is a special case of EP
(823)
22.5.5 Ranking players using TrueSkill
(824)
22.5.6 Other applications of EP
(830)
22.6 MAP state estimation
(830)
22.6.1 Linear programming relaxation
(830)
22.6.2 Max-product belief propagation
(831)
22.6.3 Graphcuts
(832)
22.6.4 Experimental comparison of graphcuts and BP
(835)
22.6.5 Dual decomposition
(837)
23 Monte Carlo Inference
(846)
23.1 Introduction
(846)
23.2 Sampling from standard distributions
(846)
23.2.1 Using the cdf
(846)
23.2.2 Sampling from a Gaussian (Box-Muller method)
(848)
23.3 Rejection sampling
(848)
23.3.1 Basic idea
(848)
23.3.2 Example
(849)
23.3.3 Application to Bayesian statistics
(850)
23.3.4 Adaptive rejection sampling
(850)
23.3.5 Rejection sampling in high dimensions
(851)
23.4 Importance sampling
(851)
23.4.1 Basic idea
(851)
23.4.2 Handling unnormalized distributions
(852)
23.4.3 Importance sampling for a DGM: likelihood weighting
(853)
23.4.4 Sampling importance resampling (SIR)
(853)
23.5 Particle filtering
(854)
23.5.1 Sequential importance sampling
(855)
23.5.2 The degeneracy problem
(856)
23.5.3 The resampling step
(856)
23.5.4 The proposal distribution
(858)
23.5.5 Application: robot localization
(859)
23.5.6 Application: visual object tracking
(859)
23.5.7 Application: time series forecasting
(862)
23.6 Rao-Blackwellised particle filtering (RBPF)
(862)
23.6.1 RBPF for switching LG-SSMs
(862)
23.6.2 Application: tracking a maneuvering target
(863)
23.6.3 Application: Fast SLAM
(865)
24 Markov Chain Monte Carlo (MCMC) Inference
(868)
24.1 Introduction
(868)
24.2 Gibbs sampling
(869)
24.2.1 Basic idea
(869)
24.2.2 Example: Gibbs sampling for the Ising model
(869)
24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM
(871)
24.2.4 Collapsed Gibbs sampling *
(872)
24.2.5 Gibbs sampling for hierarchical GLMs
(875)
24.2.6 BUGS and JAGS
(877)
24.2.7 The Imputation Posterior (IP) algorithm
(878)
24.2.8 Blocking Gibbs sampling
(878)
24.3 Metropolis Hastings algorithm
(879)
24.3.1 Basic idea
(879)
24.3.2 Gibbs sampling is a special case of MH
(880)
24.3.3 Proposal distributions
(881)
24.3.4 Adaptive MCMC
(884)
24.3.5 Initialization and mode hopping
(885)
24.3.6 Why MH works *
(885)
24.3.7 Reversible jump (trans-dimensional) MCMC *
(886)
24.4 Speed and accuracy of MCMC
(887)
24.4.1 The burn-in phase
(887)
24.4.2 Mixing rates of Markov chains *
(888)
24.4.3 Practical convergence diagnostic
(889)
24.4.4 Accuracy of MCMC
(891)
24.4.5 How many chains?
(893)
24.5 Auxiliary variable MCMC *
(894)
24.5.1 Auxiliary variable sampling for logistic regression
(894)
24.5.2 Slice sampling
(895)
24.5.3 Swendsen Wang
(897)
24.5.4 Hybrid/Hamiltonian MCMC *
(899)
24.6 Annealing methods
(899)
24.6.1 Simulated annealing
(900)
24.6.2 Annealed importance sampling
(902)
24.6.3 Parallel tempering
(902)
24.7 Approximating the marginal likelihood
(903)
24.7.1 The candidate method
(903)
24.7.2 Harmonic mean estimate
(903)
24.7.3 Annealed importance sampling
(904)
25 Clustering
(906)
25.1 Introduction
(906)
25.1.1 Measuring (dis)similarity
(906)
25.1.2 Evaluating the output of clustering methods *
(907)
25.2 Dirichlet process mixture models
(910)
25.2.1 From finite to infinite mixture models
(910)
25.2.2 The Dirichlet process
(913)
25.2.3 Applying Dirichlet processes to mixture modeling
(916)
25.2.4 Fitting a DP mixture model
(917)
25.3 Affinity propagation
(918)
25.4 Spectral clustering
(921)
25.4.1 Graph Laplacian
(922)
25.4.2 Normalized graph Laplacian
(923)
25.4.3 Example
(924)
25.5 Hierarchical clustering
(924)
25.5.1 Agglomerative clustering
(926)
25.5.2 Divisive clustering
(929)
25.5.3 Choosing the number of clusters
(930)
25.5.4 Bayesian hierarchical clustering
(930)
25.6 Clustering datapoints and features
(932)
25.6.1 Biclustering
(934)
25.6.2 Multi-view clustering
(934)
26 Graphical Model Structure Learning
(938)
26.1 Introduction
(938)
26.2 Structure learning for knowledge discovery
(939)
26.2.1 Relevance networks
(939)
26.2.2 Dependency networks
(940)
26.3 Learning tree structures
(941)
26.3.1 Directed or undirected tree?
(942)
26.3.2 Chow-Liu algorithm for finding the ML tree structure
(943)
26.3.3 Finding the MAP forest
(943)
26.3.4 Mixtures of trees
(945)
26.4 Learning DAG structures
(945)
26.4.1 Markov equivalence
(945)
26.4.2 Exact structural inference
(947)
26.4.3 Scaling up to larger graphs
(951)
26.5 Learning DAG structure with latent variables
(953)
26.5.1 Approximating the marginal likelihood when we have missing data
(953)
26.5.2 Structural EM
(956)
26.5.3 Discovering hidden variables
(957)
26.5.4 Case study: Google’s Rephil
(959)
26.5.5 Structural equation models *
(960)
26.6 Learning causal DAGs
(962)
26.6.1 Causal interpretation of DAGs
(962)
26.6.2 Using causal DAGs to resolve Simpson’s paradox
(964)
26.6.3 Learning causal DAG structures
(966)
26.7 Learning undirected Gaussian graphical models
(969)
26.7.1 MLE for a GGM
(969)
26.7.2 Graphical lasso
(970)
26.7.3 Bayesian inference for GGM structure *
(972)
26.7.4 Handling non-Gaussian data using copulas *
(973)
26.8 Learning undirected discrete graphical models
(973)
26.8.1 Graphical lasso for MRFs/CRFs
(973)
26.8.2 Thin junction trees
(975)
27 Latent Variable Models for Discrete Data
(976)
27.1 Introduction
(976)
27.2 Distributed state LVMs for discrete data
(977)
27.2.1 Mixture models
(977)
27.2.2 Exponential family PCA
(978)
27.2.3 LDA and mPCA
(979)
27.2.4 GaP model and non-negative matrix factorization
(980)
27.3 Latent Dirichlet allocation (LDA)
(981)
27.3.1 Basics
(981)
27.3.2 Unsupervised discovery of topics
(984)
27.3.3 Quantitatively evaluating LDA as a language model
(984)
27.3.4 Fitting using (collapsed) Gibbs sampling
(986)
27.3.5 Example
(987)
27.3.6 Fitting using batch variational inference
(988)
27.3.7 Fitting using online variational inference
(990)
27.3.8 Determining the number of topics
(991)
27.4 Extensions of LDA
(992)
27.4.1 Correlated topic model
(992)
27.4.2 Dynamic topic model
(993)
27.4.3 LDA-HMM
(994)
27.4.4 Supervised LDA
(998)
27.5 LVMs for graph-structured data
(1001)
27.5.1 Stochastic block model
(1002)
27.5.2 Mixed membership stochastic block model
(1004)
27.5.3 Relational topic model
(1005)
27.6 LVMs for relational data
(1006)
27.6.1 Infinite relational model
(1007)
27.6.2 Probabilistic matrix factorization for collaborative filtering
(1010)
27.7 Restricted Boltzmann machines (RBMs)
(1014)
27.7.1 Varieties of RBMs
(1016)
27.7.2 Learning RBMs
(1018)
27.7.3 Applications of RBMs
(1022)
28 Deep Learning
(1026)
28.1 Introduction
(1026)
28.2 Deep generative models
(1026)
28.2.1 Deep directed networks
(1027)
28.2.2 Deep Boltzmann machines
(1027)
28.2.3 Deep belief networks
(1028)
28.2.4 Greedy layer-wise learning of DBNs
(1029)
28.3 Deep neural networks
(1030)
28.3.1 Deep multi-layer perceptrons
(1030)
28.3.2 Deep auto-encoders
(1031)
28.3.3 Stacked denoising auto-encoders
(1032)
28.4 Applications of deep networks
(1032)
28.4.1 Handwritten digit classification using DBNs
(1032)
28.4.2 Data visualization and feature discovery using deep auto-encoders
(1033)
28.4.3 Information retrieval using deep auto-encoders (semantic hashing)
(1034)
28.4.4 Learning audio features using 1d convolutional DBNs
(1035)
28.4.5 Learning image features using 2d convolutional DBNs
(1036)
28.5 Discussion
(1036)
Notation
(1040)
Bibliography
(1046)
Index to Code
(1078)
Index to Keywords
(1081)