کتابخانه مرکزی دانشگاه صنعتی شریف
    • [نمايش بزرگتر]
    • [نمايش کوچکتر]
  • صفحه 
     از  0
  • [صفحه قبل]
  • [صفحه بعد]
  • [نمایش تمام صفحه]
  • [بستن]
 
Machine learning : a probabilistic perspective
Murphy, Kevin P.,

اطلاعات کتابشناختی

Machine learning : a probabilistic perspective
Author :   Murphy, Kevin P.,
Publisher :   MIT Press,
Pub. Year  :   2012
Subjects :   Machine learning. Probabilities.
Call Number :   ‭Q 325 .5 .M87 2012

جستجو در محتوا

ترتيب

فهرست مطالب

  • Cover Page (1)
  • Half Title Page (2)
  • Title Page (4)
  • Copyright Page (5)
  • Dedication (6)
  • Contents (8)
  • Preface (28)
  • 1 Introduction (32)
    • 1.1 Machine learning: what and why? (32)
      • 1.1.1 Types of machine learning (33)
    • 1.2 Supervised learning (34)
      • 1.2.1 Classification (34)
      • 1.2.2 Regression (39)
    • 1.3 Unsupervised learning (40)
      • 1.3.1 Discovering clusters (41)
      • 1.3.2 Discovering latent factors (42)
      • 1.3.3 Discovering graph structure (44)
      • 1.3.4 Matrix completion (45)
    • 1.4 Some basic concepts in machine learning (47)
      • 1.4.1 Parametric vs non-parametric models (47)
      • 1.4.2 A simple non-parametric classifier: K-nearest neighbors (47)
      • 1.4.3 The curse of dimensionality (49)
      • 1.4.4 Parametric models for classification and regression (50)
      • 1.4.5 Linear regression (50)
      • 1.4.6 Logistic regression (52)
      • 1.4.7 Overfitting (53)
      • 1.4.8 Model selection (53)
      • 1.4.9 No free lunch theorem (55)
  • 2 Probability (58)
    • 2.1 Introduction (58)
    • 2.2 A brief review of probability theory (59)
      • 2.2.1 Discrete random var (59)
      • 2.2.2 Fundamental rules (59)
      • 2.2.3 Bayes rule (60)
      • 2.2.4 Independence and conditional independence (61)
      • 2.2.5 Continuous random variables (63)
      • 2.2.6 Quantiles (64)
      • 2.2.7 Mean and variance (64)
    • 2.3 Some common discrete distributions (65)
      • 2.3.1 The binomial and Bernoulli distributions (65)
      • 2.3.2 The multinomial and multinoulli distributions (66)
      • 2.3.3 The Poisson distribution (68)
      • 2.3.4 The empirical distribution (68)
    • 2.4 Some common continuous distributions (69)
      • 2.4.1 Gaussian (normal) distribution (69)
      • 2.4.2 Degenerate pdf (70)
      • 2.4.3 The Laplace distribution (72)
      • 2.4.4 The gamma distribution (72)
      • 2.4.5 The beta distribution (73)
      • 2.4.6 Pareto distribution (74)
    • 2.5 Joint probability distributions (75)
      • 2.5.1 Covariance and correlation (75)
      • 2.5.2 The multivariate Gaussian (77)
      • 2.5.3 Multivariate Student t distribution (77)
      • 2.5.4 Dirichlet distribution (78)
    • 2.6 Transformations of random variables (80)
      • 2.6.1 Linear transformations (80)
      • 2.6.2 General transformations (81)
      • 2.6.3 Central limit theorem (82)
    • 2.7 Monte Carlo approximation (83)
      • 2.7.1 Example: change of variables, the MC way (84)
      • 2.7.2 Example: estimating π by Monte Carlo integration (85)
      • 2.7.3 Accuracy of Monte Carlo approximation (85)
    • 2.8 Information theory (87)
      • 2.8.1 Entropy (87)
      • 2.8.2 KL divergence (88)
      • 2.8.3 Mutual information (90)
  • 3 Generative Models for Discrete Data (96)
    • 3.1 Introduction (96)
    • 3.2 Bayesian concept learning (96)
      • 3.2.1 Likelihood (98)
      • 3.2.2 Prior (98)
      • 3.2.3 Posterior (99)
      • 3.2.4 Posterior predictive distribution (102)
      • 3.2.5 A more complex prior (103)
    • 3.3 The beta-binomial model (103)
      • 3.3.1 Likelihood (104)
      • 3.3.2 Prior (105)
      • 3.3.3 Posterior (106)
      • 3.3.4 Posterior predictive distribution (108)
    • 3.4 The Dirichlet-multinomial model (109)
      • 3.4.1 Likelihood (110)
      • 3.4.2 Prior (110)
      • 3.4.3 Posterior (110)
      • 3.4.4 Posterior predictive (112)
    • 3.5 Naive Bayes classifiers (113)
      • 3.5.1 Model fitting (114)
      • 3.5.2 Using the model for prediction (116)
      • 3.5.3 The log-sum-exp trick (117)
      • 3.5.4 Feature selection using mutual information (117)
      • 3.5.5 Classifying documents using bag of words (118)
  • 4 Gaussian Models (128)
    • 4.1 Introduction (128)
      • 4.1.1 Notation (128)
      • 4.1.2 Basics (128)
      • 4.1.3 MLE for an MVN (130)
      • 4.1.4 Maximum entropy derivation of the Gaussian * (132)
    • 4.2 Gaussian discriminant analysis (132)
      • 4.2.1 Quadratic discriminant analysis (QDA) (133)
      • 4.2.2 Linear discriminant analysis (LDA) (134)
      • 4.2.3 Two-class LDA (135)
      • 4.2.4 MLE for discriminant analysis (137)
      • 4.2.5 Strategies for preventing overfitting (137)
      • 4.2.6 Regularized LDA * (138)
      • 4.2.7 Diagonal LDA (139)
      • 4.2.8 Nearest shrunken centroids classifier * (140)
    • 4.3 Inference in jointly Gaussian distributions (141)
      • 4.3.1 Statement of the result (142)
      • 4.3.2 Examples (142)
      • 4.3.3 Information form (146)
      • 4.3.4 Proof of the result * (147)
    • 4.4 Linear Gaussian systems (150)
      • 4.4.1 Statement of the result (150)
      • 4.4.2 Examples (151)
      • 4.4.3 Proof of the result * (155)
    • 4.5 Digression: The Wishart distribution * (156)
      • 4.5.1 Inverse Wishart distribution (157)
      • 4.5.2 Visualizing the Wishart distribution * (158)
    • 4.6 Inferring the parameters of an MVN (158)
      • 4.6.1 Posterior distribution of μ (159)
      • 4.6.2 Posterior distribution of Σ * (159)
      • 4.6.3 Posterior distribution of μ and Σ * (163)
      • 4.6.4 Sensor fusion with unknown precisions * (169)
  • 5 Bayesian Statistics (180)
    • 5.1 Introduction (180)
    • 5.2 Summarizing posterior distributions (180)
      • 5.2.1 MAP estimation (180)
      • 5.2.2 Credible intervals (183)
      • 5.2.3 Inference for a difference in proportions (185)
    • 5.3 Bayesian model selection (186)
      • 5.3.1 Bayesian Occam’s razor (187)
      • 5.3.2 Computing the marginal likelihood (evidence) (189)
      • 5.3.3 Bayes factors (194)
      • 5.3.4 Jeffreys-Lindley paradox * (195)
    • 5.4 Priors (196)
      • 5.4.1 Uninformative priors (196)
      • 5.4.2 Jeffreys priors * (197)
      • 5.4.3 Robust priors (199)
      • 5.4.4 Mixtures of conjugate priors (199)
    • 5.5 Hierarchical Bayes (202)
      • 5.5.1 Example: modeling related cancer rates (202)
    • 5.6 Empirical Bayes (203)
      • 5.6.1 Example: beta-binomial model (204)
      • 5.6.2 Example: Gaussian-Gaussian model (204)
    • 5.7 Bayesian decision theory (207)
      • 5.7.1 Bayes estimators for common loss functions (208)
      • 5.7.2 The false positive vs false negative tradeoff (211)
      • 5.7.3 Other topics * (215)
  • 6 Frequentist Statistics (222)
    • 6.1 Introduction (222)
    • 6.2 Sampling distribution of an estimator (222)
      • 6.2.1 Bootstrap (223)
      • 6.2.2 Large sample theory for the MLE * (224)
    • 6.3 Frequentist decision theory (225)
      • 6.3.1 Bayes risk (226)
      • 6.3.2 Minimax risk (227)
      • 6.3.3 Admissible estimators (228)
    • 6.4 Desirable properties of estimators (231)
      • 6.4.1 Consistent estimators (231)
      • 6.4.2 Unbiased estimators (231)
      • 6.4.3 Minimum variance estimators (232)
      • 6.4.4 The bias-variance tradeoff (233)
    • 6.5 Empirical risk minimization (235)
      • 6.5.1 Regularized risk minimization (236)
      • 6.5.2 Structural risk minimization (237)
      • 6.5.3 Estimating the risk using cross validation (237)
      • 6.5.4 Upper bounding the risk using statistical learning theory * (240)
      • 6.5.5 Surrogate loss functions (241)
    • 6.6 Pathologies of frequentist statistics * (242)
      • 6.6.1 Counter-intuitive behavior of confidence intervals (243)
      • 6.6.2 p-values considered harmful (244)
      • 6.6.3 The likelihood principle (245)
      • 6.6.4 Why isn’t everyone a Bayesian? (246)
  • 7 Linear Regression (248)
    • 7.1 Introduction (248)
    • 7.2 Model specification (248)
    • 7.3 Maximum likelihood estimation (least squares) (248)
      • 7.3.1 Derivation of the MLE (250)
      • 7.3.2 Geometric interpretation (251)
      • 7.3.3 Convexity (252)
    • 7.4 Robust linear regression * (254)
    • 7.5 Ridge regression (256)
      • 7.5.1 Basic idea (256)
      • 7.5.2 Numerically stable computation * (258)
      • 7.5.3 Connection with PCA * (259)
      • 7.5.4 Regularization effects of big data (261)
    • 7.6 Bayesian linear regression (262)
      • 7.6.1 Computing the posterior (263)
      • 7.6.2 Computing the posterior predictive (264)
      • 7.6.3 Bayesian inference when σ2 is unknown * (265)
      • 7.6.4 EB for linear regression (evidence procedure) (265)
  • 8 Logistic Regression (276)
    • 8.1 Introduction (276)
    • 8.2 Model specification (276)
    • 8.3 Model fitting (276)
      • 8.3.1 MLE (277)
      • 8.3.2 Steepest descent (278)
      • 8.3.3 Newton’s method (280)
      • 8.3.4 Iteratively reweighted least squares (IRLS) (281)
      • 8.3.5 Quasi-Newton (variable metric) methods (282)
      • 8.3.6 2e regularization (283)
      • 8.3.7 Multi-class logistic regression (283)
    • 8.4 Bayesian logistic regression (285)
      • 8.4.1 Laplace approximation (286)
      • 8.4.2 Derivation of the BIC (286)
      • 8.4.3 Gaussian approximation for logistic regression (287)
      • 8.4.4 Approximating the posterior predictive (287)
      • 8.4.5 Residual analysis (outlier detection) * (291)
    • 8.5 Online learning and stochastic optimization (292)
      • 8.5.1 Online learning and regret minimization (293)
      • 8.5.2 Stochastic optimization and risk minimization (293)
      • 8.5.3 The LMS algorithm (295)
      • 8.5.4 The perceptron algorithm (296)
      • 8.5.5 A Bayesian view (297)
    • 8.6 Generative vs discriminative classifiers (298)
      • 8.6.1 Pros and cons of each approach (299)
      • 8.6.2 Dealing with missing data (300)
      • 8.6.3 Fisher’s linear discriminant analysis (FLDA) * (302)
  • 9 Generalized Linear Models and the Exponential Family (312)
    • 9.1 Introduction (312)
    • 9.2 The exponential family (312)
      • 9.2.1 Definition (313)
      • 9.2.2 Examples (313)
      • 9.2.3 Log partition function (315)
      • 9.2.4 MLE for the exponential family (317)
      • 9.2.5 Bayes for the exponential family * (318)
      • 9.2.6 Maximum entropy derivation of the exponential family * (320)
    • 9.3 Generalized linear models (GLMs) (321)
      • 9.3.1 Basics (321)
      • 9.3.2 ML and MAP estimation (323)
      • 9.3.3 Bayesian inference (324)
    • 9.4 Probit regression (324)
      • 9.4.1 ML/MAP estimation using gradient-based optimization (325)
      • 9.4.2 Latent variable interpretation (325)
      • 9.4.3 Ordinal probit regression * (326)
      • 9.4.4 Multinomial probit models * (326)
    • 9.5 Multi-task learning (327)
      • 9.5.1 Hierarchical Bayes for multi-task learning (327)
      • 9.5.2 Application to personalized email spam filtering (327)
      • 9.5.3 Application to domain adaptation (328)
      • 9.5.4 Other kinds of prior (328)
    • 9.6 Generalized linear mixed models * (329)
      • 9.6.1 Example: semi-parametric GLMMs for medical data (329)
      • 9.6.2 Computational issues (331)
    • 9.7 Learning to rank * (331)
      • 9.7.1 The pointwise approach (332)
      • 9.7.2 The pairwise approach (332)
      • 9.7.3 The listwise approach (333)
      • 9.7.4 Loss functions for ranking (334)
  • 10 Directed Graphical Models (Bayes Nets) (338)
    • 10.1 Introduction (338)
      • 10.1.1 Chain rule (338)
      • 10.1.2 Conditional independence (339)
      • 10.1.3 Graphical models (339)
      • 10.1.4 Graph terminology (340)
      • 10.1.5 Directed graphical models (341)
    • 10.2 Examples (342)
      • 10.2.1 Naive Bayes classifiers (342)
      • 10.2.2 Markov and hidden Markov models (343)
      • 10.2.3 Medical diagnosis (344)
      • 10.2.4 Genetic linkage analysis * (346)
      • 10.2.5 Directed Gaussian graphical models * (349)
    • 10.3 Inference (350)
    • 10.4 Learning (351)
      • 10.4.1 Plate notation (351)
      • 10.4.2 Learning from complete data (353)
      • 10.4.3 Learning with missing and/or latent variables (354)
    • 10.5 Conditional independence properties of DGMs (355)
      • 10.5.1 d-separation and the Bayes Ball algorithm (global Markov properties) (355)
      • 10.5.2 Other Markov properties of DGMs (358)
      • 10.5.3 Markov blanket and full conditionals (358)
    • 10.6 Influence (decision) diagrams * (359)
  • 11 Mixture Models and the EM Algorithm (368)
    • 11.1 Latent variable models (368)
    • 11.2 Mixture models (368)
      • 11.2.1 Mixtures of Gaussians (370)
      • 11.2.2 Mixture of multinoullis (371)
      • 11.2.3 Using mixture models for clustering (371)
      • 11.2.4 Mixtures of experts (373)
    • 11.3 Parameter estimation for mixture models (376)
      • 11.3.1 Unidentifiability (377)
      • 11.3.2 Computing a MAP estimate is non-convex (378)
    • 11.4 The EM algorithm (379)
      • 11.4.1 Basic idea (380)
      • 11.4.2 EM for GMMs (381)
      • 11.4.3 EM for mixture of experts (388)
      • 11.4.4 EM for DGMs with hidden variables (389)
      • 11.4.5 EM for the Student distribution * (390)
      • 11.4.6 EM for probit regression * (393)
      • 11.4.7 Theoretical basis for EM * (394)
      • 11.4.8 Online EM (396)
      • 11.4.9 Other EM variants * (398)
    • 11.5 Model selection for latent variable models (401)
      • 11.5.1 Model selection for probabilistic models (401)
      • 11.5.2 Model selection for non-probabilistic methods (401)
    • 11.6 Fitting models with missing data (403)
      • 11.6.1 EM for the MLE of an MVN with missing data (404)
  • 12 Latent Linear Models (412)
    • 12.1 Factor analysis (412)
      • 12.1.1 FA is a low rank parameterization of an MVN (412)
      • 12.1.2 Inference of the latent factors (413)
      • 12.1.3 Unidentifiability (414)
      • 12.1.4 Mixtures of factor analysers (416)
      • 12.1.5 EM for factor analysis models (417)
      • 12.1.6 Fitting FA models with missing data (418)
    • 12.2 Principal components analysis (PCA) (418)
      • 12.2.1 Classical PCA: statement of the theorem (418)
      • 12.2.2 Proof * (420)
      • 12.2.3 Singular value decomposition (SVD) (423)
      • 12.2.4 Probabilistic PCA (426)
      • 12.2.5 EM algorithm for PCA (427)
    • 12.3 Choosing the number of latent dimensions (429)
      • 12.3.1 Model selection for FA/PPCA (429)
      • 12.3.2 Model selection for PCA (430)
    • 12.4 PCA for categorical data (433)
    • 12.5 PCA for paired and multi-view data (435)
      • 12.5.1 Supervised PCA (latent factor regression) (436)
      • 12.5.2 Partial least squares (437)
      • 12.5.3 Canonical correlation analysis (438)
    • 12.6 Independent Component Analysis (ICA) (438)
      • 12.6.1 Maximum likelihood estimation (441)
      • 12.6.2 The FastICA algorithm (442)
      • 12.6.3 Using EM (445)
      • 12.6.4 Other estimation principles * (446)
  • 13 Sparse Linear Models (452)
    • 13.1 Introduction (452)
    • 13.2 Bayesian variable selection (453)
      • 13.2.1 The spike and slab model (455)
      • 13.2.2 From the Bernoulli-Gaussian model to 0 regularization (456)
      • 13.2.3 Algorithms (457)
    • 13.3 e1 regularization: basics (460)
      • 13.3.1 Why does e1 regularization yield sparse solutions? (461)
      • 13.3.2 Optimality conditions for lasso (462)
      • 13.3.3 Comparison of least squares, lasso, ridge and subset selection (466)
      • 13.3.4 Regularization path (467)
      • 13.3.5 Model selection (470)
      • 13.3.6 Bayesian inference for linear models with Laplace priors (471)
    • 13.4 e1 regularization: algorithms (472)
      • 13.4.1 Coordinate descent (472)
      • 13.4.2 LARS and other homotopy methods (472)
      • 13.4.3 Proximal and gradient projection methods (473)
      • 13.4.4 EM for lasso (478)
    • 13.5 e1 regularization: extensions (480)
      • 13.5.1 Group Lasso (480)
      • 13.5.2 Fused lasso (485)
      • 13.5.3 Elastic net (ridge and lasso combined) (486)
    • 13.6 Non-convex regularizers (488)
      • 13.6.1 Bridge regression (489)
      • 13.6.2 Hierarchical adaptive lasso (489)
      • 13.6.3 Other hierarchical priors (493)
    • 13.7 Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) (494)
      • 13.7.1 ARD for linear regression (494)
      • 13.7.2 Whence sparsity? (496)
      • 13.7.3 Connection to MAP estimation (496)
      • 13.7.4 Algorithms for ARD * (497)
      • 13.7.5 ARD for logistic regression (499)
    • 13.8 Sparse coding * (499)
      • 13.8.1 Learning a sparse coding dictionary (500)
      • 13.8.2 Results of dictionary learning from image patches (501)
      • 13.8.3 Compressed sensing (503)
      • 13.8.4 Image inpainting and denoising (503)
  • 14 Kernels (510)
    • 14.1 Introduction (510)
    • 14.2 Kernel functions (510)
      • 14.2.1 RBF kernels (511)
      • 14.2.2 Kernels for comparing documents (511)
      • 14.2.3 Mercer (positive definite) kernels (512)
      • 14.2.4 Linear kernels (513)
      • 14.2.5 Matern kernels (513)
      • 14.2.6 String kernels (514)
      • 14.2.7 Pyramid match kernels (515)
      • 14.2.8 Kernels derived from probabilistic generative models (516)
    • 14.3 Using kernels inside GLMs (517)
      • 14.3.1 Kernel machines (517)
      • 14.3.2 L1VMs, RVMs, and other sparse vector machines (518)
    • 14.4 The kernel trick (519)
      • 14.4.1 Kernelized nearest neighbor classification (520)
      • 14.4.2 Kernelized K-medoids clustering (520)
      • 14.4.3 Kernelized ridge regression (523)
      • 14.4.4 Kernel PCA (524)
    • 14.5 Support vector machines (SVMs) (527)
      • 14.5.1 SVMs for regression (528)
      • 14.5.2 SVMs for classification (529)
      • 14.5.3 Choosing C (535)
      • 14.5.4 Summary of key points (535)
      • 14.5.5 A probabilistic interpretation of SVMs (536)
    • 14.6 Comparison of discriminative kernel methods (536)
    • 14.7 Kernels for building generative models (538)
      • 14.7.1 Smoothing kernels (538)
      • 14.7.2 Kernel density estimation (KDE) (539)
      • 14.7.3 From KDE to KNN (540)
      • 14.7.4 Kernel regression (541)
      • 14.7.5 Locally weighted regression (543)
  • 15 Gaussian Processes (546)
    • 15.1 Introduction (546)
    • 15.2 GPs for regression (547)
      • 15.2.1 Predictions using noise-free observations (548)
      • 15.2.2 Predictions using noisy observations (549)
      • 15.2.3 Effect of the kernel parameters (550)
      • 15.2.4 Estimating the kernel parameters (552)
      • 15.2.5 Computational and numerical issues * (555)
      • 15.2.6 Semi-parametric GPs * (555)
    • 15.3 GPs meet GLMs (556)
      • 15.3.1 Binary classification (556)
      • 15.3.2 Multi-class classification (559)
      • 15.3.3 GPs for Poisson regression (562)
    • 15.4 Connection with other methods (563)
      • 15.4.1 Linear models compared to GPs (563)
      • 15.4.2 Linear smoothers compared to GPs (564)
      • 15.4.3 SVMs compared to GPs (565)
      • 15.4.4 L1VM and RVMs compared to GPs (565)
      • 15.4.5 Neural networks compared to GPs (566)
      • 15.4.6 Smoothing splines compared to GPs * (567)
      • 15.4.7 RKHS methods compared to GPs * (569)
    • 15.5 GP latent variable model (571)
    • 15.6 Approximation methods for large datasets (573)
  • 16 Adaptive Basis Function Models (574)
    • 16.1 Introduction (574)
    • 16.2 Classification and regression trees (CART) (575)
      • 16.2.1 Basics (575)
      • 16.2.2 Growing a tree (576)
      • 16.2.3 Pruning a tree (580)
      • 16.2.4 Pros and cons of trees (581)
      • 16.2.5 Random forests (581)
      • 16.2.6 CART compared to hierarchical mixture of experts * (582)
    • 16.3 Generalized additive models (583)
      • 16.3.1 Backfitting (583)
      • 16.3.2 Computational efficiency (584)
      • 16.3.3 Multivariate adaptive regression splines (MARS) (584)
    • 16.4 Boosting (585)
      • 16.4.1 Forward stagewise additive modeling (586)
      • 16.4.2 L2boosting (588)
      • 16.4.3 AdaBoost (589)
      • 16.4.4 LogitBoost (590)
      • 16.4.5 Boosting as functional gradient descent (591)
      • 16.4.6 Sparse boosting (592)
      • 16.4.7 Multivariate adaptive regression trees (MART) (593)
      • 16.4.8 Why does boosting work so well? (593)
      • 16.4.9 A Bayesian view (594)
    • 16.5 Feedforward neural networks (multilayer perceptrons) (594)
      • 16.5.1 Convolutional neural networks (595)
      • 16.5.2 Other kinds of neural networks (599)
      • 16.5.3 A brief history of the field (599)
      • 16.5.4 The backpropagation algorithm (600)
      • 16.5.5 Identifiability (603)
      • 16.5.6 Regularization (603)
      • 16.5.7 Bayesian inference (607)
    • 16.6 Ensemble learning (611)
      • 16.6.1 Stacking (611)
      • 16.6.2 Error-correcting output codes (612)
      • 16.6.3 Ensemble learning is not equivalent to Bayes model averaging (612)
    • 16.7 Experimental comparison (613)
      • 16.7.1 Low-dimensional features (613)
      • 16.7.2 High-dimensional features (614)
    • 16.8 Interpreting black-box models (616)
  • 17 Markov and Hidden Markov Models (620)
    • 17.1 Introduction (620)
    • 17.2 Markov models (620)
      • 17.2.1 Transition matrix (620)
      • 17.2.2 Application: Language modeling (622)
      • 17.2.3 Stationary distribution of a Markov chain * (627)
      • 17.2.4 Application: Google’s PageRank algorithm for web page ranking (631)
    • 17.3 Hidden Markov models (634)
      • 17.3.1 Applications of HMMs (635)
    • 17.4 Inference in HMMs (637)
      • 17.4.1 Types of inference problems for temporal models (637)
      • 17.4.2 The forwards algorithm (640)
      • 17.4.3 The forwards-backwards algorithm (641)
      • 17.4.4 The Viterbi algorithm (643)
      • 17.4.5 Forwards filtering, backwards sampling (647)
    • 17.5 Learning for HMMs (648)
      • 17.5.1 Training with fully observed data (648)
      • 17.5.2 EM for HMMs (the Baum-Welch algorithm) (649)
      • 17.5.3 Bayesian methods for “fitting” HMMs * (651)
      • 17.5.4 Discriminative training (651)
      • 17.5.5 Model selection (652)
    • 17.6 Generalizations of HMMs (652)
      • 17.6.1 Variable duration (semi-Markov) HMMs (653)
      • 17.6.2 Hierarchical HMMs (655)
      • 17.6.3 Input-output HMMs (656)
      • 17.6.4 Auto-regressive and buried HMMs (657)
      • 17.6.5 Factorial HMM (658)
      • 17.6.6 Coupled HMM and the influence model (659)
      • 17.6.7 Dynamic Bayesian networks (DBNs) (659)
  • 18 State Space Models (662)
    • 18.1 Introduction (662)
    • 18.2 Applications of SSMs (663)
      • 18.2.1 SSMs for object tracking (663)
      • 18.2.2 Robotic SLAM (664)
      • 18.2.3 Online parameter learning using recursive least squares (667)
      • 18.2.4 SSM for time series forecasting * (668)
    • 18.3 Inference in LG-SSM (671)
      • 18.3.1 The Kalman filtering algorithm (671)
      • 18.3.2 The Kalman smoothing algorithm (674)
    • 18.4 Learning for LG-SSM (677)
      • 18.4.1 Identifiability and numerical stability (677)
      • 18.4.2 Training with fully observed data (678)
      • 18.4.3 EM for LG-SSM (678)
      • 18.4.4 Subspace methods (678)
      • 18.4.5 Bayesian methods for “fitting” LG-SSMs (678)
    • 18.5 Approximate online inference for non-linear, non-Gaussian SSMs (678)
      • 18.5.1 Extended Kalman filter (EKF) (679)
      • 18.5.2 Unscented Kalman filter (UKF) (681)
      • 18.5.3 Assumed density filtering (ADF) (683)
    • 18.6 Hybrid discrete/continuous SSMs (686)
      • 18.6.1 Inference (687)
      • 18.6.2 Application: data association and multi-target tracking (689)
      • 18.6.3 Application: fault diagnosis (690)
      • 18.6.4 Application: econometric forecasting (691)
  • 19 Undirected Graphical Models (Markov Random Fields) (692)
    • 19.1 Introduction (692)
    • 19.2 Conditional independence properties of UGMs (692)
      • 19.2.1 Key properties (692)
      • 19.2.2 An undirected alternative to d-separation (694)
      • 19.2.3 Comparing directed and undirected graphical models (695)
    • 19.3 Parameterization of MRFs (696)
      • 19.3.1 The Hammersley-Clifford theorem (696)
      • 19.3.2 Representing potential functions (698)
    • 19.4 Examples of MRFs (699)
      • 19.4.1 Ising model (699)
      • 19.4.2 Hopfield networks (700)
      • 19.4.3 Potts model (702)
      • 19.4.4 Gaussian MRFs (703)
      • 19.4.5 Markov logic networks * (705)
    • 19.5 Learning (707)
      • 19.5.1 Training maxent models using gradient methods (707)
      • 19.5.2 Training partially observed maxent models (708)
      • 19.5.3 Approximate methods for computing the MLEs of MRFs (709)
      • 19.5.4 Pseudo likelihood (709)
      • 19.5.5 Stochastic maximum likelihood (710)
      • 19.5.6 Feature induction for maxent models * (711)
      • 19.5.7 Iterative proportional fitting (IPF) * (712)
    • 19.6 Conditional random fields (CRFs) (715)
      • 19.6.1 Chain-structured CRFs, MEMMs and the label-bias problem (715)
      • 19.6.2 Applications of CRFs (717)
      • 19.6.3 CRF training (723)
    • 19.7 Structural SVMs (724)
      • 19.7.1 SSVMs: a probabilistic view (724)
      • 19.7.2 SSVMs: a non-probabilistic view (726)
      • 19.7.3 Cutting plane methods for fitting SSVMs (729)
      • 19.7.4 Online algorithms for fitting SSVMs (731)
      • 19.7.5 Latent structural SVMs (732)
  • 20 Exact Inference for Graphical Models (738)
    • 20.1 Introduction (738)
    • 20.2 Belief propagation for trees (738)
      • 20.2.1 Serial protocol (738)
      • 20.2.2 Parallel protocol (740)
      • 20.2.3 Gaussian BP * (741)
      • 20.2.4 Other BP variants * (743)
    • 20.3 The variable elimination algorithm (745)
      • 20.3.1 The generalized distributive law * (748)
      • 20.3.2 Computational complexity of VE (748)
      • 20.3.3 A weakness of VE (751)
    • 20.4 The junction tree algorithm * (751)
      • 20.4.1 Creating a junction tree (751)
      • 20.4.2 Message passing on a junction tree (753)
      • 20.4.3 Computational complexity of JTA (756)
      • 20.4.4 JTA generalizations * (757)
    • 20.5 Computational intractability of exact inference in the worst case (757)
      • 20.5.1 Approximate inference (758)
  • 21 Variational Inference (762)
    • 21.1 Introduction (762)
    • 21.2 Variational inference (763)
      • 21.2.1 Alternative interpretations of the variational objective (764)
      • 21.2.2 Forward or reverse KL? * (764)
    • 21.3 The mean field method (766)
      • 21.3.1 Derivation of the mean field update equations (767)
      • 21.3.2 Example: mean field for the Ising model (768)
    • 21.4 Structured mean field * (770)
      • 21.4.1 Example: factorial HMM (771)
    • 21.5 Variational Bayes (773)
      • 21.5.1 Example: VB for a univariate Gaussian (773)
      • 21.5.2 Example: VB for linear regression (777)
    • 21.6 Variational Bayes EM (780)
      • 21.6.1 Example: VBEM for mixtures of Gaussians * (781)
    • 21.7 Variational message passing and VIBES (787)
    • 21.8 Local variational bounds * (787)
      • 21.8.1 Motivating applications (787)
      • 21.8.2 Bohning’s quadratic bound to the log-sum-exp function (789)
      • 21.8.3 Bounds for the sigmoid function (791)
      • 21.8.4 Other bounds and approximations to the log-sum-exp function (793)
      • 21.8.5 Variational inference based on upper bounds (794)
  • 22 More Variational Inference (798)
    • 22.1 Introduction (798)
    • 22.2 Loopy belief propagation: algorithmic issues (798)
      • 22.2.1 A brief history (798)
      • 22.2.2 LBP on pairwise models (799)
      • 22.2.3 LBP on a factor graph (800)
      • 22.2.4 Convergence (802)
      • 22.2.5 Accuracy of LBP (805)
      • 22.2.6 Other speedup tricks for LBP * (806)
    • 22.3 Loopy belief propagation: theoretical issues * (807)
      • 22.3.1 UGMs represented in exponential family form (807)
      • 22.3.2 The marginal polytope (808)
      • 22.3.3 Exact inference as a variational optimization problem (809)
      • 22.3.4 Mean field as a variational optimization problem (810)
      • 22.3.5 LBP as a variational optimization problem (810)
      • 22.3.6 Loopy BP vs mean field (814)
    • 22.4 Extensions of belief propagation * (814)
      • 22.4.1 Generalized belief propagation (814)
      • 22.4.2 Convex belief propagation (816)
    • 22.5 Expectation propagation (818)
      • 22.5.1 EP as a variational inference p (819)
      • 22.5.2 Optimizing the EP objective using moment matching (820)
      • 22.5.3 EP for the clutter problem (822)
      • 22.5.4 LBP is a special case of EP (823)
      • 22.5.5 Ranking players using TrueSkill (824)
      • 22.5.6 Other applications of EP (830)
    • 22.6 MAP state estimation (830)
      • 22.6.1 Linear programming relaxation (830)
      • 22.6.2 Max-product belief propagation (831)
      • 22.6.3 Graphcuts (832)
      • 22.6.4 Experimental comparison of graphcuts and BP (835)
      • 22.6.5 Dual decomposition (837)
  • 23 Monte Carlo Inference (846)
    • 23.1 Introduction (846)
    • 23.2 Sampling from standard distributions (846)
      • 23.2.1 Using the cdf (846)
      • 23.2.2 Sampling from a Gaussian (Box-Muller method) (848)
    • 23.3 Rejection sampling (848)
      • 23.3.1 Basic idea (848)
      • 23.3.2 Example (849)
      • 23.3.3 Application to Bayesian statistics (850)
      • 23.3.4 Adaptive rejection sampling (850)
      • 23.3.5 Rejection sampling in high dimensions (851)
    • 23.4 Importance sampling (851)
      • 23.4.1 Basic idea (851)
      • 23.4.2 Handling unnormalized distributions (852)
      • 23.4.3 Importance sampling for a DGM: likelihood weighting (853)
      • 23.4.4 Sampling importance resampling (SIR) (853)
    • 23.5 Particle filtering (854)
      • 23.5.1 Sequential importance sampling (855)
      • 23.5.2 The degeneracy problem (856)
      • 23.5.3 The resampling step (856)
      • 23.5.4 The proposal distribution (858)
      • 23.5.5 Application: robot localization (859)
      • 23.5.6 Application: visual object tracking (859)
      • 23.5.7 Application: time series forecasting (862)
    • 23.6 Rao-Blackwellised particle filtering (RBPF) (862)
      • 23.6.1 RBPF for switching LG-SSMs (862)
      • 23.6.2 Application: tracking a maneuvering target (863)
      • 23.6.3 Application: Fast SLAM (865)
  • 24 Markov Chain Monte Carlo (MCMC) Inference (868)
    • 24.1 Introduction (868)
    • 24.2 Gibbs sampling (869)
      • 24.2.1 Basic idea (869)
      • 24.2.2 Example: Gibbs sampling for the Ising model (869)
      • 24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM (871)
      • 24.2.4 Collapsed Gibbs sampling * (872)
      • 24.2.5 Gibbs sampling for hierarchical GLMs (875)
      • 24.2.6 BUGS and JAGS (877)
      • 24.2.7 The Imputation Posterior (IP) algorithm (878)
      • 24.2.8 Blocking Gibbs sampling (878)
    • 24.3 Metropolis Hastings algorithm (879)
      • 24.3.1 Basic idea (879)
      • 24.3.2 Gibbs sampling is a special case of MH (880)
      • 24.3.3 Proposal distributions (881)
      • 24.3.4 Adaptive MCMC (884)
      • 24.3.5 Initialization and mode hopping (885)
      • 24.3.6 Why MH works * (885)
      • 24.3.7 Reversible jump (trans-dimensional) MCMC * (886)
    • 24.4 Speed and accuracy of MCMC (887)
      • 24.4.1 The burn-in phase (887)
      • 24.4.2 Mixing rates of Markov chains * (888)
      • 24.4.3 Practical convergence diagnostic (889)
      • 24.4.4 Accuracy of MCMC (891)
      • 24.4.5 How many chains? (893)
    • 24.5 Auxiliary variable MCMC * (894)
      • 24.5.1 Auxiliary variable sampling for logistic regression (894)
      • 24.5.2 Slice sampling (895)
      • 24.5.3 Swendsen Wang (897)
      • 24.5.4 Hybrid/Hamiltonian MCMC * (899)
    • 24.6 Annealing methods (899)
      • 24.6.1 Simulated annealing (900)
      • 24.6.2 Annealed importance sampling (902)
      • 24.6.3 Parallel tempering (902)
    • 24.7 Approximating the marginal likelihood (903)
      • 24.7.1 The candidate method (903)
      • 24.7.2 Harmonic mean estimate (903)
      • 24.7.3 Annealed importance sampling (904)
  • 25 Clustering (906)
    • 25.1 Introduction (906)
      • 25.1.1 Measuring (dis)similarity (906)
      • 25.1.2 Evaluating the output of clustering methods * (907)
    • 25.2 Dirichlet process mixture models (910)
      • 25.2.1 From finite to infinite mixture models (910)
      • 25.2.2 The Dirichlet process (913)
      • 25.2.3 Applying Dirichlet processes to mixture modeling (916)
      • 25.2.4 Fitting a DP mixture model (917)
    • 25.3 Affinity propagation (918)
    • 25.4 Spectral clustering (921)
      • 25.4.1 Graph Laplacian (922)
      • 25.4.2 Normalized graph Laplacian (923)
      • 25.4.3 Example (924)
    • 25.5 Hierarchical clustering (924)
      • 25.5.1 Agglomerative clustering (926)
      • 25.5.2 Divisive clustering (929)
      • 25.5.3 Choosing the number of clusters (930)
      • 25.5.4 Bayesian hierarchical clustering (930)
    • 25.6 Clustering datapoints and features (932)
      • 25.6.1 Biclustering (934)
      • 25.6.2 Multi-view clustering (934)
  • 26 Graphical Model Structure Learning (938)
    • 26.1 Introduction (938)
    • 26.2 Structure learning for knowledge discovery (939)
      • 26.2.1 Relevance networks (939)
      • 26.2.2 Dependency networks (940)
    • 26.3 Learning tree structures (941)
      • 26.3.1 Directed or undirected tree? (942)
      • 26.3.2 Chow-Liu algorithm for finding the ML tree structure (943)
      • 26.3.3 Finding the MAP forest (943)
      • 26.3.4 Mixtures of trees (945)
    • 26.4 Learning DAG structures (945)
      • 26.4.1 Markov equivalence (945)
      • 26.4.2 Exact structural inference (947)
      • 26.4.3 Scaling up to larger graphs (951)
    • 26.5 Learning DAG structure with latent variables (953)
      • 26.5.1 Approximating the marginal likelihood when we have missing data (953)
      • 26.5.2 Structural EM (956)
      • 26.5.3 Discovering hidden variables (957)
      • 26.5.4 Case study: Google’s Rephil (959)
      • 26.5.5 Structural equation models * (960)
    • 26.6 Learning causal DAGs (962)
      • 26.6.1 Causal interpretation of DAGs (962)
      • 26.6.2 Using causal DAGs to resolve Simpson’s paradox (964)
      • 26.6.3 Learning causal DAG structures (966)
    • 26.7 Learning undirected Gaussian graphical models (969)
      • 26.7.1 MLE for a GGM (969)
      • 26.7.2 Graphical lasso (970)
      • 26.7.3 Bayesian inference for GGM structure * (972)
      • 26.7.4 Handling non-Gaussian data using copulas * (973)
    • 26.8 Learning undirected discrete graphical models (973)
      • 26.8.1 Graphical lasso for MRFs/CRFs (973)
      • 26.8.2 Thin junction trees (975)
  • 27 Latent Variable Models for Discrete Data (976)
    • 27.1 Introduction (976)
    • 27.2 Distributed state LVMs for discrete data (977)
      • 27.2.1 Mixture models (977)
      • 27.2.2 Exponential family PCA (978)
      • 27.2.3 LDA and mPCA (979)
      • 27.2.4 GaP model and non-negative matrix factorization (980)
    • 27.3 Latent Dirichlet allocation (LDA) (981)
      • 27.3.1 Basics (981)
      • 27.3.2 Unsupervised discovery of topics (984)
      • 27.3.3 Quantitatively evaluating LDA as a language model (984)
      • 27.3.4 Fitting using (collapsed) Gibbs sampling (986)
      • 27.3.5 Example (987)
      • 27.3.6 Fitting using batch variational inference (988)
      • 27.3.7 Fitting using online variational inference (990)
      • 27.3.8 Determining the number of topics (991)
    • 27.4 Extensions of LDA (992)
      • 27.4.1 Correlated topic model (992)
      • 27.4.2 Dynamic topic model (993)
      • 27.4.3 LDA-HMM (994)
      • 27.4.4 Supervised LDA (998)
    • 27.5 LVMs for graph-structured data (1001)
      • 27.5.1 Stochastic block model (1002)
      • 27.5.2 Mixed membership stochastic block model (1004)
      • 27.5.3 Relational topic model (1005)
    • 27.6 LVMs for relational data (1006)
      • 27.6.1 Infinite relational model (1007)
      • 27.6.2 Probabilistic matrix factorization for collaborative filtering (1010)
    • 27.7 Restricted Boltzmann machines (RBMs) (1014)
      • 27.7.1 Varieties of RBMs (1016)
      • 27.7.2 Learning RBMs (1018)
      • 27.7.3 Applications of RBMs (1022)
  • 28 Deep Learning (1026)
    • 28.1 Introduction (1026)
    • 28.2 Deep generative models (1026)
      • 28.2.1 Deep directed networks (1027)
      • 28.2.2 Deep Boltzmann machines (1027)
      • 28.2.3 Deep belief networks (1028)
      • 28.2.4 Greedy layer-wise learning of DBNs (1029)
    • 28.3 Deep neural networks (1030)
      • 28.3.1 Deep multi-layer perceptrons (1030)
      • 28.3.2 Deep auto-encoders (1031)
      • 28.3.3 Stacked denoising auto-encoders (1032)
    • 28.4 Applications of deep networks (1032)
      • 28.4.1 Handwritten digit classification using DBNs (1032)
      • 28.4.2 Data visualization and feature discovery using deep auto-encoders (1033)
      • 28.4.3 Information retrieval using deep auto-encoders (semantic hashing) (1034)
      • 28.4.4 Learning audio features using 1d convolutional DBNs (1035)
      • 28.4.5 Learning image features using 2d convolutional DBNs (1036)
    • 28.5 Discussion (1036)
  • Notation (1040)
  • Bibliography (1046)
  • Index to Code (1078)
  • Index to Keywords (1081)
Loading...