20.15 结论-深度学习-万书网

为了让模型理解基于给定训练数据表示的大千世界，训练具有隐藏单元的生成模型是一种有力方法。通过学习模型和表示，生成模型可以解答x输入变量之间关系的许多推断问题，并且可以在不同层对h求期望来提供表示x的许多不同方式。生成模型可以为AI系统提供它们所要理解的、各种不同概念的框架，让它们有能力在面对不确定性的情况下推理这些概念。我们希望读者能够找到增强这些方法的新途径，并继续探究智能和学习背后原理的旅程。

————————————————————

(1)  术语“mcRBM”根据字母M-C-R-B-M发音；“mc”不是“McDonald's”中的“Mc”的发音。

(2)  这个版本的Gaussian-Bernoulli  RBM能量函数假定图像数据的每个像素具有零均值。考虑非零像素均值时，可以简单地将像素偏移添加到模型中。

(3)  该论文将模型描述为“深度信念网络”，但因为它可以被描述为纯无向模型（具有易处理逐层均匀场不动点更新），所以它最适合深度玻尔兹曼机的定义。

参考文献

Abadi，M.，Agarwal，A.，Barham，P.，Brevdo，E.，Chen，Z.，Citro，C.，Corrado，G.  S.，Davis，A.，Dean，J.，Devin，M.，Ghemawat，S.，Goodfellow，I.，Harp，A.，Irving，G.，Isard，M.，Jia，Y.，Jozefowicz，R.，Kaiser，L.，Kudlur，M.，Levenberg，J.，Mané，D.，Monga，R.，Moore，S.，Murray，D.，Olah，C.，Schuster，M.，Shlens，J.，Steiner，B.，Sutskever，I.，Talwar，K.，Tucker，P.，Vanhoucke，V.，Vasudevan，V.，Viégas，F.，Vinyals，O.，Warden，P.，Wattenberg，M.，Wicke，M.，Yu，Y.，and  Zheng，X.（2015）.  TensorFlow:Large-scale  machine  learning  on  heterogeneous  systems.  Software  available  from  tensorflow.org.

Ackley，D.  H.，Hinton，G.  E.，and  Sejnowski，T.  J.（1985）.  A  learning  algorithm  for  Boltzmann  machines.  Cognitive  Science，9，147–169.

Alain，G.  and  Bengio，Y.（2013）.  What  regularized  auto-encoders  learn  from  the  data  generating  distribution.  In  ICLR'2013，arXiv:1211.4246.

Alain，G.，Bengio，Y.，Yao，L.，Éric  Thibodeau-Laufer，Yosinski，J.，and  Vincent，P.（2015）.  GSNs:  Generative  stochastic  networks.  arXiv:1503.05571.

Anderson，E.（1935）.  The  Irises  of  the  Gaspé  Peninsula.  Bulletin  of  the  American  Iris  Society，59，2–5.

Ba，J.，Mnih，V.，and  Kavukcuoglu，K.（2014）.  Multiple  object  recognition  with  visual  attention.  arXiv:1412.7755.

Bachman，P.  and  Precup，D.（2015）.  Variational  generative  stochastic  networks  with  collaborative  shaping.  In  Proceedings  of  the  32nd  International  Conference  on  Machine  Learning，ICML  2015，Lille，France，6-11  July  2015，pages  1964–1972.

Bacon，P.-L.，Bengio，E.，Pineau，J.，and  Precup，D.（2015）.  Conditional  computation  in  neural  networks  using  a  decision-theoretic  approach.  In  2nd  Multidisciplinary  Conference  on  Rein-forcement  Learning  and  Decision  Making（RLDM  2015）.

Bagnell，J.  A.  and  Bradley，D.  M.（2009）.  Differentiable  sparse  coding.  In  NIPS'2009，pages  113–120.

Bahdanau，D.，Cho，K.，and  Bengio，Y.（2015）.  Neural  machine  translation  by  jointly  learning  to  align  and  translate.  In  ICLR'2015，arXiv:1409.0473.

Bahl，L.  R.，Brown，P.，de  Souza，P.  V.，and  Mercer，R.  L.（1987）.  Speech  recognition  with  continuous-parameter  hidden  Markov  models.  Computer，Speech  and  Language，2，219–234.

Baldi，P.  and  Hornik，K.（1989）.  Neural  networks  and  principal  component  analysis:Learning  from  examples  without  local  minima.  Neural  Networks，2，53–58.

Baldi，P.，Brunak，S.，Frasconi，P.，Soda，G.，and  Pollastri，G.（1999）.  Exploiting  the  past  and  the  future  in  protein  secondary  structure  prediction.  Bioinformatics，15（11），937–946.

Baldi，P.，Sadowski，P.，and  Whiteson，D.（2014）.  Searching  for  exotic  particles  in  high-energy  physics  with  deep  learning.  Nature  communications，5.

Ballard，D.  H.，Hinton，G.  E.，and  Sejnowski，T.  J.（1983）.  Parallel  vision  computation.  Nature.

Barlow，H.  B.（1989）.  Unsupervised  learning.  Neural  Computation，1，295–311.

Barron，A.  E.（1993）.  Universal  approximation  bounds  for  superpositions  of  a  sigmoidal  function.  IEEE  Trans.  on  Information  Theory，39，930–945.

Bartholomew，D.  J.（1987）.  Latent  variable  models  and  factor  analysis.  Oxford  University  Press.

Basilevsky，A.（1994）.  Statistical  Factor  Analysis  and  Related  Methods:Theory  and  Applications.  Wiley.

Bastien，F.，Lamblin，P.，Pascanu，R.，Bergstra，J.，Goodfellow，I.，Bergeron，A.，Bouchard，N.，Warde-Farley，D.，and  Bengio，Y.（2012a）.  Theano:new  features  and  speed  improvements.  Submited  to  the  Deep  Learning  and  Unsupervised  Feature  Learning  NIPS  2012  Workshop，http://iro.umontreal.ca/lisa/publications2/index.php/publications/show/551.

Bastien，F.，Lamblin，P.，Pascanu，R.，Bergstra，J.，Goodfellow，I.  J.，Bergeron，A.，Bouchard，N.，and  Bengio，Y.（2012b）.  Theano:new  features  and  speed  improvements.  Deep  Learning  and  Unsupervised  Feature  Learning  NIPS  2012  Workshop.

Basu，S.  and  Christensen，J.（2013）.  Teaching  classification  boundaries  to  humans.  In  AAAI'2013.

Baxter，J.（1995）.  Learning  internal  representations.  In  Proceedings  of  the  8th  International  Conference  on  Computational  Learning  Theory（COLT'95），pages  311–320，Santa  Cruz，California.  ACM  Press.

Bayer，J.  and  Osendorfer，C.（2014）.  Learning  stochastic  recurrent  networks.  ArXiv  e-prints.

Becker，S.  and  Hinton，G.（1992）.  A  self-organizing  neural  network  that  discovers  surfaces  in  random-dot  stereograms.  Nature，355，161–163.

Behnke，S.（2001）.  Learning  iterative  image  reconstruction  in  the  neural  abstraction  pyramid.  Int.  J.  Computational  Intelligence  and  Applications，1（4），427–438.

Beiu，V.，Quintana，J.  M.，and  Avedillo，M.  J.（2003）.  VLSI  implementations  of  threshold  logic-a  comprehensive  survey.  Neural  Networks，IEEE  Transactions  on，14（5），1217–1243.

Belkin，M.  and  Niyogi，P.（2002）.  Laplacian  eigenmaps  and  spectral  techniques  for  embedding  and  clustering.  In  T.  Dietterich，S.  Becker，and  Z.  Ghahramani，editors，Advances  in  Neural  Information  Processing  Systems  14（NIPS'01），Cambridge，MA.  MIT  Press.

Belkin，M.  and  Niyogi，P.（2003a）.  Laplacian  eigenmaps  for  dimensionality  reduction  and  data  representation.  Neural  Computation，15（6），1373–1396.

Belkin，M.  and  Niyogi，P.（2003b）.  Using  manifold  structure  for  partially  labeled  classification.  In  S.  Becker，S.  Thrun，and  K.  Obermayer，editors，Advances  in  Neural  Information  Processing  Systems  15（NIPS'02），Cambridge，MA.  MIT  Press.

Bengio，E.，Bacon，P.-L.，Pineau，J.，and  Precup，D.（2015a）.  Conditional  computation  in  neural  networks  for  faster  models.  arXiv:1511.06297.

Bengio，S.  and  Bengio，Y.（2000a）.  Taking  on  the  curse  of  dimensionality  in  joint  distributions  using  neural  networks.  IEEE  Transactions  on  Neural  Networks，special  issue  on  Data  Mining  and  Knowledge  Discovery，11（3），550–557.

Bengio，S.，Vinyals，O.，Jaitly，N.，and  Shazeer，N.（2015b）.  Scheduled  sampling  for  sequence  prediction  with  recurrent  neural  networks.  Technical  report，arXiv:1506.03099.

Bengio，Y.（1991）.  Artificial  Neural  Networks  and  their  Application  to  Sequence  Recognition.  Ph.D.  thesis，McGill  University，（Computer  Science），Montreal，Canada.

Bengio，Y.（2000）.  Gradient-based  optimization  of  hyperparameters.  Neural  Computation，12（8），1889–1900.

Bengio，Y.（2002）.  New  distributed  probabilistic  language  models.  Technical  Report  1215，Dept.  IRO，Université  de  Montréal.

Bengio，Y.（2009）.  Learning  deep  architectures  for  AI.  Now  Publishers.

Bengio，Y.（2013）.  Deep  learning  of  representations:  looking  forward.  In  Statistical  Language  and  Speech  Processing，volume  7978  of  Lecture  Notes  in  Computer  Science，pages  1–37.  Springer，also  in  arXiv  at  http://arxiv.org/abs/1305.0445.

Bengio，Y.（2015）.  Early  inference  in  energy-based  models  approximates  back-propagation.  Technical  Report  arXiv:1510.02777，Universite  de  Montreal.

Bengio，Y.  and  Bengio，S.（2000b）.  Modeling  high-dimensional  discrete  data  with  multi-layer  neural  networks.  In  NIPS  12，pages  400–406.  MIT  Press.

Bengio，Y.  and  Delalleau，O.（2009）.  Justifying  and  generalizing  contrastive  divergence.  Neural  Computation，21（6），1601–1621.

Bengio，Y.  and  Grandvalet，Y.（2004）.  No  unbiased  estimator  of  the  variance  of  k-fold  cross-validation.  In  JML（1），pages  1089–1105.

Bengio，Y.  and  LeCun，Y.（2007a）.  Scaling  learning  algorithms  towards  AI.  In  Large  Scale  Kernel  Machines.

Bengio，Y.  and  LeCun，Y.（2007b）.  Scaling  learning  algorithms  towards  AI.  In  L.  Bottou，O.  Chapelle，D.  DeCoste，and  J.  Weston，editors，Large  Scale  Kernel  Machines.  MIT  Press.

Bengio，Y.  and  Monperrus，M.（2005）.  Non-local  manifold  tangent  learning.  In  L.  Saul，Y.  Weiss，and  L.  Bottou，editors，Advances  in  Neural  Information  Processing  Systems  17（NIPS'04），pages  129–136.  MIT  Press.

Bengio，Y.  and  Sénécal，J.-S.（2003）.  Quick  training  of  probabilistic  neural  nets  by  importance  sampling.  In  Proceedings  of  AISTATS  2003.

Bengio，Y.  and  Sénécal，J.-S.（2008）.  Adaptive  importance  sampling  to  accelerate  training  of  a  neural  probabilistic  language  model.  IEEE  Trans.  Neural  Networks，19（4），713–722.

Bengio，Y.，De  Mori，R.，Flammia，G.，and  Kompe，R.（1991）.  Phonetically  motivated  acoustic  parameters  for  continuous  speech  recognition  using  artificial  neural  networks.  In  Proceedings  of  EuroSpeech'91.

Bengio，Y.，De  Mori，R.，Flammia，G.，and  Kompe，R.（1992）.  Neural  network-Gaussian  mix-ture  hybrid  for  speech  recognition  or  density  estimation.  In  NIPS  4，pages  175–182.  Morgan  Kaufmann.

Bengio，Y.，Frasconi，P.，and  Simard，P.（1993）.  The  problem  of  learning  long-term  dependencies  in  recurrent  networks.  In  IEEE  International  Conference  on  Neural  Networks，pages  1183–1195，San  Francisco.  IEEE  Press.（invited  paper）.

Bengio，Y.，Simard，P.，and  Frasconi，P.（1994a）.  Learning  long-term  dependencies  with  gradient  descent  is  difficult.  IEEE  Tr.  Neural  Nets.

Bengio，Y.，Simard，P.，and  Frasconi，P.（1994b）.  Learning  long-term  dependencies  with  gradient  descent  is  difficult.  IEEE  Transactions  on  Neural  Networks，5（2），157–166.

Bengio，Y.，Simard，P.，and  Frasconi，P.（1994c）.  Learning  long-term  dependencies  with  gradient  descent  is  difficult.  IEEE  Transactions  on  Neural  Networks，5（2），157–166.

Bengio，Y.，Latendresse，S.，and  Dugas，C.（1999）.  Gradient-based  learning  of  hyper-parameters.  In  Learning  Conference.

Bengio，Y.，Ducharme，R.，and  Vincent，P.（2001a）.  A  neural  probabilistic  language  model.  In  T.  Leen，T.  Dietterich，and  V.  Tresp，editors，Advances  in  Neural  Information  Processing  Systems  13（NIPS'00），pages  933–938.  MIT  Press.

Bengio，Y.，Ducharme，R.，and  Vincent，P.（2001b）.  A  neural  probabilistic  language  model.  In  T.  K.  Leen，T.  G.  Dietterich，and  V.  Tresp，editors，NIPS'2000，pages  932–938.  MIT  Press.

Bengio，Y.，Ducharme，R.，Vincent，P.，and  Jauvin，C.（2003）.  A  neural  probabilistic  language  model.  JMLR，3，1137–1155.

Bengio，Y.，Delalleau，O.，and  Le  Roux，N.（2006a）.  The  curse  of  highly  variable  functions  for  local  kernel  machines.  In  NIPS'2005.

Bengio，Y.，Larochelle，H.，and  Vincent，P.（2006b）.  Non-local  manifold  Parzen  windows.  In  NIPS'2005.  MIT  Press.

Bengio，Y.，Lamblin，P.，Popovici，D.，and  Larochelle，H.（2007a）.  Greedy  layer-wise  training  of  deep  networks.  In  NIPS'2006.

Bengio，Y.，Lamblin，P.，Popovici，D.，and  Larochelle，H.（2007b）.  Greedy  layer-wise  training  of  deep  networks.  In  B.  Schölkopf，J.  Platt，and  T.  Hoffman，editors，Advances  in  Neural  Information  Processing  Systems  19（NIPS'06），pages  153–160.  MIT  Press.

Bengio，Y.，Lamblin，P.，Popovici，D.，and  Larochelle，H.（2007c）.  Greedy  layer-wise  training  of  deep  networks.  In  Adv.  Neural  Inf.  Proc.  Sys.  19，pages  153–160.

Bengio，Y.，Lamblin，P.，Popovici，D.，and  Larochelle，H.（2007d）.  Greedy  layer-wise  training  of  deep  networks.  In  NIPS  19，pages  153–160.  MIT  Press.

Bengio，Y.，Louradour，J.，Collobert，R.，and  Weston，J.（2009）.  Curriculum  learning.  In  ICML'09.  ACM.

Bengio，Y.，Mesnil，G.，Dauphin，Y.，and  Rifai，S.（2013a）.  Better  mixing  via  deep  representa-tions.  In  ICML'2013.

Bengio，Y.，Léonard，N.，and  Courville，A.（2013b）.  Estimating  or  propagating  gradients  through  stochastic  neurons  for  conditional  computation.  arXiv:1308.3432.

Bengio，Y.，Yao，L.，Alain，G.，and  Vincent，P.（2013c）.  Generalized  denoising  auto-encoders  as  generative  models.  In  NIPS'2013.

Bengio，Y.，Courville，A.，and  Vincent，P.（2013d）.  Representation  learning:  A  review  and  new  perspectives.  Pattern  Analysis  and  Machine  Intelligence，IEEE  Transactions  on，35（8），1798–1828.

Bengio，Y.，Thibodeau-Laufer，E.，Alain，G.，and  Yosinski，J.（2014）.  Deep  generative  stochastic  networks  trainable  by  backprop.  In  ICML'2014.

Bennett，C.（1976）.  Efficient  estimation  of  free  energy  differences  from  Monte  Carlo  data.  Journal  of  Computational  Physics，22（2），245–268.

Bennett，J.  and  Lanning，S.（2007）.  The  Netflix  prize.

Berger，A.  L.，Della  Pietra，V.  J.，and  Della  Pietra，S.  A.（1996）.  A  maximum  entropy  approach  to  natural  language  processing.  Computational  Linguistics，22，39–71.

Berglund，M.  and  Raiko，T.（2013）.  Stochastic  gradient  estimate  variance  in  contrastive  diver-gence  and  persistent  contrastive  divergence.  CoRR，abs/1312.6002.

Bergstra，J.（2011）.  Incorporating  Complex  Cells  into  Neural  Networks  for  Pattern  Classification.  Ph.D.  thesis，Université  de  Montréal.

Bergstra，J.  and  Bengio，Y.（2009）.  Slow，decorrelated  features  for  pretraining  complex  cell-like  networks.  In  NIPS  22，pages  99–107.  MIT  Press.

Bergstra，J.  and  Bengio，Y.（2011）.  Random  search  for  hyper-parameter  optimization.  The  Learning  Workshop，Fort  Lauderdale，Florida.

Bergstra，J.  and  Bengio，Y.（2012）.  Random  search  for  hyper-parameter  optimization.  J.  Machine  Learning  Res.，13，281–305.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and  Bengio，Y.（2010a）.  Theano:  a  CPU  and  GPU  math  expression  compiler.  In  Proceedings  of  the  Python  for  Scientific  Computing  Conference（SciPy）.  Oral  Presentation.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and  Bengio，Y.（2010b）.  Theano:  a  CPU  and  GPU  math  expression  com-piler.  In  Proc.  SciPy.

Bergstra，J.，Breuleux，O.，Bastien，F.，Lamblin，P.，Pascanu，R.，Desjardins，G.，Turian，J.，Warde-Farley，D.，and  Bengio，Y.（2010c）.  Theano:a  CPU  and  GPU  math  expression  compiler.  In  Proceedings  of  the  Python  for  Scientific  Computing  Conference（SciPy）.

Bergstra，J.，Bardenet，R.，Bengio，Y.，and  Kégl，B.（2011）.  Algorithms  for  hyper-parameter  optimization.  In  NIPS'2011.

Berkes，P.  and  Wiskott，L.（2005）.  Slow  feature  analysis  yields  a  rich  repertoire  of  complex  cell  properties.  Journal  of  Vision，5（6），579–602.

Bertsekas，D.  P.  and  Tsitsiklis，J.（1996）.  Neuro-Dynamic  Programming.  Athena  Scientific.

Besag，J.（1975）.  Statistical  analysis  of  non-lattice  data.  The  Statistician，24（3），179–195.

Bishop，C.  M.（1994）.  Mixture  density  networks.

Bishop，C.  M.（1995a）.  Regularization  and  complexity  control  in  feed-forward  networks.  In  Proceedings  International  Conference  on  Artificial  Neural  Networks  ICANN'95，volume  1，page  141–148.

Bishop，C.  M.（1995b）.  Training  with  noise  is  equivalent  to  Tikhonov  regularization.  Neural  Computation，7（1），108–116.

Bishop，C.  M.（2006）.  Pattern  Recognition  and  Machine  Learning.  Springer.

Blum，A.  L.  and  Rivest，R.  L.（1992）.  Training  a  3-node  neural  network  is  NP-complete.

Blumer，A.，Ehrenfeucht，A.，Haussler，D.，and  Warmuth，M.  K.（1989）.  Learnability  and  the  Vapnik–Chervonenkis  dimension.  Journal  of  the  ACM，36（4），865–929.

Bonnet，G.（1964）.  Transformations  des  signaux  aléatoires  à  travers  les  systèmes  non  linéaires  sans  mémoire.  Annales  des  Télécommunications，19（9–10），203–220.

Bordes，A.，Weston，J.，Collobert，R.，and  Bengio，Y.（2011）.  Learning  structured  embeddings  of  knowledge  bases.  In  AAAI  2011.

Bordes，A.，Glorot，X.，Weston，J.，and  Bengio，Y.（2012）.  Joint  learning  of  words  and  meaning  representations  for  open-text  semantic  parsing.  AISTATS'2012.

Bordes，A.，Glorot，X.，Weston，J.，and  Bengio，Y.（2013a）.  A  semantic  matching  energy  func-tion  for  learning  with  multi-relational  data.  Machine  Learning:  Special  Issue  on  Learning  Semantics.

Bordes，A.，Usunier，N.，Garcia-Duran，A.，Weston，J.，and  Yakhnenko，O.（2013b）.  Translating  embeddings  for  modeling  multi-relational  data.  In  C.  Burges，L.  Bottou，M.  Welling，Z.  Ghahramani，and  K.  Weinberger，editors，Advances  in  Neural  Information  Processing  Systems  26，pages  2787–2795.  Curran  Associates，Inc.

Bornschein，J.  and  Bengio，Y.（2015）.  Reweighted  wake-sleep.  In  ICLR'2015，arXiv:1406.2751.

Bornschein，J.，Shabanian，S.，Fischer，A.，and  Bengio，Y.（2015）.  Training  bidirectional  Helmholtz  machines.  Technical  report，arXiv:1506.03877.

Boser，B.  E.，Guyon，I.  M.，and  Vapnik，V.  N.（1992）.  A  training  algorithm  for  optimal  margin  classifiers.  In  COLT  '92:  Proceedings  of  thefifth  annual  workshop  on  Computational  learning  theory，pages  144–152，New  York，NY，USA.  ACM.

Bottou，L.（1998）.  Online  algorithms  and  stochastic  approximations.  In  D.  Saad，editor，Online  Learning  in  Neural  Networks.  Cambridge  University  Press，Cambridge，UK.

Bottou，L.（2011）.  From  machine  learning  to  machine  reasoning.  Technical  report，arXiv.1102.1808.

Bottou，L.（2015）.  Multilayer  neural  networks.  Deep  Learning  Summer  School.

Bottou，L.  and  Bousquet，O.（2008a）.  The  tradeoffs  of  large  scale  learning.  In  J.  Platt，D.  Koller，Y.  Singer，and  S.  Roweis，editors，Advances  in  Neural  Information  Processing  Systems  20（NIPS'07），volume  20.  MIT  Press，Cambridge，MA.

Bottou，L.  and  Bousquet，O.（2008b）.  The  tradeoffs  of  large  scale  learning.  In  NIPS'2008.

Boulanger-Lewandowski，N.，Bengio，Y.，and  Vincent，P.（2012）.  Modeling  temporal  dependen-cies  in  high-dimensional  sequences:  Application  to  polyphonic  music  generation  and  transcrip-tion.  In  ICML'12.

Boureau，Y.，Ponce，J.，and  LeCun，Y.（2010）.  A  theoretical  analysis  of  feature  pooling  in  vision  algorithms.  In  Proc.  International  Conference  on  Machine  learning（ICML'10）.

Boureau，Y.，Le  Roux，N.，Bach，F.，Ponce，J.，and  LeCun，Y.（2011）.  Ask  the  locals:  multi-way  local  pooling  for  image  recognition.  In  Proc.  International  Conference  on  Computer  Vision（ICCV'11）.  IEEE.

Bourlard，H.  and  Kamp，Y.（1988）.  Auto-association  by  multilayer  perceptrons  and  singular  value  decomposition.  Biological  Cybernetics，59，291–294.

Bourlard，H.  and  Wellekens，C.（1989）.  Speech  pattern  discrimination  and  multi-layered  percep-trons.  Computer  Speech  and  Language，3，1–19.

Boyd，S.  and  Vandenberghe，L.（2004）.  Convex  Optimization.  Cambridge  University  Press，New  York，NY，USA.

Brady，M.  L.，Raghavan，R.，and  Slawny，J.（1989）.  Back-propagation  fails  to  separate  where  perceptrons  succeed.  IEEE  Transactions  on  Circuits  and  Systems，36（5），665–674.

Brakel，P.，Stroobandt，D.，and  Schrauwen，B.（2013）.  Training  energy-based  models  for  time-series  imputation.  Journal  of  Machine  Learning  Research，14，2771–2797.

Brand，M.（2003a）.  Charting  a  manifold.  In  S.  Becker，S.  Thrun，and  K.  Obermayer，editors，Advances  in  Neural  Information  Processing  Systems  15（NIPS'02），pages  961–968.  MIT  Press.

Brand，M.（2003b）.  Charting  a  manifold.  In  NIPS'2002，pages  961–968.  MIT  Press.

Breiman，L.（1994）.  Bagging  predictors.  Machine  Learning，24（2），123–140.

Breiman，L.，Friedman，J.  H.，Olshen，R.  A.，and  Stone，C.  J.（1984）.  Classification  and  Regression  Trees.  Wadsworth  International  Group，Belmont，CA.

Bridle，J.  S.（1990）.  Alphanets:  a  recurrent  ‘neural’  network  architecture  with  a  hidden  Markov  model  interpretation.  Speech  Communication，9（1），83–92.

Briggman，K.，Denk，W.，Seung，S.，Helmstaedter，M.  N.，and  Turaga，S.  C.（2009）.  Maximin  affinity  learning  of  image  segmentation.  In  NIPS'2009，pages  1865–1873.

Brown，P.  F.，Cocke，J.，Pietra，S.  A.  D.，Pietra，V.  J.  D.，Jelinek，F.，Lafferty，J.  D.，Mercer，R.  L.，and  Roossin，P.  S.（1990）.  A  statistical  approach  to  machine  translation.  Computational  linguistics，16（2），79–85.

Brown，P.  F.，Pietra，V.  J.  D.，DeSouza，P.  V.，Lai，J.  C.，and  Mercer，R.  L.（1992）.  Class-based  n-gram  models  of  natural  language.  Computational  Linguistics，18，467–479.

Bryson，A.  and  Ho，Y.（1969）.  Applied  optimal  control:  optimization，estimation，and  control.  Blaisdell  Pub.  Co.

Bryson，Jr.，A.  E.  and  Denham，W.  F.（1961）.  A  steepest-ascent  method  for  solving  optimum  programming  problems.  Technical  Report  BR-1303，Raytheon  Company，Missle  and  Space  Division.

Buciluǎ，C.，Caruana，R.，and  Niculescu-Mizil，A.（2006）.  Model  compression.  In  Proceedings  of  the  12th  ACM  SIGKDD  international  conference  on  Knowledge  discovery  and  data  mining，pages  535–541.  ACM.

Burda，Y.，Grosse，R.，and  Salakhutdinov，R.（2015）.  Importance  weighted  autoencoders.  arXiv  preprint  arXiv:1509.00519.

Cai，M.，Shi，Y.，and  Liu，J.（2013）.  Deep  maxout  neural  networks  for  speech  recognition.  In  Automatic  Speech  Recognition  and  Understanding（ASRU），2013  IEEE  Workshop  on，pages  291–296.  IEEE.

Carreira-Perpiñan，M.  A.  and  Hinton，G.  E.（2005）.  On  contrastive  divergence  learning.  In  AISTATS'2005，pages  33–40.

Caruana，R.（1993）.  Multitask  connectionist  learning.  In  Proceedings  of  the  1993  Connectionist  Models  Summer  School，pages  372–379.

Cauchy，A.（1847）.  Méthode  générale  pour  la  résolution  de  systèmes  d'équations  simultanées.  In  Compte  rendu  des  séances  de  l'académie  des  sciences，pages  536–538.

Cayton，L.（2005）.  Algorithms  for  manifold  learning.  Technical  Report  CS2008-0923，UCSD.

Chandola，V.，Banerjee，A.，and  Kumar，V.（2009）.  Anomaly  detection:  A  survey.  ACM  computing  surveys（CSUR），41（3），15.

Chapelle，O.，Weston，J.，and  Schölkopf，B.（2003）.  Cluster  kernels  for  semi-supervised  learning.  In  S.  Becker，S.  Thrun，and  K.  Obermayer，editors，Advances  in  Neural  Information  Processing  Systems  15（NIPS'02），pages  585–592，Cambridge，MA.  MIT  Press.

Chapelle，O.，Schölkopf，B.，and  Zien，A.，editors（2006）.  Semi-Supervised  Learning.  MIT  Press，Cambridge，MA.

Chellapilla，K.，Puri，S.，and  Simard，P.（2006）.  High  Performance  Convolutional  Neural  Net-works  for  Document  Processing.  In  Guy  Lorette，editor，Tenth  International  Workshop  on  Frontiers  in  Handwriting  Recognition，La  Baule（France）.  Université  de  Rennes  1，Suvisoft.  http://suvisoft.

Chen，B.，Ting，J.-A.，Marlin，B.  M.，and  de  Freitas，N.（2010）.  Deep  learning  of  invariant  spatio-temporal  features  from  video.  NIPS*2010  Deep  Learning  and  Unsupervised  Feature  Learning  Workshop.

Chen，S.  F.  and  Goodman，J.  T.（1999）.  An  empirical  study  of  smoothing  techniques  for  language  modeling.  Computer，Speech  and  Language，13（4），359–393.

Chen，T.，Du，Z.，Sun，N.，Wang，J.，Wu，C.，Chen，Y.，and  Temam，O.（2014a）.  DianNao:  A  small-footprint  high-throughput  accelerator  for  ubiquitous  machine-learning.  In  Proceedings  of  the  19th  international  conference  on  Architectural  support  for  programming  languages  and  operating  systems，pages  269–284.  ACM.

Chen，T.，Li，M.，Li，Y.，Lin，M.，Wang，N.，Wang，M.，Xiao，T.，Xu，B.，Zhang，C.，and  Zhang，Z.（2015）.  MXNet:  A  flexible  and  efficient  machine  learning  library  for  heterogeneous  distributed  systems.  arXiv  preprint  arXiv:1512.01274.

Chen，Y.，Luo，T.，Liu，S.，Zhang，S.，He，L.，Wang，J.，Li，L.，Chen，T.，Xu，Z.，Sun，N.，et  al.（2014b）.  DaDianNao:  A  machine-learning  supercomputer.  In  Microarchitecture（MICRO），2014  47th  Annual  IEEE/ACM  International  Symposium  on，pages  609–622.  IEEE.

Chilimbi，T.，Suzue，Y.，Apacible，J.，and  Kalyanaraman，K.（2014）.  Project  Adam:  Building  an  efficient  and  scalable  deep  learning  training  system.  In  11th  USENIX  Symposium  on  Operating  Systems  Design  and  Implementation（OSDI'14）.

Cho，K.，Raiko，T.，and  Ilin，A.（2010a）.  Parallel  tempering  is  efficient  for  learning  restricted  Boltzmann  machines.  In  Proceedings  of  the  International  Joint  Conference  on  Neural  Networks（IJCNN  2010），Barcelona，Spain.

Cho，K.，Raiko，T.，and  Ilin，A.（2010b）.  Parallel  tempering  is  efficient  for  learning  restricted  Boltzmann  machines.  In  IJCNN'2010.

Cho，K.，Raiko，T.，and  Ilin，A.（2011）.  Enhanced  gradient  and  adaptive  learning  rate  for  training  restricted  Boltzmann  machines.  In  ICML'2011，pages  105–112.

Cho，K.，Van  Merriënboer，B.，Gülçehre，Ç.，Bahdanau，D.，Bougares，F.，Schwenk，H.，and  Bengio，Y.（2014a）.  Learning  phrase  representations  using  RNN  encoder–decoder  for  statistical  machine  translation.  In  Proceedings  of  the  2014  Conference  on  Empirical  Methods  in  Natural  Language  Processing（EMNLP），pages  1724–1734.  Association  for  Computational  Linguistics.

Cho，K.，van  Merriënboer，B.，Gulcehre，C.，Bougares，F.，Schwenk，H.，and  Bengio，Y.（2014b）.  Learning  phrase  representations  using  RNN  encoder-decoder  for  statistical  machine  translation.  In  Proceedings  of  the  Empiricial  Methods  in  Natural  Language  Processing（EMNLP  2014）.

Cho，K.，Van  Merriënboer，B.，Bahdanau，D.，and  Bengio，Y.（2014c）.  On  the  properties  of  neural  machine  translation:  Encoder-decoder  approaches.  ArXiv  e-prints，abs/1409.1259.

Choromanska，A.，Henaff，M.，Mathieu，M.，Arous，G.  B.，and  LeCun，Y.（2014）.  The  loss  surface  of  multilayer  networks.

Chorowski，J.，Bahdanau，D.，Cho，K.，and  Bengio，Y.（2014）.  End-to-end  continuous  speech  recognition  using  attention-based  recurrent  NN:  First  results.  arXiv:1412.1602.

Christianson，B.（1992）.  Automatic  Hessians  by  reverse  accumulation.  IMA  Journal  of  Numerical  Analysis，12（2），135–150.

Chrupala，G.，Kadar，A.，and  Alishahi，A.（2015）.  Learning  language  through  pictures.  arXiv  1506.03694.

Chung，J.，Gulcehre，C.，Cho，K.，and  Bengio，Y.（2014）.  Empirical  evaluation  of  gated  recurrent  neural  networks  on  sequence  modeling.  NIPS'2014  Deep  Learning  workshop，arXiv  1412.3555.

Chung，J.，Gülçehre，Ç.，Cho，K.，and  Bengio，Y.（2015a）.  Gated  feedback  recurrent  neural  networks.  In  ICML'15.

Chung，J.，Kastner，K.，Dinh，L.，Goel，K.，Courville，A.，and  Bengio，Y.（2015b）.  A  recurrent  latent  variable  model  for  sequential  data.  In  NIPS'2015.

Ciresan，D.，Meier，U.，Masci，J.，and  Schmidhuber，J.（2012）.  Multi-column  deep  neural  network  for  traffic  sign  classification.  Neural  Networks，32，333–338.

Ciresan，D.  C.，Meier，U.，Gambardella，L.  M.，and  Schmidhuber，J.（2010）.  Deep  big  simple  neural  nets  for  handwritten  digit  recognition.  Neural  Computation，22，1–14.

Coates，A.  and  Ng，A.  Y.（2011）.  The  importance  of  encoding  versus  training  with  sparse  coding  and  vector  quantization.  In  ICML'2011.

Coates，A.，Lee，H.，and  Ng，A.  Y.（2011）.  An  analysis  of  single-layer  networks  in  unsuper-vised  feature  learning.  In  Proceedings  of  the  Thirteenth  International  Conference  on  Artificial  Intelligence  and  Statistics（AISTATS  2011）.

Coates，A.，Huval，B.，Wang，T.，Wu，D.，Catanzaro，B.，and  Andrew，N.（2013）.  Deep  learning  with  COTS  HPC  systems.  In  S.  Dasgupta  and  D.  McAllester，editors，Proceedings  of  the  30th  International  Conference  on  Machine  Learning（ICML-13），volume  28（3），pages  1337–1345.  JMLR  Workshop  and  Conference  Proceedings.

Cohen，N.，Sharir，O.，and  Shashua，A.（2015）.  On  the  expressive  power  of  deep  learning:  A  tensor  analysis.  arXiv:1509.05009.

Collobert，R.（2004）.  Large  Scale  Machine  Learning.  Ph.D.  thesis，Université  de  Paris  VI，LIP6.

Collobert，R.（2011）.  Deep  learning  for  efficient  discriminative  parsing.  In  AISTATS'2011.

Collobert，R.  and  Weston，J.（2008a）.  A  unified  architecture  for  natural  language  processing:  Deep  neural  networks  with  multitask  learning.  In  ICML'2008.

Collobert，R.  and  Weston，J.（2008b）.  A  unified  architecture  for  natural  language  processing:  Deep  neural  networks  with  multitask  learning.  In  ICML'2008.

Collobert，R.，Bengio，S.，and  Bengio，Y.（2001）.  A  parallel  mixture  of  SVMs  for  very  large  scale  problems.  Technical  Report  12，IDIAP.

Collobert，R.，Bengio，S.，and  Bengio，Y.（2002）.  Parallel  mixture  of  SVMs  for  very  large  scale  problem.  Neural  Computation.

Collobert，R.，Weston，J.，Bottou，L.，Karlen，M.，Kavukcuoglu，K.，and  Kuksa，P.（2011a）.  Natural  language  processing（almost）  from  scratch.  The  Journal  of  Machine  Learning  Research，12，2493–2537.

Collobert，R.，Kavukcuoglu，K.，and  Farabet，C.（2011b）.  Torch7:  A  Matlab-like  environment  for  machine  learning.  In  BigLearn，NIPS  Workshop.

Comon，P.（1994）.  Independent  component  analysis-a  new  concept？Signal  Processing，36，287–314.

Cortes，C.  and  Vapnik，V.（1995）.  Support  vector  networks.  Machine  Learning，20，273–297.

Couprie，C.，Farabet，C.，Najman，L.，and  LeCun，Y.（2013）.  Indoor  semantic  segmentation  using  depth  information.  In  International  Conference  on  Learning  Representations（ICLR2013）.

Courbariaux，M.，Bengio，Y.，and  David，J.-P.（2015）.  Low  precision  arithmetic  for  deep  learning.  In  Arxiv:1412.7024，ICLR'2015  Workshop.

Courville，A.，Bergstra，J.，and  Bengio，Y.（2011a）.  Unsupervised  models  of  images  by  spike-and-slab  RBMs.  In  ICML'2011.

Courville，A.，Bergstra，J.，and  Bengio，Y.（2011b）.  Unsupervised  models  of  images  by  spike-and-slab  RBMs.  In  ICM（1b）.

Courville，A.，Desjardins，G.，Bergstra，J.，and  Bengio，Y.（2014）.  The  spike-and-slab  RBM  and  extensions  to  discrete  and  sparse  data  distributions.  Pattern  Analysis  and  Machine  Intelligence，IEEE  Transactions  on，36（9），1874–1887.

Cover，T.  M.  and  Thomas，J.  A.（2006）.  Elements  of  Information  Theory，2nd  Edition.  Wiley-Interscience.

Cox，D.  and  Pinto，N.（2011）.  Beyond  simple  features:  A  large-scale  feature  search  approach  to  unconstrained  face  recognition.  In  Automatic  Face  &  Gesture  Recognition  and  Workshops（FG  2011），2011  IEEE  International  Conference  on，pages  8–15.  IEEE.

Cramér，H.（1946）.  Mathematical  methods  of  statistics.  Princeton  University  Press.

Crick，F.  H.  C.  and  Mitchison，G.（1983）.  The  function  of  dream  sleep.  Nature，304，111–114.

Cybenko，G.（1989）.  Approximation  by  superpositions  of  a  sigmoidal  function.  Mathematics  of  Control，Signals，and  Systems，2，303–314.

Dahl，G.  E.，Ranzato，M.，Mohamed，A.，and  Hinton，G.  E.（2010）.  Phone  recognition  with  the  mean-covariance  restricted  Boltzmann  machine.  In  Advances  in  Neural  Information  Processing  Systems（NIPS）.

Dahl，G.  E.，Yu，D.，Deng，L.，and  Acero，A.（2012）.  Context-dependent  pre-trained  deep  neural  networks  for  large  vocabulary  speech  recognition.  IEEE  Transactions  on  Audio，Speech，and  Language  Processing，20（1），33–42.

Dahl，G.  E.，Sainath，T.  N.，and  Hinton，G.  E.（2013）.  Improving  deep  neural  networks  for  LVCSR  using  rectified  linear  units  and  dropout.  In  ICASSP'2013.

Dahl，G.  E.，Jaitly，N.，and  Salakhutdinov，R.（2014）.  Multi-task  neural  networks  for  QSAR  predictions.  arXiv:1406.1231.

Dauphin，Y.  and  Bengio，Y.（2013）.  Stochastic  ratio  matching  of  RBMs  for  sparse  high-dimensional  inputs.  In  NIP（1）.

Dauphin，Y.，Glorot，X.，and  Bengio，Y.（2011）.  Large-scale  learning  of  embeddings  with  recon-struction  sampling.  In  ICML'2011.

Dauphin，Y.，Pascanu，R.，Gulcehre，C.，Cho，K.，Ganguli，S.，and  Bengio，Y.（2014）.  Identifying  and  attacking  the  saddle  point  problem  in  high-dimensional  non-convex  optimization.  In  NIPS'2014.

Davis，A.，Rubinstein，M.，Wadhwa，N.，Mysore，G.，Durand，F.，and  Freeman，W.  T.（2014）.  The  visual  microphone:  Passive  recovery  of  sound  from  video.  ACM  Transactions  on  Graphics（Proc.  SIGGRAPH），33（4），79:1–79:10.

Dayan，P.（1990）.  Reinforcement  comparison.  In  Connectionist  Models:  Proceedings  of  the  1990  Connectionist  Summer  School，San  Mateo，CA.

Dayan，P.  and  Hinton，G.  E.（1996）.  Varieties  of  Helmholtz  machine.  Neural  Networks，9（8），1385–1403.

Dayan，P.，Hinton，G.  E.，Neal，R.  M.，and  Zemel，R.  S.（1995）.  The  Helmholtz  machine.  Neural  computation，7（5），889–904.

Dean，J.，Corrado，G.，Monga，R.，Chen，K.，Devin，M.，Le，Q.，Mao，M.，Ranzato，M.，Senior，A.，Tucker，P.，Yang，K.，and  Ng，A.  Y.（2012）.  Large  scale  distributed  deep  networks.  In  NIPS'2012.

Dean，T.  and  Kanazawa，K.（1989）.  A  model  for  reasoning  about  persistence  and  causation.  Computational  Intelligence，5（3），142–150.

Deerwester，S.，Dumais，S.  T.，Furnas，G.  W.，Landauer，T.  K.，and  Harshman，R.（1990）.  Indexing  by  latent  semantic  analysis.  Journal  of  the  American  Society  for  Information  Science，41（6），391–407.

Delalleau，O.  and  Bengio，Y.（2011）.  Shallow  vs.  deep  sum-product  networks.  In  NIPS.

Deng，J.，Dong，W.，Socher，R.，Li，L.-J.，Li，K.，and  Fei-Fei，L.（2009）.  ImageNet:  A  Large-Scale  Hierarchical  Image  Database.  In  CVPR09.

Deng，J.，Berg，A.  C.，Li，K.，and  Fei-Fei，L.（2010a）.  What  does  classifying  more  than  10，000  image  categories  tell  us?  In  Proceedings  of  the  11th  European  Conference  on  Computer  Vision:  Part  V，ECCV'10，pages  71–84，Berlin，Heidelberg.  Springer-Verlag.

Deng，L.  and  Yu，D.（2014）.  Deep  learning–methods  and  applications.  Foundations  and  Trends  in  Signal  Processing.

Deng，L.，Seltzer，M.，Yu，D.，Acero，A.，Mohamed，A.，and  Hinton，G.（2010b）.  Binary  coding  of  speech  spectrograms  using  a  deep  auto-encoder.  In  Interspeech  2010，Makuhari，Chiba，Japan.

Denil，M.，Bazzani，L.，Larochelle，H.，and  de  Freitas，N.（2012）.  Learning  where  to  attend  with  deep  architectures  for  image  tracking.  Neural  Computation，24（8），2151–2184.

Denton，E.，Chintala，S.，Szlam，A.，and  Fergus，R.（2015）.  Deep  generative  image  models  using  a  Laplacian  pyramid  of  adversarial  networks.  NIPS.

Desjardins，G.  and  Bengio，Y.（2008）.  Empirical  evaluation  of  convolutional  RBMs  for  vision.  Technical  Report  1327，Département  d'Informatique  et  de  Recherche  Opérationnelle，Université  de  Montréal.

Desjardins，G.，Courville，A.  C.，Bengio，Y.，Vincent，P.，and  Delalleau，O.（2010）.  Tempered  Markov  chain  Monte  Carlo  for  training  of  restricted  Boltzmann  machines.  In  International  Conference  on  Artificial  Intelligence  and  Statistics，pages  145–152.

Desjardins，G.，Courville，A.，and  Bengio，Y.（2011）.  On  tracking  the  partition  function.  In  NIPS'2011.

Devlin，J.，Zbib，R.，Huang，Z.，Lamar，T.，Schwartz，R.，and  Makhoul，J.（2014）.  Fast  and  robust  neural  network  joint  models  for  statistical  machine  translation.  In  Proc.  ACL'2014.

Devroye，L.（2013）.  Non-Uniform  Random  Variate  Generation.  SpringerLink:  Bücher.  Springer  New  York.

DiCarlo，J.  J.（2013）.  Mechanisms  underlying  visual  object  recognition:Humans  vs.  neurons  vs.  machines.  NIPS  Tutorial.

Dinh，L.，Krueger，D.，and  Bengio，Y.（2014）.  NICE:  Non-linear  independent  components  esti-mation.  arXiv:1410.8516.

Donahue，J.，Hendricks，L.  A.，Guadarrama，S.，Rohrbach，M.，Venugopalan，S.，Saenko，K.，and  Darrell，T.（2014）.  Long-term  recurrent  convolutional  networks  for  visual  recognition  and  description.  arXiv:1411.4389.

Donoho，D.  L.  and  Grimes，C.（2003）.  Hessian  eigenmaps:  new  locally  linear  embedding  tech-niques  for  high-dimensional  data.  Technical  Report  2003-08，Dept.  Statistics，Stanford  University.

Dosovitskiy，A.，Springenberg，J.  T.，and  Brox，T.（2015）.  Learning  to  generate  chairs  with  convolutional  neural  networks.  In  Proceedings  of  the  IEEE  Conference  on  Computer  Vision  and  Pattern  Recognition，pages  1538–1546.

Doya，K.（1993）.  Bifurcations  of  recurrent  neural  networks  in  gradient  descent  learning.  IEEE  Transactions  on  Neural  Networks，1，75–80.

Dreyfus，S.  E.（1962）.  The  numerical  solution  of  variational  problems.  Journal  of  Mathematical  Analysis  and  Applications，5（1），30–45.

Dreyfus，S.  E.（1973）.  The  computational  solution  of  optimal  control  problems  with  time  lag.  IEEE  Transactions  on  Automatic  Control，18（4），383–385.

Drucker，H.  and  LeCun，Y.（1992）.  Improving  generalisation  performance  using  double  back-propagation.  IEEE  Transactions  on  Neural  Networks，3（6），991–997.

Duchi，J.，Hazan，E.，and  Singer，Y.（2011）.  Adaptive  subgradient  methods  for  online  learning  and  stochastic  optimization.  Journal  of  Machine  Learning  Research.

Dudik，M.，Langford，J.，and  Li，L.（2011）.  Doubly  robust  policy  evaluation  and  learning.  In  Proceedings  of  the  28th  International  Conference  on  Machine  learning，ICML  '11.

Dugas，C.，Bengio，Y.，Bélisle，F.，and  Nadeau，C.（2001）.  Incorporating  second-order  functional  knowledge  for  better  option  pricing.  In  T.  Leen，T.  Dietterich，and  V.  Tresp，editors，Advances  in  Neural  Information  Processing  Systems  13（NIPS'00），pages  472–478.  MIT  Press.

Dziugaite，G.  K.，Roy，D.  M.，and  Ghahramani，Z.（2015）.  Training  generative  neural  networks  via  maximum  mean  discrepancy  optimization.  arXiv  preprint  arXiv:1505.03906.

El  Hihi，S.  and  Bengio，Y.（1996）.  Hierarchical  recurrent  neural  networks  for  long-term  depen-dencies.  In  NIPS  8.  MIT  Press.

Elkahky，A.  M.，Song，Y.，and  He，X.（2015）.  A  multi-view  deep  learning  approach  for  cross  domain  user  modeling  in  recommendation  systems.  In  Proceedings  of  the  24th  International  Conference  on  World  Wide  Web，pages  278–288.

Elman，J.  L.（1993）.  Learning  and  development  in  neural  networks:  The  importance  of  starting  small.  Cognition，48，781–799.

Erhan，D.，Manzagol，P.-A.，Bengio，Y.，Bengio，S.，and  Vincent，P.（2009）.  The  difficulty  of  training  deep  architectures  and  the  effect  of  unsupervised  pre-training.  In  AISTATS'2009，pages  153–160.

Erhan，D.，Bengio，Y.，Courville，A.，Manzagol，P.，Vincent，P.，and  Bengio，S.（2010）.  Why  does  unsupervised  pre-training  help  deep  learning?  J.  Machine  Learning  Res.

Fahlman，S.  E.，Hinton，G.  E.，and  Sejnowski，T.  J.（1983）.  Massively  parallel  architectures  for  AI:  NETL，thistle，and  Boltzmann  machines.  In  Proceedings  of  the  National  Conference  on  Artificial  Intelligence  AAAI-83.

Fang，H.，Gupta，S.，Iandola，F.，Srivastava，R.，Deng，L.，Dollár，P.，Gao，J.，He，X.，Mitchell，M.，Platt，J.  C.，Zitnick，C.  L.，and  Zweig，G.（2015）.  From  captions  to  visual  concepts  and  back.  arXiv:1411.4952.

Farabet，C.，LeCun，Y.，Kavukcuoglu，K.，Culurciello，E.，Martini，B.，Akselrod，P.，and  Talay，S.（2011）.  Large-scale  FPGA-based  convolutional  networks.  In  R.  Bekkerman，M.  Bilenko，and  J.  Langford，editors，Scaling  up  Machine  Learning:  Parallel  and  Distributed  Approaches.  Cambridge  University  Press.

Farabet，C.，Couprie，C.，Najman，L.，and  LeCun，Y.（2013）.  Learning  hierarchical  features  for  scene  labeling.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence，35（8），1915–1929.

Fei-Fei，L.，Fergus，R.，and  Perona，P.（2006）.  One-shot  learning  of  object  categories.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence，28（4），594–611.

Finn，C.，Tan，X.  Y.，Duan，Y.，Darrell，T.，Levine，S.，and  Abbeel，P.（2015）.  Learning  visual  feature  spaces  for  robotic  manipulation  with  deep  spatial  autoencoders.  arXiv  preprint  arXiv:1509.06113.

Fisher，R.  A.（1936）.  The  use  of  multiple  measurements  in  taxonomic  problems.  Annals  of  Eugenics，7，179–188.

Földiák，P.（1989）.  Adaptive  network  for  optimal  linear  feature  extraction.  In  International  Joint  Conference  on  Neural  Networks（IJCNN），volume  1，pages  401–405，Washington  1989.  IEEE，New  York.

Franzius，M.，Sprekeler，H.，and  Wiskott，L.（2007）.  Slowness  and  sparseness  lead  to  place，head-direction，and  spatial-view  cells.

Franzius，M.，Wilbert，N.，and  Wiskott，L.（2008）.  Invariant  object  recognition  with  slow  feature  analysis.  In  Proceedings  of  the  18th  international  conference  on  Artificial  Neural  Networks，Part  I，ICANN  '08，pages  961–970，Berlin，Heidelberg.  Springer-Verlag.

Frasconi，P.，Gori，M.，and  Sperduti，A.（1997）.  On  the  efficient  classification  of  data  structures  by  neural  networks.  In  Proc.  Int.  Joint  Conf.  on  Artificial  Intelligence.

Frasconi，P.，Gori，M.，and  Sperduti，A.（1998）.  A  general  framework  for  adaptive  processing  of  data  structures.  IEEE  Transactions  on  Neural  Networks，9（5），768–786.

Freund，Y.  and  Schapire，R.  E.（1996a）.  Experiments  with  a  new  boosting  algorithm.  In  Machine  Learning:  Proceedings  of  Thirteenth  International  Conference，pages  148–156，USA.  ACM.

Freund，Y.  and  Schapire，R.  E.（1996b）.  Game  theory，on-line  prediction  and  boosting.  In  Proceedings  of  the  Ninth  Annual  Conference  on  Computational  Learning  Theory，pages  325–332.

Frey，B.  J.（1998）.  Graphical  models  for  machine  learning  and  digital  communication.  MIT  Press.

Frey，B.  J.，Hinton，G.  E.，and  Dayan，P.（1996）.  Does  the  wake-sleep  algorithm  learn  good  density  estimators?  In  D.  Touretzky，M.  Mozer，and  M.  Hasselmo，editors，Advances  in  Neural  Information  Processing  Systems  8（NIPS'95），pages  661–670.  MIT  Press，Cambridge，MA.

Frobenius，G.（1908）.  Über  matrizen  aus  positiven  elementen，s.  B.  Preuss.  Akad.  Wiss.  Berlin，Germany.

Fukushima，K.（1975）.  Cognitron:  A  self-organizing  multilayered  neural  network.  Biological  Cybernetics，20，121–136.

Fukushima，K.（1980）.  Neocognitron:  A  self-organizing  neural  network  model  for  a  mechanism  of  pattern  recognition  unaffected  by  shift  in  position.  Biological  Cybernetics，36，193–202.

Gal，Y.  and  Ghahramani，Z.（2015）.  Bayesian  convolutional  neural  networks  with  Bernoulli  approximate  variational  inference.  arXiv  preprint  arXiv:1506.02158.

Gallinari，P.，LeCun，Y.，Thiria，S.，and  Fogelman-Soulie，F.（1987）.  Memoires  associatives  distribuees.  In  Proceedings  of  COGNITIVA  87，Paris，La  Villette.

Garcia-Duran，A.，Bordes，A.，Usunier，N.，and  Grandvalet，Y.（2015）.  Combining  two  and  three-way  embeddings  models  for  link  prediction  in  knowledge  bases.  arXiv  preprint  arXiv:1506.00999.

Garofolo，J.  S.，Lamel，L.  F.，Fisher，W.  M.，Fiscus，J.  G.，and  Pallett，D.  S.（1993）.  Darpa  timit  acoustic-phonetic  continous  speech  corpus  cd-rom.  nist  speech  disc  1-1.1.  NASA  STI/Recon  Technical  Report  N，93，27403.

Garson，J.（1900）.  The  metric  system  of  identification  of  criminals，as  used  in  Great  Britain  and  Ireland.  The  Journal  of  the  Anthropological  Institute  of  Great  Britain  and  Ireland，（2），177–227.

Gers，F.  A.，Schmidhuber，J.，and  Cummins，F.（2000）.  Learning  to  forget:  Continual  prediction  with  LSTM.  Neural  computation，12（10），2451–2471.

Ghahramani，Z.  and  Hinton，G.  E.（1996）.  The  EM  algorithm  for  mixtures  of  factor  analyzers.  Technical  Report  CRG-TR-96-1，Dpt.  of  Comp.  Sci.，Univ.  of  Toronto.

Gillick，D.，Brunk，C.，Vinyals，O.，and  Subramanya，A.（2015）.  Multilingual  language  processing  from  bytes.  arXiv  preprint  arXiv:1512.00103.

Girshick，R.，Donahue，J.，Darrell，T.，and  Malik，J.（2015）.  Region-based  convolutional  networks  for  accurate  object  detection  and  segmentation.

Giudice，M.  D.，Manera，V.，and  Keysers，C.（2009）.  Programmed  to  learn?  The  ontogeny  of  mirror  neurons.  Dev.  Sci.，12（2），350–363.

Glorot，X.  and  Bengio，Y.（2010）.  Understanding  the  difficulty  of  training  deep  feedforward  neural  networks.  In  AISTATS'2010.

Glorot，X.，Bordes，A.，and  Bengio，Y.（2011a）.  Deep  sparse  rectifier  neural  networks.  In  AISTATS'2011.

Glorot，X.，Bordes，A.，and  Bengio，Y.（2011b）.  Domain  adaptation  for  large-scale  sentiment  classification:  A  deep  learning  approach.  In  ICML'2011.

Glorot，X.，Bordes，A.，and  Bengio，Y.（2011c）.  Domain  adaptation  for  large-scale  sentiment  classification:  A  deep  learning  approach.  In  ICM（1b），pages  97–110.

Goldberger，J.，Roweis，S.，Hinton，G.  E.，and  Salakhutdinov，R.（2005）.  Neighbourhood  components  analysis.  In  L.  Saul，Y.  Weiss，and  L.  Bottou，editors，Advances  in  Neural  Information  Processing  Systems  17（NIPS'04）.  MIT  Press.

Gong，S.，McKenna，S.，and  Psarrou，A.（2000）.  Dynamic  Vision:  From  Images  to  Face  Recognition.  Imperial  College  Press.

Goodfellow，I.，Le，Q.，Saxe，A.，and  Ng，A.（2009）.  Measuring  invariances  in  deep  networks.  In  Y.  Bengio，D.  Schuurmans，C.  Williams，J.  Lafferty，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  22（NIPS'09），pages  646–654.

Goodfellow，I.，Koenig，N.，Muja，M.，Pantofaru，C.，Sorokin，A.，and  Takayama，L.（2010）.  Help  me  help  you:  Interfaces  for  personal  robots.  In  Proc.  of  Human  Robot  Interaction（HRI），Osaka，Japan.  ACM  Press，ACM  Press.

Goodfellow，I.，Mirza，M.，Xiao，D.，Courville，A.，and  Bengio，Y.（2014a）.  An  empirical  inves-tigation  of  catastrophic  forgetting  in  gradient-based  neural  networks.  In  ICLR'14.

Goodfellow，I.  J.（2010）.  Technical  report:Multidimensional，downsampled  convolution  for  autoencoders.  Technical  report，Université  de  Montréal.

Goodfellow，I.  J.（2014）.  On  distinguishability  criteria  for  estimating  generative  models.  In  International  Conference  on  Learning  Representations，Workshops  Track.

Goodfellow，I.  J.，Courville，A.，and  Bengio，Y.（2011）.  Spike-and-slab  sparse  coding  for  unsu-pervised  feature  discovery.  In  NIPS  Workshop  on  Challenges  in  Learning  Hierarchical  Models.

Goodfellow，I.  J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and  Bengio，Y.（2013a）.  Maxout  networks.  In  ICML'2013.

Goodfellow，I.  J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and  Bengio，Y.（2013b）.  Maxout  networks.  In  ICM（1c），pages  1319–1327.

Goodfellow，I.  J.，Warde-Farley，D.，Mirza，M.，Courville，A.，and  Bengio，Y.（2013c）.  Maxout  networks.  Technical  Report  arXiv:1302.4389，Université  de  Montréal.

Goodfellow，I.  J.，Mirza，M.，Courville，A.，and  Bengio，Y.（2013d）.  Multi-prediction  deep  Boltzmann  machines.  In  NIP（1）.
Goodfellow，I.  J.，Warde-Farley，D.，Lamblin，P.，Dumoulin，V.，Mirza，M.，Pascanu，R.，Bergstra，J.，Bastien，F.，and  Bengio，Y.（2013e）.  Pylearn2:  a  machine  learning  research  library.  arXiv  preprint  arXiv:1308.4214.

Goodfellow，I.  J.，Courville，A.，and  Bengio，Y.（2013f）.  Scaling  up  spike-and-slab  models  for  unsupervised  feature  learning.  IEEE  T.  PAMI，pages  1902–1914.

Goodfellow，I.  J.，Courville，A.，and  Bengio，Y.（2013g）.  Scaling  up  spike-and-slab  models  for  un-supervised  feature  learning.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence，35（8），1902–1914.

Goodfellow，I.  J.，Shlens，J.，and  Szegedy，C.（2014b）.  Explaining  and  harnessing  adversarial  examples.  CoRR，abs/1412.6572.

Goodfellow，I.  J.，Pouget-Abadie，J.，Mirza，M.，Xu，B.，Warde-Farley，D.，Ozair，S.，Courville，A.，and  Bengio，Y.（2014c）.  Generative  adversarial  networks.  In  NIPS'2014.

Goodfellow，I.  J.，Bulatov，Y.，Ibarz，J.，Arnoud，S.，and  Shet，V.（2014d）.  Multi-digit  number  recognition  from  Street  View  imagery  using  deep  convolutional  neural  networks.  In  International  Conference  on  Learning  Representations.

Goodfellow，I.  J.，Vinyals，O.，and  Saxe，A.  M.（2015）.  Qualitatively  characterizing  neural  network  optimization  problems.  In  International  Conference  on  Learning  Representations.

Goodman，J.（2001）.  Classes  for  fast  maximum  entropy  training.  In  International  Conference  on  Acoustics，Speech  and  Signal  Processing（ICASSP），Utah.

Gori，M.  and  Tesi，A.（1992）.  On  the  problem  of  local  minima  in  backpropagation.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence，PAMI-14（1），76–86.

Gosset，W.  S.（1908）.  The  probable  error  of  a  mean.  Biometrika，6（1），1–25.  Originally  published  under  the  pseudonym“Student”.

Gouws，S.，Bengio，Y.，and  Corrado，G.（2014）.  BilBOWA:  Fast  bilingual  distributed  representations  without  word  alignments.  Technical  report，arXiv:1410.2455.

Graf，H.  P.  and  Jackel，L.  D.（1989）.  Analog  electronic  neural  network  circuits.  Circuits  and  Devices  Magazine，IEEE，5（4），44–49.

Graves，A.（2011）.  Practical  variational  inference  for  neural  networks.  In  NIPS'2011.

Graves，A.（2012）.  Supervised  Sequence  Labelling  with  Recurrent  Neural  Networks.  Studies  in  Computational  Intelligence.  Springer.

Graves，A.（2013）.  Generating  sequences  with  recurrent  neural  networks.  Technical  report，arXiv:1308.0850.

Graves，A.  and  Jaitly，N.（2014）.  Towards  end-to-end  speech  recognition  with  recurrent  neural  networks.  In  ICML'2014.

Graves，A.  and  Schmidhuber，J.（2005）.  Framewise  phoneme  classification  with  bidirectional  LSTM  and  other  neural  network  architectures.  Neural  Networks，18（5），602–610.

Graves，A.  and  Schmidhuber，J.（2009）.  Offine  handwriting  recognition  with  multidimensional  recurrent  neural  networks.  In  D.  Koller，D.  Schuurmans，Y.  Bengio，and  L.  Bottou，editors，NIPS'2008，pages  545–552.

Graves，A.，Fernández，S.，Gomez，F.，and  Schmidhuber，J.（2006）.  Connectionist  temporal  classification:  Labelling  unsegmented  sequence  data  with  recurrent  neural  networks.  In  ICML'2006，pages  369–376，Pittsburgh，USA.

Graves，A.，Liwicki，M.，Bunke，H.，Schmidhuber，J.，and  Fernández，S.（2008）.  Unconstrained  on-line  handwriting  recognition  with  recurrent  neural  networks.  In  J.  Platt，D.  Koller，Y.  Singer，and  S.  Roweis，editors，NIPS'2007，pages  577–584.

Graves，A.，Liwicki，M.，Fernández，S.，Bertolami，R.，Bunke，H.，and  Schmidhuber，J.（2009）.  A  novel  connectionist  system  for  unconstrained  handwriting  recognition.  Pattern  Analysis  and  Machine  Intelligence，IEEE  Transactions  on，31（5），855–868.

Graves，A.，Mohamed，A.，and  Hinton，G.（2013）.  Speech  recognition  with  deep  recurrent  neural  networks.  In  ICASSP'2013，pages  6645–6649.

Graves，A.，Wayne，G.，and  Danihelka，I.（2014）.  Neural  Turing  machines.  arXiv:1410.5401.

Grefenstette，E.，Hermann，K.  M.，Suleyman，M.，and  Blunsom，P.（2015）.  Learning  to  transduce  with  unbounded  memory.  In  NIPS'2015.

Greff，K.，Srivastava，R.  K.，Koutník，J.，Steunebrink，B.  R.，and  Schmidhuber，J.（2015）.  LSTM:  a  search  space  odyssey.  arXiv  preprint  arXiv:1503.04069.

Gregor，K.  and  LeCun，Y.（2010a）.  Emergence  of  complex-like  cells  in  a  temporal  product  network  with  local  receptivefields.  Technical  report，arXiv:1006.0448.

Gregor，K.  and  LeCun，Y.（2010b）.  Learning  fast  approximations  of  sparse  coding.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  Twenty-seventh  International  Conference  on  Machine  Learning（ICML-10）.  ACM.

Gregor，K.，Danihelka，I.，Mnih，A.，Blundell，C.，and  Wierstra，D.（2014）.  Deep  autoregressive  networks.  In  International  Conference  on  Machine  Learning（ICML'2014）.

Gregor，K.，Danihelka，I.，Graves，A.，and  Wierstra，D.（2015）.  DRAW:  A  recurrent  neural  network  for  image  generation.  arXiv  preprint  arXiv:1502.04623.

Gretton，A.，Borgwardt，K.  M.，Rasch，M.  J.，Schölkopf，B.，and  Smola，A.（2012）.  A  kernel  two-sample  test.  The  Journal  of  Machine  Learning  Research，13（1），723–773.

Guillaume  Desjardins，Karen  Simonyan，R.  P.  K.  K.（2015）.  Natural  neural  networks.  Technical  report，arXiv:1507.00210.

Gulcehre，C.  and  Bengio，Y.（2013）.  Knowledge  matters:  Importance  of  prior  information  for  optimization.  Technical  Report  arXiv:1301.4083，Universite  de  Montreal.

Guo，H.  and  Gelfand，S.  B.（1992）.  Classification  trees  with  neural  network  feature  extraction.  Neural  Networks，IEEE  Transactions  on，3（6），923–933.

Gupta，S.，Agrawal，A.，Gopalakrishnan，K.，and  Narayanan，P.（2015）.  Deep  learning  with  limited  numerical  precision.  CoRR，abs/1502.02551.

Gutmann，M.  and  Hyvarinen，A.（2010）.  Noise-contrastive  estimation:  A  new  estimation  princi-ple  for  unnormalized  statistical  models.  In  Proceedings  of  The  Thirteenth  International  Conference  on  Artificial  Intelligence  and  Statistics（AISTATS'10）.

Hadsell，R.，Sermanet，P.，Ben，J.，Erkan，A.，Han，J.，Muller，U.，and  LeCun，Y.（2007）.  Online  learning  for  offroad  robots:  Spatial  label  propagation  to  learn  long-range  traversability.  In  Proceedings  of  Robotics:  Science  and  Systems，Atlanta，GA，USA.

Hajnal，A.，Maass，W.，Pudlak，P.，Szegedy，M.，and  Turan，G.（1993）.  Threshold  circuits  of  bounded  depth.  J.  Comput.  System.  Sci.，46，129–154.

Håstad，J.（1986）.  Almost  optimal  lower  bounds  for  small  depth  circuits.  In  Proceedings  of  the  18th  annual  ACM  Symposium  on  Theory  of  Computing，pages  6–20，Berkeley，California.  ACM  Press.

Håstad，J.  and  Goldmann，M.（1991）.  On  the  power  of  small-depth  threshold  circuits.  Computational  Complexity，1，113–129.

Hastie，T.，Tibshirani，R.，and  Friedman，J.（2001）.  The  elements  of  statistical  learning:  data  mining，inference  and  prediction.  Springer  Series  in  Statistics.  Springer  Verlag.

He，K.，Zhang，X.，Ren，S.，and  Sun，J.（2015）.  Delving  deep  into  rectifiers:  Surpassing  human-level  performance  on  ImageNet  classification.  arXiv  preprint  arXiv:1502.01852.

Hebb，D.  O.（1949）.  The  Organization  of  Behavior.  Wiley，New  York.

Henaff，M.，Jarrett，K.，Kavukcuoglu，K.，and  LeCun，Y.（2011）.  Unsupervised  learning  of  sparse  features  for  scalable  audio  classification.  In  ISMIR'11.

Henderson，J.（2003）.  Inducing  history  representations  for  broad  coverage  statistical  parsing.  In  HLT-NAACL，pages  103–110.

Henderson，J.（2004）.  Discriminative  training  of  a  neural  network  statistical  parser.  In  Proceedings  of  the  42nd  Annual  Meeting  on  Association  for  Computational  Linguistics，page  95.

Henniges，M.，Puertas，G.，Bornschein，J.，Eggert，J.，and  Lücke，J.（2010）.  Binary  sparse  coding.  In  Latent  Variable  Analysis  and  Signal  Separation，pages  450–457.  Springer.

Herault，J.  and  Ans，B.（1984）.  Circuits  neuronaux  à  synapses  modifiables:  Décodage  de  messages  composites  par  apprentissage  non  supervisé.  Comptes  Rendus  de  l'Académie  des  Sciences，299（III-13），525–528.

Hinton，G.，Deng，L.，Dahl，G.  E.，Mohamed，A.，Jaitly，N.，Senior，A.，Vanhoucke，V.，Nguyen，P.，Sainath，T.，and  Kingsbury，B.（2012a）.  Deep  neural  networks  for  acoustic  modeling  in  speech  recognition.  IEEE  Signal  Processing  Magazine，29（6），82–97.

Hinton，G.，Vinyals，O.，and  Dean，J.（2015）.  Distilling  the  knowledge  in  a  neural  network.  arXiv  preprint  arXiv:1503.02531.

Hinton，G.  E.（1989）.  Connectionist  learning  procedures.  Artificial  Intelligence，40，185–234.

Hinton，G.  E.（1990）.  Mapping  part-whole  hierarchies  into  connectionist  networks.  Artificial  Intelligence，46（1），47–75.

Hinton，G.  E.（1999）.  Products  of  experts.  In  Proceedings  of  the  Ninth  International  Conference  on  Artificial  Neural  Networks（ICANN），volume  1，pages  1–6，Edinburgh，Scotland.  IEE.

Hinton，G.  E.（2000）.  Training  products  of  experts  by  minimizing  contrastive  divergence.  Technical  Report  GCNU  TR  2000-004，Gatsby  Unit，University  College  London.

Hinton，G.  E.（2006）.  To  recognize  shapes，first  learn  to  generate  images.  Technical  Report  UTML  TR  2006-003，University  of  Toronto.

Hinton，G.  E.（2007a）.  How  to  do  backpropagation  in  a  brain.  Invited  talk  at  the  NIPS'2007  Deep  Learning  Workshop.

Hinton，G.  E.（2007b）.  Learning  multiple  layers  of  representation.  Trends  in  cognitive  sciences，11（10），428–434.

Hinton，G.  E.（2010）.  A  practical  guide  to  training  restricted  Boltzmann  machines.  Technical  Report  UTML  TR  2010-003，Comp.  Sc.，University  of  Toronto.

Hinton，G.  E.（2012）.  Tutorial  on  deep  learning.  IPAM  Graduate  Summer  School:  Deep  Learning，Feature  Learning.

Hinton，G.  E.  and  Ghahramani，Z.（1997）.  Generative  models  for  discovering  sparse  distributed  representations.  Philosophical  Transactions  of  the  Royal  Society  of  London.

Hinton，G.  E.  and  McClelland，J.  L.（1988）.  Learning  representations  by  recirculation.  In  NIPS'1987，pages  358–366.

Hinton，G.  E.  and  Roweis，S.（2003）.  Stochastic  neighbor  embedding.  In  NIPS'2002.

Hinton，G.  E.  and  Salakhutdinov，R.（2006）.  Reducing  the  dimensionality  of  data  with  neural  networks.  Science，313（5786），504–507.

Hinton，G.  E.  and  Sejnowski，T.  J.（1986）.  Learning  and  relearning  in  Boltzmann  machines.  In  D.  E.  Rumelhart  and  J.  L.  McClelland，editors，Parallel  Distributed  Processing，volume  1，chapter  7，pages  282–317.  MIT  Press，Cambridge.

Hinton，G.  E.  and  Sejnowski，T.  J.（1999）.  Unsupervised  learning:  foundations  of  neural  computation.  MIT  press.

Hinton，G.  E.  and  Shallice，T.（1991）.  Lesioning  an  attractor  network:  investigations  of  acquired  dyslexia.  Psychological  review，98（1），74.

Hinton，G.  E.  and  Zemel，R.  S.（1994）.  Autoencoders，minimum  description  length，and  Helmholtz  free  energy.  In  NIPS'1993.

Hinton，G.  E.，Sejnowski，T.  J.，and  Ackley，D.  H.（1984a）.  Boltzmann  machines:  Constraint  satisfaction  networks  that  learn.  Technical  Report  TR-CMU-CS-84-119，Carnegie-Mellon  Uni-versity，Dept.  of  Computer  Science.

Hinton，G.  E.，Sejnowski，T.  J.，and  Ackley，D.  H.（1984b）.  Boltzmann  machines:  Constraint  satisfaction  networks  that  learn.  Technical  Report  TR-CMU-CS-84-119，Carnegie-Mellon  Uni-versity，Dept.  of  Computer  Science.

Hinton，G.  E.，McClelland，J.，and  Rumelhart，D.（1986）.  Distributed  representations.  In  D.  E.  Rumelhart  and  J.  L.  McClelland，editors，Parallel  Distributed  Processing:  Explorations  in  the  Microstructure  of  Cognition，volume  1，pages  77–109.  MIT  Press，Cambridge.

Hinton，G.  E.，Revow，M.，and  Dayan，P.（1995a）.  Recognizing  handwritten  digits  using  mixtures  of  linear  models.  In  G.  Tesauro，D.  Touretzky，and  T.  Leen，editors，Advances  in  Neural  Information  Processing  Systems  7（NIPS'94），pages  1015–1022.  MIT  Press，Cambridge，MA.

Hinton，G.  E.，Dayan，P.，Frey，B.  J.，and  Neal，R.  M.（1995b）.  The  wake-sleep  algorithm  for  unsupervised  neural  networks.  Science，268，1558–1161.

Hinton，G.  E.，Dayan，P.，and  Revow，M.（1997）.  Modelling  the  manifolds  of  images  of  hand-written  digits.  IEEE  Transactions  on  Neural  Networks，8，65–74.

Hinton，G.  E.，Welling，M.，Teh，Y.  W.，and  Osindero，S.（2001）.  A  new  view  of  ICA.  In  Proceedings  of  3rd  International  Conference  on  Independent  Component  Analysis  and  Blind  Signal  Separation（ICA'01），pages  746–751，San  Diego，CA.

Hinton，G.  E.，Osindero，S.，and  Teh，Y.（2006a）.  A  fast  learning  algorithm  for  deep  belief  nets.  Neural  Computation，18，1527–1554.

Hinton，G.  E.，Osindero，S.，and  Teh，Y.-W.（2006b）.  A  fast  learning  algorithm  for  deep  belief  nets.  Neural  Computation，18，1527–1554.

Hinton，G.  E.，Deng，L.，Yu，D.，Dahl，G.  E.，Mohamed，A.，Jaitly，N.，Senior，A.，Vanhoucke，V.，Nguyen，P.，Sainath，T.  N.，and  Kingsbury，B.（2012b）.  Deep  neural  networks  for  acoustic  modeling  in  speech  recognition:The  shared  views  of  four  research  groups.  IEEE  Signal  Process.  Mag.，29（6），82–97.

Hinton，G.  E.，Srivastava，N.，Krizhevsky，A.，Sutskever，I.，and  Salakhutdinov，R.（2012c）.  Improving  neural  networks  by  preventing  co-adaptation  of  feature  detectors.  Technical  report，arXiv:1207.0580.

Hinton，G.  E.，Srivastava，N.，Krizhevsky，A.，Sutskever，I.，and  Salakhutdinov，R.（2012d）.  Improving  neural  networks  by  preventing  co-adaptation  of  feature  detectors.  Technical  report，arXiv:1207.0580.

Hinton，G.  E.，Vinyals，O.，and  Dean，J.（2014）.  Dark  knowledge.  Invited  talk  at  the  BayLearn  Bay  Area  Machine  Learning  Symposium.

Hochreiter，S.（1991a）.  Untersuchungen  zu  dynamischen  neuronalen  Netzen.  Diploma  thesis，T.U.  München.

Hochreiter，S.（1991b）.  Untersuchungen  zu  dynamischen  neuronalen  Netzen.  Diploma  thesis，Institut  für  Informatik，Lehrstuhl  Prof.  Brauer，Technische  Universität  München.

Hochreiter，S.  and  Schmidhuber，J.（1995）.  Simplifying  neural  nets  by  discoveringflat  minima.  In  Advances  in  Neural  Information  Processing  Systems  7，pages  529–536.  MIT  Press.

Hochreiter，S.  and  Schmidhuber，J.（1997）.  Long  short-term  memory.  Neural  Computation，9（8），1735–1780.

Hochreiter，S.，Bengio，Y.，and  Frasconi，P.（2001）.  Gradientflow  in  recurrent  nets:  the  difficulty  of  learning  long-term  dependencies.  In  J.  Kolen  and  S.  Kremer，editors，Field  Guide  to  Dynamical  Recurrent  Networks.  IEEE  Press.

Holi，J.  L.  and  Hwang，J.-N.（1993）.  Finite  precision  error  analysis  of  neural  network  hardware  implementations.  Computers，IEEE  Transactions  on，42（3），281–290.

Holt，J.  L.  and  Baker，T.  E.（1991）.  Back  propagation  simulations  using  limited  precision  calculations.  In  Neural  Networks，1991.，IJCNN-91-Seattle  International  Joint  Conference  on，volume  2，pages  121–126.  IEEE.

Hornik，K.，Stinchcombe，M.，and  White，H.（1989）.  Multilayer  feedforward  networks  are  universal  approximators.  Neural  Networks，2，359–366.

Hornik，K.，Stinchcombe，M.，and  White，H.（1990）.  Universal  approximation  of  an  unknown  mapping  and  its  derivatives  using  multilayer  feedforward  networks.  Neural  networks，3（5），551–560.

Hsu，F.-H.（2002）.  Behind  Deep  Blue:  Building  the  Computer  That  Defeated  the  World  Chess  Champion.  Princeton  University  Press，Princeton，NJ，USA.

Huang，F.  and  Ogata，Y.（2002）.  Generalized  pseudo-likelihood  estimates  for  Markov  random  fields  on  lattice.  Annals  of  the  Institute  of  Statistical  Mathematics，54（1），1–18.

Huang，P.-S.，He，X.，Gao，J.，Deng，L.，Acero，A.，and  Heck，L.（2013）.  Learning  deep  structured  semantic  models  for  web  search  using  clickthrough  data.  In  Proceedings  of  the  22nd  ACM  international  conference  on  Conference  on  information  &  knowledge  management，pages  2333–2338.  ACM.

Hubel，D.  and  Wiesel，T.（1968）.  Receptivefields  and  functional  architecture  of  monkey  striate  cortex.  Journal  of  Physiology（London），195，215–243.

Hubel，D.  H.  and  Wiesel，T.  N.（1959）.  Receptivefields  of  single  neurons  in  the  cat's  striate  cortex.  Journal  of  Physiology，148，574–591.

Hubel，D.  H.  and  Wiesel，T.  N.（1962）.  Receptivefields，binocular  interaction，and  functional  architecture  in  the  cat's  visual  cortex.  Journal  of  Physiology（London），160，106–154.

Huszar，F.（2015）.  How（not）  to  train  your  generative  model:  schedule  sampling，likelihood，adversary?  arXiv:1511.05101.

Hutter，F.，Hoos，H.，and  Leyton-Brown，K.（2011）.  Sequential  model-based  optimization  for  general  algorithm  configuration.  In  LION-5.  Extended  version  as  UBC  Tech  report  TR-2010-10.

Hyotyniemi，H.（1996）.  Turing  machines  are  recurrent  neural  networks.  In  STeP'96，pages  13–24.

Hyvärinen，A.（1999）.  Survey  on  independent  component  analysis.  Neural  Computing  Surveys，2，94–128.

Hyvärinen，A.（2005a）.  Estimation  of  non-normalized  statistical  models  using  score  matching.  Journal  of  Machine  Learning  Research，6，695–709.

Hyvärinen，A.（2005b）.  Estimation  of  non-normalized  statistical  models  using  score  matching.  J.  Machine  Learning  Res.，6.

Hyvärinen，A.（2007a）.  Connections  between  score  matching，contrastive  divergence，and  pseu-dolikelihood  for  continuous-valued  variables.  IEEE  Transactions  on  Neural  Networks，18，1529–1531.

Hyvärinen，A.（2007b）.  Some  extensions  of  score  matching.  Computational  Statistics  and  Data  Analysis，51，2499–2512.

Hyvärinen，A.  and  Hoyer，P.  O.（1999）.  Emergence  of  topography  and  complex  cell  properties  from  natural  images  using  extensions  of  ica.  In  NIPS，pages  827–833.

Hyvärinen，A.  and  Pajunen，P.（1999）.  Nonlinear  independent  component  analysis:  Existence  and  uniqueness  results.  Neural  Networks，12（3），429–439.

Hyvärinen，A.，Karhunen，J.，and  Oja，E.（2001a）.  Independent  Component  Analysis.  Wiley-Interscience.

Hyvärinen，A.，Hoyer，P.  O.，and  Inki，M.  O.（2001b）.  Topographic  independent  component  analysis.  Neural  Computation，13（7），1527–1558.

Hyvärinen，A.，Hurri，J.，and  Hoyer，P.  O.（2009）.  Natural  Image  Statistics:  A  probabilistic  approach  to  early  computational  vision.  Springer-Verlag.

Iba，Y.（2001）.  Extended  ensemble  Monte  Carlo.  International  Journal  of  Modern  Physics，C12，623–656.

Inayoshi，H.  and  Kurita，T.（2005）.  Improved  generalization  by  adding  both  auto-association  and  hidden-layer  noise  to  neural-network-based-classifiers.  IEEE  Workshop  on  Machine  Learning  for  Signal  Processing，pages  141–146.

Ioffe，S.  and  Szegedy，C.（2015）.  Batch  normalization:  Accelerating  deep  network  training  by  reducing  internal  covariate  shift.

Jacobs，R.  A.（1988）.  Increased  rates  of  convergence  through  learning  rate  adaptation.  Neural  networks，1（4），295–307.

Jacobs，R.  A.，Jordan，M.  I.，Nowlan，S.  J.，and  Hinton，G.  E.（1991）.  Adaptive  mixtures  of  local  experts.  Neural  Computation，3，79–87.

Jaeger，H.（2003）.  Adaptive  nonlinear  system  identification  with  echo  state  networks.  In  Advances  in  Neural  Information  Processing  Systems  15.

Jaeger，H.（2007a）.  Discovering  multiscale  dynamical  features  with  hierarchical  echo  state  networks.  Technical  report，Jacobs  University.

Jaeger，H.（2007b）.  Echo  state  network.  Scholarpedia，2（9），2330.

Jaeger，H.（2012）.  Long  short-term  memory  in  echo  state  networks:  Details  of  a  simulation  study.  Technical  report，Technical  report，Jacobs  University  Bremen.

Jaeger，H.  and  Haas，H.（2004）.  Harnessing  nonlinearity:  Predicting  chaotic  systems  and  saving  energy  in  wireless  communication.  Science，304（5667），78–80.

Jaeger，H.，Lukosevicius，M.，Popovici，D.，and  Siewert，U.（2007）.  Optimization  and  applications  of  echo  state  networks  with  leaky-integrator  neurons.  Neural  Networks，20（3），335–352.

Jain，V.，Murray，J.  F.，Roth，F.，Turaga，S.，Zhigulin，V.，Briggman，K.  L.，Helmstaedter，M.  N.，Denk，W.，and  Seung，H.  S.（2007）.  Supervised  learning  of  image  restoration  with  convolutional  networks.  In  Computer  Vision，2007.  ICCV  2007.  IEEE  11th  International  Conference  on，pages  1–8.  IEEE.

Jaitly，N.  and  Hinton，G.（2011）.  Learning  a  better  representation  of  speech  soundwaves  using  restricted  Boltzmann  machines.  In  Acoustics，Speech  and  Signal  Processing（ICASSP），2011  IEEE  International  Conference  on，pages  5884–5887.  IEEE.

Jaitly，N.  and  Hinton，G.  E.（2013）.  Vocal  tract  length  perturbation（VTLP）  improves  speech  recognition.  In  ICML'2013.

Jarrett，K.，Kavukcuoglu，K.，Ranzato，M.，and  LeCun，Y.（2009a）.  What  is  the  best  multi-stage  architecture  for  object  recognition?  In  Proc.  International  Conference  on  Computer  Vision（ICCV'09），pages  2146–2153.  IEEE.

Jarrett，K.，Kavukcuoglu，K.，Ranzato，M.，and  LeCun，Y.（2009b）.  What  is  the  best  multi-stage  architecture  for  object  recognition?  In  ICCV'09.

Jarzynski，C.（1997）.  Nonequilibrium  equality  for  free  energy  differences.  Phys.  Rev.  Lett.，78，2690–2693.

Jaynes，E.  T.（2003）.  Probability  Theory:  The  Logic  of  Science.  Cambridge  University  Press.

Jean，S.，Cho，K.，Memisevic，R.，and  Bengio，Y.（2014）.  On  using  very  large  target  vocabulary  for  neural  machine  translation.  arXiv:1412.2007.

Jelinek，F.  and  Mercer，R.  L.（1980）.  Interpolated  estimation  of  Markov  source  parameters  from  sparse  data.  In  E.  S.  Gelsema  and  L.  N.  Kanal，editors，Pattern  Recognition  in  Practice.  North-Holland，Amsterdam.

Jia，Y.（2013）.  Caffe:An  open  source  convolutional  architecture  for  fast  feature  embedding.  http://caffe.berkeleyvision.org/.

Jia，Y.，Huang，C.，and  Darrell，T.（2012）.  Beyond  spatial  pyramids:  Receptivefield  learning  for  pooled  image  features.  In  Computer  Vision  and  Pattern  Recognition（CVPR），2012  IEEE  Conference  on，pages  3370–3377.  IEEE.

Jim，K.-C.，Giles，C.  L.，and  Horne，B.  G.（1996）.  An  analysis  of  noise  in  recurrent  neural  networks:  convergence  and  generalization.  IEEE  Transactions  on  Neural  Networks，7（6），1424–1438.

Jordan，M.  I.（1998）.  Learning  in  Graphical  Models.  Kluwer，Dordrecht，Netherlands.

Joulin，A.  and  Mikolov，T.（2015）.  Inferring  algorithmic  patterns  with  stack-augmented  recurrent  nets.  arXiv  preprint  arXiv:1503.01007.

Jozefowicz，R.，Zaremba，W.，and  Sutskever，I.（2015）.  An  empirical  evaluation  of  recurrent  network  architectures.  In  ICML'2015.

Judd，J.  S.（1989）.  Neural  Network  Design  and  the  Complexity  of  Learning.  MIT  press.

Jutten，C.  and  Herault，J.（1991）.  Blind  separation  of  sources，part  I:  an  adaptive  algorithm  based  on  neuromimetic  architecture.  Signal  Processing，24，1–10.

Kahou，S.  E.，Pal，C.，Bouthillier，X.，Froumenty，P.，Gülçehre，c.，Memisevic，R.，Vincent，P.，Courville，A.，Bengio，Y.，Ferrari，R.  C.，Mirza，M.，Jean，S.，Carrier，P.  L.，Dauphin，Y.，Boulanger-Lewandowski，N.，Aggarwal，A.，Zumer，J.，Lamblin，P.，Raymond，J.-P.，Des-jardins，G.，Pascanu，R.，Warde-Farley，D.，Torabi，A.，Sharma，A.，Bengio，E.，Côté，M.，Konda，K.  R.，and  Wu，Z.（2013）.  Combining  modality  specific  deep  neural  networks  for  emotion  recognition  in  video.  In  Proceedings  of  the  15th  ACM  on  International  Conference  on  Multimodal  Interaction.

Kalchbrenner，N.  and  Blunsom，P.（2013）.  Recurrent  continuous  translation  models.  In  EMNLP'2013.

Kalchbrenner，N.，Danihelka，I.，and  Graves，A.（2015）.  Grid  long  short-term  memory.  arXiv  preprint  arXiv:1507.01526.

Kamyshanska，H.  and  Memisevic，R.（2015）.  The  potential  energy  of  an  autoencoder.  IEEE  Transactions  on  Pattern  Analysis  and  Machine  Intelligence.

Karpathy，A.  and  Li，F.-F.（2015）.  Deep  visual-semantic  alignments  for  generating  image  de-scriptions.  In  CVPR'2015.  arXiv:1412.2306.

Karpathy，A.，Toderici，G.，Shetty，S.，Leung，T.，Sukthankar，R.，and  Fei-Fei，L.（2014）.  Large-scale  video  classification  with  convolutional  neural  networks.  In  CVPR.

Karush，W.（1939）.  Minima  of  Functions  of  Several  Variables  with  Inequalities  as  Side  Constraints.  Master's  thesis，Dept.  of  Mathematics，Univ.  of  Chicago.

Katz，S.  M.（1987）.  Estimation  of  probabilities  from  sparse  data  for  the  language  model  compo-nent  of  a  speech  recognizer.  IEEE  Transactions  on  Acoustics，Speech，and  Signal  Processing，ASSP-35（3），400–401.

Kavukcuoglu，K.，Ranzato，M.，and  LeCun，Y.（2008）.  Fast  inference  in  sparse  coding  algorithms  with  applications  to  object  recognition.  Technical  report，Computational  and  Biological  Learn-ing  Lab，Courant  Institute，NYU.  Tech  Report  CBLL-TR-2008-12-01.

Kavukcuoglu，K.，Ranzato，M.-A.，Fergus，R.，and  LeCun，Y.（2009）.  Learning  invariant  features  through  topographicfilter  maps.  In  CVPR'2009.

Kavukcuoglu，K.，Sermanet，P.，Boureau，Y.-L.，Gregor，K.，Mathieu，M.，and  LeCun，Y.（2010）.  Learning  convolutional  feature  hierarchies  for  visual  recognition.  In  NIPS'2010.

Kelley，H.  J.（1960）.  Gradient  theory  of  optimalflight  paths.  ARS  Journal，30（10），947–954.

Khan，F.，Zhu，X.，and  Mutlu，B.（2011）.  How  do  humans  teach:  On  curriculum  learning  and  teaching  dimension.  In  Advances  in  Neural  Information  Processing  Systems  24（NIPS'11），pages  1449–1457.

Kim，S.  K.，McAfee，L.  C.，McMahon，P.  L.，and  Olukotun，K.（2009）.  A  highly  scalable  restricted  Boltzmann  machine  FPGA  implementation.  In  Field  Programmable  Logic  and  Applications，2009.  FPL  2009.  International  Conference  on，pages  367–372.  IEEE.

Kindermann，R.（1980）.  Markov  Random  Fields  and  Their  Applications（Contemporary  Mathe-matics；V.  1）.  American  Mathematical  Society.

Kingma，D.  and  Ba，J.（2014）.  Adam:  A  method  for  stochastic  optimization.  arXiv  preprint  arXiv:1412.6980.

Kingma，D.  and  LeCun，Y.（2010a）.  Regularized  estimation  of  image  statistics  by  score  matching.  In  NIPS'2010.

Kingma，D.  and  LeCun，Y.（2010b）.  Regularized  estimation  of  image  statistics  by  score  matching.  In  J.  Lafferty，C.  K.  I.  Williams，J.  Shawe-Taylor，R.  Zemel，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  23，pages  1126–1134.

Kingma，D.，Rezende，D.，Mohamed，S.，and  Welling，M.（2014）.  Semi-supervised  learning  with  deep  generative  models.  In  NIPS'2014.

Kingma，D.  P.（2013）.  Fast  gradient-based  inference  with  continuous  latent  variable  models  in  auxiliary  form.  Technical  report，arxiv:1306.0733.

Kingma，D.  P.  and  Welling，M.（2014a）.  Auto-encoding  variational  bayes.  In  Proceedings  of  the  International  Conference  on  Learning  Representations（ICLR）.

Kingma，D.  P.  and  Welling，M.（2014b）.  Efficient  gradient-based  inference  through  transforma-tions  between  bayes  nets  and  neural  nets.  Technical  report，arxiv:1402.0480.

Kirkpatrick，S.，Jr.，C.  D.  G.，and  Vecchi，M.  P.（1983）.  Optimization  by  simulated  annealing.  Science，220，671–680.

Kiros，R.，Salakhutdinov，R.，and  Zemel，R.（2014a）.  Multimodal  neural  language  models.  In  ICML'2014.

Kiros，R.，Salakhutdinov，R.，and  Zemel，R.（2014b）.  Unifying  visual-semantic  embeddings  with  multimodal  neural  language  models.  arXiv:1411.2539  ［cs.LG］.

Klementiev，A.，Titov，I.，and  Bhattarai，B.（2012）.  Inducing  crosslingual  distributed  representations  of  words.  In  Proceedings  of  COLING  2012.

Knowles-Barley，S.，Jones，T.  R.，Morgan，J.，Lee，D.，Kasthuri，N.，Lichtman，J.  W.，and  Pfister，H.（2014）.  Deep  learning  for  the  connectome.  GPU  Technology  Conference.

Koller，D.  and  Friedman，N.（2009）.  Probabilistic  Graphical  Models:  Principles  and  Techniques.  MIT  Press.

Konig，Y.，Bourlard，H.，and  Morgan，N.（1996）.  REMAP:Recursive  estimation  and  maxi-mization  of  a  posteriori  probabilities–application  to  transition-based  connectionist  speech  recognition.  In  D.  Touretzky，M.  Mozer，and  M.  Hasselmo，editors，Advances  in  Neural  Information  Processing  Systems  8（NIPS'95）.  MIT  Press，Cambridge，MA.

Koren，Y.（2009）.  The  BellKor  solution  to  the  Netflix  grand  prize.

Kotzias，D.，Denil，M.，de  Freitas，N.，and  Smyth，P.（2015）.  From  group  to  individual  labels  using  deep  features.  In  ACM  SIGKDD.

Koutnik，J.，Greff，K.，Gomez，F.，and  Schmidhuber，J.（2014）.  A  clockwork  RNN.  In  ICML'2014.

Kočiský，T.，Hermann，K.  M.，and  Blunsom，P.（2014）.  Learning  Bilingual  Word  Representations  by  Marginalizing  Alignments.  In  Proceedings  of  ACL.

Krause，O.，Fischer，A.，Glasmachers，T.，and  Igel，C.（2013）.  Approximation  properties  of  DBNs  with  binary  hidden  units  and  real-valued  visible  units.  In  ICML'2013.

Krizhevsky，A.（2010）.  Convolutional  deep  belief  networks  on  CIFAR-10.  Technical  report，Uni-versity  of  Toronto.  Unpublished  Manuscript:  http://cs.utoronto.ca/kriz/conv-cifar10-aug2010.pdf.

Krizhevsky，A.  and  Hinton，G.（2009）.  Learning  multiple  layers  of  features  from  tiny  images.  Technical  report，University  of  Toronto.

Krizhevsky，A.  and  Hinton，G.  E.（2011）.  Using  very  deep  autoencoders  for  content-based  image  retrieval.  In  ESANN.

Krizhevsky，A.，Sutskever，I.，and  Hinton，G.（2012a）.  ImageNet  classification  with  deep  convo-lutional  neural  networks.  In  NIPS'2012.

Krizhevsky，A.，Sutskever，I.，and  Hinton，G.（2012b）.  ImageNet  classification  with  deep  convolutional  neural  networks.  In  Advances  in  Neural  Information  Processing  Systems  25（NIPS'2012）.

Krueger，K.  A.  and  Dayan，P.（2009）.  Flexible  shaping:  how  learning  in  small  steps  helps.  Cognition，110，380–394.

Kuhn，H.  W.  and  Tucker，A.  W.（1951）.  Nonlinear  programming.  In  Proceedings  of  the  Sec-ond  Berkeley  Symposium  on  Mathematical  Statistics  and  Probability，pages  481–492，Berkeley，Calif.  University  of  California  Press.

Kumar，A.，Irsoy，O.，Ondruska，P.，Iyyer，M.，Bradbury，J.，Gulrajani，I.，and  Socher，R.（2015a）.  Ask  me  anything:  Dynamic  memory  networks  for  natural  language  processing.  Technical  report，arXiv:1506.07285.

Kumar，A.，Irsoy，O.，Su，J.，Bradbury，J.，English，R.，Pierce，B.，Ondruska，P.，Iyyer，M.，Gulrajani，I.，and  Socher，R.（2015b）.  Ask  me  anything:  Dynamic  memory  networks  for  natural  language  processing.  arXiv:1506.07285.

Kumar，M.  P.，Packer，B.，and  Koller，D.（2010）.  Self-paced  learning  for  latent  variable  models.  In  J.  Lafferty，C.  K.  I.  Williams，J.  Shawe-Taylor，R.  Zemel，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  23，pages  1189–1197.

Lang，K.  J.  and  Hinton，G.  E.（1988）.  The  development  of  the  time-delay  neural  network  archi-tecture  for  speech  recognition.  Technical  Report  CMU-CS-88-152，Carnegie-Mellon  University.
Lang，K.  J.，Waibel，A.  H.，and  Hinton，G.  E.（1990）.  A  time-delay  neural  network  architecture  for  isolated  word  recognition.  Neural  networks，3（1），23–43.

Langford，J.  and  Zhang，T.（2008）.  The  epoch-greedy  algorithm  for  contextual  multi-armed  bandits.  In  NIPS'2008，pages  1096–1103.

Lappalainen，H.，Giannakopoulos，X.，Honkela，A.，and  Karhunen，J.（2000）.  Nonlinear  independent  component  analysis  using  ensemble  learning:  Experiments  and  discussion.  In  Proc.  ICA.  Citeseer.

Larochelle，H.  and  Bengio，Y.（2008a）.  Classification  using  discriminative  restricted  Boltzmann  machines.  In  ICML'2008.

Larochelle，H.  and  Bengio，Y.（2008b）.  Classification  using  discriminative  restricted  Boltzmann  machines.  In  ICM（1a），pages  536–543.

Larochelle，H.  and  Hinton，G.  E.（2010）.  Learning  to  combine  foveal  glimpses  with  a  third-order  Boltzmann  machine.  In  Advances  in  Neural  Information  Processing  Systems  23，pages  1243–1251.

Larochelle，H.  and  Murray，I.（2011）.  The  Neural  Autoregressive  Distribution  Estimator.  In  AISTATS'2011.

Larochelle，H.，Erhan，D.，and  Bengio，Y.（2008）.  Zero-data  learning  of  new  tasks.  In  AAAI  Conference  on  Artificial  Intelligence.

Larochelle，H.，Bengio，Y.，Louradour，J.，and  Lamblin，P.（2009）.  Exploring  strategies  for  training  deep  neural  networks.  In  JML（1），pages  1–40.

Lasserre，J.  A.，Bishop，C.  M.，and  Minka，T.  P.（2006）.  Principled  hybrids  of  generative  and  discriminative  models.  In  Proceedings  of  the  Computer  Vision  and  Pattern  Recognition  Conference（CVPR'06），pages  87–94，Washington，DC，USA.  IEEE  Computer  Society.

Le，Q.，Ngiam，J.，Chen，Z.，hao  Chia，D.  J.，Koh，P.  W.，and  Ng，A.（2010）.  Tiled  convolutional  neural  networks.  In  J.  Lafferty，C.  K.  I.  Williams，J.  Shawe-Taylor，R.  Zemel，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  23（NIPS'10），pages  1279–1287.

Le，Q.，Ngiam，J.，Coates，A.，Lahiri，A.，Prochnow，B.，and  Ng，A.（2011）.  On  optimization  methods  for  deep  learning.  In  Proc.  ICML'2011.  ACM.

Le，Q.，Ranzato，M.，Monga，R.，Devin，M.，Corrado，G.，Chen，K.，Dean，J.，and  Ng，A.（2012）.  Building  high-level  features  using  large  scale  unsupervised  learning.  In  ICML'2012.

Le  Roux，N.  and  Bengio，Y.（2008）.  Representational  power  of  restricted  Boltzmann  machines  and  deep  belief  networks.  Neural  Computation，20（6），1631–1649.

Le  Roux，N.  and  Bengio，Y.（2010）.  Deep  belief  networks  are  compact  universal  approximators.  Neural  Computation，22（8），2192–2207.

LeCun，Y.（1985）.  Une  procédure  d'apprentissage  pour  Réseau  à  seuil  assymétrique.  In  Cognitiva  85:  A  la  Frontière  de  l'Intelligence  Artificielle，des  Sciences  de  la  Connaissance  et  des  Neurosciences，pages  599–604，Paris  1985.  CESTA，Paris.

LeCun，Y.（1986）.  Learning  processes  in  an  asymmetric  threshold  network.  In  E.  Bienenstock，F.  Fogelman-Soulié，and  G.  Weisbuch，editors，Disordered  Systems  and  Biological  Organization，pages  233–240.  Springer-Verlag，Berlin，Les  Houches  1985.

LeCun，Y.（1987）.  Modèles  connexionistes  de  l'apprentissage.  Ph.D.  thesis，Université  de  Paris  VI.

LeCun，Y.（1989）.  Generalization  and  network  design  strategies.  Technical  Report  CRG-TR-89-4，University  of  Toronto.

LeCun，Y.，Jackel，L.  D.，Boser，B.，Denker，J.  S.，Graf，H.  P.，Guyon，I.，Henderson，D.，Howard，R.  E.，and  Hubbard，W.（1989）.  Handwritten  digit  recognition:  Applications  of  neural  network  chips  and  automatic  learning.  IEEE  Communications  Magazine，27（11），41–46.

LeCun，Y.，Bottou，L.，Orr，G.  B.，and  Müller，K.-R.（1998a）.  Efficient  backprop.  In  Neural  Networks，Tricks  of  the  Trade，Lecture  Notes  in  Computer  Science  LNCS  1524.  Springer  Verlag.

LeCun，Y.，Bottou，L.，Orr，G.  B.，and  Müller，K.（1998b）.  Efficient  backprop.  In  Neural  Networks，Tricks  of  the  Trade.

LeCun，Y.，Bottou，L.，Bengio，Y.，and  Haffner，P.（1998c）.  Gradient  based  learning  applied  to  document  recognition.  Proc.  IEEE.

LeCun，Y.，Kavukcuoglu，K.，and  Farabet，C.（2010）.  Convolutional  networks  and  applications  in  vision.  In  Circuits  and  Systems（ISCAS），Proceedings  of  2010  IEEE  International  Symposium  on，pages  253–256.  IEEE.

L'Ecuyer，P.（1994）.  Efficiency  improvement  and  variance  reduction.  In  Proceedings  of  the  1994  Winter  Simulation  Conference，pages  122–132.

Lee，C.-Y.，Xie，S.，Gallagher，P.，Zhang，Z.，and  Tu，Z.（2014）.  Deeply-supervised  nets.  arXiv  preprint  arXiv:1409.5185.

Lee，H.，Battle，A.，Raina，R.，and  Ng，A.（2007）.  Efficient  sparse  coding  algorithms.  In  B.  Schölkopf，J.  Platt，and  T.  Hoffman，editors，Advances  in  Neural  Information  Processing  Systems  19（NIPS'06），pages  801–808.  MIT  Press.

Lee，H.，Ekanadham，C.，and  Ng，A.（2008）.  Sparse  deep  belief  net  model  for  visual  area  V2.  In  NIPS'07.

Lee，H.，Grosse，R.，Ranganath，R.，and  Ng，A.  Y.（2009）.  Convolutional  deep  belief  net-works  for  scalable  unsupervised  learning  of  hierarchical  representations.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  Twenty-sixth  International  Conference  on  Machine  Learning（ICML'09）.  ACM，Montreal，Canada.

Lee，Y.  J.  and  Grauman，K.（2011）.  Learning  the  easy  thingsfirst:  self-paced  visual  category  discovery.  In  CVPR'2011.

Leibniz，G.  W.（1676）.  Memoir  using  the  chain  rule.（Cited  in  TMME  7:2&3  p  321-332，2010）.

Lenat，D.  B.  and  Guha，R.  V.（1989）.  Building  large  knowledge-based  systems；representation  and  inference  in  the  Cyc  project.  Addison-Wesley  Longman  Publishing  Co.，Inc.

Leshno，M.，Lin，V.  Y.，Pinkus，A.，and  Schocken，S.（1993）.  Multilayer  feedforward  networks  with  a  nonpolynomial  activation  function  can  approximate  any  function.  Neural  Networks，6，861–867.

Levenberg，K.（1944）.  A  method  for  the  solution  of  certain  non-linear  problems  in  least  squares.  Quarterly  Journal  of  Applied  Mathematics，II（2），164–168.

L'Hôpital，G.  F.  A.（1696）.  Analyse  des  infiniment  petits，pour  l'intelligence  des  lignes  courbes.  Paris:  L'Imprimerie  Royale.

Li，Y.，Swersky，K.，and  Zemel，R.  S.（2015）.  Generative  moment  matching  networks.  CoRR，abs/1502.02761.

Lin，T.，Horne，B.  G.，Tino，P.，and  Giles，C.  L.（1996）.  Learning  long-term  dependencies  is  not  as  difficult  with  NARX  recurrent  neural  networks.  IEEE  Transactions  on  Neural  Networks，7（6），1329–1338.

Lin，Y.，Liu，Z.，Sun，M.，Liu，Y.，and  Zhu，X.（2015）.  Learning  entity  and  relation  embeddings  for  knowledge  graph  completion.  In  Proc.  AAAI'15.

Linde，N.（1992）.  The  machine  that  changed  the  world，episode  3.  Documentary  miniseries.

Lindsey，C.  and  Lindblad，T.（1994）.  Review  of  hardware  neural  networks:  a  user's  perspective.  In  Proc.  Third  Workshop  on  Neural  Networks:  From  Biology  to  High  Energy  Physics，pages  195–202，Isola  d'Elba，Italy.

Linnainmaa，S.（1976）.  Taylor  expansion  of  the  accumulated  rounding  error.  BIT  Numerical  Mathematics，16（2），146–160.

LISA（2008）.  Deep  learning  tutorials:Restricted  Boltzmann  machines.  Technical  report，LISA  Lab，Université  de  Montréal.

Long，P.  M.  and  Servedio，R.  A.（2010）.  Restricted  Boltzmann  machines  are  hard  to  approximately  evaluate  or  simulate.  In  Proceedings  of  the  27th  International  Conference  on  Machine  Learning（ICML'10）.

Lotter，W.，Kreiman，G.，and  Cox，D.（2015）.  Unsupervised  learning  of  visual  structure  using  predictive  generative  networks.  arXiv  preprint  arXiv:1511.06380.

Lovelace，A.（1842）.  Notes  upon  L.  F.  Menabrea's“Sketch  of  the  Analytical  Engine  invented  by  Charles  Babbage”.

Lu，L.，Zhang，X.，Cho，K.，and  Renals，S.（2015）.  A  study  of  the  recurrent  neural  network  encoder-decoder  for  large  vocabulary  speech  recognition.  In  Proc.  Interspeech.

Lu，T.，Pál，D.，and  Pál，M.（2010）.  Contextual  multi-armed  bandits.  In  International  Conference  on  Artificial  Intelligence  and  Statistics，pages  485–492.

Luenberger，D.  G.（1984）.  Linear  and  Nonlinear  Programming.  Addison  Wesley.

Lukoševičius，M.  and  Jaeger，H.（2009）.  Reservoir  computing  approaches  to  recurrent  neural  network  training.  Computer  Science  Review，3（3），127–149.

Luo，H.，Shen，R.，Niu，C.，and  Ullrich，C.（2011）.  Learning  class-relevant  features  and  class-irrelevant  features  via  a  hybrid  third-order  RBM.  In  International  Conference  on  Artificial  Intelligence  and  Statistics，pages  470–478.

Luo，H.，Carrier，P.  L.，Courville，A.，and  Bengio，Y.（2013）.  Texture  modeling  with  convolutional  spike-and-slab  RBMs  and  deep  extensions.  In  AISTATS'2013.

Lyu，S.（2009）.  Interpretation  and  generalization  of  score  matching.  In  Proceedings  of  the  Twenty-fifth  Conference  in  Uncertainty  in  Artificial  Intelligence（UAI'09）.

Ma，J.，Sheridan，R.  P.，Liaw，A.，Dahl，G.  E.，and  Svetnik，V.（2015）.  Deep  neural  nets  as  a  method  for  quantitative  structure–activity  relationships.  J.  Chemical  information  and  modeling.

Maas，A.  L.，Hannun，A.  Y.，and  Ng，A.  Y.（2013）.  Rectifier  nonlinearities  improve  neural  network  acoustic  models.  In  ICML  Workshop  on  Deep  Learning  for  Audio，Speech，and  Language  Processing.

Maass，W.（1992）.  Bounds  for  the  computational  power  and  learning  complexity  of  analog  neural  nets（extended  abstract）.  In  Proc.  of  the  25th  ACM  Symp.  Theory  of  Computing，pages  335–344.

Maass，W.，Schnitger，G.，and  Sontag，E.  D.（1994）.  A  comparison  of  the  computational  power  of  sigmoid  and  Boolean  threshold  circuits.  Theoretical  Advances  in  Neural  Computation  and  Learning，pages  127–151.

Maass，W.，Natschlaeger，T.，and  Markram，H.（2002）.  Real-time  computing  without  stable  states:  A  new  framework  for  neural  computation  based  on  perturbations.  Neural  Computation，14（11），2531–2560.

MacKay，D.（2003）.  Information  Theory，Inference  and  Learning  Algorithms.  Cambridge  University  Press.

Maclaurin，D.，Duvenaud，D.，and  Adams，R.  P.（2015）.  Gradient-based  hyperparameter  optimization  through  reversible  learning.  arXiv  preprint  arXiv:1502.03492.

Mao，J.，Xu，W.，Yang，Y.，Wang，J.，and  Yuille，A.（2014）.  Deep  captioning  with  multimodal  recurrent  neural  networks（m-rnn）.  arXiv:1412.6632［cs.CV］.

Marcotte，P.  and  Savard，G.（1992）.  Novel  approaches  to  the  discrimination  problem.  Zeitschrift  für  Operations  Research（Theory），36，517–545.

Marlin，B.  and  de  Freitas，N.（2011）.  Asymptotic  efficiency  of  deterministic  estimators  for  discrete  energy-based  models:  Ratio  matching  and  pseudolikelihood.  In  UAI'2011.

Marlin，B.，Swersky，K.，Chen，B.，and  de  Freitas，N.（2010）.  Inductive  principles  for  restricted  Boltzmann  machine  learning.  In  AISTATS'2010，pages  509–516.

Marquardt，D.  W.（1963）.  An  algorithm  for  least-squares  estimation  of  non-linear  parameters.  Journal  of  the  Society  of  Industrial  and  Applied  Mathematics，11（2），431–441.

Marr，D.  and  Poggio，T.（1976）.  Cooperative  computation  of  stereo  disparity.  Science，194.

Martens，J.（2010）.  Deep  learning  via  Hessian-free  optimization.  In  ICML'2010，pages  735–742.

Martens，J.  and  Medabalimi，V.（2014）.  On  the  expressive  efficiency  of  sum  product  networks.  arXiv:1411.7717.

Martens，J.  and  Sutskever，I.（2011）.  Learning  recurrent  neural  networks  with  Hessian-free  optimization.  In  Proc.  ICML'2011.  ACM.

Mase，S.（1995）.  Consistency  of  the  maximum  pseudo-likelihood  estimator  of  continuous  state  space  Gibbsian  processes.  The  Annals  of  Applied  Probability，5（3），pp.  603–612.

McClelland，J.，Rumelhart，D.，and  Hinton，G.（1995）.  The  appeal  of  parallel  distributed  processing.  In  Computation  &  intelligence，pages  305–341.  American  Association  for  Artificial  Intelligence.

McCulloch，W.  S.  and  Pitts，W.（1943）.  A  logical  calculus  of  ideas  immanent  in  nervous  activity.  Bulletin  of  Mathematical  Biophysics，5，115–133.

Mead，C.  and  Ismail，M.（2012）.  Analog  VLSI  implementation  of  neural  systems，volume  80.  Springer  Science  &  Business  Media.

Melchior，J.，Fischer，A.，and  Wiskott，L.（2013）.  How  to  center  binary  deep  Boltzmann  machines.  arXiv  preprint  arXiv:1311.1354.

Memisevic，R.  and  Hinton，G.  E.（2007）.  Unsupervised  learning  of  image  transformations.  In  Proceedings  of  the  Computer  Vision  and  Pattern  Recognition  Conference（CVPR'07）.

Memisevic，R.  and  Hinton，G.  E.（2010）.  Learning  to  represent  spatial  transformations  with  factored  higher-order  Boltzmann  machines.  Neural  Computation，22（6），1473–1492.

Mesnil，G.，Dauphin，Y.，Glorot，X.，Rifai，S.，Bengio，Y.，Goodfellow，I.，Lavoie，E.，Muller，X.，Desjardins，G.，Warde-Farley，D.，Vincent，P.，Courville，A.，and  Bergstra，J.（2011）.  Unsupervised  and  transfer  learning  challenge:  a  deep  learning  approach.  In  JMLR  W&CP:  Proc.  Unsupervised  and  Transfer  Learning，volume  7.

Mesnil，G.，Rifai，S.，Dauphin，Y.，Bengio，Y.，and  Vincent，P.（2012）.  Surfing  on  the  manifold.  Learning  Workshop，Snowbird.

Miikkulainen，R.  and  Dyer，M.  G.（1991）.  Natural  language  processing  with  modular  PDP  networks  and  distributed  lexicon.  Cognitive  Science，15，343–399.

Mikolov，T.（2012）.  Statistical  Language  Models  based  on  Neural  Networks.  Ph.D.  thesis，Brno  University  of  Technology.

Mikolov，T.，Deoras，A.，Kombrink，S.，Burget，L.，and  Cernocky，J.（2011a）.  Empirical  evaluation  and  combination  of  advanced  language  modeling  techniques.  In  Proc.  12th  annual  conference  of  the  international  speech  communication  association（INTERSPEECH  2011）.

Mikolov，T.，Deoras，A.，Povey，D.，Burget，L.，and  Cernocky，J.（2011b）.  Strategies  for  training  large  scale  neural  network  language  models.  In  Proc.  ASRU'2011.

Mikolov，T.，Chen，K.，Corrado，G.，and  Dean，J.（2013a）.  Efficient  estimation  of  word  representations  in  vector  space.  In  International  Conference  on  Learning  Representations:  Workshops  Track.

Mikolov，T.，Le，Q.  V.，and  Sutskever，I.（2013b）.  Exploiting  similarities  among  languages  for  machine  translation.  Technical  report，arXiv:1309.4168.

Minka，T.（2005）.  Divergence  measures  and  message  passing.  Microsoft  Research  Cambridge  UK  Tech  Rep  MSRTR2005173，72（TR-2005-173）.

Minsky，M.  L.  and  Papert，S.  A.（1969）.  Perceptrons.  MIT  Press，Cambridge.

Mirza，M.  and  Osindero，S.（2014）.  Conditional  generative  adversarial  nets.  arXiv  preprint  arXiv:1411.1784.

Mishkin，D.  and  Matas，J.（2015）.  All  you  need  is  a  good  init.  arXiv  preprint  arXiv:1511.06422.

Misra，J.  and  Saha，I.（2010）.  Artificial  neural  networks  in  hardware:  A  survey  of  two  decades  of  progress.  Neurocomputing，74（1），239–255.

Mitchell，T.  M.（1997）.  Machine  Learning.  McGraw-Hill，New  York.

Miyato，T.，Maeda，S.，Koyama，M.，Nakae，K.，and  Ishii，S.（2015）.  Distributional  smoothing  with  virtual  adversarial  training.  In  ICLR.  Preprint:  arXiv:1507.00677.

Mnih，A.  and  Gregor，K.（2014）.  Neural  variational  inference  and  learning  in  belief  networks.  In  ICML'2014.

Mnih，A.  and  Hinton，G.  E.（2007）.  Three  new  graphical  models  for  statistical  language  mod-elling.  In  Z.  Ghahramani，editor，Proceedings  of  the  Twenty-fourth  International  Conference  on  Machine  Learning（ICML'07），pages  641–648.  ACM.

Mnih，A.  and  Hinton，G.  E.（2009）.  A  scalable  hierarchical  distributed  language  model.  In  D.  Koller，D.  Schuurmans，Y.  Bengio，and  L.  Bottou，editors，Advances  in  Neural  Information  Processing  Systems  21（NIPS'08），pages  1081–1088.

Mnih，A.  and  Kavukcuoglu，K.（2013）.  Learning  word  embeddings  efficiently  with  noise-contrastive  estimation.  In  C.  Burges，L.  Bottou，M.  Welling，Z.  Ghahramani，and  K.  Weinberger，editors，Advances  in  Neural  Information  Processing  Systems  26，pages  2265–2273.  Curran  Associates，Inc.

Mnih，A.  and  Teh，Y.  W.（2012）.  A  fast  and  simple  algorithm  for  training  neural  probabilistic  language  models.  In  ICML'2012，pages  1751–1758.

Mnih，V.  and  Hinton，G.（2010）.  Learning  to  detect  roads  in  high-resolution  aerial  images.  In  Proceedings  of  the  11th  European  Conference  on  Computer  Vision（ECCV）.

Mnih，V.，Larochelle，H.，and  Hinton，G.（2011）.  Conditional  restricted  Boltzmann  machines  for  structure  output  prediction.  In  Proc.  Conf.  on  Uncertainty  in  Artificial  Intelligence（UAI）.

Mnih，V.，Kavukcuoglo，K.，Silver，D.，Graves，A.，Antonoglou，I.，and  Wierstra，D.（2013）.  Playing  Atari  with  deep  reinforcement  learning.  Technical  report，arXiv:1312.5602.

Mnih，V.，Heess，N.，Graves，A.，and  Kavukcuoglu，K.（2014）.  Recurrent  models  of  visual  attention.  In  Z.  Ghahramani，M.  Welling，C.  Cortes，N.  Lawrence，and  K.  Weinberger，editors，NIPS'2014，pages  2204–2212.

Mnih，V.，Kavukcuoglo，K.，Silver，D.，Rusu，A.  A.，Veness，J.，Bellemare，M.  G.，Graves，A.，Riedmiller，M.，Fidgeland，A.  K.，Ostrovski，G.，Petersen，S.，Beattie，C.，Sadik，A.，Antonoglou，I.，King，H.，Kumaran，D.，Wierstra，D.，Legg，S.，and  Hassabis，D.（2015）.  Human-level  control  through  deep  reinforcement  learning.  Nature，518，529–533.

Mobahi，H.  and  Fisher，III，J.  W.（2015）.  A  theoretical  analysis  of  optimization  by  Gaussian  continuation.  In  AAAI'2015.

Mobahi，H.，Collobert，R.，and  Weston，J.（2009）.  Deep  learning  from  temporal  coherence  in  video.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  26th  International  Conference  on  Machine  Learning，pages  737–744，Montreal.  Omnipress.

Mohamed，A.，Dahl，G.，and  Hinton，G.（2009）.  Deep  belief  networks  for  phone  recognition.

Mohamed，A.，Sainath，T.  N.，Dahl，G.，Ramabhadran，B.，Hinton，G.  E.，and  Picheny，M.  A.（2011）.  Deep  belief  networks  using  discriminative  features  for  phone  recognition.  In  Acoustics，Speech  and  Signal  Processing（ICASSP），2011  IEEE  International  Conference  on，pages  5060–5063.  IEEE.

Mohamed，A.，Dahl，G.，and  Hinton，G.（2012a）.  Acoustic  modeling  using  deep  belief  networks.  IEEE  Trans.  on  Audio，Speech  and  Language  Processing，20（1），14–22.

Mohamed，A.，Hinton，G.，and  Penn，G.（2012b）.  Understanding  how  deep  belief  networks  perform  acoustic  modelling.  In  Acoustics，Speech  and  Signal  Processing（ICASSP），2012  IEEE  International  Conference  on，pages  4273–4276.  IEEE.

Moller，M.（1993）.  Efficient  Training  of  Feed-Forward  Neural  Networks.  Ph.D.  thesis，Aarhus  University，Aarhus，Denmark.

Montavon，G.  and  Muller，K.-R.（2012）.  Deep  Boltzmann  machines  and  the  centering  trick.  In  G.  Montavon，G.  Orr，and  K.-R.  Müller，editors，Neural  Networks:  Tricks  of  the  Trade，volume  7700  of  Lecture  Notes  in  Computer  Science，pages  621–637.  Preprint:  http://arxiv.org/abs/1203.3783.

Montúfar，G.（2014）.  Universal  approximation  depth  and  errors  of  narrow  belief  networks  with  discrete  units.  Neural  Computation，26.

Montúfar，G.  and  Ay，N.（2011）.  Refinements  of  universal  approximation  results  for  deep  belief  networks  and  restricted  Boltzmann  machines.  Neural  Computation，23（5），1306–1319.

Montufar，G.  F.，Pascanu，R.，Cho，K.，and  Bengio，Y.（2014）.  On  the  number  of  linear  regions  of  deep  neural  networks.  In  NIPS'2014.

Mor-Yosef，S.，Samueloff，A.，Modan，B.，Navot，D.，and  Schenker，J.  G.（1990）.  Ranking  the  risk  factors  for  cesarean:  logistic  regression  analysis  of  a  nationwide  study.  Obstet  Gynecol，75（6），944–7.

Morin，F.  and  Bengio，Y.（2005）.  Hierarchical  probabilistic  neural  network  language  model.  In  AISTATS'2005.

Mozer，M.  C.（1992）.  The  induction  of  multiscale  temporal  structure.  In  J.  M.  S.  Hanson  and  R.  Lippmann，editors，Advances  in  Neural  Information  Processing  Systems  4（NIPS'91），pages  275–282，San  Mateo，CA.  Morgan  Kaufmann.

Murphy，K.  P.（2012）.  Machine  Learning:  a  Probabilistic  Perspective.  MIT  Press，Cambridge，MA，USA.

Murray，B.  U.  I.  and  Larochelle，H.（2014）.  A  deep  and  tractable  density  estimator.  In  ICML'2014.

Nair，V.  and  Hinton，G.（2010a）.  Rectified  linear  units  improve  restricted  Boltzmann  machines.  In  ICML'2010.

Nair，V.  and  Hinton，G.  E.（2009）.  3d  object  recognition  with  deep  belief  nets.  In  Y.  Bengio，D.  Schuurmans，J.  D.  Lafferty，C.  K.  I.  Williams，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  22，pages  1339–1347.  Curran  Associates，Inc.

Nair，V.  and  Hinton，G.  E.（2010b）.  Rectified  linear  units  improve  restricted  Boltzmann  machines.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  Twenty-seventh  International  Conference  on  Machine  Learning（ICML-10），pages  807–814.  ACM.

Narayanan，H.  and  Mitter，S.（2010）.  Sample  complexity  of  testing  the  manifold  hypothesis.  In  J.  Lafferty，C.  K.  I.  Williams，J.  Shawe-Taylor，R.  Zemel，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  23，pages  1786–1794.

Naumann，U.（2008）.  Optimal  Jacobian  accumulation  is  NP-complete.  Mathematical  Programming，112（2），427–441.

Navigli，R.  and  Velardi，P.（2005）.  Structural  semantic  interconnections:  a  knowledge-based  approach  to  word  sense  disambiguation.  IEEE  Trans.  Pattern  Analysis  and  Machine  Intelligence，27（7），1075–1086.

Neal，R.  and  Hinton，G.（1999）.  A  view  of  the  EM  algorithm  that  justifies  incremental，sparse，and  other  variants.  In  M.  I.  Jordan，editor，Learning  in  Graphical  Models.  MIT  Press，Cambridge，MA.

Neal，R.  M.（1990）.  Learning  stochastic  feedforward  networks.  Technical  report.

Neal，R.  M.（1993）.  Probabilistic  inference  using  Markov  chain  Monte-Carlo  methods.  Technical  Report  CRG-TR-93-1，Dept.  of  Computer  Science，University  of  Toronto.

Neal，R.  M.（1994）.  Sampling  from  multimodal  distributions  using  tempered  transitions.  Technical  Report  9421，Dept.  of  Statistics，University  of  Toronto.

Neal，R.  M.（1996）.  Bayesian  Learning  for  Neural  Networks.  Lecture  Notes  in  Statistics.  Springer.

Neal，R.  M.（2001）.  Annealed  importance  sampling.  Statistics  and  Computing，11（2），125–139.

Neal，R.  M.（2005）.  Estimating  ratios  of  normalizing  constants  using  linked  importance  sampling.

Nesterov，Y.（1983）.  A  method  of  solving  a  convex  programming  problem  with  convergence  rate  O（1/k2）.  Soviet  Mathematics  Doklady，27，372–376.

Nesterov，Y.（2004）.  Introductory  lectures  on  convex  optimization:  a  basic  course.  Applied  optimization.  Kluwer  Academic  Publ.，Boston，Dordrecht，London.

Netzer，Y.，Wang，T.，Coates，A.，Bissacco，A.，Wu，B.，and  Ng，A.  Y.（2011）.  Reading  digits  in  natural  images  with  unsupervised  feature  learning.  Deep  Learning  and  Unsupervised  Feature  Learning  Workshop，NIPS.

Ney，H.  and  Kneser，R.（1993）.  Improved  clustering  techniques  for  class-based  statistical  language  modelling.  In  European  Conference  on  Speech  Communication  and  Technology（Eurospeech），pages  973–976，Berlin.

Ng，A.（2015）.  Advice  for  applying  machine  learning.  https://see.stanford.edu/materials/aimlcs229/ML-advice.pdf.

Niesler，T.  R.，Whittaker，E.  W.  D.，and  Woodland，P.  C.（1998）.  Comparison  of  part-of-speech  and  automatically  derived  category-based  language  models  for  speech  recognition.  In  International  Conference  on  Acoustics，Speech  and  Signal  Processing（ICASSP），pages  177–180.

Ning，F.，Delhomme，D.，LeCun，Y.，Piano，F.，Bottou，L.，and  Barbano，P.  E.（2005）.  To-ward  automatic  phenotyping  of  developing  embryos  from  videos.  Image  Processing，IEEE  Transactions  on，14（9），1360–1371.

Nocedal，J.  and  Wright，S.（2006）.  Numerical  Optimization.  Springer.

Norouzi，M.  and  Fleet，D.  J.（2011）.  Minimal  loss  hashing  for  compact  binary  codes.  In  ICML'2011.

Nowlan，S.  J.（1990）.  Competing  experts:  An  experimental  investigation  of  associative  mixture  models.  Technical  Report  CRG-TR-90-5，University  of  Toronto.

Nowlan，S.  J.  and  Hinton，G.  E.（1992）.  Adaptive  soft  weight  tying  using  Gaussian  mixtures.  In  J.  M.  S.  Hanson  and  R.  Lippmann，editors，Advances  in  Neural  Information  Processing  Systems  4（NIPS'91），pages  993–1000，San  Mateo，CA.  Morgan  Kaufmann.

Olshausen，B.  and  Field，D.  J.（2005）.  How  close  are  we  to  understanding  V1?  Neural  Computation，17，1665–1699.

Olshausen，B.  A.  and  Field，D.  J.（1996）.  Emergence  of  simple-cell  receptivefield  properties  by  learning  a  sparse  code  for  natural  images.  Nature，381，607–609.

Olshausen，B.  A.，Anderson，C.  H.，and  Van  Essen，D.  C.（1993）.  A  neurobiological  model  of  visual  attention  and  invariant  pattern  recognition  based  on  dynamic  routing  of  information.  J.  Neurosci.，13（11），4700–4719.

Opper，M.  and  Archambeau，C.（2009）.  The  variational  Gaussian  approximation  revisited.  Neural  computation，21（3），786–792.

Oquab，M.，Bottou，L.，Laptev，I.，and  Sivic，J.（2014）.  Learning  and  transferring  mid-level  image  representations  using  convolutional  neural  networks.  In  Computer  Vision  and  Pattern  Recognition（CVPR），2014  IEEE  Conference  on，pages  1717–1724.  IEEE.

Osindero，S.  and  Hinton，G.  E.（2008）.  Modeling  image  patches  with  a  directed  hierarchy  of  Markov  randomfields.  In  J.  Platt，D.  Koller，Y.  Singer，and  S.  Roweis，editors，Advances  in  Neural  Information  Processing  Systems  20（NIPS'07），pages  1121–1128，Cambridge，MA.  MIT  Press.

Ovid  and  Martin，C.（2004）.  Metamorphoses.  W.W.  Norton.

Paccanaro，A.  and  Hinton，G.  E.（2000）.  Extracting  distributed  representations  of  concepts  and  relations  from  positive  and  negative  propositions.  In  International  Joint  Conference  on  Neural  Networks（IJCNN），Como，Italy.  IEEE，New  York.

Paine，T.  L.，Khorrami，P.，Han，W.，and  Huang，T.  S.（2014）.  An  analysis  of  unsupervised  pre-training  in  light  of  recent  advances.  arXiv  preprint  arXiv:1412.6597.

Palatucci，M.，Pomerleau，D.，Hinton，G.  E.，and  Mitchell，T.  M.（2009）.  Zero-shot  learning  with  semantic  output  codes.  In  Y.  Bengio，D.  Schuurmans，J.  D.  Lafferty，C.  K.  I.  Williams，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  22，pages  1410–1418.  Curran  Associates，Inc.

Parker，D.  B.（1985）.  Learning-logic.  Technical  Report  TR-47，Center  for  Comp.  Research  in  Economics  and  Management  Sci.，MIT.

Pascanu，R.，Mikolov，T.，and  Bengio，Y.（2013a）.  On  the  difficulty  of  training  recurrent  neural  networks.  In  ICML'2013.

Pascanu，R.，Mikolov，T.，and  Bengio，Y.（2013b）.  On  the  difficulty  of  training  recurrent  neural  networks.  In  ICM（1c）.

Pascanu，R.，Gulcehre，C.，Cho，K.，and  Bengio，Y.（2014a）.  How  to  construct  deep  recurrent  neural  networks.  In  ICLR.

Pascanu，R.，Montufar，G.，and  Bengio，Y.（2014b）.  On  the  number  of  inference  regions  of  deep  feed  forward  networks  with  piece-wise  linear  activations.  In  ICL（1）.

Pati，Y.，Rezaiifar，R.，and  Krishnaprasad，P.（1993）.  Orthogonal  matching  pursuit:Recursive  function  approximation  with  applications  to  wavelet  decomposition.  In  Proceedings  of  the  27  th  Annual  Asilomar  Conference  on  Signals，Systems，and  Computers，pages  40–44.

Pearl，J.（1985）.  Bayesian  networks:  A  model  of  self-activated  memory  for  evidential  reasoning.  In  Proceedings  of  the  7th  Conference  of  the  Cognitive  Science  Society，University  of  California，Irvine，pages  329–334.

Pearl，J.（1988）.  Probabilistic  Reasoning  in  Intelligent  Systems:  Networks  of  Plausible  Inference.  Morgan  Kaufmann.

Perron，O.（1907）.  Zur  theorie  der  matrices.  Mathematische  Annalen，64（2），248–263.

Petersen，K.  B.  and  Pedersen，M.  S.（2006）.  The  matrix  cookbook.  Version  20051003.

Peterson，G.  B.（2004）.  A  day  of  great  illumination:  B.  F.  Skinner's  discovery  of  shaping.  Journal  of  the  Experimental  Analysis  of  Behavior，82（3），317–328.

Pham，D.-T.，Garat，P.，and  Jutten，C.（1992）.  Separation  of  a  mixture  of  independent  sources  through  a  maximum  likelihood  approach.  In  EUSIPCO，pages  771–774.

Pham，P.-H.，Jelaca，D.，Farabet，C.，Martini，B.，LeCun，Y.，and  Culurciello，E.（2012）.  Neu-Flow:  dataflow  vision  processing  system-on-a-chip.  In  Circuits  and  Systems（MWSCAS），2012  IEEE  55th  International  Midwest  Symposium  on，pages  1044–1047.  IEEE.

Pinheiro，P.  H.  O.  and  Collobert，R.（2014）.  Recurrent  convolutional  neural  networks  for  scene  labeling.  In  ICML'2014.

Pinheiro，P.  H.  O.  and  Collobert，R.（2015）.  From  image-level  to  pixel-level  labeling  with  con-volutional  networks.  In  Conference  on  Computer  Vision  and  Pattern  Recognition（CVPR）.

Pinto，N.，Cox，D.  D.，and  DiCarlo，J.  J.（2008）.  Why  is  real-world  visual  object  recognition  hard?  PLoS  Comput  Biol，4.

Pinto，N.，Stone，Z.，Zickler，T.，and  Cox，D.（2011）.  Scaling  up  biologically-inspired  computer  vision:  A  case  study  in  unconstrained  face  recognition  on  facebook.  In  Computer  Vision  and  Pattern  Recognition  Workshops（CVPRW），2011  IEEE  Computer  Society  Conference  on，pages  35–42.  IEEE.

Pollack，J.  B.（1990）.  Recursive  distributed  representations.  Artificial  Intelligence，46（1），77–105.

Polyak，B.  and  Juditsky，A.（1992）.  Acceleration  of  stochastic  approximation  by  averaging.  SIAM  J.  Control  and  Optimization，30（4），838–855.

Polyak，B.  T.（1964）.  Some  methods  of  speeding  up  the  convergence  of  iteration  methods.  USSR  Computational  Mathematics  and  Mathematical  Physics，4（5），1–17.

Poole，B.，Sohl-Dickstein，J.，and  Ganguli，S.（2014）.  Analyzing  noise  in  autoencoders  and  deep  networks.  CoRR，abs/1406.1831.

Poon，H.  and  Domingos，P.（2011）.  Sum-product  networks  for  deep  learning.  In  Learning  Workshop，Fort  Lauderdale，FL.

Presley，R.  K.  and  Haggard，R.  L.（1994）.  Afixed  point  implementation  of  the  backpropaga-tion  learning  algorithm.  In  Southeastcon  '94.  Creative  Technology  Transfer-A  Global  Affair.，Proceedings  of  the  1994  IEEE，pages  136–138.  IEEE.

Price，R.（1958）.  A  useful  theorem  for  nonlinear  devices  having  Gaussian  inputs.  IEEE  Transactions  on  Information  Theory，4（2），69–72.

Quiroga，R.  Q.，Reddy，L.，Kreiman，G.，Koch，C.，and  Fried，I.（2005）.  Invariant  visual  representation  by  single  neurons  in  the  human  brain.  Nature，435（7045），1102–1107.

Radford，A.，Metz，L.，and  Chintala，S.（2015）.  Unsupervised  representation  learning  with  deep  convolutional  generative  adversarial  networks.  arXiv  preprint  arXiv:1511.06434.

Raiko，T.，Yao，L.，Cho，K.，and  Bengio，Y.（2014）.  Iterative  neural  autoregressive  distribution  estimator（NADE-k）.  Technical  report，arXiv:1406.1485.

Raina，R.，Madhavan，A.，and  Ng，A.  Y.（2009a）.  Large-scale  deep  unsupervised  learning  using  graphics  processors.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  Twenty-sixth  International  Conference  on  Machine  Learning（ICML'09），pages  873–880，New  York，NY，USA.  ACM.

Raina，R.，Madhavan，A.，and  Ng，A.  Y.（2009b）.  Large-scale  deep  unsupervised  learning  using  graphics  processors.  In  ICML'2009.

Ramsey，F.  P.（1926）.  Truth  and  probability.  In  R.  B.  Braithwaite，editor，The  Foundations  of  Mathematics  and  other  Logical  Essays，chapter  7，pages  156–198.  McMaster  University  Archive  for  the  History  of  Economic  Thought.

Ranzato，M.  and  Hinton，G.  H.（2010）.  Modeling  pixel  means  and  covariances  using  factorized  third-order  Boltzmann  machines.  In  CVPR'2010，pages  2551–2558.

Ranzato，M.，Poultney，C.，Chopra，S.，and  LeCun，Y.（2007a）.  Efficient  learning  of  sparse  representations  with  an  energy-based  model.  In  NIPS'2006.

Ranzato，M.，Poultney，C.，Chopra，S.，and  LeCun，Y.（2007b）.  Efficient  learning  of  sparse  representations  with  an  energy-based  model.  In  B.  Schölkopf，J.  Platt，and  T.  Hoffman，editors，Advances  in  Neural  Information  Processing  Systems  19（NIPS'06），pages  1137–1144.  MIT  Press.

Ranzato，M.，Huang，F.，Boureau，Y.，and  LeCun，Y.（2007c）.  Unsupervised  learning  of  invariant  feature  hierarchies  with  applications  to  object  recognition.  In  CVPR'07.

Ranzato，M.，Boureau，Y.，and  LeCun，Y.（2008）.  Sparse  feature  learning  for  deep  belief  networks.  In  NIPS'2007.

Ranzato，M.，Krizhevsky，A.，and  Hinton，G.  E.（2010a）.  Factored  3-way  restricted  Boltzmann  machines  for  modeling  natural  images.  In  Proceedings  of  AISTATS  2010.

Ranzato，M.，Mnih，V.，and  Hinton，G.（2010b）.  Generating  more  realistic  images  using  gated  MRFs.  In  NIPS'2010.

Rao，C.（1945）.  Information  and  the  accuracy  attainable  in  the  estimation  of  statistical  param-eters.  Bulletin  of  the  Calcutta  Mathematical  Society，37，81–89.

Rasmus，A.，Valpola，H.，Honkala，M.，Berglund，M.，and  Raiko，T.（2015）.  Semi-supervised  learning  with  ladder  network.  arXiv  preprint  arXiv:1507.02672.

Recht，B.，Re，C.，Wright，S.，and  Niu，F.（2011）.  Hogwild:  A  lock-free  approach  to  parallelizing  stochastic  gradient  descent.  In  NIPS'2011.

Reichert，D.  P.，Seriès，P.，and  Storkey，A.  J.（2011）.  Neuronal  adaptation  for  sampling-based  probabilistic  inference  in  perceptual  bistability.  In  Advances  in  Neural  Information  Processing  Systems，pages  2357–2365.

Rezende，D.  J.，Mohamed，S.，and  Wierstra，D.（2014）.  Stochastic  backpropagation  and  approx-imate  inference  in  deep  generative  models.  In  ICML'2014.  Preprint:arXiv:1401.4082.

Rifai，S.，Vincent，P.，Muller，X.，Glorot，X.，and  Bengio，Y.（2011a）.  Contractive  auto-encoders:  Explicit  invariance  during  feature  extraction.  In  ICML'2011.

Rifai，S.，Mesnil，G.，Vincent，P.，Muller，X.，Bengio，Y.，Dauphin，Y.，and  Glorot，X.（2011b）.  Higher  order  contractive  auto-encoder.  In  ECML  PKDD.

Rifai，S.，Dauphin，Y.，Vincent，P.，Bengio，Y.，and  Muller，X.（2011c）.  The  manifold  tangent  classifier.  In  NIPS'2011.

Rifai，S.，Dauphin，Y.，Vincent，P.，Bengio，Y.，and  Muller，X.（2011d）.  The  manifold  tangent  classifier.  In  NIPS'2011.  Student  paper  award.

Rifai，S.，Bengio，Y.，Dauphin，Y.，and  Vincent，P.（2012）.  A  generative  process  for  sampling  contractive  auto-encoders.  In  ICML'2012.

Ringach，D.  and  Shapley，R.（2004）.  Reverse  correlation  in  neurophysiology.  Cognitive  Science，28（2），147–166.

Roberts，S.  and  Everson，R.（2001）.  Independent  component  analysis:  principles  and  practice.  Cambridge  University  Press.

Robinson，A.  J.  and  Fallside，F.（1991）.  A  recurrent  error  propagation  network  speech  recognition  system.  Computer  Speech  and  Language，5（3），259–274.

Rockafellar，R.  T.（1997）.  Convex  analysis.  princeton  landmarks  in  mathematics.

Romero，A.，Ballas，N.，Ebrahimi  Kahou，S.，Chassang，A.，Gatta，C.，and  Bengio，Y.（2015）.  Fitnets:Hints  for  thin  deep  nets.  In  ICLR'2015，arXiv:1412.6550.

Rosen，J.  B.（1960）.  The  gradient  projection  method  for  nonlinear  programming.  part  i.  linear  constraints.  Journal  of  the  Society  for  Industrial  and  Applied  Mathematics，8（1），pp.  181–217.

Rosenblatt，F.（1958）.  The  perceptron:  A  probabilistic  model  for  information  storage  and  organization  in  the  brain.  Psychological  Review，65，386–408.

Rosenblatt，F.（1962）.  Principles  of  Neurodynamics.  Spartan，New  York.

Rosenblatt，M.（1956）.  Remarks  on  some  nonparametric  estimates  of  a  density  function.  The  Annals  of  Mathematical  Statistics，27（3），832–837.

Roweis，S.  and  Saul，L.  K.（2000）.  Nonlinear  dimensionality  reduction  by  locally  linear  embedding.  Science，290（5500）.

Roweis，S.，Saul，L.，and  Hinton，G.（2002）.  Global  coordination  of  local  linear  models.  In  T.  Dietterich，S.  Becker，and  Z.  Ghahramani，editors，Advances  in  Neural  Information  Processing  Systems  14（NIPS'01），Cambridge，MA.  MIT  Press.

Rubin，D.  B.  et  al.（1984）.  Bayesianly  justifiable  and  relevant  frequency  calculations  for  the  applied  statistician.  The  Annals  of  Statistics，12（4），1151–1172.

Rumelhart，D.，Hinton，G.，and  Williams，R.（1986a）.  Learning  representations  by  back-propagating  errors.  Nature，323，533–536.

Rumelhart，D.  E.，Hinton，G.  E.，and  Williams，R.  J.（1986b）.  Learning  internal  representations  by  error  propagation.  In  D.  E.  Rumelhart  and  J.  L.  McClelland，editors，Parallel  Distributed  Processing，volume  1，chapter  8，pages  318–362.  MIT  Press，Cambridge.

Rumelhart，D.  E.，Hinton，G.  E.，and  Williams，R.  J.（1986c）.  Learning  representations  by  back-propagating  errors.  Nature，323，533–536.

Rumelhart，D.  E.，McClelland，J.  L.，and  the  PDP  Research  Group（1986d）.  Parallel  Distributed  Processing:  Explorations  in  the  Microstructure  of  Cognition.  MIT  Press，Cambridge.

Russakovsky，O.，Deng，J.，Su，H.，Krause，J.，Satheesh，S.，Ma，S.，Huang，Z.，Karpathy，A.，Khosla，A.，Bernstein，M.，Berg，A.  C.，and  Fei-Fei，L.（2014a）.  ImageNet  Large  Scale  Visual  Recognition  Challenge.

Russakovsky，O.，Deng，J.，Su，H.，Krause，J.，Satheesh，S.，Ma，S.，Huang，Z.，Karpathy，A.，Khosla，A.，Bernstein，M.，et  al.（2014b）.  Imagenet  large  scale  visual  recognition  challenge.  arXiv  preprint  arXiv:1409.0575.

Russel，S.  J.  and  Norvig，P.（2003）.  Artificial  Intelligence:a  Modern  Approach.  Prentice  Hall.

Rust，N.，Schwartz，O.，Movshon，J.  A.，and  Simoncelli，E.（2005）.  Spatiotemporal  elements  of  macaque  V1  receptivefields.  Neuron，46（6），945–956.

Sainath，T.，Mohamed，A.，Kingsbury，B.，and  Ramabhadran，B.（2013）.  Deep  convolutional  neural  networks  for  LVCSR.  In  ICASSP  2013.

Salakhutdinov，R.（2010）.  Learning  in  Markov  randomfields  using  tempered  transitions.  In  Y.  Bengio，D.  Schuurmans，C.  Williams，J.  Lafferty，and  A.  Culotta，editors，Advances  in  Neural  Information  Processing  Systems  22（NIPS'09）.

Salakhutdinov，R.  and  Hinton，G.（2009a）.  Deep  Boltzmann  machines.  In  Proceedings  of  the  International  Conference  on  Artificial  Intelligence  and  Statistics，volume  5，pages  448–455.

Salakhutdinov，R.  and  Hinton，G.（2009b）.  Semantic  hashing.  In  International  Journal  of  Approximate  Reasoning.

Salakhutdinov，R.  and  Hinton，G.  E.（2007a）.  Learning  a  nonlinear  embedding  by  preserving  class  neighbourhood  structure.  In  Proceedings  of  AISTATS-2007.

Salakhutdinov，R.  and  Hinton，G.  E.（2007b）.  Semantic  hashing.  In  SIGIR'2007.

Salakhutdinov，R.  and  Hinton，G.  E.（2008）.  Using  deep  belief  nets  to  learn  covariance  kernels  for  Gaussian  processes.  In  J.  Platt，D.  Koller，Y.  Singer，and  S.  Roweis，editors，Advances  in  Neural  Information  Processing  Systems  20（NIPS'07），pages  1249–1256，Cambridge，MA.  MIT  Press.

Salakhutdinov，R.  and  Larochelle，H.（2010）.  Efficient  learning  of  deep  Boltzmann  machines.  In  Proceedings  of  the  Thirteenth  International  Conference  on  Artificial  Intelligence  and  Statistics（AISTATS  2010），JMLR  W&CP，volume  9，pages  693–700.

Salakhutdinov，R.  and  Mnih，A.（2008）.  Probabilistic  matrix  factorization.  In  NIPS'2008.

Salakhutdinov，R.  and  Murray，I.（2008）.  On  the  quantitative  analysis  of  deep  belief  networks.  In  W.  W.  Cohen，A.  McCallum，and  S.  T.  Roweis，editors，Proceedings  of  the  Twenty-fifth  International  Conference  on  Machine  Learning（ICML'08），volume  25，pages  872–879.  ACM.

Salakhutdinov，R.，Mnih，A.，and  Hinton，G.（2007）.  Restricted  Boltzmann  machines  for  collab-orativefiltering.  In  ICML.

Sanger，T.  D.（1994）.  Neural  network  learning  control  of  robot  manipulators  using  gradually  increasing  task  difficulty.  IEEE  Transactions  on  Robotics  and  Automation，10（3）.

Saul，L.  K.  and  Jordan，M.  I.（1996）.  Exploiting  tractable  substructures  in  intractable  networks.  In  D.  Touretzky，M.  Mozer，and  M.  Hasselmo，editors，Advances  in  Neural  Information  Processing  Systems  8（NIPS'95）.  MIT  Press，Cambridge，MA.

Saul，L.  K.，Jaakkola，T.，and  Jordan，M.  I.（1996）.  Meanfield  theory  for  sigmoid  belief  networks.  Journal  of  Artificial  Intelligence  Research，4，61–76.

Savich，A.  W.，Moussa，M.，and  Areibi，S.（2007）.  The  impact  of  arithmetic  representation  on  implementing  mlp-bp  on  fpgas:  A  study.  Neural  Networks，IEEE  Transactions  on，18（1），240–252.

Saxe，A.  M.，Koh，P.  W.，Chen，Z.，Bhand，M.，Suresh，B.，and  Ng，A.（2011）.  On  random  weights  and  unsupervised  feature  learning.  In  Proc.  ICML'2011.  ACM.

Saxe，A.  M.，McClelland，J.  L.，and  Ganguli，S.（2013）.  Exact  solutions  to  the  nonlinear  dynamics  of  learning  in  deep  linear  neural  networks.  In  ICLR.

Schaul，T.，Antonoglou，I.，and  Silver，D.（2014）.  Unit  tests  for  stochastic  optimization.  In  International  Conference  on  Learning  Representations.

Schmidhuber，J.（1992）.  Learning  complex，extended  sequences  using  the  principle  of  history  compression.  Neural  Computation，4（2），234–242.

Schmidhuber，J.（1996）.  Sequential  neural  text  compression.  IEEE  Transactions  on  Neural  Networks，7（1），142–146.

Schmidhuber，J.（2012）.  Self-delimiting  neural  networks.  arXiv  preprint  arXiv:1210.0118.

Schölkopf，B.  and  Smola，A.  J.（2002）.  Learning  with  kernels:  Support  vector  machines，regular-ization，optimization，and  beyond.  MIT  press.

Schölkopf，B.，Burges，C.  J.  C.，and  Smola，A.  J.（1998a）.  Advances  in  kernel  methods:  support  vector  learning.  MIT  Press，Cambridge，MA.

Schölkopf，B.，Smola，A.，and  Müller，K.-R.（1998b）.  Nonlinear  component  analysis  as  a  kernel  eigenvalue  problem.  Neural  Computation，10，1299–1319.

Schölkopf，B.，Burges，C.  J.  C.，and  Smola，A.  J.（1999）.  Advances  in  Kernel  Methods—Support  Vector  Learning.  MIT  Press，Cambridge，MA.

Schölkopf，B.，Janzing，D.，Peters，J.，Sgouritsa，E.，Zhang，K.，and  Mooij，J.（2012）.  On  causal  and  anticausal  learning.  In  ICML'2012，pages  1255–1262.

Schuster，M.（1999）.  On  supervised  learning  from  sequential  data  with  applications  for  speech  recognition.

Schuster，M.  and  Paliwal，K.（1997）.  Bidirectional  recurrent  neural  networks.  IEEE  Transactions  on  Signal  Processing，45（11），2673–2681.

Schwenk，H.（2007）.  Continuous  space  language  models.  Computer  speech  and  language，21，492–518.

Schwenk，H.（2010）.  Continuous  space  language  models  for  statistical  machine  translation.  The  Prague  Bulletin  of  Mathematical  Linguistics，93，137–146.

Schwenk，H.（2014）.  Cleaned  subset  of  WMT  '14  dataset.

Schwenk，H.  and  Bengio，Y.（1998）.  Training  methods  for  adaptive  boosting  of  neural  networks.  In  M.  Jordan，M.  Kearns，and  S.  Solla，editors，Advances  in  Neural  Information  Processing  Systems  10（NIPS'97），pages  647–653.  MIT  Press.

Schwenk，H.  and  Gauvain，J.-L.（2002）.  Connectionist  language  modeling  for  large  vocabulary  continuous  speech  recognition.  In  International  Conference  on  Acoustics，Speech  and  Signal  Processing（ICASSP），pages  765–768，Orlando，Florida.

Schwenk，H.，Costa-jussà，M.  R.，and  Fonollosa，J.  A.  R.（2006）.  Continuous  space  language  models  for  the  IWSLT  2006  task.  In  International  Workshop  on  Spoken  Language  Translation，pages  166–173.

Seide，F.，Li，G.，and  Yu，D.（2011）.  Conversational  speech  transcription  using  context-dependent  deep  neural  networks.  In  Interspeech  2011，pages  437–440.

Sejnowski，T.（1987）.  Higher-order  Boltzmann  machines.  In  AIP  Conference  Proceedings  151  on  Neural  Networks  for  Computing，pages  398–403.  American  Institute  of  Physics  Inc.

Series，P.，Reichert，D.  P.，and  Storkey，A.  J.（2010）.  Hallucinations  in  Charles  Bonnet  syndrome  induced  by  homeostasis:  a  deep  Boltzmann  machine  model.  In  Advances  in  Neural  Information  Processing  Systems，pages  2020–2028.

Sermanet，P.，Chintala，S.，and  LeCun，Y.（2012）.  Convolutional  neural  networks  applied  to  house  numbers  digit  classification.  In  International  Conference  on  Pattern  Recognition（ICPR  2012）.

Sermanet，P.，Kavukcuoglu，K.，Chintala，S.，and  LeCun，Y.（2013）.  Pedestrian  detection  with  unsupervised  multi-stage  feature  learning.  In  Proc.  International  Conference  on  Computer  Vision  and  Pattern  Recognition（CVPR'13）.  IEEE.

Shilov，G.（1977）.  Linear  Algebra.  Dover  Books  on  Mathematics  Series.  Dover  Publications.

Siegelmann，H.（1995）.  Computation  beyond  the  Turing  limit.  Science，268（5210），545–548.

Siegelmann，H.  and  Sontag，E.（1991）.  Turing  computability  with  neural  nets.  Applied  Mathe-matics  Letters，4（6），77–80.

Siegelmann，H.  T.  and  Sontag，E.  D.（1995）.  On  the  computational  power  of  neural  nets.  Journal  of  Computer  and  Systems  Sciences，50（1），132–150.

Sietsma，J.  and  Dow，R.（1991）.  Creating  artificial  neural  networks  that  generalize.  Neural  Networks，4（1），67–79.

Simard，D.，Steinkraus，P.  Y.，and  Platt，J.  C.（2003）.  Best  practices  for  convolutional  neural  networks.  In  ICDAR'2003.

Simard，P.  and  Graf，H.  P.（1994）.  Backpropagation  without  multiplication.  In  Advances  in  Neural  Information  Processing  Systems，pages  232–239.

Simard，P.，Victorri，B.，LeCun，Y.，and  Denker，J.（1992）.  Tangent  prop-A  formalism  for  specifying  selected  invariances  in  an  adaptive  network.  In  NIPS'1991.

Simard，P.  Y.，LeCun，Y.，and  Denker，J.（1993）.  Efficient  pattern  recognition  using  a  new  transformation  distance.  In  NIPS'92.

Simard，P.  Y.，LeCun，Y.  A.，Denker，J.  S.，and  Victorri，B.（1998）.  Transformation  invariance  in  pattern  recognition—tangent  distance  and  tangent  propagation.  Lecture  Notes  in  Computer  Science，1524.

Simons，D.  J.  and  Levin，D.  T.（1998）.  Failure  to  detect  changes  to  people  during  a  real-world  interaction.  Psychonomic  Bulletin  &  Review，5（4），644–649.

Simonyan，K.  and  Zisserman，A.（2015）.  Very  deep  convolutional  networks  for  large-scale  image  recognition.  In  ICLR.

Sjöberg，J.  and  Ljung，L.（1995）.  Overtraining，regularization  and  searching  for  a  minimum，with  application  to  neural  networks.  International  Journal  of  Control，62（6），1391–1407.

Skinner，B.  F.（1958）.  Reinforcement  today.  American  Psychologist，13，94–99.

Smolensky，P.（1986）.  Information  processing  in  dynamical  systems:  Foundations  of  harmony  theory.  In  D.  E.  Rumelhart  and  J.  L.  McClelland，editors，Parallel  Distributed  Processing，volume  1，chapter  6，pages  194–281.  MIT  Press，Cambridge.

Snoek，J.，Larochelle，H.，and  Adams，R.  P.（2012）.  Practical  Bayesian  optimization  of  machine  learning  algorithms.  In  NIPS'2012.

Socher，R.，Huang，E.  H.，Pennington，J.，Ng，A.  Y.，and  Manning，C.  D.（2011a）.  Dynamic  pooling  and  unfolding  recursive  autoencoders  for  paraphrase  detection.  In  NIPS'2011.

Socher，R.，Manning，C.，and  Ng，A.  Y.（2011b）.  Parsing  natural  scenes  and  natural  language  with  recursive  neural  networks.  In  Proceedings  of  the  Twenty-Eighth  International  Conference  on  Machine  Learning（ICML'2011）.

Socher，R.，Pennington，J.，Huang，E.  H.，Ng，A.  Y.，and  Manning，C.  D.（2011c）.  Semi-supervised  recursive  autoencoders  for  predicting  sentiment  distributions.  In  EMNLP'2011.

Socher，R.，Perelygin，A.，Wu，J.  Y.，Chuang，J.，Manning，C.  D.，Ng，A.  Y.，and  Potts，C.（2013a）.  Recursive  deep  models  for  semantic  compositionality  over  a  sentiment  treebank.  In  EMNLP'2013.

Socher，R.，Ganjoo，M.，Manning，C.  D.，and  Ng，A.  Y.（2013b）.  Zero-shot  learning  through  cross-modal  transfer.  In  27th  Annual  Conference  on  Neural  Information  Processing  Systems（NIPS  2013）.

Sohl-Dickstein，J.，Weiss，E.  A.，Maheswaranathan，N.，and  Ganguli，S.（2015）.  Deep  unsuper-vised  learning  using  nonequilibrium  thermodynamics.

Sohn，K.，Zhou，G.，and  Lee，H.（2013）.  Learning  and  selecting  features  jointly  with  point-wise  gated  Boltzmann  machines.  In  ICML'2013.

Solomonoff，R.  J.（1989）.  A  system  for  incremental  learning  based  on  algorithmic  probability.

Sontag，E.  D.（1998）.  VC  dimension  of  neural  networks.  NATO  ASI  Series  F  Computer  and  Systems  Sciences，168，69–96.

Sontag，E.  D.  and  Sussman，H.  J.（1989）.  Backpropagation  can  give  rise  to  spurious  local  minima  even  for  networks  without  hidden  layers.  Complex  Systems，3，91–106.

Sparkes，B.（1996）.  The  Red  and  the  Black:  Studies  in  Greek  Pottery.  Routledge.

Spitkovsky，V.  I.，Alshawi，H.，and  Jurafsky，D.（2010）.  From  baby  steps  to  leapfrog:  how“less  is  more”in  unsupervised  dependency  parsing.  In  HLT'10.

Squire，W.  and  Trapp，G.（1998）.  Using  complex  variables  to  estimate  derivatives  of  real  functions.  SIAM  Rev.，40（1），110–112.

Srebro，N.  and  Shraibman，A.（2005）.  Rank，trace-norm  and  max-norm.  In  Proceedings  of  the  18th  Annual  Conference  on  Learning  Theory，pages  545–560.  Springer-Verlag.

Srivastava，N.（2013）.  Improving  Neural  Networks  With  Dropout.  Master's  thesis，U.  Toronto.

Srivastava，N.  and  Salakhutdinov，R.（2012）.  Multimodal  learning  with  deep  Boltzmann  machines.  In  NIPS'2012.

Srivastava，N.，Salakhutdinov，R.  R.，and  Hinton，G.  E.（2013）.  Modeling  documents  with  deep  Boltzmann  machines.  arXiv  preprint  arXiv:1309.6865.

Srivastava，N.，Hinton，G.，Krizhevsky，A.，Sutskever，I.，and  Salakhutdinov，R.（2014）.  Dropout:  A  simple  way  to  prevent  neural  networks  from  overfitting.  Journal  of  Machine  Learning  Research，15，1929–1958.

Srivastava，R.  K.，Greff，K.，and  Schmidhuber，J.（2015）.  Highway  networks.  arXiv:1505.00387.

Steinkrau，D.，Simard，P.  Y.，and  Buck，I.（2005）.  Using  GPUs  for  machine  learning  algorithms.  2013  12th  International  Conference  on  Document  Analysis  and  Recognition，0，1115–1119.

Stoyanov，V.，Ropson，A.，and  Eisner，J.（2011）.  Empirical  risk  minimization  of  graphical  model  parameters  given  approximate  inference，decoding，and  model  structure.  In  Proceedings  of  the  14th  International  Conference  on  Artificial  Intelligence  and  Statistics（AISTATS），volume  15  of  JMLR  Workshop  and  Conference  Proceedings，pages  725–733，Fort  Lauderdale.  Supplemen-tary  material（4  pages）  also  available.

Sukhbaatar，S.，Szlam，A.，Weston，J.，and  Fergus，R.（2015）.  Weakly  supervised  memory  networks.  arXiv  preprint  arXiv:1503.08895.

Supancic，J.  and  Ramanan，D.（2013）.  Self-paced  learning  for  long-term  tracking.  In  CVPR'2013.

Sussillo，D.（2014）.  Random  walks:Training  very  deep  nonlinear  feed-forward  networks  with  smart  initialization.  CoRR，abs/1412.6558.

Sutskever，I.（2012）.  Training  Recurrent  Neural  Networks.  Ph.D.  thesis，Department  of  computer  science，University  of  Toronto.

Sutskever，I.  and  Hinton，G.  E.（2008）.  Deep  narrow  sigmoid  belief  networks  are  universal  approximators.  Neural  Computation，20（11），2629–2636.

Sutskever，I.  and  Tieleman，T.（2010）.  On  the  Convergence  Properties  of  Contrastive  Divergence.  In  AISTATS'2010.

Sutskever，I.，Hinton，G.，and  Taylor，G.（2009）.  The  recurrent  temporal  restricted  Boltzmann  machine.  In  NIPS'2008.

Sutskever，I.，Martens，J.，and  Hinton，G.  E.（2011）.  Generating  text  with  recurrent  neural  networks.  In  ICML'2011，pages  1017–1024.

Sutskever，I.，Martens，J.，Dahl，G.，and  Hinton，G.（2013）.  On  the  importance  of  initialization  and  momentum  in  deep  learning.  In  ICML.

Sutskever，I.，Vinyals，O.，and  Le，Q.  V.（2014）.  Sequence  to  sequence  learning  with  neural  networks.  In  NIPS'2014，arXiv:1409.3215.

Sutton，R.  and  Barto，A.（1998）.  Reinforcement  Learning:  An  Introduction.  MIT  Press.

Sutton，R.  S.，Mcallester，D.，Singh，S.，and  Mansour，Y.（2000）.  Policy  gradient  methods  for  reinforcement  learning  with  function  approximation.  In  NIPS'1999，pages  1057–1063.  MIT  Press.

Swersky，K.，Ranzato，M.，Buchman，D.，Marlin，B.，and  de  Freitas，N.（2011）.  On  autoencoders  and  score  matching  for  energy  based  models.  In  ICML'2011.  ACM.

Swersky，K.，Snoek，J.，and  Adams，R.  P.（2014）.  Freeze-thaw  Bayesian  optimization.  arXiv  preprint  arXiv:1406.3896.

Szegedy，C.，Liu，W.，Jia，Y.，Sermanet，P.，Reed，S.，Anguelov，D.，Erhan，D.，Vanhoucke，V.，and  Rabinovich，A.（2014a）.  Going  deeper  with  convolutions.  Technical  report，arXiv:1409.4842.

Szegedy，C.，Zaremba，W.，Sutskever，I.，Bruna，J.，Erhan，D.，Goodfellow，I.  J.，and  Fergus，R.（2014b）.  Intriguing  properties  of  neural  networks.  ICLR，abs/1312.6199.

Szegedy，C.，Vanhoucke，V.，Ioffe，S.，Shlens，J.，and  Wojna，Z.（2015）.  Rethinking  the  Inception  Architecture  for  Computer  Vision.  ArXiv  e-prints.

Taigman，Y.，Yang，M.，Ranzato，M.，and  Wolf，L.（2014）.  DeepFace:  Closing  the  gap  to  human-level  performance  in  face  verification.  In  CVPR'2014.

Tandy，D.  W.（1997）.  Works  and  Days:  A  Translation  and  Commentary  for  the  Social  Sciences.  University  of  California  Press.

Tang，Y.  and  Eliasmith，C.（2010）.  Deep  networks  for  robust  visual  recognition.  In  Proceedings  of  the  27th  International  Conference  on  Machine  Learning，June  21-24，2010，Haifa，Israel.

Tang，Y.，Salakhutdinov，R.，and  Hinton，G.（2012）.  Deep  mixtures  of  factor  analysers.  arXiv  preprint  arXiv:1206.4635.

Taylor，G.  and  Hinton，G.（2009）.  Factored  conditional  restricted  Boltzmann  machines  for  modeling  motion  style.  In  L.  Bottou  and  M.  Littman，editors，Proceedings  of  the  Twenty-sixth  International  Conference  on  Machine  Learning（ICML'09），pages  1025–1032，Montreal，Quebec，Canada.  ACM.

Taylor，G.，Hinton，G.  E.，and  Roweis，S.（2007）.  Modeling  human  motion  using  binary  latent  variables.  In  B.  Schölkopf，J.  Platt，and  T.  Hoffman，editors，Advances  in  Neural  Information  Processing  Systems  19（NIPS'06），pages  1345–1352.  MIT  Press，Cambridge，MA.

Teh，Y.，Welling，M.，Osindero，S.，and  Hinton，G.  E.（2003）.  Energy-based  models  for  sparse  overcomplete  representations.  Journal  of  Machine  Learning  Research，4，1235–1260.

Tenenbaum，J.，de  Silva，V.，and  Langford，J.  C.（2000）.  A  global  geometric  framework  for  nonlinear  dimensionality  reduction.  Science，290（5500），2319–2323.

Theis，L.，van  den  Oord，A.，and  Bethge，M.（2015）.  A  note  on  the  evaluation  of  generative  models.  arXiv:1511.01844.

Thompson，J.，Jain，A.，LeCun，Y.，and  Bregler，C.（2014）.  Joint  training  of  a  convolutional  network  and  a  graphical  model  for  human  pose  estimation.  In  NIPS'2014.

Thrun，S.（1995）.  Learning  to  play  the  game  of  chess.  In  NIPS'1994.

Tibshirani，R.  J.（1995）.  Regression  shrinkage  and  selection  via  the  lasso.  Journal  of  the  Royal  Statistical  Society  B，58，267–288.

Tieleman，T.（2008）.  Training  restricted  Boltzmann  machines  using  approximations  to  the  like-lihood  gradient.  In  ICML'2008，pages  1064–1071.

Tieleman，T.  and  Hinton，G.（2009）.  Using  fast  weights  to  improve  persistent  contrastive  diver-gence.  In  ICML'2009.

Tipping，M.  E.  and  Bishop，C.  M.（1999）.  Probabilistic  principal  components  analysis.  Journal  of  the  Royal  Statistical  Society  B，61（3），611–622.

Torralba，A.，Fergus，R.，and  Weiss，Y.（2008）.  Small  codes  and  large  databases  for  recognition.  In  Proceedings  of  the  Computer  Vision  and  Pattern  Recognition  Conference（CVPR'08），pages  1–8.

Touretzky，D.  S.  and  Minton，G.  E.（1985）.  Symbols  among  the  neurons:  Details  of  a  con-nectionist  inference  architecture.  In  Proceedings  of  the  9th  International  Joint  Conference  on  Artificial  Intelligence-Volume  1，IJCAI'85，pages  238–243，San  Francisco，CA，USA.  Morgan  Kaufmann  Publishers  Inc.

Tu，K.  and  Honavar，V.（2011）.  On  the  utility  of  curricula  in  unsupervised  learning  of  probabilistic  grammars.  In  IJCAI'2011.

Turaga，S.  C.，Murray，J.  F.，Jain，V.，Roth，F.，Helmstaedter，M.，Briggman，K.，Denk，W.，and  Seung，H.  S.（2010）.  Convolutional  networks  can  learn  to  generate  affinity  graphs  for  image  segmentation.  Neural  Computation，22，511–538.

Turian，J.，Ratinov，L.，and  Bengio，Y.（2010）.  Word  representations:  A  simple  and  general  method  for  semi-supervised  learning.  In  Proc.  ACL'2010，pages  384–394.

Töscher，A.，Jahrer，M.，and  Bell，R.  M.（2009）.  The  BigChaos  solution  to  the  Netflix  grand  prize.

Uria，B.，Murray，I.，and  Larochelle，H.（2013）.  Rnade:  The  real-valued  neural  autoregressive  density-estimator.  In  NIPS'2013.

van  den  Oörd，A.，Dieleman，S.，and  Schrauwen，B.（2013）.  Deep  content-based  music  recom-mendation.  In  NIPS'2013.

van  der  Maaten，L.  and  Hinton，G.  E.（2008）.  Visualizing  data  using  t-SNE.  J.  Machine  Learning  Res.，9.

Vanhoucke，V.，Senior，A.，and  Mao，M.  Z.（2011）.  Improving  the  speed  of  neural  networks  on  CPUs.  In  Proc.  Deep  Learning  and  Unsupervised  Feature  Learning  NIPS  Workshop.

Vapnik，V.  N.（1982）.  Estimation  of  Dependences  Based  on  Empirical  Data.  Springer-Verlag，Berlin.

Vapnik，V.  N.（1995）.  The  Nature  of  Statistical  Learning  Theory.  Springer，New  York.

Vapnik，V.  N.  and  Chervonenkis，A.  Y.（1971）.  On  the  uniform  convergence  of  relative  frequencies  of  events  to  their  probabilities.  Theory  of  Probability  and  Its  Applications，16，264–280.

Vincent，P.（2011）.  A  connection  between  score  matching  and  denoising  autoencoders.  Neural  Computation，23（7）.

Vincent，P.  and  Bengio，Y.（2003）.  Manifold  Parzen  windows.  In  NIPS'2002.  MIT  Press.

Vincent，P.，Larochelle，H.，Bengio，Y.，and  Manzagol，P.-A.（2008a）.  Extracting  and  composing  robust  features  with  denoising  autoencoders.  In  ICM（1a），pages  1096–1103.

Vincent，P.，Larochelle，H.，Bengio，Y.，and  Manzagol，P.-A.（2008b）.  Extracting  and  composing  robust  features  with  denoising  autoencoders.  In  ICML  2008.

Vincent，P.，Larochelle，H.，Lajoie，I.，Bengio，Y.，and  Manzagol，P.-A.（2010）.  Stacked  denoising  autoencoders:  Learning  useful  representations  in  a  deep  network  with  a  local  denoising  criterion.  J.  Machine  Learning  Res.，11.

Vincent，P.，de  Brébisson，A.，and  Bouthillier，X.（2015）.  Efficient  exact  gradient  update  for  training  deep  networks  with  very  large  sparse  targets.  In  C.  Cortes，N.  D.  Lawrence，D.  D.  Lee，M.  Sugiyama，and  R.  Garnett，editors，Advances  in  Neural  Information  Processing  Systems  28，pages  1108–1116.  Curran  Associates，Inc.

Vinyals，O.，Kaiser，L.，Koo，T.，Petrov，S.，Sutskever，I.，and  Hinton，G.（2014a）.  Grammar  as  a  foreign  language.  arXiv  preprint  arXiv:1412.7449.

Vinyals，O.，Toshev，A.，Bengio，S.，and  Erhan，D.（2014b）.  Show  and  tell:a  neural  image  caption  generator.  arXiv  1411.4555.

Vinyals，O.，Fortunato，M.，and  Jaitly，N.（2015a）.  Pointer  networks.  arXiv  preprint  arXiv:1506.03134.

Vinyals，O.，Toshev，A.，Bengio，S.，and  Erhan，D.（2015b）.  Show  and  tell:a  neural  image  caption  generator.  In  CVPR'2015.  arXiv:1411.4555.

Viola，P.  and  Jones，M.（2001）.  Robust  real-time  object  detection.  In  International  Journal  of  Computer  Vision.

Visin，F.，Kastner，K.，Cho，K.，Matteucci，M.，Courville，A.，and  Bengio，Y.（2015）.  ReNet:  A  recurrent  neural  network  based  alternative  to  convolutional  networks.  arXiv  preprint  arXiv:1505.00393.

Von  Melchner，L.，Pallas，S.  L.，and  Sur，M.（2000）.  Visual  behaviour  mediated  by  retinal  projections  directed  to  the  auditory  pathway.  Nature，404（6780），871–876.

Wager，S.，Wang，S.，and  Liang，P.（2013）.  Dropout  training  as  adaptive  regularization.  In  Advances  in  Neural  Information  Processing  Systems  26，pages  351–359.

Waibel，A.，Hanazawa，T.，Hinton，G.  E.，Shikano，K.，and  Lang，K.（1989）.  Phoneme  recognition  using  time-delay  neural  networks.  IEEE  Transactions  on  Acoustics，Speech，and  Signal  Processing，37，328–339.

Wan，L.，Zeiler，M.，Zhang，S.，LeCun，Y.，and  Fergus，R.（2013）.  Regularization  of  neural  networks  using  dropconnect.  In  ICML'2013.

Wang，S.  and  Manning，C.（2013）.  Fast  dropout  training.  In  ICML'2013.

Wang，Z.，Zhang，J.，Feng，J.，and  Chen，Z.（2014a）.  Knowledge  graph  and  text  jointly  embedding.  In  Proc.  EMNLP'2014.

Wang，Z.，Zhang，J.，Feng，J.，and  Chen，Z.（2014b）.  Knowledge  graph  embedding  by  translating  on  hyperplanes.  In  Proc.  AAAI'2014.

Warde-Farley，D.，Goodfellow，I.  J.，Courville，A.，and  Bengio，Y.（2014）.  An  empirical  analysis  of  dropout  in  piecewise  linear  networks.  In  ICL（1）.

Wawrzynek，J.，Asanovic，K.，Kingsbury，B.，Johnson，D.，Beck，J.，and  Morgan，N.（1996）.  Spert-II:  A  vector  microprocessor  system.  Computer，29（3），79–86.

Weaver，L.  and  Tao，N.（2001）.  The  optimal  reward  baseline  for  gradient-based  reinforcement  learning.  In  Proc.  UAI'2001，pages  538–545.

Weinberger，K.  Q.  and  Saul，L.  K.（2004a）.  Unsupervised  learning  of  image  manifolds  by  semidefi-nite  programming.  In  Proceedings  of  the  Computer  Vision  and  Pattern  Recognition  Conference（CVPR'04），volume  2，pages  988–995，Washington  D.C.

Weinberger，K.  Q.  and  Saul，L.  K.（2004b）.  Unsupervised  learning  of  image  manifolds  by  semidefinite  programming.  In  CVPR'2004，pages  988–995.

Weiss，Y.，Torralba，A.，and  Fergus，R.（2008）.  Spectral  hashing.  In  NIPS，pages  1753–1760.

Welling，M.，Zemel，R.  S.，and  Hinton，G.  E.（2002）.  Self  supervised  boosting.  In  Advances  in  Neural  Information  Processing  Systems，pages  665–672.

Welling，M.，Hinton，G.  E.，and  Osindero，S.（2003a）.  Learning  sparse  topographic  representa-tions  with  products  of  Student-t  distributions.  In  NIPS'2002.

Welling，M.，Zemel，R.，and  Hinton，G.  E.（2003b）.  Self-supervised  boosting.  In  S.  Becker，S.  Thrun，and  K.  Obermayer，editors，Advances  in  Neural  Information  Processing  Systems  15（NIPS'02），pages  665–672.  MIT  Press.

Welling，M.，Rosen-Zvi，M.，and  Hinton，G.  E.（2005）.  Exponential  family  harmoniums  with  an  application  to  information  retrieval.  In  L.  Saul，Y.  Weiss，and  L.  Bottou，editors，Advances  in  Neural  Information  Processing  Systems  17（NIPS'04），volume  17，Cambridge，MA.  MIT  Press.

Werbos，P.  J.（1981）.  Applications  of  advances  in  nonlinear  sensitivity  analysis.  In  Proceedings  of  the  10th  IFIP  Conference，31.8-4.9，NYC，pages  762–770.

Weston，J.，Bengio，S.，and  Usunier，N.（2010）.  Large  scale  image  annotation:  learning  to  rank  with  joint  word-image  embeddings.  Machine  Learning，81（1），21–35.

Weston，J.，Chopra，S.，and  Bordes，A.（2014）.  Memory  networks.  arXiv  preprint  arXiv:1410.3916.

Widrow，B.  and  Hoff，M.  E.（1960）.  Adaptive  switching  circuits.  In  1960  IRE  WESCON  Convention  Record，volume  4，pages  96–104.  IRE，New  York.

Wikipedia（2015）.  List  of  animals  by  number  of  neurons—Wikipedia，the  free  encyclopedia.  ［Online；accessed  4-March-2015］.

Williams，C.  K.  I.  and  Agakov，F.  V.（2002）.  Products  of  Gaussians  and  Probabilistic  Minor  Component  Analysis.  Neural  Computation，14（5），1169–1182.

Williams，C.  K.  I.  and  Rasmussen，C.  E.（1996）.  Gaussian  processes  for  regression.  In  D.  Touretzky，M.  Mozer，and  M.  Hasselmo，editors，Advances  in  Neural  Information  Processing  Systems  8（NIPS'95），pages  514–520.  MIT  Press，Cambridge，MA.

Williams，R.  J.（1992）.  Simple  statistical  gradient-following  algorithms  connectionist  reinforcement  learning.  Machine  Learning，8，229–256.

Williams，R.  J.  and  Zipser，D.（1989）.  A  learning  algorithm  for  continually  running  fully  recurrent  neural  networks.  Neural  Computation，1，270–280.

Wilson，D.  R.  and  Martinez，T.  R.（2003）.  The  general  inefficiency  of  batch  training  for  gradient  descent  learning.  Neural  Networks，16（10），1429–1451.

Wilson，J.  R.（1984）.  Variance  reduction  techniques  for  digital  simulation.  American  Journal  of  Mathematical  and  Management  Sciences，4（3），277–312.

Wiskott，L.  and  Sejnowski，T.  J.（2002）.  Slow  feature  analysis:  Unsupervised  learning  of  invari-ances.  Neural  Computation，14（4），715–770.

Wolpert，D.  and  MacReady，W.（1997）.  No  free  lunch  theorems  for  optimization.  IEEE  Transactions  on  Evolutionary  Computation，1，67–82.

Wolpert，D.  H.（1996）.  The  lack  of  a  priori  distinction  between  learning  algorithms.  Neural  Computation，8（7），1341–1390.

Wu，R.，Yan，S.，Shan，Y.，Dang，Q.，and  Sun，G.（2015）.  Deep  image:  Scaling  up  image  recognition.  arXiv:1501.02876.

Wu，Z.（1997）.  Global  continuation  for  distance  geometry  problems.  SIAM  Journal  of  Optimization，7，814–836.

Xiong，H.  Y.，Barash，Y.，and  Frey，B.  J.（2011）.  Bayesian  prediction  of  tissue-regulated  splicing  using  RNA  sequence  and  cellular  context.  Bioinformatics，27（18），2554–2562.

Xu，K.，Ba，J.  L.，Kiros，R.，Cho，K.，Courville，A.，Salakhutdinov，R.，Zemel，R.  S.，and  Bengio，Y.（2015）.  Show，attend  and  tell:  Neural  image  caption  generation  with  visual  attention.  In  ICML'2015，arXiv:1502.03044.

Yildiz，I.  B.，Jaeger，H.，and  Kiebel，S.  J.（2012）.  Re-visiting  the  echo  state  property.  Neural  networks，35，1–9.

Yosinski，J.，Clune，J.，Bengio，Y.，and  Lipson，H.（2014）.  How  transferable  are  features  in  deep  neural  networks?  In  NIPS  27，pages  3320–3328.  Curran  Associates，Inc.

Younes，L.（1998）.  On  the  convergence  of  Markovian  stochastic  algorithms  with  rapidly  decreasing  ergodicity  rates.  In  Stochastics  and  Stochastics  Models，pages  177–228.

Yu，D.，Wang，S.，and  Deng，L.（2010）.  Sequential  labeling  using  deep-structured  conditional  randomfields.  IEEE  Journal  of  Selected  Topics  in  Signal  Processing.

Zaremba，W.  and  Sutskever，I.（2014）.  Learning  to  execute.  arXiv  1410.4615.

Zaremba，W.  and  Sutskever，I.（2015）.  Reinforcement  learning  neural  Turing  machines.  arXiv:1505.00521.

Zaslavsky，T.（1975）.  Facing  Up  to  Arrangements:  Face-Count  Formulas  for  Partitions  of  Space  by  Hyperplanes.  Number  no.  154  in  Memoirs  of  the  American  Mathematical  Society.  American  Mathematical  Society.

Zeiler，M.  D.  and  Fergus，R.（2014）.  Visualizing  and  understanding  convolutional  networks.  In  ECCV'14.

Zeiler，M.  D.，Ranzato，M.，Monga，R.，Mao，M.，Yang，K.，Le，Q.，Nguyen，P.，Senior，A.，Vanhoucke，V.，Dean，J.，and  Hinton，G.  E.（2013）.  On  rectified  linear  units  for  speech  processing.  In  ICASSP  2013.

Zhou，B.，Khosla，A.，Lapedriza，A.，Oliva，A.，and  Torralba，A.（2015）.  Object  detectors  emerge  in  deep  scene  CNNs.  ICLR'2015，arXiv:1412.6856.

Zhou，J.  and  Troyanskaya，O.  G.（2014）.  Deep  supervised  and  convolutional  generative  stochastic  network  for  protein  secondary  structure  prediction.  In  ICML'2014.

Zhou，Y.  and  Chellappa，R.（1988）.  Computation  of  opticalflow  using  a  neural  network.  In  Neural  Networks，1988.，IEEE  International  Conference  on，pages  71–78.  IEEE.

Zöhrer，M.  and  Pernkopf，F.（2014）.  General  stochastic  networks  for  classification.  In  NIPS'2014.

索引

绝对值整流absolute  value  rectification

准确率accuracy

声学acoustic

激活函数activation  function

AdaGrad  AdaGrad

对抗adversarial

对抗样本adversarial  example

对抗训练adversarial  training

几乎处处almost  everywhere

几乎必然almost  sure

几乎必然收敛almost  sure  convergence

选择性剪接数据集alternative  splicing  dataset

原始采样ancestral  sampling

退火重要采样annealed  importance  sampling

专用集成电路application-specific  integrated  circuit

近似贝叶斯计算approximate  Bayesian  computa-tion

近似推断approximate  inference

架构architecture

人工智能artificial  intelligence

人工神经网络artificial  neural  network

渐近无偏asymptotically  unbiased

异步随机梯度下降Asynchoronous  Stochastic  Gradient  Descent

异步asynchronous

注意力机制attention  mechanism

属性attribute
自编码器autoencoder

自动微分automatic  differentiation

自动语音识别Automatic  Speech  Recognition

自回归网络auto-regressive  network

反向传播back  propagation

回退back-off

反向传播backprop

通过时间反向传播back-propagation  through  time

词袋bag  of  words

Bagging  bootstrap  aggregating

bandit  bandit

批量batch

批标准化batch  normalization

贝叶斯误差Bayes  error

贝叶斯规则Bayes'  rule

贝叶斯推断Bayesian  inference

贝叶斯网络Bayesian  network

贝叶斯概率Bayesian  probability

贝叶斯统计Bayesian  statistics

基准bechmark

信念网络belief  network

Bernoulli分布Bernoulli  distribution

基准baseline

BFGS  BFGS

偏置bias  in  affine  function

偏差bias  in  statistics

有偏biased

有偏重要采样biased  importance  sampling

偏差biass

二元语法bigram

二元关系binary  relation

二值稀疏编码binary  sparse  coding

比特bit

块坐标下降block  coordinate  descent

块吉布斯采样block  Gibbs  Sampling

玻尔兹曼分布Boltzmann  distribution

玻尔兹曼机Boltzmann  Machine

Boosting  Boosting

桥式采样bridge  sampling

广播broadcasting

磨合Burning-in

变分法calculus  of  variations

容量capacity

级联cascade

灾难遗忘catastrophic  forgetting

范畴分布categorical  distribution

因果因子causal  factor

因果模型causal  modeling

中心差分centered  difference

中心极限定理central  limit  theorem

链式法则chain  rule

混沌chaos

弦chord

弦图chordal  graph

梯度截断clip  gradient

截断梯度clipping  the  gradient

团clique

团势能clique  potential

闭式解closed  form  solution

级联coalesced

编码code

协同过滤collaborativefiltering

列column

列空间column  space

共因common  cause

完全图complete  graph

复杂细胞complex  cell

计算图computational  graph

计算机视觉Computer  Vision

概念漂移concept  drift

条件计算conditional  computation

条件概率conditional  probability

条件独立的conditionally  independent

共轭conjugate

共轭方向conjugate  directions

共轭梯度conjugate  gradient

联结主义connectionism

一致性consistency

约束优化constrained  optimization

特定环境下的独立context-specific  independences

contextual  bandit  contextual  bandit

延拓法continuation  method

收缩contractive

收缩自编码器contractive  autoencoder

对比散度contrastive  divergence

凸优化Convex  optimization

卷积convolution

卷积玻尔兹曼机Convolutional  Boltzmann  Machine

卷积网络convolutional  net

卷积神经网络convolutional  neural  network

坐标上升coordinate  ascent

坐标下降coordinate  descent

共父coparent

相关系数correlation

代价cost
代价函数cost  function

协方差covariance

协方差矩阵covariance  matrix

协方差RBM  covariance  RBM

覆盖coverage

准则criterion

临界点critical  point

临界温度critical  temperatures

互相关函数cross-correlation

交叉熵cross-entropy

累积函数cumulative  function

课程学习curriculum  learning

维数灾难curse  of  dimensionality

曲率curvature

控制论cybernetics

衰减damping

数据生成分布data  generating  distribution

数据生成过程data  generating  process

数据并行data  parallelism

数据点data  point

数据集dataset

数据集增强dataset  augmentation

决策树decision  tree

解码器decoder

分解decompose

深度信念网络deep  belief  network

深度玻尔兹曼机Deep  Boltzmann  Machine

深度回路deep  circuit

深度前馈网络deep  feedforward  network

深度生成模型deep  generative  model

深度学习deep  learning

深度模型deep  model

深度网络deep  network

信任度degree  of  belief

去噪denoising

去噪自编码器denoising  autoencoder

去噪得分匹配denoising  score  matching

依赖dependency

深度depth

导数derivative

描述description

设计矩阵design  matrix

细致平衡detailed  balance

探测级detector  stage

确定性deterministic

对角矩阵diagonal  matrix

微分熵differential  entropy

微分方程differential  equation

降维dimensionality  reduction

Dirac  delta  函数Dirac  delta  function

Dirac  分布dirac  distribution

有向directed

有向图模型directed  graphical  model

有向模型Directed  Model

方向导数directional  derivative

判别RBM  discriminative  RBM

判别器网络discriminator  network

分布式表示distributed  representation

深度神经网络DNN

领域自适应domain  adaption

点积dot  product

双反向传播double  backprop

双重分块循环矩阵doubly  block  circulant  matrix

降采样downsampling

Dropout  Dropout

Dropout  Boosting  Dropout  Boosting

d-分离d-separation

动态规划dynamic  programming

动态结构dynamic  structure

提前终止early  stopping

回声状态网络echo  state  network

有效容量effective  capacity

特征分解eigendecomposition

特征值eigenvalue

特征向量eigenvector

基本单位向量elementary  basis  vectors

元素对应乘积element-wise  product

嵌入embedding

经验分布empirical  distribution

经验频率empirical  frequency

经验风险empirical  risk

经验风险最小化empirical  risk  minimization

编码器encoder

端到端的end-to-end

能量函数energy  function

基于能量的模型Energy-based  model

集成ensemble

集成学习ensemble  learning

轮epoch

轮数epochs

等式约束equality  constraint

均衡分布Equilibrium  Distribution

等变equivariance

等变表示equivariant  representations
误差条error  bar

误差函数error  function

误差度量error  metric

错误率error  rate

估计量estimator

欧几里得范数Euclidean  norm

欧拉-拉格朗日方程Euler-Lagrange  Equation

证据下界evidence  lower  bound

样本example

额外误差excess  error

期望expectation

期望最大化expectation  maximization

E步expectation  step

期望值expected  value

经验experience

专家网络expert  network

相消解释explaining  away

相消解释作用explaining  away  effect

解释因子explanatory  factort

梯度爆炸exploding  gradient

开发exploitation

探索exploration

指数分布exponential  distribution

因子factor

因子分析factor  analysis

因子图factor  graph

因子factorial

分解factorization

分解的factorized

变差因素factors  of  variation

快速Dropout  fast  dropout

快速持续性对比散度fast  persistent  contrastive  di-vergence

可行feasible

特征feature

特征提取器feature  extractor

特征映射feature  map

特征选择feature  selection

反馈feedback

前向feedforward

前馈分类器feedforward  classifier

前馈网络feedforward  network

前馈神经网络feedforward  neural  network

现场可编程门阵列field  programmable  gated  array

精调fine-tune

精调fine-tuning

有限差分finite  difference

第一层first  layer

不动点方程fixed  point  equation

定点运算fixed-point  arithmetic

翻转flip

浮点运算float-point  arithmetic

遗忘门forget  gate

前向传播forward  propagation

傅里叶变换Fourier  transform

中央凹fovea

自由能free  energy

频率派概率frequentist  probability

频率派统计frequentist  statistics

Frobenius范数Frobenius  norm

F分数F-score

全full

泛函functional

泛函导数functional  derivative

Gabor函数Gabor  function

Gamma分布Gamma  distribution

门控gated

门控循环网络gated  recurrent  net

门控循环单元gated  recurrent  unit

门控RNN  gated  RNN

选通器gater

高斯分布Gaussian  distribution

高斯核Gaussian  kernel

高斯混合模型Gaussian  Mixture  Model

高斯混合体Gaussian  mixtures

高斯输出分布Gaussian  output  distribution

高斯RBM  Gaussian  RBM

Gaussian-Bernoulli  RBM  Gaussian-Bernoulli  RBM

通用GPU  general  purpose  GPU

泛化generalization

泛化误差generalization  error

广义函数generalized  function

广义Lagrange函数generalized  Lagrange  function

广义Lagrangian  generalized  Lagrangian

广义伪似然generalized  pseudolikelihood

广义伪似然估计generalized  pseudolikelihood  esti-mator

广义得分匹配generalized  score  matching

生成式对抗框架generative  adversarial  framework

生成式对抗网络generative  adversarial  network

生成模型generative  model

生成式建模generative  modeling

生成矩匹配网络generative  moment  matching  net-work

生成随机网络generative  stochastic  network

生成器网络generator  network

吉布斯分布Gibbs  distribution

Gibbs采样Gibbs  Sampling

吉布斯步数Gibbs  steps

全局对比度归一化Global  contrast  normalization

全局极小值global  minima

全局最小点global  minimum

梯度gradient

梯度上升gradient  ascent

梯度截断gradient  clipping

梯度下降gradient  descent

图模型graphical  model

图形处理器Graphics  Processing  Unit

贪心greedy

贪心算法greedy  algorithm

贪心逐层预训练greedy  layer-wise  pretraining

贪心逐层训练greedy  layer-wise  training

贪心逐层无监督预训练greedy  layer-wise  unsuper-vised  pretraining

贪心监督预训练greedy  supervised  pretraining

贪心无监督预训练greedy  unsupervised  pretraining

网格搜索grid  search

Hadamard乘积Hadamard  product

汉明距离Hamming  distance

硬专家混合体hard  mixture  of  experts

硬双曲正切函数hard  tanh

簧风琴harmonium

哈里斯链Harris  Chain

Helmholtz机Helmholtz  machine

Hessian  Hessian

异方差heteroscedastic

隐藏层hidden  layer

隐马尔可夫模型Hidden  Markov  Model

隐藏单元hidden  unit

隐藏变量hidden  variable

爬山hill  climbing

超参数hyperparameter

超参数优化hyperparameter  optimization

假设空间hypothesis  space

同分布的identically  distributed

可辨认的identifiable

单位矩阵identity  matrix

独立同分布假设i.i.d.  assumption

病态ill  conditioning

不道德immorality

重要采样Importance  Sampling

相互独立的independent

独立成分分析independent  component  analysis

独立同分布independent  identically  distributed

独立子空间分析independent  subspace  analysis

索引index  of  matrix

指示函数  indicator  function

不等式约束inequality  constraint

推断inference

无限infinite

信息检索information  retrieval

内积inner  product

输入input

输入分布input  distribution

干预查询intervention  query

不变invariant

求逆invert

Isomap  Isomap

各向同性isotropic

Jacobian  Jacobian

Jacobian矩阵Jacobian  matrix

联合概率分布joint  probability  distribution

Karush-Kuhn-Tucker  Karush-Kuhn-Tucker

核函数kernel  function

核机器kernel  machine

核方法kernel  method

核技巧kernel  trick

KL散度KL  divergence

知识库knowledge  base

知识图谱knowledge  graph

Krylov方法Krylov  method

KL散度Kullback-Leibler（KL）  divergence

标签label

标注labeled

拉格朗日乘子Lagrange  multiplier

语言模型language  model

Laplace分布Laplace  distribution

大学习步骤large  learning  step

潜在latent

潜层latent  layer

潜变量latent  variable

大数定理Law  of  large  number

逐层的layer-wise

L-BFGS  L-BFGS

渗漏整流线性单元Leaky  ReLU

渗漏单元leaky  unit

学成learned

学习近似推断learned  approximate  inference

学习器learner

学习率learning  rate

勒贝格可积Lebesgue-integrable

左特征向量left  eigenvector

左奇异向量left  singular  vector

莱布尼兹法则Leibniz's  rule

似然likelihood

线搜索line  search

线性自回归网络linear  auto-regressive  network

线性分类器linear  classifier

线性组合linear  combination

线性相关linear  dependence

线性因子模型linear  factor  model

线性模型linear  model

线性回归linear  regression

线性阈值单元linear  threshold  units

线性无关linearly  independent

链接预测link  prediction

链接重要采样linked  importance  sampling

Lipschitz  Lipschitz

Lipschitz常数Lipschitz  constant

Lipschitz连续Lipschitz  continuous

流体状态机liquid  state  machine

局部条件概率分布local  conditional  probability  dis-tribution

局部不变性先验local  constancy  prior

局部对比度归一化local  contrast  normalization

局部下降local  descent

局部核local  kernel

局部极大值local  maxima

局部极大点local  maximum

局部极小值local  minima

局部极小点local  minimum

对数尺度logarithmic  scale

逻辑回归logistic  regression

logistic  sigmoid  logistic  sigmoid

分对数logit

对数线性模型log-linear  model

长短期记忆long  short-term  memory

长期依赖long-term  dependency

环loop

环状信念传播loopy  belief  propagation

损失loss

损失函数loss  function

机器学习machine  learning

机器学习模型machine  learning  model

机器翻译machine  translation

主对角线main  diagonal

流形manifold

流形假设manifold  hypothesis

流形学习manifold  learning

边缘概率分布marginal  probability  distribution

马尔可夫链Markov  Chain

马尔可夫链蒙特卡罗Markov  Chain  Monte  Carlo

马尔可夫网络Markov  network

马尔可夫随机场Markov  randomfield

掩码mask

矩阵matrix

矩阵逆matrix  inversion

矩阵乘积matrix  product

最大范数max  norm

池pool

最大池化max  pooling

极大值maxima

M步maximization  step

最大后验Maximum  A  Posteriori

最大似然maximum  likelihood

最大似然估计maximum  likelihood  estimation

最大平均偏差maximum  mean  discrepancy

maxout  maxout

maxout单元maxout  unit

平均绝对误差mean  absolute  error

均值和协方差RBM  mean  and  covariance  RBM

学生t分布均值乘积mean  product  of  Student  t-distribution

均方误差mean  squared  error

均值-协方差RBM  mean-covariance  restricted  Boltzmann  machine

均匀场meanfield

均值场mean-field

测度论measure  theory

零测度measure  zero

记忆网络memory  network

信息传输message  passing

小批量minibatch

小批量随机minibatch  stochastic

极小值minima

极小点minimum

混合mixing

混合时间mixing  Time

混合密度网络mixture  density  network

混合分布mixture  distribution

专家混合体mixture  of  experts

模态modality

峰值mode

模型model

模型平均model  averaging

模型压缩model  compression

模型可辨识性model  identifiability

模型并行model  parallelism

矩moment

矩匹配moment  matching

动量momentum

蒙特卡罗Monte  Carlo

Moore-Penrose伪逆Moore-Penrose  pseudoinverse

道德化moralization

道德图moralized  graph

多层感知机multilayer  perceptron

多峰值multimodal

多模态学习multimodal  learning

多项式分布multinomial  distribution

Multinoulli分布multinoulli  distribution

多预测深度玻尔兹曼机multi-prediction  deep  Boltzmann  machine

多任务学习multitask  learning

多维正态分布multivariate  normal  distribution

朴素贝叶斯naive  Bayes

奈特nats

自然语言处理Natural  Language  Processing

最近邻nearest  neighbor

最近邻图nearest  neighbor  graph

最近邻回归nearest  neighbor  regression

负定negative  definite

负部函数negative  part  function

负相negative  phase

半负定negative  semidefinite

Nesterov动量Nesterov  momentum

网络network

神经自回归密度估计器neural  auto-regressive  den-sity  estimator

神经自回归网络neural  auto-regressive  network

神经语言模型Neural  Language  Model

神经机器翻译Neural  Machine  Translation

神经网络neural  network

神经网络图灵机neural  Turing  machine

牛顿法Newton's  method

n-gram  n-gram

没有免费午餐定理no  free  lunch  theorem

噪声noise

噪声分布noise  distribution

噪声对比估计noise-contrastive  estimation

非凸nonconvex

非分布式nondistributed

非分布式表示nondistributed  representation

非线性共轭梯度nonlinear  conjugate  gradients

非线性独立成分估计nonlinear  independent  com-ponents  estimation

非参数non-parametric

范数norm

正态分布normal  distribution

正规方程normal  equation

归一化的normalized

标准初始化normalized  initialization

数值numeric  value

数值优化numerical  optimization

对象识别object  recognition

目标objective

目标函数objective  function

奥卡姆剃刀Occam's  razor

one-hot  one-hot

一次学习one-shot  learning

在线online

在线学习online  learning

操作operation

最优容量optimal  capacity

原点origin

正交orthogonal

正交矩阵orthogonal  matrix

标准正交orthonormal

输出output

输出层output  layer

过完备overcomplete

过估计overestimation

过拟合overfitting

过拟合机制overfitting  regime

上溢overflow

并行分布式处理Parallel  Distributed  Processing

并行回火parallel  tempering

参数parameter

参数服务器parameter  server

参数共享parameter  sharing

有参情况parametric  case

参数化整流线性单元parametric  ReLU

偏导数partial  derivative

配分函数Partition  Function

性能度量performance  measures

性能度量performance  metrics

置换不变性permutation  invariant

持续性对比散度persistent  contrastive  divergence

音素phoneme

语音phonetic

分段piecewise

点估计point  estimator

策略policy

策略梯度policy  gradient

池化pooling

池化函数pooling  function

病态条件poor  conditioning

正定positive  definite

正部函数positive  part  function

正相positive  phase

半正定positive  semidefinite

后验概率posterior  probability

幂方法power  method

PR曲线PR  curve

精度precision

精度矩阵precision  matrix

预测稀疏分解predictive  sparse  decomposition

预训练pretraining

初级视觉皮层primary  visual  cortex

主成分分析principal  components  analysis

先验概率prior  probability

先验概率分布prior  probability  distribution

概率PCA  probabilistic  PCA

概率密度函数probability  density  function

概率分布probability  distribution

概率质量函数probability  mass  function

专家之积product  of  expert

乘法法则product  rule

成比例proportional

提议分布proposal  distribution

伪似然pseudolikelihood

象限对quadrature  pair

量子力学quantum  mechanics

径向基函数radial  basis  function

随机搜索random  search

随机变量random  variable

值域range

比率匹配ratio  matching

召回率recall

接受域receptivefield

再循环recirculation

推荐系统recommender  system

重构reconstruction

重构误差reconstruction  error

整流线性rectified  linear

整流线性变换rectified  linear  transformation

整流线性单元rectified  linear  unit

整流网络rectifier  network

循环recurrence

循环卷积网络recurrent  convolutional  network

循环网络recurrent  network

循环神经网络recurrent  neural  network

回归regression

正则化regularization

正则化regularize

正则化项regularizer

强化学习reinforcement  learning

关系relation

关系型数据库relational  database

重参数化reparametrization

重参数化技巧reparametrization  trick

表示representation

表示学习representation  learning

表示容量representational  capacity

储层计算reservoir  computing

受限玻尔兹曼机Restricted  Boltzmann  Machine

反向相关reverse  correlation

反向模式累加reverse  mode  accumulation

岭回归ridge  regression

右特征向量right  eigenvector

右奇异向量right  singular  vector

风险risk

行row

扫视saccade

鞍点saddle  point

无鞍牛顿法saddle-free  Newton  method

相同same

样本均值sample  mean

样本方差sample  variance

饱和saturate

标量scalar

得分score

得分匹配score  matching

二阶导数second  derivative

二阶导数测试second  derivative  test

第二层second  layer

二阶方法second-order  method

自对比估计self-contrastive  estimation

自信息self-information

语义哈希semantic  hashing

半受限玻尔兹曼机semi-restricted  Boltzmann  Ma-chine

半监督semi-supervised

半监督学习semi-supervised  learning

可分离的separable

分离的separate

分离separation

情景setting

浅度回路shadow  circuit

香农熵Shannon  entropy

香农shannons

塑造shaping

短列表shortlist

sigmoid  sigmoid

sigmoid信念网络sigmoid  Belief  Network

简单细胞simple  cell

奇异的singular

奇异值singular  value

奇异值分解singular  value  decomposition

奇异向量singular  vector

跳跃连接skip  connection

慢特征分析slow  feature  analysis

慢性原则slowness  principle

平滑smoothing

平滑先验smoothness  prior

softmax  softmax

softmax函数softmax  function

softmax单元softmax  unit

softplus  softplus

softplus函数softplus  function

生成子空间span

稀疏sparse

稀疏激活sparse  activation

稀疏编码sparse  coding

稀疏连接sparse  connectivity

稀疏初始化sparse  initialization

稀疏交互sparse  interactions

稀疏权重sparse  weights

谱半径spectral  radius

语音识别Speech  Recognition

sphering  sphering

尖峰和平板spike  and  slab

尖峰和平板RBM  spike  and  slab  RBM

虚假模态spurious  modes

方阵square

标准差standard  deviation

标准差standard  error

标准正态分布standard  normal  distribution

声明statement

平稳的stationary

平稳分布Stationary  Distribution

驻点stationary  point

统计效率statistic  efficiency

统计学习理论statistical  learning  theory

统计量statistics

最陡下降steepest  descent

随机stochastic

随机课程stochastic  curriculum

随机梯度上升Stochastic  Gradient  Ascent

随机梯度下降stochastic  gradient  descent

随机矩阵Stochastic  Matrix

随机最大似然stochastic  maximum  likelihood

流stream

步幅stride

结构学习structure  learning

结构化概率模型structured  probabilistic  model

结构化变分推断structured  variational  inference

亚原子subatomic

子采样subsample

求和法则sum  rule

和–积网络sum-product  network

监督supervised

监督学习supervised  learning

监督学习算法supervised  learning  algorithm

监督模型supervised  model

监督预训练supervised  pretraining

支持向量support  vector

代理损失函数surrogate  loss  function

符号symbol

符号表示symbolic  representation

对称symmetric

切面距离tangent  distance

切平面tangent  plane

正切传播tangent  prop

目标  target

泰勒taylor

导师驱动过程teacher  forcing

温度temperature

回火转移tempered  transition

回火tempering

张量tensor

测试误差test  error

测试集test  set

碰撞情况the  collider  case

绑定的权重tied  weights

Tikhonov正则Tikhonov  regularization

平铺卷积tiled  convolution

时延神经网络time  delay  neural  network

时间步time  step

Toeplitz矩阵Toeplitz  matrix

标记token

容差tolerance

地质ICA  topographic  ICA

训练误差training  error

训练集training  set

转录transcribe

转录系统transcription  system

迁移学习transfer  learning

转移transition

转置transpose

三角不等式triangle  inequality

三角形化triangulate

三角形化图triangulated  graph

三元语法trigram

无偏unbiased

无偏样本方差unbiased  sample  variance

欠完备undercomplete

欠定的underdetermined

欠估计underestimation

欠拟合underfitting

欠拟合机制underfitting  regime

下溢underflow

潜在underlying

潜在成因underlying  cause

无向undirected

无向模型undirected  model

展开图unfolded  graph

展开unfolding

均匀分布uniform  distribution

一元语法unigram

单峰值unimodal

单元unit

单位范数unit  norm

单位向量unit  vector

万能近似定理universal  approximation  theorem

万能近似器universal  approximator

万能函数近似器universal  function  approximator

未标注unlabeled

未归一化概率函数unnormalized  probability  func-tion

非共享卷积unshared  convolution

无监督unsupervised

无监督学习unsupervised  learning

无监督学习算法unsupervised  learning  algorithm

无监督预训练unsupervised  pretraining

有效valid

验证集validation  set

梯度消失与爆炸问题vanishing  and  exploding  gra-dient  problem

梯度消失vanishing  gradient

Vapnik-Chervonenkis维度Vapnik-Chervonenkis  dimension

变量消去variable  elimination

方差variance

方差减小variance  reduction

变分自编码器variational  auto-encoder

变分导数variational  derivative

变分自由能variational  free  energy

变分推断variational  inference

向量vector

虚拟对抗样本virtual  adversarial  example

虚拟对抗训练virtual  adversarial  training

可见层visible  layer

V-结构V-structure

醒眠wake  sleep

warp  warp

支持向量机support  vector  machine

无向图模型undirected  graphical  model

权重weight

权重衰减weight  decay

权重比例推断规则weight  scaling  inference  rule

权重空间对称性weight  space  symmetry

条件概率分布conditional  probability  distribution

白化whitening

宽度width

赢者通吃winner-take-all

正切传播tangent  propagation

流形正切分类器manifold  tangent  classifier

词嵌入word  embedding

词义消歧word-sense  disambiguation

零数据学习zero-data  learning

零次学习zero-shot  learning

诚邀读者

人民邮电出版