Lynx Roundup, June 26th 2018

Lynx Roundup, June 26th 2018

Daily roundup of Data Science news around the industry, 6/26/2018.

Matthew Alhonte
Matthew Alhonte
Don’t let your ethical judgement go to sleep
We need to build organizations that are self-critical and avoid corporate self-deception.
Connecting the Dots in Early Drug Discovery at Novartis

Bold claim!  My gut is it's one of those things that's Wrong But Proving It's Wrong Will Lead To Neat Things.  

You may be interested in my new arXiv paper, joint work with Xi Cheng, an undergraduate at UC Davis (now heading to Cornell for grad school); Bohdan Khomtchouk, a post doc in biology at Stanford; and Pete Mohanty,  a Science, Engineering & Education Fellow in statistics at Stanford. The paper is of a provocative nature, and we welcome feedback.

A summary of the paper is:

  • We present a very simple, informal mathematical argument that neural networks (NNs) are in essence polynomial regression (PR). We refer to this as NNAEPR.
  • NNAEPR implies that we can use our knowledge of the “old-fashioned” method of PR to gain insight into how NNs — widely viewed somewhat warily as a “black box” — work inside.
  • One such insight is that the outputs of an NN layer will be prone to multicollinearity, with the problem becoming worse with each successive layer. This in turn may explain why convergence issues often develop in NNs. It also suggests that NN users tend to use overly large networks.
  • NNAEPR suggests that one may abandon using NNs altogether, and simply use PR instead.
  • We investigated this on a wide variety of datasets, and found that in every case PR did as well as, and often better than, NNs.
  • We have developed a feature-rich R package, polyreg, to facilitate using PR in multivariate settings.

Much work remains to be done (see paper), but our results so far are very encouraging. By using PR, one can avoid the headaches of NN, such as selecting good combinations of tuning parameters, dealing with convergence problems, and so on.

Also available are the slides for our presentation at GRAIL on this project.

The 5 Clustering Algorithms Data Scientists Need to Know - KDnuggets
Today, we’re going to look at 5 popular clustering algorithms that data scientists need to know and their pros and cons!
Instance vs. Static vs. Class Methods in Python: The Important Differences
Roundup

Matthew Alhonte

Supervillain in somebody's action hero movie. Experienced a radioactive freak accident at a young age which rendered him part-snake and strangely adept at Python.