In this series we interview professors and contributors to our fields of passion: computational statistics, data science and machine learning. The first interviewee is Dr. Ioannis Kosmidis. He is Senior Lecturer at the University College London and during our first term he taught a workshop on GLMs and statistical learning in the n>>p setting. Furthermore, his research interests also focus on data-analytic applications for sport sciences and health.
Nandan: First of all thank you very much for taking the time to talk to us. We are very eager to learn more from you, so lets get started with our first question: Which papers have fundamentally changed the way you think?
Kosmidis: There many papers that have deeply affected me, but let me give you my top 3:
The first one is a paper by Bradley Efron from 1975. Statisticians have known that procedures and models that are linear (e.g. canonically-linked exponential family models) have nice properties in learning and inference problems. In his paper Efron provided a framework that describes of how much one can deviate from these linear cases before these good properties start deteriorating. It’s a beautiful paper, which concentrates elements from 70 years worth of statistical thinking. Furthermore, it is accompanied by an amazing discussion, including topics on the construction of confidence intervals, second-order efficiency and unbiased estimators. All of these concepts are put under the same umbrella, and quantified by one key quantity: the statistical curvature. This is one of the papers that helped me a lot in my own research and made me think hard on the models I use and the impact they have on what I want to achieve with an analysis.
The second paper is well known to graduate students in Data Science: Leo Breiman’s paper about the two cultures in statistics and machine learning – inference and prediction. In that paper, Breiman has a pro-prediction attitude, arguing that the relevant problems are prediction-based. One of my favorite parts in that paper is actually Cox’s commentary. Cox tries to balance Breiman’s enthusiasm for prediction in a very elegant way. He argues that optimal prediction, as a target is often dangerous; in many settings in statistics you are faced with the following problem. Data can only be collected under certain conditions but predictions are only useful under a different set of conditions. For example, think of modeling the spread of a new dangerous virus. Of course you wouldn’t have past data on its behavior. So without any idea of the dynamics you are trying to model and by using a black box prediction algorithm, you might get strong predictive results, which are actually useless for the question at hand. One of the things that a statistician ought to be good at is to recover data-generating mechanisms, test scientific hypotheses, and develop the means to do so.
And finally, there is a wonderful report by Donoho, which is called ’50 Years of Data Science’. Starting with the fundamental work by John Tukey, Donoho outlines the history of our craft. He asks fundamental questions about what defines a “Data Scientist” and how Data Science should be taught. He comes up with a list of courses, which a Data Science program should include. For example, I fully agree with his proposal to include a course on “Legal, Policy, and Ethical Considerations for Data Scientists”. Furthermore, a course, which takes an in-depth look at the connection between causality and experimental design should find a home in every Data Science program. Clever experimental design is key for good inference.
It is really important to be clear about which question you are trying to answer and to stay open whenever a new data analytics setting comes in. You can’t answer everything with a neural network. In some cases it is much better to have a model with only few parameters, which actually explains something. In this way subject matter experts are able to improve their work. This is much harder with complex black boxes.
Robert: Successful companies such as Google DeepMind, SpaceX and Alphabet Inc. are largely focused on optimizing certain prediction tasks by building complex models (Deep Reinforcement Learning, etc.). Do you think that this hype could pose a potential danger for the state of inference?
Kosmidis: No, not at all. Such developments are hugely important and very inspirational. In order to understand these systems we need complex models. But this is not all. These extremely non-parametric models often give you a baseline, something to try to outperform with a model that encodes common sense and scientific knowledge, and can be used for inference on relevant hypotheses.
Robert: What do you think about p-value manipulation and the fact that so far (almost) only significant effects are being published in economics, social sciences and psychology? Does scientific evolution really apply to those fields?
Kosmidis: In 2011 Jonathan Schooler proposed in a Nature paper to introduce a repository of negative or non-important results. Whenever you get a non-important result, there are two things that might have happened: There is actually no effect or you have used the wrong methods to try to discover an effect. If we had a repository of all those analysis, one could try and re-attempt them.
A move towards raising awareness of the issues you are referring to is the ASA’s statement on p-values from 2016. The statement outlines 6 things p-values are not and is signed by well-known scientists.
Nandan: Coming to one of your key research areas – Data analytics for health and fitness. How do you imagine our data driven health system evolve in the next 10 to 20 years?
Kosmidis: I imagine that there is going to be a huge revolution in data analytic solutions for health and fitness. All of us carry many devices that can now communicate in a very efficient way with each other. They collect data on a regular schedule and can update us on the potential actions we should take to improve, for example, our fitness. In a sense, the algorithms will be able to map the data that we generate to our idea of the future, an putting together inference and prediction, you as a user will be able to intervene to yourself, online.
Nandan: Jumping to another topic: What are the most valuable, annoying or essential attributes of Phd students that you have supervised or hope to supervise in the past or future? And is there any final advice you can give to young and aspiring Data Science students?
Kosmidis: I believe that the most important quality for a PhD student is curiosity.
My advice: Keep an open mind!
Robert and Nandan: That wraps it up. Again, thank you very much for your time and thoughts. It was a pleasure to learn from you.
Breiman, Leo. “Statistical modeling: The two cultures (with comments and a rejoinder by the author).” Statistical Science 16.3 (2001): 199-231.
Donoho, David. “50 years of Data Science.” Princeton NJ, Tukey Centennial Workshop. 2015.
Efron, Bradley. “Defining the curvature of a statistical problem (with applications to second order efficiency).” The Annals of Statistics (1975): 1189-1242.
Schooler, Jonathan. “Unpublished results hide the decline effect.” Nature – 470.7335 (2011): 437.
Wasserstein, Ronald L., and Nicole A. Lazar. “The ASA’s statement on p-values: context, process, and purpose.” The American Statistician (2016).