**Link:** https://arxiv.org/pdf/1602.04938v1.pdf

**Authors:** Marco Tulio Ribeiro, Sameer Singh, Carlos Guestrin

I have been busy with finals for the past few days so this post is to make up for last week. The structure of the old posts is quite boring so I decide to go “freestyle” in this post.

To begin with, I must say that I was super excited about reading this paper. It emphasizes very very important issues we have to consider when deploying machine learning systems into the real world. Machine learning models make predictions, but we humans make decisions. Hence, it is up to us to use or not use predictions from a model. This is a matter of **trust**. We have to be careful with trusting machine learning models because either too much or too little trust can be problematic.

“When should I trust a model?” is also the theme of my research last year, which basically measures how much a model deserves to be trusted. My answer to the question is when the model is calibrated/reliable/knowing its own mistakes. We employ the notion of calibration and calculate calibration error of various models. Simply put, a model is calibrated if when we aggregate all instances where the model predicts an event with X% confidence, those events must happen X% of the time.

This paper answers the question from a different perspective. In essence, it believes that a model appears to be more trustworthy to us if we know what it is doing. In other words, the model has to be intepretable and its decisions have to be explainable.

*Digression*: I really like how the authors discuss many cases when merely looking at the accuracy metric fools us. One interesting case is when a model is found to make predictions (of some medical condition) strongly correlated with the patients’ IDs. This sounds silly and funny but, at the same time, scary to me because this is the first I notice it, which means that there is a non-zero probability that I have done something as silly and funny in the past. I think that many people would have had the same feeling. Okay, from now on, remove all authentication information!

The cool thing about this research is that it is model-agnostic and does not modify anything about the targeted model. The solution proposed is to **locally **approximate a black-boxed model with an intepretable model. Suppose we want to explain a prediction $x$, we sample data points in the neighborhood of $x$, weight them probably according to their proximity to $x$. Then we construct an optimization problem aiming to find an intepretable model that is locally faithful to the explained model with respect to those data points. To force the approximating model to be simple, we also add a regularization term to penalize its complexity.

The second contribution of this paper is an algorithm for picking data instances that represent major correlations with the features. This problem can be cast as follows: given a m x n board, each cell in the board is either 0 or 1 representing that some row is correlated with some column, and each column has a weight; the goal is to pick a subset of rows that maximizes the sum of weights of columns correlate with at least one row in the subset. The authors propose a greedy solution but, as person who did algorithms, I smell some matching, or max-flow taste here (but I couldn’t figure it out).

Evaluating these methods is challenging. The results section is very impressive, especially the ones with human subjects. In one experiment, they show that non-experts can improve a model substantially in just a few minutes by feature-engineering with the help of explanations produced by their method. In another experiment, they present to experts predictions made by a contaminated classifier, which always predicts “wolf” if there is snow in an image and “husky” otherwise. The experts believe the model at first but, after seeing their explanations, their trust go down.

I wish there was more space for the authors to say more about their experiment setups. In many experiments, they construct a mysterious “truly important” feature set without saying what they are and how/why they are selected. This makes their evaluation a little less transparent.

Overall, this is a very thorough research. It manages to answer many open questions in this line of research at the same time. I hope research on assessing and interpreting machine learning models will receive more attentions in the future.