What Types of Insights Are Possible
可能获得哪些类型的洞察
Many people say machine learning models are "black boxes", in the sense that they can make good predictions but you can't understand the logic behind those predictions. This statement is true in the sense that most data scientists don't know how to extract insights from models yet.
许多人说机器学习模型是“黑匣子”,即它们可以做出正确的预测,但你无法理解这些预测背后的逻辑。这种说法是正确的,因为大多数数据科学家还不知道如何从模型中提取洞察。
However, this micro-course will teach you techniques to extract the following insights from sophisticated machine learning models.
但是,这个微课程将教你从复杂的机器学习模型中提取以下洞察的技术。
- What features in the data did the model think are most important?
- 模型认为数据中的哪些特征最重要?
- For any single prediction from a model, how did each feature in the data affect that particular prediction?
- 对于模型的任何单个预测,数据中的每个特征如何影响该特定预测?
- How does each feature affect the model's predictions in a big-picture sense (what is its typical effect when considered over a large number of possible predictions)?
- 从大局来看,每个特征如何影响模型的预测(当考虑大量可能的预测时,其典型影响是什么)?
Why Are These Insights Valuable
为什么这些见解很有价值
These insights have many uses, including
这些见解有很多用途,包括
- Debugging
- 调试
- Informing feature engineering
- 为特征工程提供信息
- Directing future data collection
- 指导未来的数据收集
- Informing human decision-making
- 为人类决策提供信息
- Building Trust
- 建立信任
Debugging
调试
The world has a lot of unreliable, disorganized and generally dirty data. You add a potential source of errors as you write preprocessing code. Add in the potential for target leakage, and it is the norm rather than the exception to have errors at some point in a real data science project.
世界上有很多不可靠、混乱且通常很脏的数据。编写预处理代码时,可能会产生错误。再加上 目标泄漏 的可能性,在实际数据科学项目中,某些时候出现错误是常态,而不是例外。
Given the frequency and potentially disastrous consequences of bugs, debugging is one of the most valuable skills in data science. Understanding the patterns a model is finding will help you identify when those are at odds with your knowledge of the real world, and this is typically the first step in tracking down bugs.
鉴于错误出现的频率和可能造成的灾难性后果,调试是数据科学中最有价值的技能之一。了解模型发现的模式将有助于您识别这些模式何时与您对现实世界的了解相矛盾,这通常是追踪错误的第一步。
Informing Feature Engineering
特征工程提供信息
Feature engineering is usually the most effective way to improve model accuracy. Feature engineering usually involves repeatedly creating new features using transformations of your raw data or features you have previously created.
特征工程 通常是提高模型准确性的最有效方法。特征工程通常涉及使用原始数据或您之前创建的特征的转换反复创建新特征。
Sometimes you can go through this process using nothing but intuition about the underlying topic. But you'll need more direction when you have 100s of raw features or when you lack background knowledge about the topic you are working on.
有时,您可以仅凭对基础主题的直觉来完成此过程。但是,当您拥有 100 个原始特征或缺乏有关所研究主题的背景知识时,您将需要更多指导。
A Kaggle competition to predict loan defaults gives an extreme example. This competition had 100s of raw features. For privacy reasons, the features had names like f1
, f2
, f3
rather than common English names. This simulated a scenario where you have little intuition about the raw data.
Kaggle 竞赛 预测贷款违约 给出了一个极端的例子。该竞赛有 100 个原始特征。出于隐私原因,这些特征的名称为 f1
、f2
、f3
,而不是常见的英文名称。这模拟了您对原始数据缺乏直觉的场景。
One competitor found that the difference between two of the features, specificallyf527 - f528
, created a very powerful new feature. Models including that difference as a feature were far better than models without it. But how might you think of creating this variable when you start with hundreds of variables?
一位参赛者发现,两个特征(特别是 f527 - f528
)之间的差异创建了一个非常强大的新特征。包含该差异作为特征的模型比没有该差异的模型要好得多。但是,当您从数百个变量开始时,您会如何考虑创建这个变量?
The techniques you'll learn in this micro-course would make it transparent that f527
and f528
are important features, and that their role is tightly entangled. This will direct you to consider transformations of these two variables, and likely find the "golden feature" of f527 - f528
.
您将在本微课程中学习的技术将使您清楚地认识到 f527
和 f528
是重要特征,并且它们的作用紧密相关。这将指导您考虑这两个变量的转换,并可能找到 f527 - f528
的“黄金特征”。
As an increasing number of datasets start with 100s or 1000s of raw features, this approach is becoming increasingly important.
随着越来越多的数据集以 100 或 1000 个原始特征开始,这种方法变得越来越重要。
Directing Future Data Collection
指导未来的数据收集
You have no control over datasets you download online. But many businesses and organizations using data science have opportunities to expand what types of data they collect. Collecting new types of data can be expensive or inconvenient, so they only want to do this if they know it will be worthwhile. Model-based insights give you a good understanding of the value of features you currently have, which will help you reason about what new values may be most helpful.
您无法控制在线下载的数据集。但许多使用数据科学的企业和组织都有机会扩展他们收集的数据类型。收集新类型的数据可能很昂贵或不方便,因此他们只有在知道这样做值得时才会这样做。基于模型的洞察可以让您很好地了解您当前拥有的功能的价值,这将帮助您推断哪些新的特征值可能最有帮助。
Informing Human Decision-Making
为人类决策提供信息
Some decisions are made automatically by models. Amazon doesn't have humans (or elves) scurry to decide what to show you whenever you go to their website. But many important decisions are made by humans. For these decisions, insights can be more valuable than predictions.
一些决策是由模型自动做出的。亚马逊没有人类(或精灵)匆忙决定在您访问其网站时向您展示什么。但许多重要决策都是由人类做出的。对于这些决策,洞察力比预测更有价值。
Building Trust
建立信任
Many people won't assume they can trust your model for important decisions without verifying some basic facts. This is a smart precaution given the frequency of data errors. In practice, showing insights that fit their general understanding of the problem will help build trust, even among people with little deep knowledge of data science.
如果不核实一些基本事实,许多人不会认为他们可以信任您的模型做出重要决策。考虑到数据错误的频率,这是一种明智的预防措施。在实践中,展示符合他们对问题的一般理解的见解将有助于建立信任,即使是在对数据科学了解不多的人中也是如此。
Keep Going
继续
The first insight you'll learn is permutation importance.
您将学到的第一个见解是 排列重要性 。