Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

07.exercise-data-leakage【练习:数据泄漏】

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.


Most people find target leakage very tricky until they've thought about it for a long time.

大多数人都认为目标泄漏非常棘手,直到他们思考了很长时间。

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

因此,在尝试考虑房价示例中的泄漏之前,我们将浏览其他应用中的一些示例。 当你回到有关房价的问题时,事情会感觉更熟悉。

Setup

设置

The questions below will give you feedback on your answers. Run the following cell to set up the feedback system.

以下问题将为您提供有关答案的反馈。 运行以下单元格来设置反馈系统。

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex7 import *
print("Setup Complete")
Setup Complete

Step 1: The Data Science of Shoelaces

步骤 1:鞋带的数据科学

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:

耐克聘请您作为数据科学顾问,帮助他们节省鞋子材料的费用。 您的第一项任务是检查他们的一名员工构建的模型,该模型用于预测他们每月需要多少鞋带。 机器学习模型的特征包括:

  • The current month (January, February, etc)
  • 当前月份(一月、二月等)
  • Advertising expenditures in the previous month
  • 上个月的广告支出
  • Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
  • 截至当前月初的各种宏观经济特征(如失业率)
  • The amount of leather they ended up using in the current month
  • 他们当月最终使用的皮革量

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

结果表明,如果包含有关他们使用了多少皮革的特征,则该模型几乎完全准确。 但如果你忽略这个功能,它的准确度就只能是中等。 您意识到这是因为他们使用的皮革量是他们生产鞋子数量的完美指标,这反过来又告诉您他们需要多少鞋带。

Do you think the leather used feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

您认为使用皮革量这个特征是否数据泄露的来源? 如果你的答案是看情况,那么它取决于什么?

After you have thought about your answer, check it against the solution below.

考虑完您的答案后,请对照下面的解决方案进行检查。

# Check your answer (Run this code cell to receive credit!)
q_1.check()

Correct:

This is tricky, and it depends on details of how data is collected (which is common when thinking about leakage). Would you at the beginning of the month decide how much leather will be used that month? If so, this is ok. But if that is determined during the month, you would not have access to it when you make the prediction. If you have a guess at the beginning of the month, and it is subsequently changed during the month, the actual amount used during the month cannot be used as a feature (because it causes leakage).

这很棘手,并且取决于数据收集方式的细节(这在考虑泄漏时很常见)。 您会在月初决定当月使用多少皮革吗? 如果是这样,那就可以了。 但如果是在当月内确定的,那么您在进行预测时将无法访问它。 如果您在月初进行了猜测,并且随后在该月内进行了更改,则该月内的实际使用量无法用作特征(因为它会导致泄漏)。

Step 2: Return of the Shoelaces

第二步:回到鞋带

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

你有一个新想法。 您可以使用指定月份之前 Nike 订购的皮革数量(而不是实际使用的数量)作为鞋带模型的预测指标。

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

这是否会改变您关于是否存在泄漏问题的答案? 如果您回答看情况,那么取决于什么?

# Check your answer (Run this code cell to receive credit!)
q_2.check()

Correct:

This could be fine, but it depends on whether they order shoelaces first or leather first. If they order shoelaces first, you won't know how much leather they've ordered when you predict their shoelace needs. If they order leather first, then you'll have that number available when you place your shoelace order, and you should be ok.

这可能没问题,但这取决于他们是先订购鞋带还是先订购皮革。 如果他们先订购鞋带,当你预测他们的鞋带需求时,你将不知道他们订购了多少皮革。 如果他们先订购皮革,那么当您订购鞋带时就会有该数量,应该没问题。

Step 3: Getting Rich With Cryptocurrencies?

第三步:通过加密货币致富?

You saved Nike so much money that they gave you a bonus. Congratulations.

你为耐克节省了很多钱,他们给了你奖金。 恭喜。

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

你的朋友也是一名数据科学家,他说他已经建立了一个模型,可以让你将奖金变成数百万美元。 具体来说,他的模型在能够提前一天预测一种新的加密货币(如比特币,但较新)的价格。 他的计划是每当模型显示货币(以美元计)价格即将上涨时就购买加密货币。

The most important features in his model are:

他的模型中最重要的特征是:

  • Current price of the currency
  • 货币的当前价格
  • Amount of the currency sold in the last 24 hours
  • 过去 24 小时内出售的货币数量
  • Change in the currency price in the last 24 hours
  • 过去24小时内货币价格的变化
  • Change in the currency price in the last 1 hour
  • 过去1小时内货币价格的变化
  • Number of new tweets in the last 24 hours that mention the currency
  • 过去 24 小时内提及该货币的新推文数量

The value of the cryptocurrency in dollars has fluctuated up and down by over \$100 in the last year, and yet his model's average error is less than \$1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

去年,加密货币的美元价值上下波动了超过 100 美元,但他的模型的平均误差小于 1 美元。 他说这证明了他的模型是准确的,你应该和他一起投资,每当模型表明货币即将上涨时就买入货币。

Is he right? If there is a problem with his model, what is it?

他说得对吗? 如果他的模型有问题,那是什么问题呢?

# Check your answer (Run this code cell to receive credit!)
q_3.check()

Correct:

There is no source of leakage here. These features should be available at the moment you want to make a predition, and they're unlikely to be changed in the training data after the prediction target is determined. But, the way he describes accuracy could be misleading if you aren't careful. If the price moves gradually, today's price will be an accurate predictor of tomorrow's price, but it may not tell you whether it's a good time to invest. For instance, if it is $100 today, a model predicting a price of $100 tomorrow may seem accurate, even if it can't tell you whether the price is going up or down from the current price. A better prediction target would be the change in price over the next day. If you can consistently predict whether the price is about to go up or down (and by how much), you may have a winning investment opportunity.

这里不存在泄漏源。 这些特征应该在您想要进行预测时可用,并且在确定预测目标后,它们不太可能在训练数据中发生更改。 但是,如果您不小心的话,他描述准确性的方式可能会产生误导。 如果价格逐渐变动,今天的价格将准确预测明天的价格,但它可能无法告诉您现在是否是投资的好时机。 例如,如果今天是 100,那么预测明天价格为 100 的模型可能看起来很准确,即使它无法告诉您价格相对于当前价格是上涨还是下跌。 更好的预测目标是第二天的价格变化。 如果您能够始终如一地预测价格是否会上涨或下跌(以及上涨幅度),您可能会获得一个成功的投资机会。

Step 4: Preventing Infections

第四步:防止感染

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

一家提供医疗保健的机构希望预测哪些罕见手术患者有感染风险,因此可以提醒护士在跟踪这些患者时要特别小心。

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

你想建立一个模型。 建模数据集中的每一行将是接受手术的单个患者,预测目标将是他们是否受到感染。

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

一些外科医生可能会以增加或降低感染风险的方式进行手术。 但如何才能最好地将外科医生信息融入到模型中呢?

You have a clever idea.

你有一个聪明的主意。

  1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
  2. 统计每位外科医生的所有手术,并计算这些外科医生之间的感染率。
  3. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.
  4. 对于数据中的每位患者,找出外科医生是谁,并将该外科医生的平均感染率作为特征插入。

Does this pose any target leakage issues?

这会造成任何目标泄漏问题吗?

Does it pose any train-test contamination issues?

它会造成任何 训练-测试 污染问题吗?

# Check your answer (Run this code cell to receive credit!)
q_4.check()

Correct:

This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).

You have target leakage if a given patient's outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon's infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.

You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn't generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.

这会带来目标泄漏和训练测试污染的风险(尽管如果您小心的话,也许可以避免这两种情况)。

如果给定患者的结果会影响其外科医生的感染率,那么您就有目标泄漏,然后将其重新插入到该患者是否被感染的预测模型中。 如果仅使用我们预测的患者之前的手术来计算外科医生的感染率,则可以避免目标泄漏。 为训练数据中的每次手术计算此值可能有点棘手。

如果您使用外科医生执行的所有手术(包括来自测试集的手术)来计算这一点,那么您还会遇到训练-测试污染问题。 结果是您的模型在测试集上看起来非常准确,即使在模型部署后它不能很好地推广到新患者。 发生这种情况是因为外科医生风险特征考虑了测试集中的数据。 测试集的存在是为了估计模型在看到新数据时的表现。 因此,这种污染违背了测试集的目的。

Step 5: Housing Prices

第五步:房价

You will build a model to predict housing prices. The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website. Here are four features that could be used as predictors.

您将建立一个模型来预测房价。 该模型将持续部署,以便在网站上添加描述时预测新房的价格。 以下是可以用作预测因子的四个特征。

  1. Size of the house (in square meters)
  2. 房屋面积(平方米)
  3. Average sales price of homes in the same neighborhood
  4. 同一街区的房屋平均销售价格
  5. Latitude and longitude of the house
  6. 房屋的经纬度
  7. Whether the house has a basement
  8. 房屋是否有地下室

You have historic data to train and validate the model.

您有历史数据来训练和验证模型。

Which of the features is most likely to be a source of leakage?

哪个功能最有可能成为泄漏源?

# Fill in the line below with one of 1, 2, 3 or 4.
potential_leakage_feature = 2

# Check your answer
q_5.check()

Correct:

2 is the source of target leakage. Here is an analysis for each feature:

  1. The size of a house is unlikely to be changed after it is sold (though technically it's possible). But typically this will be available when we need to make a prediction, and the data won't be modified after the home is sold. So it is pretty safe.

  2. We don't know the rules for when this is updated. If the field is updated in the raw data after a home was sold, and the home's sale is used to calculate the average, this constitutes a case of target leakage. At an extreme, if only one home is sold in the neighborhood, and it is the home we are trying to predict, then the average will be exactly equal to the value we are trying to predict. In general, for neighborhoods with few sales, the model will perform very well on the training data. But when you apply the model, the home you are predicting won't have been sold yet, so this feature won't work the same as it did in the training data.

  3. These don't change, and will be available at the time we want to make a prediction. So there's no risk of target leakage here.

  4. This also doesn't change, and it is available at the time we want to make a prediction. So there's no risk of target leakage here.

2是目标泄漏的根源。 下面对每个功能进行分析:

  1. 房屋的大小在出售后不太可能改变(尽管技术上这是可能的)。 但通常情况下,当我们需要进行预测时,这将可用,并且在房屋出售后数据不会被修改。 所以这是相当安全的。

  2. 我们不知道何时更新的规则。 如果在房屋出售后更新原始数据中的字段,并使用房屋的销售来计算平均值,则构成目标泄漏的情况。 在极端情况下,如果附近仅出售一栋房屋,并且这就是我们试图预测的房屋,那么平均值将恰好等于我们试图预测的值。 一般来说,对于销量很少的社区,模型在训练数据上表现得很好。 但是,当您应用该模型时,您预测的房屋尚未售出,因此该功能的工作方式与训练数据中的工作方式不同。

  3. 这些不会改变,并且在我们想要进行预测时可用。 所以这里不存在目标泄漏的风险。

  4. 这也不会改变,并且在我们想要进行预测时可用。 所以这里不存在目标泄漏的风险。

# q_5.hint()
# q_5.solution()

Conclusion

结论

Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

泄漏是一个棘手而微妙的问题。 如果您发现了这些示例中的问题,您应该感到自豪。

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.

现在,您拥有了制作高准确度模型的工具,并了解应用这些模型解决实际问题时出现的最困难的实际问题。

There is still a lot of room to build knowledge and experience. Try out a Competition or look through our Datasets to practice your new skills.

积累知识和经验的空间仍然很大。 尝试竞赛 或浏览我们的数据集 来练习您的新技能。

Again, Congratulations!

再次恭喜!

07.exercise-data-leakage【练习:数据泄漏】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top