You've built a model. But how good is it?
In this lesson, you will learn to use model validation to measure the quality of your model. Measuring model quality is the key to iteratively improving your models.
你已经建立了一个模型。 但它有多好呢?
在本课程中,您将学习使用模型验证来衡量模型的质量。 衡量模型质量是迭代改进模型的关键。
What is Model Validation
You'll want to evaluate almost every model you ever build. In most (though not all) applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.
Many people make a huge mistake when measuring predictive accuracy. They make predictions with their training data and compare those predictions to the target values in the training data. You'll see the problem with this approach and how to solve it in a moment, but let's think about how we'd do this first.
You'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.
There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE). Let's break down this metric starting with the last word, error.
什么是模型验证
您几乎需要评估您构建的每个模型。 在大多数(尽管不是全部)应用中,模型质量的相关衡量标准是预测准确性。 换句话说,模型的预测是否会接近实际发生的情况。
许多人在衡量预测准确性时犯了一个巨大的错误。 他们利用训练数据进行预测,并将这些预测与训练数据中的目标值进行比较。 稍后您就会看到这种方法的问题以及如何解决它,但让我们首先考虑一下如何做到这一点。
您首先需要以易于理解的方式总结模型质量。 如果您比较 10,000 栋房屋的预测价值和实际价值,您可能会发现预测的好坏参半。 查看包含 10,000 个预测值和实际值的列表是没有意义的。 我们需要将其总结为一个指标。
总结模型质量的指标有很多,但我们将从一个名为 平均绝对误差(也称为 MAE)的指标开始。 让我们从最后一个词“误差”开始分解这个指标。
The prediction error for each house is:
每栋房屋的预测误差为:
error=actual−predicted
So, if a house cost \$150,000 and you predicted it would cost \$100,000 the error is \$50,000.
With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as
On average, our predictions are off by about X.
To calculate MAE, we first need a model. That is built in a hidden cell below, which you can review by clicking the code
button.
因此,如果一栋房子的价格为 150,000 美元,而您预测它的价格为 100,000 美元,则错误为 50,000 美元。
使用 MAE 指标,我们取每个误差的绝对值。 这会将每个错误转换为正数。 然后我们取这些绝对误差的平均值。 这是我们衡量模型质量的标准。 简单的来说,可以说是
平均而言,我们的预测偏差约为 X。
为了计算 MAE,我们首先需要一个模型。 它内置在下面的隐藏单元格中,您可以通过单击代码
按钮进行查看。
# Data Loading Code Hidden Here
import pandas as pd
# Load data
melbourne_file_path = '../00 datasets/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea',
'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)