This notebook is an exercise in the Introduction to Machine Learning course. You can reference the tutorial at this link.
Recap
You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.
回顾
您已经构建了第一个模型,现在是时候优化树的大小以做出更好的预测了。 运行此单元以在上一步结束的位置设置编码环境。
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")
Validation MAE: 29,653
Setup complete
Exercises
You could write the function get_mae
yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below.
练习
您可以自己编写函数get_mae
。 现在,我们将提供它。 这与您在上一课中读到的函数相同。 只需运行下面的单元格即可。
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
Step 1: Compare Different Tree Sizes
步骤 1:比较不同的树大小
Write a loop that tries the following values for max_leaf_nodes from a set of possible values.
编写一个循环,从一组可能的值中尝试以下 max_leaf_nodes 值。
Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes
that gives the most accurate model on your data.
对 max_leaf_nodes 的每个值调用 get_mae 函数。 以某种方式存储输出,使您可以选择max_leaf_nodes
的值,从而为您的数据提供最准确的模型。
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
# 编写循环从candidate_max_leaf_nodes中找到理想的树大小
#for node in candidate_max_leaf_nodes:
# print(node, get_mae(node, train_X, val_X, train_y, val_y))
scores = {node:get_mae(node, train_X, val_X, train_y, val_y) for node in candidate_max_leaf_nodes}
min(scores, key=scores.get)
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = min(scores, key=scores.get)
# Check your answer
step_1.check()
Correct
# The lines below will show you a hint or the solution.
# step_1.hint()
step_1.solution()
Solution:
# Here is a short solution with a dict comprehension.
# The lesson gives an example of how to do this with an explicit loop.
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
Step 2: Fit Model Using All Data
步骤 2:使用所有数据拟合模型
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.
您知道了最佳的树尺寸。 如果您要在实践中部署此模型,则可以通过使用所有数据并保持树的大小来使其更加准确。 也就是说,既然您已经做出了所有建模决策,则无需保留验证数据。
# Fill in argument to make optimal size and uncomment
# 填写参数以达到最佳结果并取消注释
final_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=0)
# fit the final model and uncomment the next two lines
final_model.fit(X, y)
# Check your answer
step_2.check()
Correct
# step_2.hint()
step_2.solution()
Solution:
# Fit the model with best_tree_size. Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)
# fit the final model
final_model.fit(X, y)
You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.
您已经调整了该模型并改进了结果。 但我们仍在使用决策树模型,根据现代机器学习标准,该模型并不是很复杂。 在下一步中,您将学习使用随机森林来进一步改进您的模型。
Keep Going
You are ready for Random Forests.
继续前进
您已准备好 随机森林。