Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

06.exercise-xgboost【练习:XGBoost】

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.


In this exercise, you will use your new knowledge to train a model with gradient boosting.

在本练习中,您将使用新知识来训练梯度提升的模型。

Setup

设置

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex6 import *
print("Setup Complete")
Setup Complete

You will work with the Housing Prices Competition for Kaggle Learn Users dataset from the previous exercise.

您将使用上一个练习中的 Kaggle Learn 用户住房价格竞赛 数据集。

Ames Housing dataset image

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

在不进行任何更改的情况下运行下一个代码单元,以加载X_trainX_validy_trainy_valid中的训练集和验证集。 测试集加载到X_test中。

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

# One-hot encode the data (to shorten the code, we use pandas)
X_train = pd.get_dummies(X_train)
X_valid = pd.get_dummies(X_valid)
X_test = pd.get_dummies(X_test)
X_train, X_valid = X_train.align(X_valid, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

Step 1: Build model

第 1 步:构建模型

Part A

A 部分

In this step, you'll build and train your first model with gradient boosting.

在此步骤中,您将使用梯度提升模型构建和训练您的第一个模型。

  • Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.
  • 首先将 my_model_1 设置为 XGBoost 模型。 使用 XGBRegressor 类,并将随机种子设置为 0 (random_state=0)。 将所有其他参数保留为默认值。
  • Then, fit the model to the training data in X_train and y_train.
  • 然后,将模型拟合到X_trainy_train中的训练数据。
from xgboost import XGBRegressor

# Define the model
#my_model_1 = ____ # Your code here

# Fit the model
#____ # Your code here

my_model_1 = XGBRegressor(random_state=0,)
my_model_1.fit(X_train, y_train)

# Check your answer
step_1.a.check()

Correct

# Lines below will give you a hint or solution code
#step_1.a.hint()
step_1.a.solution()

Solution:

# Define the model
my_model_1 = XGBRegressor(random_state=0)

# Fit the model
my_model_1.fit(X_train, y_train)

Part B

B 部分

Set predictions_1 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.

predictions_1设置为模型对验证数据的预测。 回想一下,验证功能存储在X_valid中。

from sklearn.metrics import mean_absolute_error

# Get predictions
#predictions_1 = ____ # Your code here

predictions_1 = my_model_1.predict(X_valid)
# Check your answer
step_1.b.check()

Correct

# Lines below will give you a hint or solution code
#step_1.b.hint()
step_1.b.solution()

Solution:

# Get predictions
predictions_1 = my_model_1.predict(X_valid)

Part C

C 部分

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in y_valid.

最后,使用mean_absolute_error()函数来计算验证集的平均绝对误差 (MAE)。 回想一下,验证数据的标签存储在y_valid中。

# Calculate MAE
#mae_1 = ____ # Your code here
mae_1 = mean_absolute_error(y_valid, predictions_1)
# Uncomment to print MAE
print("Mean Absolute Error:" , mae_1)

# Check your answer
step_1.c.check()
Mean Absolute Error: 18161.82412510702

Correct

# Lines below will give you a hint or solution code
#step_1.c.hint()
step_1.c.solution()

Solution:

# Calculate MAE
mae_1 = mean_absolute_error(predictions_1, y_valid)
print("Mean Absolute Error:" , mae_1)

Step 2: Improve the model

第二步:改进模型

Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance!

现在您已经训练了默认模型作为基线,是时候修改参数了,看看是否可以获得更好的性能!

  • Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to get better results.
  • 首先使用 XGBRegressor 类将 my_model_2 设置为 XGBoost 模型。 使用您在上一教程中学到的知识来了解如何更改默认参数(例如n_estimatorslearning_rate)以获得更好的结果。
  • Then, fit the model to the training data in X_train and y_train.
  • 然后,将模型拟合到X_trainy_train中的训练数据。
  • Set predictions_2 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • predictions_2设置为模型对验证数据的预测。 回想一下,验证功能存储在X_valid中。
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
  • 最后,使用mean_absolute_error()函数计算与验证集上的平均绝对误差(MAE)。 回想一下,验证数据的标签存储在y_valid中。

In order for this step to be marked correct, your model in my_model_2 must attain lower MAE than the model in my_model_1.
为了使此步骤被标记为正确,my_model_2中的模型必须获得比my_model_1中的模型更低的 MAE。

for i in range(1, 10):
    my_model_2 = XGBRegressor(random_state=0, n_estimators=i*50)
    my_model_2.fit(X_train, y_train)
    predictions_2 = my_model_2.predict(X_valid)
    mae_2 = mean_absolute_error(y_valid, predictions_2)
    print(i*50, mae_2)
50 18095.682871361303
100 18161.82412510702
150 18293.454850706337
200 18308.09936857877
250 18298.81161708048
300 18298.990020333906
350 18298.399628103594
400 18298.096358625855
450 18298.159768300513
# Define the model
#my_model_2 = ____ # Your code here

# Fit the model
#____ # Your code here

# Get predictions
#predictions_2 = ____ # Your code here

# Calculate MAE
#mae_2 = ____ # Your code here

# my_model_2 = XGBRegressor(random_state=0, n_estimators=50)
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model_2.fit(X_train, y_train)
predictions_2 = my_model_2.predict(X_valid)
mae_2 = mean_absolute_error(y_valid, predictions_2)

# Uncomment to print MAE
# print("Mean Absolute Error:" , mae_2)

# Check your answer
step_2.check()

Correct

my_model_2.n_estimators
1000
# Lines below will give you a hint or solution code
# step_2.hint()
step_2.solution()

Solution:

# Define the model
my_model_2 = XGBRegressor(n_estimators=1000, learning_rate=0.05)

# Fit the model
my_model_2.fit(X_train, y_train)

# Get predictions
predictions_2 = my_model_2.predict(X_valid)

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid)
print("Mean Absolute Error:" , mae_2)

Step 3: Break the model

第三步:提升模型

In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!

在此步骤中,您将创建一个性能比步骤 1 中的原始模型更差的模型。这将帮助您培养对如何设置参数的直觉。 您甚至可能会发现您意外地获得了更好的性能,这最终是一个很好的问题,也是一次宝贵的学习经历!

  • Begin by setting my_model_3 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to design a model to get high MAE.
  • 首先使用 XGBRegressor 类将 my_model_3 设置为 XGBoost 模型。 使用您在上一教程中学到的知识来了解如何更改默认参数(例如n_estimatorslearning_rate)来设计模型以获得高 MAE。
  • Then, fit the model to the training data in X_train and y_train.
  • 然后,将模型拟合到X_trainy_train中的训练数据。
  • Set predictions_3 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • predictions_3设置为模型对验证数据的预测。 回想一下,验证功能存储在X_valid中。
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
  • 最后,使用mean_absolute_error()函数计算与验证集上的预测相对应的平均绝对误差(MAE)。 回想一下,验证数据的标签存储在y_valid中。

In order for this step to be marked correct, your model in my_model_3 must attain higher MAE than the model in my_model_1.
为了使此步骤被标记为正确,my_model_3中的模型必须获得比my_model_1中的模型更高的 MAE。

# Define the model
#my_model_3 = ____

# Fit the model
#____ # Your code here

# Get predictions
#predictions_3 = ____

# Calculate MAE
#mae_3 = ____

my_model_3 = XGBRegressor(random_state=0, n_estimators=300)
my_model_3.fit(X_train, y_train)
predictions_3 = my_model_3.predict(X_valid)
mae_3 = mean_absolute_error(y_valid, predictions_3)

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_3)

# Check your answer
step_3.check()
Mean Absolute Error: 18298.990020333906

Correct

# Lines below will give you a hint or solution code
step_3.hint()
step_3.solution()

Hint: In the official solution for this problem, we chose to greatly decrease the number of trees in the model by tinkering with the n_estimators parameter.

Solution:

# Define the model
my_model_3 = XGBRegressor(n_estimators=1)

# Fit the model
my_model_3.fit(X_train, y_train)

# Get predictions
predictions_3 = my_model_3.predict(X_valid)

# Calculate MAE
mae_3 = mean_absolute_error(predictions_3, y_valid)
print("Mean Absolute Error:" , mae_3)

Keep going

继续前进

Continue to learn about data leakage. This is an important issue for a data scientist to understand, and it has the potential to ruin your models in subtle and dangerous ways!

继续了解数据泄露。 这是数据科学家需要理解的一个重要问题,它有可能以微妙而危险的方式毁掉你的模型!


Have questions or comments? Visit the course discussion forum to chat with other learners.

06.exercise-xgboost【练习:XGBoost】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top