This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.
As a warm-up, you'll review some machine learning fundamentals and submit your initial results to a Kaggle competition.
作为热身,您将回顾一些机器学习基础知识并将初步结果提交给 Kaggle 竞赛。
Setup
设置
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex1 import *
print("Setup Complete")
Setup Complete
You will work with data from the Housing Prices Competition for Kaggle Learn Users to predict home prices in Iowa using 79 explanatory variables describing (almost) every aspect of the homes.
您将使用 Kaggle Learn 用户 房价竞赛 中的数据,使用 79 个解释变量来预测爱荷华州的房价 (几乎)房屋的各个方面。
Run the next code cell without changes to load the training and validation features in X_train
and X_valid
, along with the prediction targets in y_train
and y_valid
. The test features are loaded in X_test
. (If you need to review features and prediction targets, please check out this short tutorial. To read about model validation, look here. Alternatively, if you'd prefer to look through a full course to review all of these topics, start here.)
在不进行任何更改的情况下运行下一个代码单元,以加载X_train
和X_valid
中的训练和验证功能,以及y_train
和y_valid
中的预测目标。 测试功能加载在X_test
中。 (如果您需要查看功能和预测目标,请查看这个简短的教程 .要了解模型验证,请查看此处。或者,如果您希望查看完整课程以复习所有内容 这些主题,请从此处开始。)
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')
# Obtain target and predictors
y = X_full.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()
# Break off validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
random_state=0)
Use the next cell to print the first several rows of the data. It's a nice way to get an overview of the data you will use in your price prediction model.
使用下一个单元格打印数据的前几行。 这是概览价格预测模型中将使用的数据的好方法。
X_train.head()
LotArea | YearBuilt | 1stFlrSF | 2ndFlrSF | FullBath | BedroomAbvGr | TotRmsAbvGrd | |
---|---|---|---|---|---|---|---|
Id | |||||||
619 | 11694 | 2007 | 1828 | 0 | 2 | 3 | 9 |
871 | 6600 | 1962 | 894 | 0 | 1 | 2 | 5 |
93 | 13360 | 1921 | 964 | 0 | 1 | 2 | 5 |
818 | 13265 | 2002 | 1689 | 0 | 2 | 3 | 7 |
303 | 13704 | 2001 | 1541 | 0 | 2 | 3 | 6 |
The next code cell defines five different random forest models. Run this code cell without changes. (To review random forests, look here.)
下一个代码单元定义了五种不同的随机森林模型。 运行此代码单元而不进行任何更改。 (要回顾随机森林,请查看此处.)
from sklearn.ensemble import RandomForestRegressor
# Define the models
model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)
models = [model_1, model_2, model_3, model_4, model_5]
To select the best model out of the five, we define a function score_model()
below. This function returns the mean absolute error (MAE) from the validation set. Recall that the best model will obtain the lowest MAE. (To review mean absolute error, look here.)
Run the code cell without changes.
为了从五个模型中选择最好的模型,我们在下面定义了一个函数score_model()
。 此函数返回验证集中的平均绝对误差 (MAE)。 回想一下,最好的模型将获得最低的 MAE。 (要查看平均绝对误差,请查看此处。)
运行代码单元而不进行任何更改。
from sklearn.metrics import mean_absolute_error
# Function for comparing different models
def score_model(model, X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
model.fit(X_t, y_t)
preds = model.predict(X_v)
return mean_absolute_error(y_v, preds)
for i in range(0, len(models)):
mae = score_model(models[i])
print("Model %d MAE: %d" % (i+1, mae))
Model 1 MAE: 24015
Model 2 MAE: 23740
Model 3 MAE: 23528
Model 4 MAE: 23996
Model 5 MAE: 23706
Step 1: Evaluate several models
第 1 步:评估多个模型
Use the above results to fill in the line below. Which model is the best model? Your answer should be one of model_1
, model_2
, model_3
, model_4
, or model_5
.
使用上面的结果填写下面的行。 哪种模型是最好的? 您的答案应该是model_1
、model_2
、model_3
、model_4
或model_5
之一。
# Fill in the best model
best_model = model_3
# Check your answer
step_1.check()
Correct
# Lines below will give you a hint or solution code
#step_1.hint()
step_1.solution()
Solution:
best_model = model_3
Step 2: Generate test predictions
第 2 步:生成测试集的预测
Great. You know how to evaluate what makes an accurate model. Now it's time to go through the modeling process and make predictions. In the line below, create a Random Forest model with the variable name my_model
.
很好。 您知道如何评估模型的准确性。 现在是时候完成建模过程并做出预测了。 在下面的行中,使用变量名称my_model
创建一个随机森林模型。
# Define a model
my_model = RandomForestRegressor(n_estimators=100, criterion='absolute_error', random_state=55) # Your code here
# Check your answer
step_2.check()
Correct
# Lines below will give you a hint or solution code
#step_2.hint()
step_2.solution()
Solution:
# Define a model
my_model = best_model
Run the next code cell without changes. The code fits the model to the training and validation data, and then generates test predictions that are saved to a CSV file. These test predictions can be submitted directly to the competition!
运行下一个代码单元而不进行任何更改。 该代码使模型适合训练和验证数据,然后生成保存到 CSV 文件的测试预测。 这些测试预测可以直接提交给比赛!
# Fit the model to the training data
my_model.fit(X, y)
# Generate test predictions
preds_test = my_model.predict(X_test)
# Save predictions in format used for competition scoring
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Submit your results
提交你的结果
Once you have successfully completed Step 2, you're ready to submit your results to the leaderboard! First, you'll need to join the competition if you haven't already. So open a new window by clicking on this link. Then click on the Join Competition button.
成功完成第 2 步后,您就可以将结果提交到排行榜了! 首先,如果您还没有参加比赛,则需要参加。 因此,请单击此链接打开一个新窗口。 然后单击“参加比赛”按钮。
Next, follow the instructions below:
接下来,请按照以下说明操作:
- Begin by clicking on the blue Save Version button in the top right corner of the window. This will generate a pop-up window.
- 首先单击窗口右上角的蓝色 保存版本 按钮。 这将生成一个弹出窗口。
- Ensure that the Save and Run All option is selected, and then click on the blue Save button.
- 确保选择“保存并运行全部”选项,然后单击蓝色的“保存”按钮。
- This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- 这会在笔记本的左下角生成一个窗口。 运行完成后,单击“保存版本”按钮右侧的数字。 这会在屏幕右侧显示版本列表。 单击最新版本右侧的省略号 (...),然后选择 在查看器中打开。 这将带您进入同一页面的查看模式。 您需要向下滚动才能返回到这些说明。
- Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the blue Submit button to submit your results to the leaderboard.
- 单击屏幕右侧的“输出”选项卡。 然后,单击您要提交的文件,然后单击蓝色的提交按钮将您的结果提交到排行榜。
You have now successfully submitted to the competition!
您现在已成功完成比赛结果的提交!
If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.
如果您想继续努力提高绩效,请选择屏幕右上角的蓝色编辑按钮。 然后您可以更改代码并重复该过程。 有很大的改进空间,随着你的努力,你将在排行榜上不断攀升。
Keep going
继续
You've made your first model. But how can you quickly make it better?
您已经建立了第一个模型。 但怎样才能快速改善呢?
Learn how to improve your competition results by incorporating columns with missing values.
了解如何通过将列与缺失值合并来提高比赛成绩。