This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.
By encoding categorical variables, you'll obtain your best results thus far!
通过编码分类变量,您将获得迄今为止最好的结果!
Setup
设置
The questions below will give you feedback on your work. Run the following cell to set up the feedback system.
以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex3 import *
print("Setup Complete")
Setup Complete
In this exercise, you will work with data from the Housing Prices Competition for Kaggle Learn Users.
在此练习中,您将使用 Kaggle Learn 用户房价竞赛 中的数据。
Run the next code cell without changes to load the training and validation sets in X_train
, X_valid
, y_train
, and y_valid
. The test set is loaded in X_test
.
在不进行任何更改的情况下运行下一个代码单元,以加载X_train
、X_valid
、y_train
和y_valid
中的训练集和验证集。 测试集加载到X_test
中。
import pandas as pd
from sklearn.model_selection import train_test_split
# Read the data
# 读取数据
X = pd.read_csv('../input/train.csv', index_col='Id')
X_test = pd.read_csv('../input/test.csv', index_col='Id')
# Remove rows with missing target, separate target from predictors
# 删除缺失目标值的列,并分离目标
X.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X.SalePrice
X.drop(['SalePrice'], axis=1, inplace=True)
# To keep things simple, we'll drop columns with missing values
# 为了保持简单,删除有缺失值的列
cols_with_missing = [col for col in X.columns if X[col].isnull().any()]
X.drop(cols_with_missing, axis=1, inplace=True)
X_test.drop(cols_with_missing, axis=1, inplace=True)
# Break off validation set from training data
# 从训练数据中分离出验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y,
train_size=0.8, test_size=0.2,
random_state=0)
Use the next code cell to print the first five rows of the data.
使用下一个代码单元格打印数据的前五行。
X_train.head()
MSSubClass | MSZoning | LotArea | Street | LotShape | LandContour | Utilities | LotConfig | LandSlope | Neighborhood | ... | OpenPorchSF | EnclosedPorch | 3SsnPorch | ScreenPorch | PoolArea | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
619 | 20 | RL | 11694 | Pave | Reg | Lvl | AllPub | Inside | Gtl | NridgHt | ... | 108 | 0 | 0 | 260 | 0 | 0 | 7 | 2007 | New | Partial |
871 | 20 | RL | 6600 | Pave | Reg | Lvl | AllPub | Inside | Gtl | NAmes | ... | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 2009 | WD | Normal |
93 | 30 | RL | 13360 | Pave | IR1 | HLS | AllPub | Inside | Gtl | Crawfor | ... | 0 | 44 | 0 | 0 | 0 | 0 | 8 | 2009 | WD | Normal |
818 | 20 | RL | 13265 | Pave | IR1 | Lvl | AllPub | CulDSac | Gtl | Mitchel | ... | 59 | 0 | 0 | 0 | 0 | 0 | 7 | 2008 | WD | Normal |
303 | 20 | RL | 13704 | Pave | IR1 | Lvl | AllPub | Corner | Gtl | CollgCr | ... | 81 | 0 | 0 | 0 | 0 | 0 | 1 | 2006 | WD | Normal |
5 rows × 60 columns
Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.
请注意,数据集包含数值变量和分类变量。 在训练模型之前,您需要对分类数据进行编码。
To compare different models, you'll use the same score_dataset()
function from the tutorial. This function reports the mean absolute error (MAE) from a random forest model.
要比较不同的模型,您将使用教程中相同的score_dataset()
函数。 此函数报告随机森林模型的平均绝对误差 (MAE)。
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
Step 1: Drop columns with categorical data
第 1 步:删除包含分类数据的列
You'll get started with the most straightforward approach. Use the code cell below to preprocess the data in X_train
and X_valid
to remove columns with categorical data. Set the preprocessed DataFrames to drop_X_train
and drop_X_valid
, respectively.
您将从最直接的方法开始。 使用下面的代码单元对X_train
和X_valid
中的数据进行预处理,以删除包含分类数据的列。 将预处理后的 DataFrame 分别设置为drop_X_train
和drop_X_valid
。
# Fill in the lines below: drop columns in training and validation data
# drop_X_train = X_train.drop(X.columns[X.dtypes == 'object'],axis=1)
# drop_X_valid = X_valid.drop(X.columns[X.dtypes == 'object'],axis=1)
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
# Check your answers
step_1.check()
Correct
# Lines below will give you a hint or solution code
#step_1.hint()
step_1.solution()
Solution:
# Drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
Run the next code cell to get the MAE for this approach.
运行下一个代码单元以获取此方法的 MAE。
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
MAE from Approach 1 (Drop categorical variables):
17837.82570776256
Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2'
column. The code cell below prints the unique entries in both the training and validation sets.
在开始标签编码之前,我们将研究数据集。 具体来说,我们将查看Condition2
列。 下面的代码单元格打印训练和验证集中的唯一条目。
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())
Unique values in 'Condition2' column in training data: ['Norm' 'PosA' 'Feedr' 'PosN' 'Artery' 'RRAe']
Unique values in 'Condition2' column in validation data: ['Norm' 'RRAn' 'RRNn' 'Artery' 'Feedr' 'PosN']
Step 2: Label encoding
步骤2:标签编码
Part A
A 部分
If you now write code to:
如果您现在编写代码:
- fit a label encoder to the training data, and then
- 将标签编码器拟合到训练数据中,然后
- use it to transform both the training and validation data,
- 用它来转换训练和验证数据,
you'll get an error. Can you see why this is the case? (You'll need to use the above output to answer this question.)
你会得到一个错误。 你能明白为什么会这样吗? (您需要使用上面的输出来回答这个问题。)
# Check your answer (Run this code cell to receive credit!)
step_2.a.check()
Correct:
Fitting an ordinal encoder to a column in the training data creates a corresponding integer-valued label for each unique value that appears in the training data. In the case that the validation data contains values that don't also appear in the training data, the encoder will throw an error, because these values won't have an integer assigned to them. Notice that the 'Condition2'
column in the validation data contains the values 'RRAn'
and 'RRNn'
, but these don't appear in the training data -- thus, if we try to use an ordinal encoder with scikit-learn, the code will throw an error.
step_2.a.hint()
Hint: Are there any values that appear in the validation data but not in the training data?
This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.
这是您在处理实际数据时会遇到的常见问题,并且有多种方法可以解决此问题。 例如,您可以编写自定义标签编码器来处理新类别。 然而,最简单的方法是删除有问题的分类列。
Run the code cell below to save the problematic columns to a Python list bad_label_cols
. Likewise, columns that can be safely label encoded are stored in good_label_cols
.
运行下面的代码单元格,将有问题的列保存到 Python 列表bad_label_cols
中。 同样,可以安全地进行标签编码的列存储在good_label_cols
中。
# All categorical columns
# 所有的分类列
object_cols = [col for col in X_train.columns if X_train[col].dtype == "object"]
# Columns that can be safely label encoded
# 可以安全进行标签编码的列
good_label_cols = [col for col in object_cols if
set(X_train[col]) == set(X_valid[col])]
# Problematic columns that will be dropped from the dataset
# 有问题的列将会从数据集中删除
bad_label_cols = list(set(object_cols)-set(good_label_cols))
print('Categorical columns that will be label encoded:', good_label_cols)
print('nCategorical columns that will be dropped from the dataset:', bad_label_cols)
Categorical columns that will be label encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'BldgType', 'HouseStyle', 'ExterQual', 'CentralAir', 'KitchenQual', 'PavedDrive', 'SaleCondition']
Categorical columns that will be dropped from the dataset: ['RoofMatl', 'RoofStyle', 'Foundation', 'Exterior1st', 'Exterior2nd', 'Condition1', 'Utilities', 'Neighborhood', 'Heating', 'Functional', 'ExterCond', 'Condition2', 'SaleType', 'HeatingQC', 'LandSlope']
Part B
B 部分
Use the next code cell to label encode the data in X_train
and X_valid
. Set the preprocessed DataFrames to label_X_train
and label_X_valid
, respectively.
使用下一个代码单元对X_train
和X_valid
中的数据进行标签编码。 将预处理后的 DataFrame 分别设置为label_X_train
和label_X_valid
。
- We have provided code below to drop the categorical columns in
bad_label_cols
from the dataset. - 我们在下面提供了代码,用于从数据集中删除
bad_label_cols
中的分类列。 - You should label encode the categorical columns in
good_label_cols
. - 您应该对
good_label_cols
中的分类列进行标签编码。
# Drop categorical columns that will not be encoded
# 删除不进行编码的分类数据列
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
# Apply label encoder
# ____ # Your code here
# 此方法也可以正常进行编码
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for col in set(good_label_cols):
label_X_train[col] = label_encoder.fit_transform(X_train[col])
label_X_valid[col] = label_encoder.transform(X_valid[col])
# 标准答案
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])
# Check your answer
step_2.b.check()
Correct
# Lines below will give you a hint or solution code
#step_2.b.hint()
step_2.b.solution()
Solution:
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
# Apply ordinal encoder
ordinal_encoder = OrdinalEncoder()
label_X_train[good_label_cols] = ordinal_encoder.fit_transform(X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinal_encoder.transform(X_valid[good_label_cols])
Run the next code cell to get the MAE for this approach.
运行下一个代码单元以获取此方法的 MAE。
print("MAE from Approach 2 (Label Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
MAE from Approach 2 (Label Encoding):
17575.291883561644
So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.
到目前为止,您已经尝试了两种不同的方法来处理分类变量。 而且,您已经看到,对分类数据进行编码比从数据集中删除列产生更好的结果。
Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.
很快,您将尝试 one-hot 编码。 在此之前,我们还需要讨论一个附加主题。 首先运行下一个代码单元而不进行任何更改。
# Get number of unique entries in each column with categorical data
# 使用分类数据获取每列中唯一条目的数量
object_nunique = list(map(lambda col: X_train[col].nunique(), object_cols))
d = dict(zip(object_cols, object_nunique))
# Print number of unique entries by column, in ascending order
# 按列打印唯一条目的数量,按升序排列
sorted(d.items(), key=lambda x: x[1])
[('Street', 2),
('Utilities', 2),
('CentralAir', 2),
('LandSlope', 3),
('PavedDrive', 3),
('LotShape', 4),
('LandContour', 4),
('ExterQual', 4),
('KitchenQual', 4),
('MSZoning', 5),
('LotConfig', 5),
('BldgType', 5),
('ExterCond', 5),
('HeatingQC', 5),
('Condition2', 6),
('RoofStyle', 6),
('Foundation', 6),
('Heating', 6),
('Functional', 6),
('SaleCondition', 6),
('RoofMatl', 7),
('HouseStyle', 8),
('Condition1', 9),
('SaleType', 9),
('Exterior1st', 15),
('Exterior2nd', 16),
('Neighborhood', 25)]
Step 3: Investigating cardinality
第三步:研究基数
Part A
A 部分
The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street'
column in the training data has two unique values: 'Grvl'
and 'Pave'
, corresponding to a gravel road and a paved road, respectively.
上面的输出显示了每列包含分类数据的唯一值的数量。 例如,训练数据中的Street
列有两个唯一值:Grvl
和Pave
,分别对应于碎石路和柏油路。
We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street'
variable has cardinality 2.
我们将分类变量的唯一条目数称为该分类变量的基数。 例如,Street
变量的基数为 2。
Use the output above to answer the questions below.
使用上面的输出回答下面的问题。
{col: X_train[col].unique() for col in X_train.select_dtypes(['object'])}
{'MSZoning': array(['RL', 'FV', 'RM', 'RH', 'C (all)'], dtype=object),
'Street': array(['Pave', 'Grvl'], dtype=object),
'LotShape': array(['Reg', 'IR1', 'IR2', 'IR3'], dtype=object),
'LandContour': array(['Lvl', 'HLS', 'Bnk', 'Low'], dtype=object),
'Utilities': array(['AllPub', 'NoSeWa'], dtype=object),
'LotConfig': array(['Inside', 'CulDSac', 'Corner', 'FR2', 'FR3'], dtype=object),
'LandSlope': array(['Gtl', 'Mod', 'Sev'], dtype=object),
'Neighborhood': array(['NridgHt', 'NAmes', 'Crawfor', 'Mitchel', 'CollgCr', 'Somerst',
'MeadowV', 'BrkSide', 'Edwards', 'OldTown', 'Veenker', 'Gilbert',
'IDOTRR', 'Timber', 'Blmngtn', 'StoneBr', 'SWISU', 'SawyerW',
'NWAmes', 'Sawyer', 'NoRidge', 'ClearCr', 'BrDale', 'NPkVill',
'Blueste'], dtype=object),
'Condition1': array(['Norm', 'PosN', 'Feedr', 'Artery', 'PosA', 'RRAn', 'RRAe', 'RRNn',
'RRNe'], dtype=object),
'Condition2': array(['Norm', 'PosA', 'Feedr', 'PosN', 'Artery', 'RRAe'], dtype=object),
'BldgType': array(['1Fam', 'Twnhs', 'TwnhsE', '2fmCon', 'Duplex'], dtype=object),
'HouseStyle': array(['1Story', '2Story', 'SLvl', '1.5Fin', '2.5Unf', '2.5Fin', 'SFoyer',
'1.5Unf'], dtype=object),
'RoofStyle': array(['Hip', 'Gable', 'Flat', 'Mansard', 'Shed', 'Gambrel'], dtype=object),
'RoofMatl': array(['CompShg', 'Membran', 'Tar&Grv', 'WdShake', 'WdShngl', 'Metal',
'Roll'], dtype=object),
'Exterior1st': array(['CemntBd', 'MetalSd', 'Wd Sdng', 'VinylSd', 'Plywood', 'HdBoard',
'BrkFace', 'AsbShng', 'WdShing', 'AsphShn', 'Stucco', 'BrkComm',
'ImStucc', 'CBlock', 'Stone'], dtype=object),
'Exterior2nd': array(['CmentBd', 'MetalSd', 'Wd Sdng', 'VinylSd', 'BrkFace', 'Plywood',
'HdBoard', 'Wd Shng', 'AsbShng', 'Brk Cmn', 'AsphShn', 'Stucco',
'ImStucc', 'Stone', 'CBlock', 'Other'], dtype=object),
'ExterQual': array(['Ex', 'TA', 'Gd', 'Fa'], dtype=object),
'ExterCond': array(['TA', 'Gd', 'Ex', 'Fa', 'Po'], dtype=object),
'Foundation': array(['PConc', 'CBlock', 'BrkTil', 'Stone', 'Slab', 'Wood'], dtype=object),
'Heating': array(['GasA', 'OthW', 'GasW', 'Grav', 'Wall', 'Floor'], dtype=object),
'HeatingQC': array(['Ex', 'Gd', 'TA', 'Fa', 'Po'], dtype=object),
'CentralAir': array(['Y', 'N'], dtype=object),
'KitchenQual': array(['Gd', 'TA', 'Ex', 'Fa'], dtype=object),
'Functional': array(['Typ', 'Min1', 'Min2', 'Maj2', 'Mod', 'Maj1'], dtype=object),
'PavedDrive': array(['Y', 'P', 'N'], dtype=object),
'SaleType': array(['New', 'WD', 'COD', 'ConLD', 'ConLw', 'Oth', 'ConLI', 'Con', 'CWD'],
dtype=object),
'SaleCondition': array(['Partial', 'Normal', 'Abnorml', 'Family', 'Alloca', 'AdjLand'],
dtype=object)}
object_nunique = [X_train[col].nunique() for col in object_cols]
pd_obj_nunique = pd.DataFrame(object_nunique, index = object_cols, columns=['col_nunique'])
# Fill in the line below: How many categorical variables in the training data
# have cardinality greater than 10?
# 填写下面一行:训练数据中有多少个分类变量基数大于 10?
high_cardinality_numcols = (pd_obj_nunique['col_nunique'] > 10).sum()
# Fill in the line below: How many columns are needed to one-hot encode the
#'Neighborhood' variable in the training data?
# 填写下面一行:one-hot 编码需要对训练集中`Neighborhood`变量的列进行编码?
num_cols_neighborhood = pd_obj_nunique.loc['Neighborhood']['col_nunique']
# Check your answers
step_3.a.check()
Correct
# Lines below will give you a hint or solution code
# step_3.a.hint()
step_3.a.solution()
Solution:
# How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = 3
# How many columns are needed to one-hot encode the
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25
Part B
B 部分
For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.
对于包含许多行的大型数据集,One-Hot 编码会极大地扩展数据集的大小。 因此,我们通常只会对基数相对较低的列进行One-Hot 编码。 然后,可以从数据集中删除高基数列,或者我们可以使用标签编码
。
As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.
例如,假设一个数据集包含 10,000 行,并且包含一个包含 100 个唯一条目的分类列。
- If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
- 如果这一列替换为相应的one-hot编码,那么数据集中会添加多少条目?
- If we instead replace the column with the label encoding, how many entries are added?
- 如果我们用标签编码替换该列,则会添加多少条目?
Use your answers to fill in the lines below.
使用您的答案填写下面的行。
from sklearn.preprocessing import OneHotEncoder
OH_encoder = OneHotEncoder(sparse_output=False)
OH_X_train = OH_encoder.fit_transform(X_train[good_label_cols])
OH_X_valid = OH_encoder.transform(X_valid[good_label_cols])
OH_X_train.shape[1]
52
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a one-hot encoding?
# 填写下面的行:通过 one-hot 编码替换该列,有多少条目添加到数据集中?
OH_entries_added = 10000*(100-1)
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a label encoding?
# 填写下面的行:通过 标签 编码替换该列,有多少条目添加到数据集中?
label_entries_added = 0
# Check your answers
step_3.b.check()
Correct
# Lines below will give you a hint or solution code
# step_3.b.hint()
step_3.b.solution()
Solution:
# How many entries are added to the dataset by
# replacing the column with a one-hot encoding?
OH_entries_added = 1e4*100 - 1e4
# How many entries are added to the dataset by
# replacing the column with an ordinal encoding?
label_entries_added = 0
Next, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.
接下来,您将尝试使用 one-hot 编码。 但是,您只需为基数小于 10 的列创建 one-hot 编码,而不是对数据集中的所有分类变量进行编码。
Run the code cell below without changes to set low_cardinality_cols
to a Python list containing the columns that will be one-hot encoded. Likewise, high_cardinality_cols
contains a list of categorical columns that will be dropped from the dataset.
运行下面的代码单元格而不进行任何更改,将low_cardinality_cols
设置为包含将进行one-hot编码的列的 Python 列表。 同样,high_cardinality_cols
包含将从数据集中删除的分类列的列表。
# Columns that will be one-hot encoded
# 将进行one-hot编码的列
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
# 将删除的列
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
Categorical columns that will be one-hot encoded: ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'KitchenQual', 'Functional', 'PavedDrive', 'SaleType', 'SaleCondition']
Categorical columns that will be dropped from the dataset: ['Exterior1st', 'Neighborhood', 'Exterior2nd']
Step 4: One-hot encoding
步骤 4:One-hot 编码
Use the next code cell to one-hot encode the data in X_train
and X_valid
. Set the preprocessed DataFrames to OH_X_train
and OH_X_valid
, respectively.
使用下一个代码单元对X_train
和X_valid
中的数据进行One-hot编码。 将预处理后的 DataFrame 分别设置为OH_X_train
和OH_X_valid
。
- The full list of categorical columns in the dataset can be found in the Python list
object_cols
. - 数据集中分类列的完整列表可以在 Python 列表
object_cols
中找到。 - You should only one-hot encode the categorical columns in
low_cardinality_cols
. All other categorical columns should be dropped from the dataset. - 您应该只对
low_cardinality_cols
中的分类列进行One-hot编码。 应从数据集中删除所有其他分类列。
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
# Apply one-hot encoder to each column with categorical data
# 针对分类数据的每列进行 one-hot 编码
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
# One-hot 编码删除索引,将其放回
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
# 删除分类数据的列(将使用One-Hot 编码替换)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
# 与数值型特征进行合并
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
# Ensure all columns have string type
# 确保列名均为str类型
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
OH_X_train
MSSubClass | LotArea | OverallQual | OverallCond | YearBuilt | YearRemodAdd | BsmtFinSF1 | BsmtFinSF2 | BsmtUnfSF | TotalBsmtSF | ... | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Id | |||||||||||||||||||||
619 | 20 | 11694 | 9 | 5 | 2007 | 2007 | 48 | 0 | 1774 | 1822 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
871 | 20 | 6600 | 5 | 5 | 1962 | 1962 | 0 | 0 | 894 | 894 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
93 | 30 | 13360 | 5 | 7 | 1921 | 2006 | 713 | 0 | 163 | 876 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
818 | 20 | 13265 | 8 | 5 | 2002 | 2002 | 1218 | 0 | 350 | 1568 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
303 | 20 | 13704 | 7 | 5 | 2001 | 2002 | 0 | 0 | 1541 | 1541 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
764 | 60 | 9430 | 8 | 5 | 1999 | 1999 | 1163 | 0 | 89 | 1252 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
836 | 20 | 9600 | 4 | 7 | 1950 | 1995 | 442 | 0 | 625 | 1067 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1217 | 90 | 8930 | 6 | 5 | 1978 | 1978 | 0 | 0 | 0 | 0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
560 | 120 | 3196 | 7 | 5 | 2003 | 2004 | 0 | 0 | 1374 | 1374 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
685 | 60 | 16770 | 7 | 5 | 1998 | 1998 | 0 | 0 | 1195 | 1195 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1168 rows × 155 columns
#OH_X_train = ____ # Your code here
#OH_X_valid = ____ # Your code here
#另一种方法
OH_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
OH_encoder.fit(X_train[low_cardinality_cols])
OH_cols_train = OH_encoder.transform(X_train[low_cardinality_cols])
OH_cols_valid = OH_encoder.transform(X_valid[low_cardinality_cols])
OH_X_train = pd.concat([X_train.select_dtypes(exclude=['object']), pd.DataFrame(OH_cols_train, index=X_train.index),], axis=1)
OH_X_valid = pd.concat([X_valid.select_dtypes(exclude=['object']), pd.DataFrame(OH_cols_valid, index=X_valid.index),], axis=1)
# Ensure all columns have string type
# 确保列名均为str类型
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
# Check your answer
step_4.check()
Correct
# Lines below will give you a hint or solution code
#step_4.hint()
step_4.solution()
Solution:
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
Run the next code cell to get the MAE for this approach.
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
MAE from Approach 3 (One-Hot Encoding):
17525.345719178084
Generate test predictions and submit your results
生成测试预测并提交结果
After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.
完成步骤 4 后,如果您想使用所学知识将结果提交到排行榜,则需要在生成预测之前对测试数据进行预处理。
This step is completely optional, and you do not need to submit results to the leaderboard to successfully complete the exercise.
此步骤完全是可选的,您无需将结果提交到排行榜即可成功完成练习。
Check out the previous exercise if you need help with remembering how to join the competition or save your results to CSV. Once you have generated a file with your results, follow the instructions below:
如果您需要帮助记住如何参加比赛 或将结果保存到 CSV,请查看之前的练习。 生成包含结果的文件后,请按照以下说明操作:
- Begin by clicking on the blue Save Version button in the top right corner of the window. This will generate a pop-up window.
- 首先单击窗口右上角的蓝色 保存版本 按钮。 这将生成一个弹出窗口。
- Ensure that the Save and Run All option is selected, and then click on the blue Save button.
- 确保选择“保存并运行全部”选项,然后单击蓝色的“保存”按钮。
- This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
- 这会在笔记本的左下角生成一个窗口。 运行完成后,单击“保存版本”按钮右侧的数字。 这会在屏幕右侧显示版本列表。 单击最新版本右侧的省略号 (...),然后选择 在查看器中打开。 这将带您进入同一页面的查看模式。 您需要向下滚动才能返回到这些说明。
- Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the blue Submit button to submit your results to the leaderboard.
- 单击屏幕右侧的“输出”选项卡。 然后,单击您要提交的文件,然后单击蓝色的提交按钮将您的结果提交到排行榜。
You have now successfully submitted to the competition!
您现在已成功提交比赛结果!
If you want to keep working to improve your performance, select the blue Edit button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.
如果您想继续努力提高绩效,请选择屏幕右上角的蓝色编辑按钮。 然后您可以更改代码并重复该过程。 有很大的改进空间,随着你的努力,你将在排行榜上不断攀升。
# (Optional) Your code here
model = RandomForestRegressor(n_estimators=100, random_state=55)
model.fit(OH_X_train, y_train)
from sklearn.impute import SimpleImputer
num_imputer = SimpleImputer(strategy='median')
imputed_num_X_test = pd.DataFrame(num_imputer.fit_transform(X_test.select_dtypes(exclude=['object'])),
index=X_test.index,
columns=X_test.select_dtypes(exclude=['object']).columns)
imputed_num_X_test.isnull().sum()
MSSubClass 0
LotArea 0
OverallQual 0
OverallCond 0
YearBuilt 0
YearRemodAdd 0
BsmtFinSF1 0
BsmtFinSF2 0
BsmtUnfSF 0
TotalBsmtSF 0
1stFlrSF 0
2ndFlrSF 0
LowQualFinSF 0
GrLivArea 0
BsmtFullBath 0
BsmtHalfBath 0
FullBath 0
HalfBath 0
BedroomAbvGr 0
KitchenAbvGr 0
TotRmsAbvGrd 0
Fireplaces 0
GarageCars 0
GarageArea 0
WoodDeckSF 0
OpenPorchSF 0
EnclosedPorch 0
3SsnPorch 0
ScreenPorch 0
PoolArea 0
MiscVal 0
MoSold 0
YrSold 0
dtype: int64
from sklearn.impute import SimpleImputer
obj_imputer = SimpleImputer(strategy='most_frequent')
imputed_obj_X_test = pd.DataFrame(obj_imputer.fit_transform(X_test.select_dtypes(['object'])),
index=X_test.index,
columns=X_test.select_dtypes(['object']).columns)
imputed_obj_X_test.isnull().sum() > 0
MSZoning False
Street False
LotShape False
LandContour False
Utilities False
LotConfig False
LandSlope False
Neighborhood False
Condition1 False
Condition2 False
BldgType False
HouseStyle False
RoofStyle False
RoofMatl False
Exterior1st False
Exterior2nd False
ExterQual False
ExterCond False
Foundation False
Heating False
HeatingQC False
CentralAir False
KitchenQual False
Functional False
PavedDrive False
SaleType False
SaleCondition False
dtype: bool
imputed_X_test = pd.concat([imputed_num_X_test, imputed_obj_X_test], axis=1)
imputed_X_test[low_cardinality_cols].isnull().sum()
MSZoning 0
Street 0
LotShape 0
LandContour 0
Utilities 0
LotConfig 0
LandSlope 0
Condition1 0
Condition2 0
BldgType 0
HouseStyle 0
RoofStyle 0
RoofMatl 0
ExterQual 0
ExterCond 0
Foundation 0
Heating 0
HeatingQC 0
CentralAir 0
KitchenQual 0
Functional 0
PavedDrive 0
SaleType 0
SaleCondition 0
dtype: int64
OH_cols_test = OH_encoder.transform(imputed_X_test[low_cardinality_cols])
OH_X_test = pd.concat([imputed_X_test.select_dtypes(exclude=['object']), pd.DataFrame(OH_cols_test, index=X_test.index),], axis=1)
OH_X_test.columns = OH_X_test.columns.astype('str')
y_preds = model.predict(OH_X_test)
submit_output = pd.DataFrame({'Id':X_test.index,
'SalePrice':y_preds})
submit_output.to_csv('submission.csv', index=False)
Keep going
继续前进
With missing value handling and categorical encoding, your modeling process is getting complex. This complexity gets worse when you want to save your model to use in the future. The key to managing this complexity is something called pipelines.
由于缺失值处理和分类编码,您的建模过程变得越来越复杂。 当您想要保存模型以供将来使用时,这种复杂性会变得更糟。 管理这种复杂性的关键是所谓的管道。
Learn to use pipelines to preprocess datasets with categorical variables, missing values and any other messiness your data throws at you.
学习使用管道 预处理包含分类变量、缺失值和数据给您带来的任何其他混乱的数据集。