This notebook is an exercise in the Feature Engineering course. You can reference the tutorial at this link.
Introduction
介绍
In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.
在本练习中,您将开始开发在练习 2 中确认的最有潜力的特征。 当您完成此练习时,您可能会花一些时间再次查看数据文档,并考虑我们正在创建的特征从现实世界的角度是否有意义,以及是否有任何对您来说有用的组合。
Run this cell to set everything up!
运行这个单元格来设置一切!
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
def score_dataset(X, y, model=XGBRegressor()):
# Label encoding for categoricals
for colname in X.select_dtypes(["category", "object"]):
X[colname], _ = X[colname].factorize()
# Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
score = cross_val_score(
model, X, y, cv=5, scoring="neg_mean_squared_log_error",
)
score = -1 * score.mean()
score = np.sqrt(score)
return score
# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")
Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.
让我们从一些数学组合开始。 我们将重点关注描述区域的特征 - 具有相同的单位(平方英尺)可以轻松地以合理的方式组合它们。 由于我们使用的是 XGBoost(基于树的模型),因此我们将重点关注比率和总和。
1) Create Mathematical Transforms
1) 创建数学变换
Create the following features:
创建以下特征:
LivLotRatio
: the ratio ofGrLivArea
toLotArea
LivLotRatio
:GrLivArea
与LotArea
的比率Spaciousness
: the sum ofFirstFlrSF
andSecondFlrSF
divided byTotRmsAbvGrd
Spaciousness
:FirstFlrSF
和SecondFlrSF
之和与TotRmsAbvGrd
的商TotalOutsideSF
: the sum ofWoodDeckSF
,OpenPorchSF
,EnclosedPorch
,Threeseasonporch
, andScreenPorch
TotalOutsideSF
:WoodDeckSF
、OpenPorchSF
、EnclosurePorch
、Threeseasonporch
和ScreenPorch
的总和
# YOUR CODE HERE
X_1 = pd.DataFrame() # dataframe to hold new features
#X_1["LivLotRatio"] = ____
#X_1["Spaciousness"] = ____
#X_1["TotalOutsideSF"] = ____
X_1["LivLotRatio"] = X.GrLivArea/X.LotArea
X_1["Spaciousness"] = (X.FirstFlrSF + X.SecondFlrSF)/X.TotRmsAbvGrd
X_1["TotalOutsideSF"] = X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].sum(axis=1)
# Check your answer
q_1.check()
Correct
# Lines below will give you a hint or solution code
#q_1.hint()
q_1.solution()
Solution:
X_1["LivLotRatio"] = df.GrLivArea / df.LotArea
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch
If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:
如果您发现数字特征和分类特征之间存在交互效应,您可能希望使用 one-hot 编码对其进行显式建模,如下所示:
# One-hot encode Categorical feature, adding a column prefix "Cat"
# One-hot 编码分类特征,添加列前缀“Cat”
X_new = pd.get_dummies(df.Categorical, prefix="Cat")
# Multiply row-by-row
# 逐行相乘
X_new = X_new.mul(df.Continuous, axis=0)
# Join the new features to the feature set
# 将新特征加入到特征集中
X = X.join(X_new)
2) Interaction with a Categorical
2) 与分类的交互
We discovered an interaction between BldgType
and GrLivArea
in Exercise 2. Now create their interaction features.
我们在练习 2 中发现了BldgType
和GrLivArea
之间的交互。现在创建它们的交互特征。
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
#X_2 = ____
X_2 = pd.get_dummies(X["BldgType"], prefix="Bldg")
# Multiply
#X_2 = ____
X_2 = X_2.multiply(X["GrLivArea"], axis=0)
# Check your answer
q_2.check()
Correct
# Lines below will give you a hint or solution code
#q_2.hint()
q_2.solution()
Solution:
X_2 = pd.get_dummies(df.BldgType, prefix="Bldg")
X_2 = X_2.mul(df.GrLivArea, axis=0)
3) Count Feature
3) 计数特征
Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature PorchTypes
that counts how many of the following are greater than 0.0:
让我们尝试创建一个特征来描述住宅有多少种室外区域。 创建一个特征 PorchTypes
来计算以下有多少个大于 0.0:
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].gt(0).sum(axis=1)
0 2
1 2
2 2
3 0
4 2
..
2925 1
2926 1
2927 2
2928 2
2929 2
Length: 2930, dtype: int64
X_3 = pd.DataFrame()
# YOUR CODE HERE
#X_3["PorchTypes"] = ____
X_3["PorchTypes"] = X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].gt(0).sum(axis=1)
# Check your answer
q_3.check()
Correct
# Lines below will give you a hint or solution code
#q_3.hint()
q_3.solution()
Solution:
X_3 = pd.DataFrame()
X_3["PorchTypes"] = df[[
"WoodDeckSF",
"OpenPorchSF",
"EnclosedPorch",
"Threeseasonporch",
"ScreenPorch",
]].gt(0.0).sum(axis=1)
4) Break Down a Categorical Feature
4) 分解分类特征
MSSubClass
describes the type of a dwelling:
MSSubClass
描述了住宅的类型:
df.MSSubClass.unique()
array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
'One_Story_PUD_1946_and_Newer',
'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
'Two_Family_conversion_All_Styles_and_Ages',
'One_and_Half_Story_Unfinished_All_Ages',
'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
'One_Story_with_Finished_Attic_All_Ages',
'PUD_Multilevel_Split_Level_Foyer',
'One_and_Half_Story_PUD_All_Ages'], dtype=object)
You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass
at the first underscore _
. (Hint: In the split
method use an argument n=1
.)
您可以看到每个类别的第一个单词(粗略地)描述了更一般的分类。 通过在第一个下划线_
处拆分MSSubClass
,创建仅包含这些前几个单词的特征。 (提示:在split
方法中使用参数n=1
。)
X_4 = pd.DataFrame()
# YOUR CODE HERE
#____
X_4['MSClass'] = df["MSSubClass"].str.split("_", n=1, expand=True)[0]
# Check your answer
q_4.check()
Correct
# Lines below will give you a hint or solution code
#q_4.hint()
q_4.solution()
Solution:
X_4 = pd.DataFrame()
X_4["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]
5) Use a Grouped Transform
5) 使用分组变换
The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature MedNhbdArea
that describes the median of GrLivArea
grouped on Neighborhood
.
房屋的价值通常取决于它与附近典型房屋的比较。 创建一个特征MedNhbdArea
,描述按Neighborhood
分组的GrLivArea
的中位值
。
X_5 = pd.DataFrame()
# YOUR CODE HERE
#X_5["MedNhbdArea"] = ____
X_5["MedNhbdArea"] = X.groupby("Neighborhood")["GrLivArea"].transform("median")
# Check your answer
q_5.check()
Correct
# Lines below will give you a hint or solution code
#q_5.hint()
q_5.solution()
Solution:
X_5 = pd.DataFrame()
X_5["MedNhbdArea"] = df.groupby("Neighborhood")["GrLivArea"].transform("median")
Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:
现在您已经制作了第一个新特征集! 如果您愿意,您可以运行下面的单元格来对添加了所有新特征的模型进行评分:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)
0.13954039591355258
score_dataset(X, y)
0.14336777904947118
Keep Going
继续前进
Untangle spatial relationships by adding cluster labels to your dataset.
通过向数据集添加聚类标签来理清空间关系。