This notebook is an exercise in the Feature Engineering course. You can reference the tutorial at this link.

Introduction

介绍

In this exercise you'll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we're creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

在本练习中，您将开始开发在练习 2 中确认的最有潜力的特征。当您完成此练习时，您可能会花一些时间再次查看数据文档，并考虑我们正在创建的特征从现实世界的角度是否有意义，以及是否有任何对您来说有用的组合。

Run this cell to set everything up!

运行这个单元格来设置一切！

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor

def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score

# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

Let's start with a few mathematical combinations. We'll focus on features describing areas -- having the same units (square-feet) makes it easy to combine them in sensible ways. Since we're using XGBoost (a tree-based model), we'll focus on ratios and sums.

让我们从一些数学组合开始。我们将重点关注描述区域的特征 - 具有相同的单位（平方英尺）可以轻松地以合理的方式组合它们。由于我们使用的是 XGBoost（基于树的模型），因此我们将重点关注比率和总和。

1) Create Mathematical Transforms

1) 创建数学变换

Create the following features:

创建以下特征：

LivLotRatio: the ratio of GrLivArea to LotArea
LivLotRatio：GrLivArea 与 LotArea 的比率
Spaciousness: the sum of FirstFlrSF and SecondFlrSF divided by TotRmsAbvGrd
Spaciousness：FirstFlrSF 和 SecondFlrSF 之和与 TotRmsAbvGrd的商
TotalOutsideSF: the sum of WoodDeckSF, OpenPorchSF, EnclosedPorch, Threeseasonporch, and ScreenPorch
TotalOutsideSF：WoodDeckSF、OpenPorchSF、EnclosurePorch、Threeseasonporch 和 ScreenPorch 的总和

# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

#X_1["LivLotRatio"] = ____
#X_1["Spaciousness"] = ____
#X_1["TotalOutsideSF"] = ____

X_1["LivLotRatio"] = X.GrLivArea/X.LotArea
X_1["Spaciousness"] = (X.FirstFlrSF + X.SecondFlrSF)/X.TotRmsAbvGrd
X_1["TotalOutsideSF"] = X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].sum(axis=1)

# Check your answer
q_1.check()

Correct

# Lines below will give you a hint or solution code
#q_1.hint()
q_1.solution()

Solution:


X_1["LivLotRatio"] = df.GrLivArea / df.LotArea
X_1["Spaciousness"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd
X_1["TotalOutsideSF"] = df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + df.Threeseasonporch + df.ScreenPorch

If you've discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

如果您发现数字特征和分类特征之间存在交互效应，您可能希望使用 one-hot 编码对其进行显式建模，如下所示：

# One-hot encode Categorical feature, adding a column prefix "Cat"
# One-hot 编码分类特征，添加列前缀“Cat”
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
# 逐行相乘
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
# 将新特征加入到特征集中
X = X.join(X_new)

2) Interaction with a Categorical

2) 与分类的交互

We discovered an interaction between BldgType and GrLivArea in Exercise 2. Now create their interaction features.

我们在练习 2 中发现了BldgType和GrLivArea之间的交互。现在创建它们的交互特征。

# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
#X_2 = ____ 
X_2 = pd.get_dummies(X["BldgType"], prefix="Bldg")
# Multiply
#X_2 = ____
X_2 = X_2.multiply(X["GrLivArea"], axis=0)

# Check your answer
q_2.check()

Correct

# Lines below will give you a hint or solution code
#q_2.hint()
q_2.solution()

Solution:


X_2 = pd.get_dummies(df.BldgType, prefix="Bldg")
X_2 = X_2.mul(df.GrLivArea, axis=0)

3) Count Feature

3) 计数特征

Let's try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature PorchTypes that counts how many of the following are greater than 0.0:

让我们尝试创建一个特征来描述住宅有多少种室外区域。创建一个特征 PorchTypes 来计算以下有多少个大于 0.0：

WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch

X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].gt(0).sum(axis=1)

0       2
1       2
2       2
3       0
4       2
       ..
2925    1
2926    1
2927    2
2928    2
2929    2
Length: 2930, dtype: int64

X_3 = pd.DataFrame()

# YOUR CODE HERE
#X_3["PorchTypes"] = ____
X_3["PorchTypes"] = X[["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]].gt(0).sum(axis=1)

# Check your answer
q_3.check()

Correct

# Lines below will give you a hint or solution code
#q_3.hint()
q_3.solution()

Solution:


X_3 = pd.DataFrame()

X_3["PorchTypes"] = df[[
    "WoodDeckSF",
    "OpenPorchSF",
    "EnclosedPorch",
    "Threeseasonporch",
    "ScreenPorch",
]].gt(0.0).sum(axis=1)

4) Break Down a Categorical Feature

4) 分解分类特征

MSSubClass describes the type of a dwelling:

MSSubClass 描述了住宅的类型：

df.MSSubClass.unique()

array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass at the first underscore _. (Hint: In the split method use an argument n=1.)

您可以看到每个类别的第一个单词（粗略地）描述了更一般的分类。通过在第一个下划线_处拆分MSSubClass，创建仅包含这些前几个单词的特征。（提示：在split方法中使用参数n=1。）

X_4 = pd.DataFrame()

# YOUR CODE HERE
#____
X_4['MSClass'] = df["MSSubClass"].str.split("_", n=1, expand=True)[0]

# Check your answer
q_4.check()

Correct

# Lines below will give you a hint or solution code
#q_4.hint()
q_4.solution()

Solution:


X_4 = pd.DataFrame()

X_4["MSClass"] = df.MSSubClass.str.split("_", n=1, expand=True)[0]

5) Use a Grouped Transform

5) 使用分组变换

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature MedNhbdArea that describes the median of GrLivArea grouped on Neighborhood.

房屋的价值通常取决于它与附近典型房屋的比较。创建一个特征MedNhbdArea，描述按Neighborhood分组的GrLivArea的中位值。

X_5 = pd.DataFrame()

# YOUR CODE HERE
#X_5["MedNhbdArea"] = ____
X_5["MedNhbdArea"] = X.groupby("Neighborhood")["GrLivArea"].transform("median")
# Check your answer
q_5.check()

Correct

# Lines below will give you a hint or solution code
#q_5.hint()
q_5.solution()

Solution:


X_5 = pd.DataFrame()

X_5["MedNhbdArea"] = df.groupby("Neighborhood")["GrLivArea"].transform("median")

Now you've made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

现在您已经制作了第一个新特征集！如果您愿意，您可以运行下面的单元格来对添加了所有新特征的模型进行评分：

X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)

0.13954039591355258

score_dataset(X, y)

0.14336777904947118

Keep Going

继续前进

Untangle spatial relationships by adding cluster labels to your dataset.

通过向数据集添加聚类标签来理清空间关系。

03.exercise-creating-features【练习：创建特征】

03.exercise-creating-features【练习：创建特征】

Introduction

介绍

1) Create Mathematical Transforms

1) 创建数学变换

2) Interaction with a Categorical

2) 与分类的交互

3) Count Feature

3) 计数特征

4) Break Down a Categorical Feature

4) 分解分类特征

5) Use a Grouped Transform

5) 使用分组变换

Keep Going

继续前进

Leave a Reply Cancel reply