Selecting Data for Modeling

Your dataset had too many variables to wrap your head around, or even to print out nicely. How can you pare down this overwhelming amount of data to something you can understand?

We'll start by picking a few variables using our intuition. Later courses will show you statistical techniques to automatically prioritize variables.

To choose variables/columns, we'll need to see a list of all columns in the dataset. That is done with the columns property of the DataFrame (the bottom line of code below).

选择建模数据

您的数据集有太多变量，难以理解，甚至无法很好地打印出来。如何将如此大量的数据简化为您可以理解的内容？

我们将首先根据直觉选择一些变量。后面的课程将向您展示自动对变量进行优先级排序的统计技术。

要选择变量/列，我们需要查看数据集中所有列的列表。这是通过 DataFrame 的 columns 属性完成的（下面的代码的底行）。

import pandas as pd

melbourne_file_path = '../00 datasets/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# We'll learn to handle missing values in a later tutorial.  
# Your Iowa data doesn't have missing values in the columns you use. 
# So we will take the simplest option for now, and drop houses from our data. 
# Don't worry about this much for now, though the code is:

# dropna drops missing values (think of na as "not available")

# 墨尔本数据有一些缺失值（一些房屋的一些变量没有记录。）
# 我们将在后面的教程中学习处理缺失值。
# 您的爱荷华州数据在您使用的列中没有缺失值。
# 所以我们现在将采取最简单的选择，从我们的数据中删除房屋。
# 现在不用担心这么多，尽管代码是：

# dropna 删除缺失值（将 na 视为“不可用”）
melbourne_data = melbourne_data.dropna(axis=0)

There are many ways to select a subset of your data. The Pandas course covers these in more depth, but we will focus on two approaches for now.

Dot notation, which we use to select the "prediction target"
Selecting with a column list, which we use to select the "features"

有多种方法可以选择数据子集。 Pandas 课程更深入地介绍了这些内容，但我们现在将重点关注两种方法。

点表示法，我们用它来选择预测目标
使用列列表进行选择，我们用它来选择特征

Selecting The Prediction Target

You can pull out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called y. So the code we need to save the house prices in the Melbourne data is

选择预测目标

您可以使用 点符号 提取变量。该单列存储在 Series 中，它大致类似于只有单列数据的 DataFrame。

我们将使用点符号来选择我们想要预测的列，这称为预测目标。按照惯例，预测目标称为y。所以我们需要保存墨尔本数据中的房价的代码是

y = melbourne_data.Price

Choosing "Features"

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. Later on you'll see how to iterate and compare models built with different features.

We select multiple features by providing a list of column names inside brackets. Each item in that list should be a string (with quotes).

Here is an example:

选择“特征”

输入到我们的模型中（随后用于进行预测）的列称为特征。在我们的例子中，这些列将用于确定房价。有时，您将使用除目标之外的所有列作为特征。其他时候，使用更少的特征会更好。

现在，我们将构建一个仅包含几个特征的模型。稍后您将了解如何迭代和比较使用不同特征构建的模型。

我们通过在括号内提供列名称列表来选择多个特征。该列表中的每个项目都应该是一个字符串（带引号）。

这是一个例子：

melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By convention, this data is called X.

按照惯例，该数据称为 X。

X = melbourne_data[melbourne_features]

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method, which shows the top few rows.

让我们快速检查一下我们将使用describe方法和head方法来预测房价的数据，该方法显示前几行。

X.describe()

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
count	6196.000000	6196.000000	6196.000000	6196.000000	6196.000000
mean	2.931407	1.576340	471.006940	-37.807904	144.990201
std	0.971079	0.711362	897.449881	0.075850	0.099165
min	1.000000	1.000000	0.000000	-38.164920	144.542370
25%	2.000000	1.000000	152.000000	-37.855438	144.926198
50%	3.000000	1.000000	373.000000	-37.802250	144.995800
75%	4.000000	2.000000	628.000000	-37.758200	145.052700
max	8.000000	8.000000	37000.000000	-37.457090	145.526350

X.head()

	Rooms	Bathroom	Landsize	Lattitude	Longtitude
1	2	1.0	156.0	-37.8079	144.9934
2	3	2.0	134.0	-37.8093	144.9944
4	4	1.0	120.0	-37.8072	144.9941
6	3	2.0	245.0	-37.8024	144.9993
7	2	1.0	256.0	-37.8060	144.9954

Visually checking your data with these commands is an important part of a data scientist's job. You'll frequently find surprises in the dataset that deserve further inspection.

使用这些命令直观地检查数据是数据科学家工作的重要组成部分。您经常会在数据集中发现值得进一步检查的惊喜。

Building Your Model

You will use the scikit-learn library to create your models. When coding, this library is written as sklearn, as you will see in the sample code. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
Fit: Capture patterns from provided data. This is the heart of modeling.
Predict: Just what it sounds like
Evaluate: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable.

构建你的模型

您将使用 scikit-learn 库来创建模型。编码时，该库被编写为 sklearn，正如您将在示例代码中看到的那样。 Scikit-learn 无疑是最流行的库，它通常用于对存储在 DataFrame 中的数据类型进行建模。

构建和使用模型的步骤是：

定义： 模型是什么类型？决策树？其他类型的模型？还指定了模型类型的一些其他参数。
拟合： 从提供的数据中捕获模式。这是建模的核心。
预测： 就像听上去的这样。
评估：确定模型预测的准确性。

以下是使用 scikit-learn 定义决策树模型并将其与特征和目标变量进行拟合的示例。

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
# 定义模型。 为 random_state 指定一个数字以确保每次运行结果相同
melbourne_model = DecisionTreeRegressor(random_state=55)

# Fit model
# 拟合模型
melbourne_model.fit(X, y)

03.course-your-first-machine-learning-model【第一个机器学习模型】