Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

06.course-random-forests【随机森林】

Introduction

介绍

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

决策树会让您做出艰难的决定。 一棵有很多叶子的深树会过度拟合,因为每个预测都来自于其叶子上的少数房屋的历史数据。 但是,叶子很少的浅树会表现不佳,因为它无法捕获原始数据中的许多区别。

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

即使当今最复杂的建模技术也面临着欠拟合和过拟合之间的紧张关系。 但是,许多模型都有巧妙的想法,可以带来更好的性能。 我们将以随机森林为例。

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

随机森林使用许多树,它通过平均每个组成树的预测来进行预测。 它通常比单个决策树具有更好的预测准确性,并且在使用默认参数时效果很好。 如果继续建模,您可以学习更多具有更好性能的模型,但其中许多模型对获取正确的参数很敏感。

Example

例子

You've already seen the code to load the data a few times. At the end of data-loading, we have the following variables:

您已经看过几次加载数据的代码。 数据加载结束时,我们有以下变量:

  • train_X
  • val_X
  • train_y
  • val_y
import pandas as pd

# Load data
# 加载数据
# melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_file_path = '../00 datasets/melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing values
# 过滤缺失值
melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
# 选择目标及特征
y = melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
# 将数据分为训练数据和验证数据,分别用于特征和目标。分割基于随机数生成器。 提供一个数值
# random_state 参数保证我们每次都会得到相同的分割。 运行这个脚本。
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the RandomForestRegressor class instead of DecisionTreeRegressor.

我们构建一个随机森林模型,类似于在 scikit-learn 中构建决策树的方式 - 这次使用RandomForestRegressor类而不是DecisionTreeRegressor

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
191669.7536453626

Conclusion

There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

结论

可能还有进一步改进的空间,但这比最佳决策树误差 250,000 有了很大的改进。 有些参数允许您更改随机森林的性能,就像我们更改单个决策树的最大深度一样。 但随机森林模型的最佳特征之一是,即使没有这种调整,它们通常也能正常工作。

Your Turn

到你了

Try Using a Random Forest model yourself and see how much it improves your model.

亲自尝试使用随机森林模型,看看它对你的模型有何改进。

06.course-random-forests【随机森林】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top