Using Pandas to Get Familiar With Your Data

The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd. We do this with the command

使用 Pandas 熟悉您的数据

任何机器学习项目的第一步都是熟悉数据。为此，您将使用 Pandas 库。 Pandas 是数据科学家用于探索和操作数据的主要工具。大多数人在代码中将 pandas 缩写为pd。我们用命令来做到这一点

import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.

Pandas has powerful methods for most things you'll want to do with this type of data.

As an example, we'll look at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.

The example (Melbourne) data is at the file path ../input/melbourne-housing-snapshot/melb_data.csv.

We load and explore the data with the following commands:

Pandas 库最重要的部分是 DataFrame。 DataFrame 保存您可能认为是表格的数据类型。这类似于 Excel 中的工作表或 SQL 数据库中的表。

Pandas 拥有强大的方法来处理您想要对此类数据执行的大多数操作。

作为示例，我们将查看澳大利亚墨尔本的有关房价的数据。在实践练习中，您将向新数据集应用相同的过程，该数据集包含爱荷华州的房价。

示例（墨尔本）数据位于文件路径 ../input/melbourne-housing-snapshot/melb_data.csv。

我们使用以下命令加载并探索数据：

# save filepath to variable for easier access
# melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_file_path = '../00 datasets/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

Interpreting Data Description

The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.

Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.

The second value is the mean, which is the average. Under that, std is the standard deviation, which measures how numerically spread out the values are.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

解释数据描述

结果显示原始数据集中每列 8 个数字。第一个数字，计数，显示有多少行具有非缺失值。

缺失值的产生有多种原因。例如，在测量一卧室房屋时，不会收集第二卧室的尺寸。我们将回到丢失数据的主题。

第二个值是 mean，即平均值。其中，std 是标准差，它衡量值的数值分布情况。

要解释 min、25%、50%、75% 和 max 值，请想象将每列从最低值到最高值排序。第一个（最小）值是最小值。如果您浏览列表的四分之一，您会发现一个大于值的 25% 且小于值的 75% 的数字。这就是 25% 值（发音为25%）。第 50 个和第 75 个百分位数的定义类似，max 是最大数字。

Your Turn

Get started with your first coding exercise

到你了

开始您的第一次编码练习

02.course-basic-data-exploration【基础数据探索】