Using Pandas to Get Familiar With Your Data
The first step in any machine learning project is familiarize yourself with the data. You'll use the Pandas library for this. Pandas is the primary tool data scientists use for exploring and manipulating data. Most people abbreviate pandas in their code as pd
. We do this with the command
使用 Pandas 熟悉您的数据
任何机器学习项目的第一步都是熟悉数据。 为此,您将使用 Pandas 库。 Pandas 是数据科学家用于探索和操作数据的主要工具。 大多数人在代码中将 pandas 缩写为pd
。 我们用命令来做到这一点
import pandas as pd
The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.
Pandas has powerful methods for most things you'll want to do with this type of data.
As an example, we'll look at data about home prices in Melbourne, Australia. In the hands-on exercises, you will apply the same processes to a new dataset, which has home prices in Iowa.
The example (Melbourne) data is at the file path ../input/melbourne-housing-snapshot/melb_data.csv
.
We load and explore the data with the following commands:
Pandas 库最重要的部分是 DataFrame。 DataFrame 保存您可能认为是表格的数据类型。 这类似于 Excel 中的工作表或 SQL 数据库中的表。
Pandas 拥有强大的方法来处理您想要对此类数据执行的大多数操作。
作为示例,我们将查看澳大利亚墨尔本的有关房价的数据。 在实践练习中,您将向新数据集应用相同的过程,该数据集包含爱荷华州的房价。
示例(墨尔本)数据位于文件路径 ../input/melbourne-housing-snapshot/melb_data.csv
。
我们使用以下命令加载并探索数据:
# save filepath to variable for easier access
# melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
melbourne_file_path = '../00 datasets/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path)
# print a summary of the data in Melbourne data
melbourne_data.describe()
Rooms | Price | Distance | Postcode | Bedroom2 | Bathroom | Car | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | Propertycount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 13580.000000 | 1.358000e+04 | 13580.000000 | 13580.000000 | 13580.000000 | 13580.000000 | 13518.000000 | 13580.000000 | 7130.000000 | 8205.000000 | 13580.000000 | 13580.000000 | 13580.000000 |
mean | 2.937997 | 1.075684e+06 | 10.137776 | 3105.301915 | 2.914728 | 1.534242 | 1.610075 | 558.416127 | 151.967650 | 1964.684217 | -37.809203 | 144.995216 | 7454.417378 |
std | 0.955748 | 6.393107e+05 | 5.868725 | 90.676964 | 0.965921 | 0.691712 | 0.962634 | 3990.669241 | 541.014538 | 37.273762 | 0.079260 | 0.103916 | 4378.581772 |
min | 1.000000 | 8.500000e+04 | 0.000000 | 3000.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1196.000000 | -38.182550 | 144.431810 | 249.000000 |
25% | 2.000000 | 6.500000e+05 | 6.100000 | 3044.000000 | 2.000000 | 1.000000 | 1.000000 | 177.000000 | 93.000000 | 1940.000000 | -37.856822 | 144.929600 | 4380.000000 |
50% | 3.000000 | 9.030000e+05 | 9.200000 | 3084.000000 | 3.000000 | 1.000000 | 2.000000 | 440.000000 | 126.000000 | 1970.000000 | -37.802355 | 145.000100 | 6555.000000 |
75% | 3.000000 | 1.330000e+06 | 13.000000 | 3148.000000 | 3.000000 | 2.000000 | 2.000000 | 651.000000 | 174.000000 | 1999.000000 | -37.756400 | 145.058305 | 10331.000000 |
max | 10.000000 | 9.000000e+06 | 48.100000 | 3977.000000 | 20.000000 | 8.000000 | 10.000000 | 433014.000000 | 44515.000000 | 2018.000000 | -37.408530 | 145.526350 | 21650.000000 |
Interpreting Data Description
The results show 8 numbers for each column in your original dataset. The first number, the count, shows how many rows have non-missing values.
Missing values arise for many reasons. For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house. We'll come back to the topic of missing data.
The second value is the mean, which is the average. Under that, std is the standard deviation, which measures how numerically spread out the values are.
To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.
解释数据描述
结果显示原始数据集中每列 8 个数字。 第一个数字,计数,显示有多少行具有非缺失值。
缺失值的产生有多种原因。 例如,在测量一卧室房屋时,不会收集第二卧室的尺寸。 我们将回到丢失数据的主题。
第二个值是 mean,即平均值。 其中,std 是标准差,它衡量值的数值分布情况。
要解释 min、25%、50%、75% 和 max 值,请想象将每列从最低值到最高值排序。 第一个(最小)值是最小值。 如果您浏览列表的四分之一,您会发现一个大于值的 25% 且小于值的 75% 的数字。 这就是 25% 值(发音为25%
)。 第 50 个和第 75 个百分位数的定义类似,max 是最大数字。
Your Turn
Get started with your first coding exercise
到你了
开始您的第一次编码练习