Introduction
介绍
In this micro-course, you'll learn all about pandas, the most popular Python library for data analysis.
在这个微课程中,您将了解有关pandas的所有信息,这是最流行的数据分析Python库。
Along the way, you'll complete several hands-on exercises with real-world data. We recommend that you work on the exercises while reading the corresponding tutorials.
在此过程中,您将使用真实数据完成一些实践练习。 我们建议您在阅读相应教程的同时做练习。
To start the first exercise, please click here.
要开始第一个练习,请单击此处。
In this tutorial, you will learn how to create your own data, along with how to work with data that already exists.
在本教程中,您将学习如何创建自己的数据,以及如何使用已存在的数据。
Getting started
入门
To use pandas, you'll typically start with the following line of code.
要使用 pandas,您通常会从以下代码行开始。
import pandas as pd
Creating data
创建数据
There are two core objects in pandas: the DataFrame and the Series.
pandas 有两个核心对象:DataFrame 和 Series。
DataFrame
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
DataFrame 是一个表。 它包含一系列单独的条目,每个条目都有特定的值。 每个条目对应于一行(或记录)和一个列。
For example, consider the following simple DataFrame:
例如,考虑以下简单的 DataFrame:
pd.DataFrame({'Yes': [50, 21], 'No': [131, 2]})
Yes | No | |
---|---|---|
0 | 50 | 131 |
1 | 21 | 2 |
In this example, the "0, No" entry has the value of 131. The "0, Yes" entry has a value of 50, and so on.
在此示例中,“0,否”条目的值为 131。“0,是”条目的值为 50,依此类推。
DataFrame entries are not limited to integers. For instance, here's a DataFrame whose values are strings:
DataFrame 条目不限于整数。 例如,这是一个值为字符串的 DataFrame:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 'Sue': ['Pretty good.', 'Bland.']})
Bob | Sue | |
---|---|---|
0 | I liked it. | Pretty good. |
1 | It was awful. | Bland. |
We are using the pd.DataFrame()
constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Bob
and Sue
in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.
我们使用pd.DataFrame()
构造函数来生成这些 DataFrame 对象。 声明新的语法是一个字典,其键是列名(本例中为Bob
和Sue
),其值是条目列表。 这是构建新 DataFrame 的标准方法,也是您最有可能遇到的方法。
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, ...) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
字典列表构造函数将值分配给列标签,但仅对行标签使用从 0 开始的升序计数(0、1、2、3...)。 有时这是可以的,但很多时候我们会想自己分配这些标签。
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index
parameter in our constructor:
DataFrame 中使用的行标签列表称为 Index。 我们可以通过在构造函数中使用index
参数为其赋值:
pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'],
'Sue': ['Pretty good.', 'Bland.']},
index=['Product A', 'Product B'])
Bob | Sue | |
---|---|---|
Product A | I liked it. | Pretty good. |
Product B | It was awful. | Bland. |
Series
Series
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:
相比之下,Series是数据值的序列。 如果 DataFrame 是一个表,那么 Series 就是一个列表。 事实上,您只需一个列表就可以创建一个:
pd.Series([1, 2, 3, 4, 5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
A Series is, in essence, a single column of a DataFrame. So you can assign row labels to the Series the same way as before, using an index
parameter. However, a Series does not have a column name, it only has one overall name
:
本质上,Series 是 DataFrame 的单列。 因此,您可以使用index
参数以与以前相同的方式将行标签分配给系列。 然而,Series 没有列名,它只有一个总体名称name
:
pd.Series([30, 35, 40], index=['2015 Sales', '2016 Sales', '2017 Sales'], name='Product A')
2015 Sales 30
2016 Sales 35
2017 Sales 40
Name: Product A, dtype: int64
The Series and the DataFrame are intimately related. It's helpful to think of a DataFrame as actually being just a bunch of Series "glued together". We'll see more of this in the next section of this tutorial.
Series 和 DataFrame 密切相关。 将 DataFrame 视为实际上只是一堆“粘在一起”的 Series 是有帮助的。 我们将在本教程的下一部分中看到更多内容。
Reading data files
读取数据文件
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won't actually be creating our own data by hand. Instead, we'll be working with data that already exists.
能够手动创建 DataFrame 或 Series 非常方便。 但是,大多数时候,我们实际上不会手动创建自己的数据。 相反,我们将使用已经存在的数据。
Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
数据可以以多种不同形式和格式中的任何一种来存储。 到目前为止,其中最基本的是简单的 CSV 文件。 当您打开 CSV 文件时,您会看到如下所示的内容:
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
So a CSV file is a table of values separated by commas. Hence the name: "Comma-Separated Values", or CSV.
因此,CSV 文件是一个由逗号分隔的值表。 因此得名:逗号分隔值
或 CSV。
Let's now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We'll use the pd.read_csv()
function to read the data into a DataFrame. This goes thusly:
现在让我们把玩具数据集放在一边,看看当我们将其读入 DataFrame 时,真实的数据集是什么样子。 我们将使用pd.read_csv()
函数将数据读入 DataFrame。 事情是这样的:
wine_reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv")
We can use the shape
attribute to check how large the resulting DataFrame is:
我们可以使用shape
属性来检查生成的 DataFrame 有多大:
wine_reviews.shape
(129971, 14)
So our new DataFrame has 130,000 records split across 14 different columns. That's almost 2 million entries!
因此,我们的新 DataFrame 有 130,000 条记录,分布在 14 个不同的列中。 这几乎有 200 万条条目!
We can examine the contents of the resultant DataFrame using the head()
command, which grabs the first five rows:
我们可以使用head()
命令检查生成的 DataFrame 的内容,该命令获取前五行:
wine_reviews.head()
Unnamed: 0 | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | 1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | 2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | 3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | 4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
The pd.read_csv()
function is well-endowed, with over 30 optional parameters you can specify. For example, you can see in this dataset that the CSV file has a built-in index, which pandas did not pick up on automatically. To make pandas use that column for the index (instead of creating a new one from scratch), we can specify an index_col
.
pd.read_csv()
函数功能齐全,有超过 30 个可选参数可供您指定。 例如,您可以在此数据集中看到 CSV 文件具有内置索引,pandas 不会自动获取该索引。 为了让 pandas 使用该列作为索引(而不是从头开始创建一个新列),我们可以指定一个index_col
。
wine_reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
wine_reviews.head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
Your turn
到你了
If you haven't started the exercise, you can get started here.
如果您还没有开始练习,可以从这里开始。