Introduction
介绍
Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.
通常,数据会带有我们不满意的列名、索引名或其他命名约定。 在这种情况下,您将学习如何使用 pandas 函数将有问题的条目的名称更改为更好的名称。
You'll also explore how to combine data from multiple DataFrames and/or Series.
您还将探索如何组合来自多个DataFrames和/或Series的数据。
To start the exercise for this topic, please click here.
要开始本主题的练习,请单击此处。
Renaming
重命名
The first function we'll introduce here is rename()
, which lets you change index names and/or column names. For example, to change the points
column in our dataset to score
, we would do:
我们在这里介绍的第一个函数是rename()
,它允许您更改索引名称和/或列名称。 例如,要将数据集中的points
列更改为score
,我们可以这样做:
import pandas as pd
pd.set_option('display.max_rows', 5)
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews.rename(columns={'points': 'score'})
country | description | designation | score | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
rename()
lets you rename index or column values by specifying a index
or column
keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.
rename()
允许您通过分别指定 index
或 column
关键字参数来重命名索引或列值。 它支持多种输入格式,但通常Python字典是最方便的。 这是一个使用它来重命名索引的某些元素的示例。
reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
firstEntry | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
secondEntry | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
You'll probably rename columns very often, but rename index values very rarely. For that, set_index()
is usually more convenient.
您可能会经常重命名列,但很少重命名索引值。 为此,set_index()
通常更方便。
Both the row index and the column index can have their own name
attribute. The complimentary rename_axis()
method may be used to change these names. For example:
行索引和列索引都可以有自己的name
属性。 可以使用附带的rename_axis()
方法来更改这些名称。 例如:
reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
fields | country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
wines | |||||||||||||
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129969 | France | A dry style of Pinot Gris, this is crisp with ... | NaN | 90 | 32.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Marcel Deiss 2012 Pinot Gris (Alsace) | Pinot Gris | Domaine Marcel Deiss |
129970 | France | Big, rich and off-dry, this is powered by inte... | Lieu-dit Harth Cuvée Caroline | 90 | 21.0 | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... | Gewürztraminer | Domaine Schoffit |
129971 rows × 13 columns
Combining
组合
When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat()
, join()
, and merge()
. Most of what merge()
can do can also be done more simply with join()
, so we will omit it and focus on the first two functions here.
当对数据集执行操作时,我们有时需要以较为简便的方式组合不同的 DataFrame 和/或 Series。 Pandas 有三个核心方法来做到这一点。 按照复杂性的增加顺序,它们是concat()
、join()
和merge()
。 merge()
能做的大部分事情也可以用 join()
更简单地完成,所以我们将省略它并在这里重点关注前两个函数。
The simplest combining method is concat()
. Given a list of elements, this function will smush those elements together along an axis.
最简单的组合方法是concat()
。 给定一个元素列表,此函数会将这些元素沿轴混合在一起。
This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat()
to smush them together:
当我们在不同的 DataFrame 或 Series 对象中拥有数据但具有相同的字段(列)时,这非常有用。 一个示例:YouTube 视频数据集,它根据来源国家/地区(例如本例中的加拿大和英国)拆分数据。 如果我们想同时研究多个国家,我们可以使用 concat()
将它们组合在一起:
canadian_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/GBvideos.csv")
pd.concat([canadian_youtube, british_youtube])
video_id | trending_date | title | channel_title | category_id | publish_time | tags | views | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | video_error_or_removed | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | n1WpP7iowLc | 17.14.11 | Eminem - Walk On Water (Audio) ft. Beyoncé | EminemVEVO | 10 | 2017-11-10T17:00:03.000Z | Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... | 17158579 | 787425 | 43420 | 125882 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | False | False | False | Eminem's new track Walk on Water ft. Beyoncé i... |
1 | 0dBIkQ4Mz1M | 17.14.11 | PLUSH - Bad Unboxing Fan Mail | iDubbbzTV | 23 | 2017-11-13T17:00:00.000Z | plush|"bad unboxing"|"unboxing"|"fan mail"|"id... | 1014651 | 127794 | 1688 | 13030 | https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg | False | False | False | STill got a lot of packages. Probably will las... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
38914 | -DRsfNObKIQ | 18.14.06 | Eleni Foureira - Fuego - Cyprus - LIVE - First... | Eurovision Song Contest | 24 | 2018-05-08T20:32:32.000Z | Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... | 14317515 | 151870 | 45875 | 26766 | https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg | False | False | False | Eleni Foureira represented Cyprus at the first... |
38915 | 4YFo4bdMO8Q | 18.14.06 | KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... | SuperDuperKyle | 10 | 2018-05-11T04:06:35.000Z | Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... | 607552 | 18271 | 274 | 1423 | https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg | False | False | False | Debut album 'Light of Mine' out now: http://ky... |
79797 rows × 16 columns
The middlemost combiner in terms of complexity is join()
. join()
lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:
就复杂性而言,最适中的组合器是join()
。 join()
允许您组合具有共同索引的不同 DataFrame 对象。 例如,要获取恰好在加拿大和英国同一天流行的视频,我们可以执行以下操作:
left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])
left.join(right, lsuffix='_CAN', rsuffix='_UK')
video_id_CAN | channel_title_CAN | category_id_CAN | publish_time_CAN | tags_CAN | views_CAN | likes_CAN | dislikes_CAN | comment_count_CAN | thumbnail_link_CAN | ... | tags_UK | views_UK | likes_UK | dislikes_UK | comment_count_UK | thumbnail_link_UK | comments_disabled_UK | ratings_disabled_UK | video_error_or_removed_UK | description_UK | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
title | trending_date | |||||||||||||||||||||
Eminem - Walk On Water (Audio) ft. Beyoncé | 17.14.11 | n1WpP7iowLc | EminemVEVO | 10 | 2017-11-10T17:00:03.000Z | Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... | 17158579 | 787425 | 43420 | 125882 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | ... | Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... | 17158579.0 | 787420.0 | 43420.0 | 125882.0 | https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg | False | False | False | Eminem's new track Walk on Water ft. Beyoncé i... |
PLUSH - Bad Unboxing Fan Mail | 17.14.11 | 0dBIkQ4Mz1M | iDubbbzTV | 23 | 2017-11-13T17:00:00.000Z | plush|"bad unboxing"|"unboxing"|"fan mail"|"id... | 1014651 | 127794 | 1688 | 13030 | https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Trump Advisor Grovels To Trudeau | 18.14.06 | lbMKLzQ4cNQ | The Young Turks | 25 | 2018-06-13T04:00:05.000Z | 180612__TB02SorryExcuse|"News"|"Politics"|"The... | 115225 | 2115 | 182 | 1672 | https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
【完整版】遇到恐怖情人該怎麼辦?2018.06.13小明星大跟班 | 18.14.06 | POTgw38-m58 | 我愛小明星大跟班 | 24 | 2018-06-13T16:00:03.000Z | 吳宗憲|"吳姍儒"|"小明星大跟班"|"Sandy"|"Jacky wu"|"憲哥"|"中天... | 107392 | 300 | 62 | 251 | https://i.ytimg.com/vi/POTgw38-m58/default.jpg | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
40900 rows × 28 columns
The lsuffix
and rsuffix
parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.
此处需要lsuffix
和rsuffix
参数,因为数据在英国和加拿大数据集中具有相同的列名称。 如果这不是真的(因为,比如说,我们事先重命名了它们),我们就不需要它们。
Your turn
到你了
If you haven't started the exercise, you can get started here.
如果您还没有开始练习,可以从这里开始。