Introduction

介绍

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.

通常，数据会带有我们不满意的列名、索引名或其他命名约定。在这种情况下，您将学习如何使用 pandas 函数将有问题的条目的名称更改为更好的名称。

You'll also explore how to combine data from multiple DataFrames and/or Series.

您还将探索如何组合来自多个DataFrames和/或Series的数据。

To start the exercise for this topic, please click here.

要开始本主题的练习，请单击此处。

Renaming

重命名

The first function we'll introduce here is rename(), which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do:

我们在这里介绍的第一个函数是rename()，它允许您更改索引名称和/或列名称。例如，要将数据集中的points列更改为score，我们可以这样做：

import pandas as pd
pd.set_option('display.max_rows', 5)
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

reviews.rename(columns={'points': 'score'})

	country	description	designation	score	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

129971 rows × 13 columns

rename() lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.

rename() 允许您通过分别指定 index 或 column 关键字参数来重命名索引或列值。它支持多种输入格式，但通常Python字典是最方便的。这是一个使用它来重命名索引的某些元素的示例。

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
firstEntry	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
secondEntry	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

129971 rows × 13 columns

You'll probably rename columns very often, but rename index values very rarely. For that, set_index() is usually more convenient.

您可能会经常重命名列，但很少重命名索引值。为此，set_index()通常更方便。

Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:

行索引和列索引都可以有自己的name属性。可以使用附带的rename_axis()方法来更改这些名称。例如：

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')

fields	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
wines
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129969	France	A dry style of Pinot Gris, this is crisp with ...	NaN	90	32.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Marcel Deiss 2012 Pinot Gris (Alsace)	Pinot Gris	Domaine Marcel Deiss
129970	France	Big, rich and off-dry, this is powered by inte...	Lieu-dit Harth Cuvée Caroline	90	21.0	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car...	Gewürztraminer	Domaine Schoffit

129971 rows × 13 columns

Combining

组合

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge(). Most of what merge() can do can also be done more simply with join(), so we will omit it and focus on the first two functions here.

当对数据集执行操作时，我们有时需要以较为简便的方式组合不同的 DataFrame 和/或 Series。 Pandas 有三个核心方法来做到这一点。按照复杂性的增加顺序，它们是concat()、join()和merge()。 merge() 能做的大部分事情也可以用 join() 更简单地完成，所以我们将省略它并在这里重点关注前两个函数。

The simplest combining method is concat(). Given a list of elements, this function will smush those elements together along an axis.

最简单的组合方法是concat()。给定一个元素列表，此函数会将这些元素沿轴混合在一起。

This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat() to smush them together:

当我们在不同的 DataFrame 或 Series 对象中拥有数据但具有相同的字段（列）时，这非常有用。一个示例：YouTube 视频数据集，它根据来源国家/地区（例如本例中的加拿大和英国）拆分数据。如果我们想同时研究多个国家，我们可以使用 concat() 将它们组合在一起：

canadian_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])

	video_id	trending_date	title	channel_title	category_id	publish_time	tags	views	likes	dislikes	comment_count	thumbnail_link	comments_disabled	ratings_disabled	video_error_or_removed	description
0	n1WpP7iowLc	17.14.11	Eminem - Walk On Water (Audio) ft. Beyoncé	EminemVEVO	10	2017-11-10T17:00:03.000Z	Eminem\|"Walk"\|"On"\|"Water"\|"Aftermath/Shady/In...	17158579	787425	43420	125882	https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg	False	False	False	Eminem's new track Walk on Water ft. Beyoncé i...
1	0dBIkQ4Mz1M	17.14.11	PLUSH - Bad Unboxing Fan Mail	iDubbbzTV	23	2017-11-13T17:00:00.000Z	plush\|"bad unboxing"\|"unboxing"\|"fan mail"\|"id...	1014651	127794	1688	13030	https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg	False	False	False	STill got a lot of packages. Probably will las...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
38914	-DRsfNObKIQ	18.14.06	Eleni Foureira - Fuego - Cyprus - LIVE - First...	Eurovision Song Contest	24	2018-05-08T20:32:32.000Z	Eurovision Song Contest\|"2018"\|"Lisbon"\|"Cypru...	14317515	151870	45875	26766	https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg	False	False	False	Eleni Foureira represented Cyprus at the first...
38915	4YFo4bdMO8Q	18.14.06	KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A...	SuperDuperKyle	10	2018-05-11T04:06:35.000Z	Kyle\|"SuperDuperKyle"\|"Ikuyo"\|"2 Chainz"\|"Soph...	607552	18271	274	1423	https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg	False	False	False	Debut album 'Light of Mine' out now: http://ky...

79797 rows × 16 columns

The middlemost combiner in terms of complexity is join(). join() lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:

就复杂性而言，最适中的组合器是join()。 join() 允许您组合具有共同索引的不同 DataFrame 对象。例如，要获取恰好在加拿大和英国同一天流行的视频，我们可以执行以下操作：

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')

		video_id_CAN	channel_title_CAN	category_id_CAN	publish_time_CAN	tags_CAN	views_CAN	likes_CAN	dislikes_CAN	comment_count_CAN	thumbnail_link_CAN	...	tags_UK	views_UK	likes_UK	dislikes_UK	comment_count_UK	thumbnail_link_UK	comments_disabled_UK	ratings_disabled_UK	video_error_or_removed_UK	description_UK
title	trending_date
Eminem - Walk On Water (Audio) ft. Beyoncé	17.14.11	n1WpP7iowLc	EminemVEVO	10	2017-11-10T17:00:03.000Z	Eminem\|"Walk"\|"On"\|"Water"\|"Aftermath/Shady/In...	17158579	787425	43420	125882	https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg	...	Eminem\|"Walk"\|"On"\|"Water"\|"Aftermath/Shady/In...	17158579.0	787420.0	43420.0	125882.0	https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg	False	False	False	Eminem's new track Walk on Water ft. Beyoncé i...
PLUSH - Bad Unboxing Fan Mail	17.14.11	0dBIkQ4Mz1M	iDubbbzTV	23	2017-11-13T17:00:00.000Z	plush\|"bad unboxing"\|"unboxing"\|"fan mail"\|"id...	1014651	127794	1688	13030	https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Trump Advisor Grovels To Trudeau	18.14.06	lbMKLzQ4cNQ	The Young Turks	25	2018-06-13T04:00:05.000Z	180612__TB02SorryExcuse\|"News"\|"Politics"\|"The...	115225	2115	182	1672	https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
【完整版】遇到恐怖情人該怎麼辦？2018.06.13小明星大跟班	18.14.06	POTgw38-m58	我愛小明星大跟班	24	2018-06-13T16:00:03.000Z	吳宗憲\|"吳姍儒"\|"小明星大跟班"\|"Sandy"\|"Jacky wu"\|"憲哥"\|"中天...	107392	300	62	251	https://i.ytimg.com/vi/POTgw38-m58/default.jpg	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

40900 rows × 28 columns

The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.

此处需要lsuffix和rsuffix参数，因为数据在英国和加拿大数据集中具有相同的列名称。如果这不是真的（因为，比如说，我们事先重命名了它们），我们就不需要它们。

Your turn

到你了

If you haven't started the exercise, you can get started here.

如果您还没有开始练习，可以从这里开始。

06.course-renaming-and-combining【重命名及组合】

06.course-renaming-and-combining【重命名及组合】

Introduction

介绍

Renaming

重命名

Combining

组合

Your turn

到你了

Leave a Reply Cancel reply