Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

06.course-renaming-and-combining【重命名及组合】

Introduction

介绍

Oftentimes data will come to us with column names, index names, or other naming conventions that we are not satisfied with. In that case, you'll learn how to use pandas functions to change the names of the offending entries to something better.

通常,数据会带有我们不满意的列名、索引名或其他命名约定。 在这种情况下,您将学习如何使用 pandas 函数将有问题的条目的名称更改为更好的名称。

You'll also explore how to combine data from multiple DataFrames and/or Series.

您还将探索如何组合来自多个DataFrames和/或Series的数据。

To start the exercise for this topic, please click here.

要开始本主题的练习,请单击此处

Renaming

重命名

The first function we'll introduce here is rename(), which lets you change index names and/or column names. For example, to change the points column in our dataset to score, we would do:

我们在这里介绍的第一个函数是rename(),它允许您更改索引名称和/或列名称。 例如,要将数据集中的points列更改为score,我们可以这样做:

import pandas as pd
pd.set_option('display.max_rows', 5)
reviews = pd.read_csv("../00 datasets/zynicide/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
reviews.rename(columns={'points': 'score'})
country description designation score price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

rename() lets you rename index or column values by specifying a index or column keyword parameter, respectively. It supports a variety of input formats, but usually a Python dictionary is the most convenient. Here is an example using it to rename some elements of the index.

rename() 允许您通过分别指定 indexcolumn 关键字参数来重命名索引或列值。 它支持多种输入格式,但通常Python字典是最方便的。 这是一个使用它来重命名索引的某些元素的示例。

reviews.rename(index={0: 'firstEntry', 1: 'secondEntry'})
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
firstEntry Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
secondEntry Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

You'll probably rename columns very often, but rename index values very rarely. For that, set_index() is usually more convenient.

您可能会经常重命名列,但很少重命名索引值。 为此,set_index()通常更方便。

Both the row index and the column index can have their own name attribute. The complimentary rename_axis() method may be used to change these names. For example:

行索引和列索引都可以有自己的name属性。 可以使用附带的rename_axis()方法来更改这些名称。 例如:

reviews.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')
fields country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129969 France A dry style of Pinot Gris, this is crisp with ... NaN 90 32.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Marcel Deiss 2012 Pinot Gris (Alsace) Pinot Gris Domaine Marcel Deiss
129970 France Big, rich and off-dry, this is powered by inte... Lieu-dit Harth Cuvée Caroline 90 21.0 Alsace Alsace NaN Roger Voss @vossroger Domaine Schoffit 2012 Lieu-dit Harth Cuvée Car... Gewürztraminer Domaine Schoffit

129971 rows × 13 columns

Combining

组合

When performing operations on a dataset, we will sometimes need to combine different DataFrames and/or Series in non-trivial ways. Pandas has three core methods for doing this. In order of increasing complexity, these are concat(), join(), and merge(). Most of what merge() can do can also be done more simply with join(), so we will omit it and focus on the first two functions here.

当对数据集执行操作时,我们有时需要以较为简便的方式组合不同的 DataFrame 和/或 Series。 Pandas 有三个核心方法来做到这一点。 按照复杂性的增加顺序,它们是concat()join()merge()merge() 能做的大部分事情也可以用 join() 更简单地完成,所以我们将省略它并在这里重点关注前两个函数。

The simplest combining method is concat(). Given a list of elements, this function will smush those elements together along an axis.

最简单的组合方法是concat()。 给定一个元素列表,此函数会将这些元素沿轴混合在一起。

This is useful when we have data in different DataFrame or Series objects but having the same fields (columns). One example: the YouTube Videos dataset, which splits the data up based on country of origin (e.g. Canada and the UK, in this example). If we want to study multiple countries simultaneously, we can use concat() to smush them together:

当我们在不同的 DataFrame 或 Series 对象中拥有数据但具有相同的字段(列)时,这非常有用。 一个示例:YouTube 视频数据集,它根据来源国家/地区(例如本例中的加拿大和英国)拆分数据。 如果我们想同时研究多个国家,我们可以使用 concat() 将它们组合在一起:

canadian_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/CAvideos.csv")
british_youtube = pd.read_csv("../00 datasets/datasnaek/youtube-new/GBvideos.csv")

pd.concat([canadian_youtube, british_youtube])
video_id trending_date title channel_title category_id publish_time tags views likes dislikes comment_count thumbnail_link comments_disabled ratings_disabled video_error_or_removed description
0 n1WpP7iowLc 17.14.11 Eminem - Walk On Water (Audio) ft. Beyoncé EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé i...
1 0dBIkQ4Mz1M 17.14.11 PLUSH - Bad Unboxing Fan Mail iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg False False False STill got a lot of packages. Probably will las...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38914 -DRsfNObKIQ 18.14.06 Eleni Foureira - Fuego - Cyprus - LIVE - First... Eurovision Song Contest 24 2018-05-08T20:32:32.000Z Eurovision Song Contest|"2018"|"Lisbon"|"Cypru... 14317515 151870 45875 26766 https://i.ytimg.com/vi/-DRsfNObKIQ/default.jpg False False False Eleni Foureira represented Cyprus at the first...
38915 4YFo4bdMO8Q 18.14.06 KYLE - Ikuyo feat. 2 Chainz & Sophia Black [A... SuperDuperKyle 10 2018-05-11T04:06:35.000Z Kyle|"SuperDuperKyle"|"Ikuyo"|"2 Chainz"|"Soph... 607552 18271 274 1423 https://i.ytimg.com/vi/4YFo4bdMO8Q/default.jpg False False False Debut album 'Light of Mine' out now: http://ky...

79797 rows × 16 columns

The middlemost combiner in terms of complexity is join(). join() lets you combine different DataFrame objects which have an index in common. For example, to pull down videos that happened to be trending on the same day in both Canada and the UK, we could do the following:

就复杂性而言,最适中的组合器是join()join() 允许您组合具有共同索引的不同 DataFrame 对象。 例如,要获取恰好在加拿大和英国同一天流行的视频,我们可以执行以下操作:

left = canadian_youtube.set_index(['title', 'trending_date'])
right = british_youtube.set_index(['title', 'trending_date'])

left.join(right, lsuffix='_CAN', rsuffix='_UK')
video_id_CAN channel_title_CAN category_id_CAN publish_time_CAN tags_CAN views_CAN likes_CAN dislikes_CAN comment_count_CAN thumbnail_link_CAN ... tags_UK views_UK likes_UK dislikes_UK comment_count_UK thumbnail_link_UK comments_disabled_UK ratings_disabled_UK video_error_or_removed_UK description_UK
title trending_date
Eminem - Walk On Water (Audio) ft. Beyoncé 17.14.11 n1WpP7iowLc EminemVEVO 10 2017-11-10T17:00:03.000Z Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579 787425 43420 125882 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg ... Eminem|"Walk"|"On"|"Water"|"Aftermath/Shady/In... 17158579.0 787420.0 43420.0 125882.0 https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg False False False Eminem's new track Walk on Water ft. Beyoncé i...
PLUSH - Bad Unboxing Fan Mail 17.14.11 0dBIkQ4Mz1M iDubbbzTV 23 2017-11-13T17:00:00.000Z plush|"bad unboxing"|"unboxing"|"fan mail"|"id... 1014651 127794 1688 13030 https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Trump Advisor Grovels To Trudeau 18.14.06 lbMKLzQ4cNQ The Young Turks 25 2018-06-13T04:00:05.000Z 180612__TB02SorryExcuse|"News"|"Politics"|"The... 115225 2115 182 1672 https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
【完整版】遇到恐怖情人該怎麼辦?2018.06.13小明星大跟班 18.14.06 POTgw38-m58 我愛小明星大跟班 24 2018-06-13T16:00:03.000Z 吳宗憲|"吳姍儒"|"小明星大跟班"|"Sandy"|"Jacky wu"|"憲哥"|"中天... 107392 300 62 251 https://i.ytimg.com/vi/POTgw38-m58/default.jpg ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

40900 rows × 28 columns

The lsuffix and rsuffix parameters are necessary here because the data has the same column names in both British and Canadian datasets. If this wasn't true (because, say, we'd renamed them beforehand) we wouldn't need them.

此处需要lsuffixrsuffix参数,因为数据在英国和加拿大数据集中具有相同的列名称。 如果这不是真的(因为,比如说,我们事先重命名了它们),我们就不需要它们。

Your turn

到你了

If you haven't started the exercise, you can get started here.

如果您还没有开始练习,可以从这里开始

06.course-renaming-and-combining【重命名及组合】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top