Flashield's Blog

Just For My Daily Diary

Flashield's Blog

Just For My Daily Diary

05.exercise-inconsistent-data-entry【练习:数据不一致】

This notebook is an exercise in the Data Cleaning course. You can reference the tutorial at this link.


In this exercise, you'll apply what you learned in the Inconsistent data entry tutorial.

在本练习中,您将应用在数据不一致教程中学到的知识。

Setup

设置

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

以下问题将为您提供有关您工作的反馈。 运行以下单元格来设置反馈系统。

from learntools.core import binder
binder.bind(globals())
from learntools.data_cleaning.ex5 import *
print("Setup Complete")
Setup Complete

Get our environment set up

设置我们的环境

The first thing we'll need to do is load in the libraries and dataset we'll be using. We use the same dataset from the tutorial.

我们需要做的第一件事是加载我们将使用的库和数据集。 我们使用教程中的相同数据集。

# modules we'll use
import pandas as pd
import numpy as np

# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import chardet

# read in all our data
professors = pd.read_csv("../input/pakistan-intellectual-capital/pakistan_intellectual_capital.csv")

# set seed for reproducibility
np.random.seed(0)

Next, we'll redo all of the work that we did in the tutorial.

接下来,我们将重做本教程中所做的所有工作。

# convert to lower case
professors['Country'] = professors['Country'].str.lower()
# remove trailing white spaces
professors['Country'] = professors['Country'].str.strip()

# get the top 10 closest matches to "south korea"
countries = professors['Country'].unique()
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()

    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match

    # let us know the function's done
    print("All done!")

replace_matches_in_column(df=professors, column='Country', string_to_match="south korea")
countries = professors['Country'].unique()
All done!

1) Examine another column

1) 检查另一列

Write code below to take a look at all the unique values in the "Graduated from" column.

下面编写代码来查看 Graduated from 列中的所有唯一值。

# TODO: Your code here
pd.set_option("display.max_rows", None)
professors["Graduated from"].value_counts().sort_index()
Graduated from
 Columbia University                                                                         2
 Delft University of Technology                                                              1
 Iowa State University                                                                       1
 University of Central Florida                                                               1
 University of Innsbruck                                                                     1
 University of Texas at Arlington (UTA)                                                      1
 University of Turin                                                                         1
Abasyn University                                                                            2
Abdul Wali Khan University, Mardan                                                           2
Abdus Salam School of Mathematical Sciences,GC University                                    1
Agricultural University Peshawar                                                             2
Allama Iqbal Open University                                                                 2
Asian Institute of Technology                                                               12
Aston University, Birmingham                                                                 1
Australian National University, Caneberra                                                    1
BUKC                                                                                         1
Bahauddin Zakariya University                                                                3
Bahria University                                                                            9
Bahria University,Islamabad                                                                 12
Balochistan University of Information Technology, Engineering and Management Sciences        2
Barani Institute of Information Technology                                                   1
Beaconhouse National University                                                              2
Beihang University                                                                           1
Beijing Institute of Technology                                                              2
Beijing Institute of Technology Beijing                                                      1
Beijing University of Posts & Telecommunications                                             1
Biztek Institute Of Business & Technology,Karachi                                            1
Blekinge Institute of Technology                                                             3
Brock University Canada                                                                      1
Brunel University                                                                            3
CECOS University of Information Technology and Emerging Sciences,Peshawar                    1
COMSATS Institute of Information Technology                                                 14
COMSATS Institute of Information Technology,Islamabad                                        6
COMSATS Institute of Information Technology,Lahore                                           4
COMSATS Institute of Information Technology,Vehari                                           1
COMSATS Institute of Information Technology,Wah Cantt                                        3
California State University                                                                  1
Capital University of Science & Technology                                                   2
Capital University of Science and Technology                                                 1
Carnegie Mellon University, Pittsburgh                                                       1
Centre for Advanced Studies in Engineering                                                   6
Chalmers University of Technology                                                            1
Chinese Academy of Sciences                                                                  3
Chosun University                                                                            1
City University of Science and Technology                                                    4
Colorado State University                                                                    1
Colorado Technical University                                                                1
Columbia University                                                                          2
Concordia University,Montreal                                                                1
Coventry University                                                                          1
Cranfield University                                                                         1
DUET,Karachi                                                                                 1
DePaul University, Chicago                                                                   1
Dresden University Of Technology, Dresden                                                    1
Eindhoven University of Technology (TU/e)                                                    2
FAST– National University of Computer and Emerging Sciences                                 67
FAST– National University of Computer and Emerging Sciences,Chiniot-Faisalabad               1
FAST– National University of Computer and Emerging Sciences,Islamabad                        7
FAST– National University of Computer and Emerging Sciences,Lahore                           5
FAST– National University of Computer and Emerging Sciences,Peshawar                         3
Fatima Jinnah Women University, Rawalpindi                                                   2
Fedral Urdu University                                                                       1
Florida Atlantic University                                                                  1
Foundation University                                                                        2
Galilée - Université Paris 13                                                                1
George Mason University                                                                      1
George Washington University                                                                 2
Georgetown University,DC                                                                     1
Ghulam Ishaq Khan Institute of Science and Technology                                       13
Gomal University                                                                             1
Government College University                                                                5
Government College University, Faisalabad                                                    2
Government College University,Faisalabad                                                     2
Graz University of Technology                                                                1
Grenoble                                                                                     1
Griffith University                                                                          1
Griffith University,Nathan Campus                                                            1
Guildford                                                                                    1
Gwangju Institute of Science and Technology                                                  2
HITEC University,Taxila                                                                      1
Hamburg University of Technology                                                             1
Hamdard University                                                                          31
Hanyang University, Ansan                                                                    1
Harbin Institute of Technology                                                               2
Huazhong University of Science and Technology (HUST), Wuhan                                  1
IBMS KP Agricultural University Peshawar                                                     1
INRIA Saclay Ile-de-France                                                                   1
INSA de Lyon, Rhone                                                                          1
IQRA University                                                                              2
IQRA University,Islamabad                                                                    2
IQRA University,Karachi                                                                      1
ISRA University                                                                              4
Illinois Institute of Technology                                                             1
Ilmenau University of Technology                                                             1
Imperial College, University of London                                                       1
Information Technology University (ITU)                                                      1
Institute Of Managment Sciences, Peshawar                                                    2
Institute of Business Administration                                                         7
Institute of Business Administration,Karachi                                                 6
Institute of Business Administration,Sukkur                                                  1
Institute of Management Sciences, Peshawar                                                   5
International Islamic University                                                             2
International Islamic University,Islamabad                                                  32
Islamia College University                                                                   1
JKU                                                                                          1
JNU                                                                                          1
Jinnah University for Women                                                                  9
John Moorse University, Liverpool                                                            1
Jonkoping University                                                                         1
KTH Royal Institute of Technology                                                            5
King Abdullah University of Science and Technology                                           1
Kingston University London                                                                   1
Kohat University of Science & Technology, Kohat                                              1
Kyung Hee University                                                                         3
Kyungpook National University                                                                2
Kyushu University,Fukuoka                                                                    1
Lahore College for Women University                                                          2
Lahore Leads University                                                                      3
Lahore University of Management Sciences                                                    27
Linköping University                                                                         2
Liverpool John Moores University                                                             3
London University                                                                            1
Loughborough University                                                                      3
Manchester Metropolitan University                                                           2
Manchester University                                                                        1
Massachusetts Institute of Technology                                                        2
Max Planck Institute for Computer Science                                                    2
Mehran University of Engineering & Technology                                                9
Mid Sweden University                                                                        2
Middle East Technical University                                                             1
Middlesex University                                                                         1
Minhaj University Lahore                                                                     1
Mohammad Ali Jinnah University                                                              18
Monash University                                                                            2
Muroran Institute of Technology,Hokkaido                                                     1
Myongji University                                                                           1
NCSU                                                                                         1
NED University of Engineering And Technology                                                34
Nancy 2 University                                                                           1
Nanyang Tech University                                                                      1
National College of Business Administration and Economics                                    7
National Textile University                                                                  6
National University of Modern Languages                                                      4
National University of Modern Languages,Islamabad                                            1
National University of Sciences and Technology                                              45
National University of Singapore                                                             1
New York Institute of Technology                                                             1
North Dakota State University                                                                2
Northeastern University,Boston                                                               2
Norwegian University of Science and Technology (NTNU),                                       1
Nottingham Trent University                                                                  1
Oxford Brookes University                                                                    1
PAF-Karachi Institute of Economics and Technology                                           16
Pace University, New York                                                                    1
Pakistan Institute of Engineering and Applied Sciences                                       4
Paris Descartes University                                                                   2
Paris Tech University of Eurecom                                                             1
Pir Mehr Ali Shah Arid Agriculture University                                               11
Pohang University of Science and Technology                                                  1
Politecnico di Milano                                                                        3
Politecnico di Torino                                                                        7
Pompeu Fabra University Barcelona                                                            1
Preston                                                                                      1
Punjab University College of Information Technology                                         12
Purdue University                                                                            3
Quaid-e-Awam University of Engineering, Science & Technology                                14
Quaid-i-Azam University                                                                     13
Queen Mary University of London                                                              3
RWTH Aachen University                                                                       4
Razak School of Engineering and Advanced Technology, Universiti Teknologi Malaysia (UTM)     1
Riphah International University                                                              1
Riphah International University,Faisalabad                                                   1
Rutgers State University of New Jersey, NJ                                                   1
SRH Hochschule Heidelberg                                                                    1
SSindh Agriculture University                                                                1
Saarland University                                                                          1
Sapienza University of Rome                                                                  1
Sardar Bahadur Khan Women's University                                                       3
Seoul National University                                                                    1
Shah Abdul Latif University, Khairpur                                                        1
Shaheed Zulfikar Ali Bhutto Institute of Science and Technology                             15
Shaheed Zulfikar Ali Bhutto Institute of Science and Technology,Islamabad                    1
Sindh Agriculture University                                                                 1
Sindh University                                                                            13
Sir Syed University of Engineering and Technology                                           47
Skolkovo Institute of Science and Technology,                                                1
South Asian University                                                                       1
Staffordshire University                                                                     1
Stanford University                                                                          1
State University of New York System                                                          1
Stockholm University                                                                         1
Sungkyunkwan University                                                                      1
Superior University, Lahore                                                                  1
Swansea                                                                                      1
Swedish University of Agricultural Sciences, Uppsala                                         1
Swinburne University Of Technology                                                           1
TU Berlin                                                                                    1
TU Wien                                                                                      1
Technical University of Braunschweig                                                         1
Technical University of Graz                                                                 1
Temple University                                                                            1
The Islamia University of Bahawalpur                                                         2
The Ohio State University                                                                    1
The Queens University of Belfast                                                             1
The State University of New Jersey                                                           1
The University of Auckland                                                                   1
The University of Birmingham                                                                 1
The University of Cambridge                                                                  1
The University of Leeds                                                                      1
The University of Manchester                                                                 1
The University of Queensland                                                                 1
The University of Texas at Austin                                                            1
The University of York                                                                       1
Tilburg University                                                                           1
Tokyo Institute of Technology                                                                1
Tsinghua University                                                                          2
United Nations University International Institute for Software Technology (UNU-IIST)         1
Univ of Porto/Univ of Aveiro Portugal/Uni of Minho                                           1
Universite d'Evry Val d'Essonne                                                              1
Universiti Putra Malaysia Putra                                                              1
Universiti Technologi                                                                        1
Universiti Teknologi PETRONAS                                                                8
Universiti Tun Hussein Onn Malaysia                                                          1
University Institute of Information Technology                                               1
University Of Caen                                                                           1
University Of Oslo                                                                           1
University Of Salford                                                                        1
University Of Southern California                                                            1
University Of Waterloo                                                                       1
University Paris                                                                             1
University of Abertay Dundee                                                                 1
University of Agriculture                                                                    1
University of Agriculture Faisalabad                                                         1
University of Agriculture, Faisalabad                                                        9
University of Agriculture, Faisalabad                                                        2
University of Arid Agriculture                                                               1
University of Balochistan                                                                    2
University of Bath                                                                           1
University of Bayreuth                                                                       1
University of BedfordShire                                                                   1
University of Bedfordshire                                                                   1
University of Bergen                                                                         1
University of Birmingham                                                                     2
University of Bologna                                                                        1
University of Bonn                                                                           1
University of Bradford                                                                       1
University of Bristol                                                                        1
University of British Columbia                                                               1
University of Canterbury                                                                     1
University of Central Florida                                                                2
University of Central Missouri                                                               1
University of Central Punjab                                                                 8
University of Colorado                                                                       1
University of Dundee                                                                         1
University of Engineering & Technology                                                       1
University of Engineering and Technology                                                    18
University of Engineering and Technology,Peshawar                                            5
University of Engineering and Technology,Taxila                                             17
University of Essex                                                                          5
University of Florida                                                                        2
University of Freiburg                                                                       1
University of Genova                                                                         1
University of Glasgow                                                                        1
University of Grenoble                                                                       3
University of Gujrat                                                                         3
University of Huddersfield                                                                   1
University of Illinois                                                                       2
University of Innsbruck                                                                      3
University of Karachi                                                                       15
University of Kent                                                                           2
University of Konstanz                                                                       2
University of Kuala Lumpur                                                                   1
University of Lahore                                                                         5
University of Leeds                                                                          5
University of Leicester                                                                      6
University of Limerick                                                                       3
University of Liverpool                                                                      1
University of Malaga                                                                         1
University of Malaya                                                                         5
University of Management and Technology                                                      6
University of Manchester                                                                     4
University of Manchester Institute of Science and Technology                                 1
University of Mississippi                                                                    1
University of New South Wales, Sydney                                                        1
University of Nice, Sophia Antipolis                                                         2
University of Northampton                                                                    1
University of Notre Dame Indiana                                                             1
University of Orleans                                                                        1
University of Oviedo                                                                         1
University of Paisley                                                                        1
University of Paris                                                                          1
University of Paris-Est                                                                      2
University of Patras                                                                         1
University of Peshawar                                                                      11
University of Pittsburgh                                                                     1
University of Plymouth                                                                       1
University of Porto                                                                          1
University of Regina                                                                         1
University of Rochester                                                                      1
University of Rome Tor Vergata                                                               1
University of Saarland                                                                       1
University of Salford                                                                        1
University of Shanghai for Science and Technology                                            1
University of South Australia                                                                2
University of South Brittany                                                                 1
University of South Florida                                                                  1
University of Southampton                                                                    2
University of Southern California                                                            1
University of Stirling                                                                       1
University of Stuttgart                                                                      1
University of Sunderland                                                                     4
University of Surrey                                                                         2
University of Sussex                                                                         3
University of Technology                                                                     7
University of Trento                                                                         2
University of Turbat                                                                         1
University of Ulm                                                                            1
University of Vienna                                                                         2
University of Virginia                                                                       1
University of Wales                                                                          1
University of Wales,Aberystwyth                                                              1
University of Westminster                                                                    1
University of York                                                                           1
University of the Punjab                                                                    34
University of the Punjab,Gujranwala                                                          1
University of the West Scotland                                                              1
University of Liverpool John Moores University                                               1
Universität Salzburg                                                                         1
Université Henri Poincaré, Nancy 1,                                                          1
Université de la Rochelle                                                                    1
Universtiy of Karachi                                                                        2
Universtiy of Lahore                                                                         2
Uppsala University                                                                           1
Usman Institute of Technology                                                                1
Usman Institute of Technology (Hamdard University)                                           1
Vienna University of Technology                                                              9
Virtual University of Pakistan                                                              10
Vrije University, Amsterdam                                                                  2
Wayne State University                                                                       2
Xiamen university                                                                            1
Zhejiang University                                                                          1
 Boston University                                                                           1
 Hongik University                                                                           1
 Nanyang Technological University                                                            1
 National University of Sciences and Technology-NIIT                                         1
 University of Bedfordshire                                                                  1
 University of Bonn                                                                          1
 University of Missouri, KC                                                                  1
 University of Windsor                                                                       1
Åbo Akademi University,                                                                      1
Name: count, dtype: int64

Do you notice any inconsistencies in the data? Can any of the inconsistencies in the data be fixed by removing white spaces at the beginning and end of cells?

您是否注意到数据中存在任何不一致之处? 是否可以通过删除单元格开头和结尾的空格来修复数据中的任何不一致之处?

Once you have answered these questions, run the code cell below to get credit for your work.

回答完这些问题后,请运行下面的代码单元格以获得您的工作成果。

# Check your answer (Run this code cell to receive credit!)
q1.check()

Correct:

There are inconsistencies that can be fixed by removing white spaces at the beginning and end of cells. For instance, "University of Central Florida" and " University of Central Florida" both appear in the column.

可以通过删除单元格开头和结尾处的空格来修复不一致的情况。 例如,"University of Central Florida"和" University of Central Florida"都出现在该列中。

# Line below will give you a hint
q1.hint()

Hint: Use unis = professors['Graduated from'].unique() to take a look at the unique values in the 'Graduated from' column. You may find it useful to sort the data before printing it.

2) Do some text pre-processing

2) 进行一些文本预处理

Convert every entry in the "Graduated from" column in the professors DataFrame to remove white spaces at the beginning and end of cells.

转换professors数据帧中Graduated from列中的每个条目,以删除单元格开头和结尾的空格。

# TODO: Your code here
professors["Graduated from"] = professors["Graduated from"].str.strip()

# Check your answer
q2.check()

Correct

# Lines below will give you a hint or solution code
#q2.hint()
q2.solution()

Solution:

professors['Graduated from'] = professors['Graduated from'].str.strip()

3) Continue working with countries

3) 继续处理County列

In the tutorial, we focused on cleaning up inconsistencies in the "Country" column. Run the code cell below to view the list of unique values that we ended with.

在本教程中,我们重点关注清理国家/地区列中的不一致之处。 运行下面的代码单元格以查看我们结束时的唯一值列表。

# get all the unique values in the 'City' column
countries = professors['Country'].unique()

# sort them alphabetically and then take a closer look
countries.sort()
countries
array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain',
       'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'],
      dtype=object)

Take another look at the "Country" column and see if there's any more data cleaning we need to do.

再看一下国家/地区列,看看是否还需要进行更多数据清理。

It looks like 'usa' and 'usofa' should be the same country. Correct the "Country" column in the dataframe so that 'usofa' appears instead as 'usa'.

看起来usausofa应该是同一个国家。 更正数据框中的国家/地区列,以便usofa显示为usa

Use the most recent version of the DataFrame (with the whitespaces at the beginning and end of cells removed) from question 2.

使用问题 2 中最新版本的 DataFrame(删除单元格开头和结尾的空格)。

fuzzywuzzy.process.extract("usa", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
[('usa', 100),
 ('usofa', 75),
 ('austria', 60),
 ('australia', 50),
 ('spain', 50),
 ('urbana', 44),
 ('uk', 40),
 ('malaysia', 36),
 ('pakistan', 36),
 ('portugal', 36)]
# TODO: Your code here!
replace_matches_in_column(df=professors, column='Country', string_to_match="usa",min_ratio=70)

# Check your answer
q3.check()
All done!

Correct

# Lines below will give you a hint or solution code
#q3.hint()
q3.solution()

Solution:

matches = fuzzywuzzy.process.extract("usa", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)
replace_matches_in_column(df=professors, column='Country', string_to_match="usa", min_ratio=70)

Congratulations!

恭喜!

Congratulations for completing the Data Cleaning course on Kaggle Learn!

恭喜您完成 Kaggle Learn 上的 数据清理 课程!

To practice your new skills, you're encouraged to download and investigate some of Kaggle's Datasets.

为了练习您的新技能,我们鼓励您下载并研究一些 Kaggle 的数据集

05.exercise-inconsistent-data-entry【练习:数据不一致】

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top