In this notebook, we're going to learn how to clean up inconsistent text entries.

在本笔记本中，我们将学习如何清理不一致的文本条目。

Let's get started!

让我们开始吧！

Get our environment set up

设置我们的环境

The first thing we'll need to do is load in the libraries and dataset we'll be using.

我们需要做的第一件事是加载我们将使用的库和数据集。

# pip install fuzzywuzzy, python-Levenshtein 

# modules we'll use
import pandas as pd
import numpy as np

# helpful modules
import fuzzywuzzy
from fuzzywuzzy import process
import charset_normalizer
# 这个字符探测模块效果更好些
import chardet

# read in all our data
professors = pd.read_csv("../00 datasets/alexisbcook/pakistan-intellectual-capital/pakistan_intellectual_capital.csv")

# set seed for reproducibility
np.random.seed(0)

Do some preliminary text pre-processing

进行一些初步的文本预处理

We'll begin by taking a quick look at the first few rows of the data.

我们首先快速浏览一下数据的前几行。

professors.head()

	Unnamed: 0	S#	Teacher Name	University Currently Teaching	Department	Province University Located	Designation	Terminal Degree	Graduated from	Country	Year	Area of Specialization/Research Interests	Other Information
0	2	3	Dr. Abdul Basit	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	Software Engineering & DBMS	NaN
1	4	5	Dr. Waheed Noor	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	DBMS	NaN
2	5	6	Dr. Junaid Baber	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	Information processing, Multimedia mining	NaN
3	6	7	Dr. Maheen Bakhtyar	University of Balochistan	Computer Science & IT	Balochistan	Assistant Professor	PhD	Asian Institute of Technology	Thailand	NaN	NLP, Information Retrieval, Question Answering...	NaN
4	24	25	Samina Azim	Sardar Bahadur Khan Women's University	Computer Science	Balochistan	Lecturer	BS	Balochistan University of Information Technolo...	Pakistan	2005.0	VLSI Electronics DLD Database	NaN

Say we're interested in cleaning up the "Country" column to make sure there's no data entry inconsistencies in it. We could go through and check each row by hand, of course, and hand-correct inconsistencies when we find them. There's a more efficient way to do this, though!

假设我们有兴趣清理国家/地区列，以确保其中不存在数据输入不一致的情况。当然，我们可以手动检查每一行，并在发现不一致时手动更正它们。不过，有一种更有效的方法可以做到这一点！

# get all the unique values in the 'Country' column
countries = professors['Country'].unique()

# sort them alphabetically and then take a closer look
countries.sort()
countries

array([' Germany', ' New Zealand', ' Sweden', ' USA', 'Australia',
       'Austria', 'Canada', 'China', 'Finland', 'France', 'Greece',
       'HongKong', 'Ireland', 'Italy', 'Japan', 'Macau', 'Malaysia',
       'Mauritius', 'Netherland', 'New Zealand', 'Norway', 'Pakistan',
       'Portugal', 'Russian Federation', 'Saudi Arabia', 'Scotland',
       'Singapore', 'South Korea', 'SouthKorea', 'Spain', 'Sweden',
       'Thailand', 'Turkey', 'UK', 'USA', 'USofA', 'Urbana', 'germany'],
      dtype=object)

Just looking at this, I can see some problems due to inconsistent data entry: ' Germany', and 'germany', for example, or ' New Zealand' and 'New Zealand'.

只要看一下这个，我就可以看到由于数据输入不一致而导致的一些问题：例如 Germany和germany，或者 New Zealand和New Zealand。

The first thing I'm going to do is make everything lower case (I can change it back at the end if I like) and remove any white spaces at the beginning and end of cells. Inconsistencies in capitalizations and trailing white spaces are very common in text data and you can fix a good 80% of your text data entry inconsistencies by doing this.

我要做的第一件事是将所有内容都变成小写（如果我愿意，我可以在末尾将其更改回来）并删除单元格开头和结尾处的所有空格。大小写和尾随空格的不一致在文本数据中非常常见，通过执行此操作，您可以修复 80% 的文本数据输入不一致问题。

# convert to lower case
professors['Country'] = professors['Country'].str.lower()
# remove trailing white spaces
professors['Country'] = professors['Country'].str.strip()

Next we're going to tackle more difficult inconsistencies.

接下来我们将解决更困难的不一致问题。

Use fuzzy matching to correct inconsistent data entry

使用模糊匹配来纠正不一致的数据输入

Alright, let's take another look at the 'Country' column and see if there's any more data cleaning we need to do.

好吧，让我们再看一下国家/地区列，看看是否还需要进行更多的数据清理。

# get all the unique values in the 'Country' column
countries = professors['Country'].unique()

# sort them alphabetically and then take a closer look
countries.sort()
countries

array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea',
       'southkorea', 'spain', 'sweden', 'thailand', 'turkey', 'uk',
       'urbana', 'usa', 'usofa'], dtype=object)

professors['Country'].value_counts().sort_index()

Country
australia              10
austria                20
canada                  6
china                  16
finland                 1
france                 23
germany                23
greece                  1
hongkong                1
ireland                 3
italy                  17
japan                   3
macau                   1
malaysia               25
mauritius               1
netherland              6
new zealand             2
norway                  3
pakistan              776
portugal                2
russian federation      1
saudi arabia            1
scotland                1
singapore               3
south korea             8
southkorea              7
spain                   3
sweden                 17
thailand               12
turkey                  1
uk                     92
urbana                  1
usa                    45
usofa                  10
Name: count, dtype: int64

It does look like there is another inconsistency: 'southkorea' and 'south korea' should be the same.

看起来确实存在另一个不一致之处：southkorea和south korea应该是相同的。

We're going to use the fuzzywuzzy package to help identify which strings are closest to each other. This dataset is small enough that we could probably could correct errors by hand, but that approach doesn't scale well. (Would you want to correct a thousand errors by hand? What about ten thousand? Automating things as early as possible is generally a good idea. Plus, it’s fun!)

我们将使用 fuzzywuzzy 包来帮助识别哪些字符串彼此最接近。这个数据集足够小，我们可以手动纠正错误，但这种方法的扩展性不好。（您想手动纠正一千个错误吗？一万个错误又如何？尽早实现自动化通常是个好主意。而且，这很有趣！）

Fuzzy matching: The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

模糊匹配：自动查找与目标字符串非常相似的文本字符串的过程。一般来说，如果将一个字符串转换为另一个字符串，则需要更改的字符越少，则认为一个字符串与另一个字符串越接近。因此，apple和snapple彼此相距两个变化（添加s和n），而in和on则相距一个变化（将i替换为o）。您并不总是能够 100% 依赖模糊匹配，但它通常最终会至少为您节省一点时间。

Fuzzywuzzy returns a ratio given two strings. The closer the ratio is to 100, the smaller the edit distance between the two strings. Here, we're going to get the ten strings from our list of cities that have the closest distance to "south korea".

Fuzzywuzzy 返回给定两个字符串的比率。该比率越接近 100，两个字符串之间的编辑距离越小。在这里，我们将从距离south korea最近的城市列表中获取十个字符串。

# get the top 10 closest matches to "south korea"
matches = fuzzywuzzy.process.extract("south korea", countries, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

[('south korea', 100),
 ('southkorea', 48),
 ('saudi arabia', 43),
 ('norway', 35),
 ('austria', 33),
 ('ireland', 33),
 ('pakistan', 32),
 ('portugal', 32),
 ('scotland', 32),
 ('australia', 30)]

We can see that two of the items in the cities are very close to "south korea": "south korea" and "southkorea". Let's replace all rows in our "Country" column that have a ratio of > 47 with "south korea".

我们可以看到城市中有两个项目与south korea非常接近：south korea和southkorea。让我们将Country列中比率 > 47 的所有行替换为south korea。

To do this, I'm going to write a function. (It's a good idea to write a general purpose function you can reuse if you think you might have to do a specific task more than once or twice. This keeps you from having to copy and paste code too often, which saves time and can help prevent mistakes.)

为此，我将编写一个函数。（如果您认为可能需要多次执行特定任务，那么编写一个可以重用的通用函数是一个好主意。这可以让您不必过于频繁地复制和粘贴代码，从而节省时间并可以帮助您防止错误。）

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 47):
    # get a list of unique strings
    strings = df[column].unique()

    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match

    # let us know the function's done
    print("All done!")

Now that we have a function, we can put it to the test!

现在我们有了一个函数，我们可以对其进行测试！

# use the function we just wrote to replace close matches to "south korea" with "south korea"
replace_matches_in_column(df=professors, column='Country', string_to_match="south korea", min_ratio=47)

All done!

And now let's check the unique values in our "Country" column again and make sure we've tidied up "south korea" correctly.

现在让我们再次检查国家/地区列中的唯一值，并确保我们已正确清理了south korea。

# get all the unique values in the 'Country' column
countries = professors['Country'].unique()

# sort them alphabetically and then take a closer look
countries.sort()
countries

array(['australia', 'austria', 'canada', 'china', 'finland', 'france',
       'germany', 'greece', 'hongkong', 'ireland', 'italy', 'japan',
       'macau', 'malaysia', 'mauritius', 'netherland', 'new zealand',
       'norway', 'pakistan', 'portugal', 'russian federation',
       'saudi arabia', 'scotland', 'singapore', 'south korea', 'spain',
       'sweden', 'thailand', 'turkey', 'uk', 'urbana', 'usa', 'usofa'],
      dtype=object)

Excellent! Now we only have "south korea" in our dataframe and we didn't have to change anything by hand.

出色的！现在我们的dataframe中只有south korea，我们不需要手动更改任何内容。

Your turn!

到你了！

Therre are still some inconsistencies in the dataset. Continue cleaning the data in a final exercise.

数据集中仍然存在一些不一致的情况。在最后的练习中继续清理数据 。

05.course-inconsistent-data-entry【数据不一致】

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30