8.数据处理：数据准备

文档摘要

数据处理：数据准备 ![插图由[@sketchthedocs]绘制](https://www.aiknowledge.cn/images/初学者的数据科学课程/08-DataPreparation.webp) :---: 数据准备 - 插图由@nitya绘制课前测验根据其来源，原始数据可能包含一些不一致之处，这些不一致将导致分析和建模中的挑战。换句话说，这种数据可以被归类为“脏数据”，需要进行清理。本课程的重点是清理和转换数据以应对缺失、不准确或不完整数据的挑战。本课程中涵盖的主题将使用Python和Pandas库进行演示，并在本目录中的笔记本文件内展示。清理数据的重要性易于使用和重用：当数据被适当组织和规范化时，更容易搜索、使用和与他人共享。

数据处理：数据准备


数据准备 - 插图由@nitya绘制

课前测验

根据其来源，原始数据可能包含一些不一致之处，这些不一致将导致分析和建模中的挑战。换句话说，这种数据可以被归类为“脏数据”，需要进行清理。本课程的重点是清理和转换数据以应对缺失、不准确或不完整数据的挑战。本课程中涵盖的主题将使用Python和Pandas库进行演示，并在本目录中的笔记本文件内展示。

清理数据的重要性

易于使用和重用：当数据被适当组织和规范化时，更容易搜索、使用和与他人共享。
一致性：数据科学通常需要处理多个数据集，其中来自不同来源的数据集需要合并在一起。确保每个单独的数据集都有共同的标准化，将确保在它们合并成一个数据集时仍然有用。
模型准确性：经过清理的数据提高了依赖于它的模型的准确性。

常见的清理目标和策略

探索数据集：数据探索（在后续课程中介绍）可以帮助你发现需要清理的数据。通过观察数据集中的值，可以设置对剩余部分的期望，或者提供解决问题的想法。探索可以涉及基本查询、可视化和采样。
格式化：根据来源，数据可能在呈现方式上存在不一致，这会导致在搜索和表示值时出现问题，在数据集中看到值但无法在可视化或查询结果中正确表示。常见的格式化问题包括解决空格、日期和数据类型。解决格式化问题通常取决于使用数据的人。例如，关于日期和数字的表示标准可能因国家而异。
重复项：具有多个实例的数据可能会产生不准确的结果，通常应予以删除。当合并两个或更多数据集时，这种情况很常见。然而，在合并的数据集中，重复项有时包含可以提供额外信息的部分，可能需要保留。
缺失数据：缺失数据也会导致不准确以及弱或有偏的结果。有时可以通过重新加载数据、使用计算和代码（如Python）填充缺失值，或者简单地删除值及其对应数据来解决这些问题。数据缺失的原因多种多样，解决这些缺失值的措施可能取决于数据最初是如何以及为何缺失的。

探索DataFrame信息

学习目标：通过本小节的学习，你应该能够找到存储在Pandas DataFrame中的数据的一般信息。

一旦将数据加载到Pandas中，它更有可能会变成DataFrame（参阅前面的课程以获得详细概述）。但是，如果你的DataFrame中的数据集有60,000行和400列，你该如何开始了解自己正在处理的内容？幸运的是，pandas提供了一些方便的工具，除了查看前几行和后几行之外，还可以快速查看DataFrame的整体信息。

为了探索此功能，我们将导入Python scikit-learn库并使用一个标志性数据集：鸢尾花数据集。


import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

DataFrame.info：首先，info() method is used to print a summary of the content present in a DataFrame。让我们看看这个数据集：


iris_df.info()


RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB

从这里，我们知道鸢尾花数据集有150条记录，分为四列且没有空值。所有数据都存储为64位浮点数。

DataFrame.head()：接下来，检查DataFrame, we use the head() method. Let's see what the first few rows of our iris_df的实际内容：


iris_df.head()


   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2

DataFrame.tail()：相反，检查DataFrame, we use the tail()方法的最后一行：


iris_df.tail()


     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

要点：即使仅通过查看DataFrame中的元数据或前几行和后几行，你也可以立即了解你正在处理的数据的大小、形状和内容。

处理缺失数据

学习目标：通过本小节的学习，你应该知道如何替换或删除DataFrame中的空值。

大多数情况下，你想使用（或必须使用）的数据集中有缺失值。处理缺失数据的方式会带来微妙的权衡，可能会影响你的最终分析和实际结果。

Pandas以两种方式处理缺失值。第一种你在之前的章节中见过：NaN, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.

For missing values apart from floats, pandas uses the Python None object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both None and NaN carry restrictions that you need to be mindful of with regards to how they can be used.

Check out more about NaN and None from the notebook!

Detecting null values: In pandas, the isnull() and notnull() methods are your primary methods for detecting null data. Both return Boolean masks over your data. We will be using numpy for NaN值：


import numpy as np

example1 = pd.Series([0, np.nan, '', None])
example1.isnull()


0    False
1     True
2    False
3     True
dtype: bool

仔细观察输出。有什么让你惊讶的地方吗？虽然0 is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. '' is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.

Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a Series or DataFrame index, which can be useful when trying to work with isolated missing (or present) values.

Takeaway: Both the isnull() and notnull()方法在DataFrame中使用时会产生类似的结果——它们显示结果及其索引，这将在你处理数据时对你大有帮助。

删除空值：除了识别缺失值外，Pandas还提供了方便的方法来从Series and DataFrames. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to example1中删除空值：


example1 = example1.dropna()
example1


0    0
2     
dtype: object

注意这应该看起来像你从example3[example3.notnull()]. The difference here is that, rather than just indexing on the masked values, dropna中删除那些缺失值后的输出。

由于DataFrame有两个维度，因此它们提供了更多删除数据的选项。


example2 = pd.DataFrame([[1,      np.nan, 7], 
                         [2,      5,      8], 
                         [np.nan, 6,      9]])
example2

	0	1	2
0	1.0	NaN	7
1	2.0	5.0	8
2	NaN	6.0	9

（你注意到Pandas为了容纳NaNs而将两列提升为浮点数了吗？）

你不能从DataFrame中删除单个值，所以你必须删除整行或整列。根据你正在做的事情，你可能希望这样做，因此Pandas提供了这两种选择。因为在数据科学中，列通常代表变量，行代表观测值，所以你更可能删除数据行；dropna()的默认设置是删除所有包含任何空值的行：


example2.dropna()


	0	1	2
1	2.0	5.0	8

如果有必要，你可以从列中删除NA值。使用axis=1来实现：


example2.dropna(axis='columns')

请注意，这可能会删除大量你可能想保留的数据，尤其是在较小的数据集上。如果你只想删除包含多个甚至全部空值的行或列怎么办？你可以在dropna with the how and thresh parameters.

By default, how='any' (if you would like to check for yourself or see what other parameters the method has, run example4.dropna? in a code cell). You could alternatively specify how='all'中指定这些设置，以便只删除包含所有空值的行或列。让我们扩展我们的示例DataFrame来看看效果。


example2[3] = np.nan
example2

	0	1	2	3
0	1.0	NaN	7	NaN
1	2.0	5.0	8	NaN
2	NaN	6.0	9	NaN

thresh参数给你更精细的控制：你设置行或列需要有多少个非空值才能被保留：


example2.dropna(axis='rows', thresh=3)


	0	1	2	3
1	2.0	5.0	8	NaN

这里，第一行和最后一行被删除了，因为它们只有两个非空值。

填充空值：根据你的数据集，有时用有效值替换空值比删除它们更有意义。你可以使用isnull to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides fillna，它返回一个副本，其中缺失值被你选择的一个值替换。让我们创建另一个示例Series来实际看看这如何工作。


example3 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example3


a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

你可以用一个单一的值（例如0）填充所有缺失项：


example3.fillna(0)


a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

你可以前向填充空值，即用最后一个有效值填充空值：


example3.fillna(method='ffill')


a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

你也可以后向填充，即用下一个有效值向后传播以填充空值：


example3.fillna(method='bfill')


a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

正如你可能猜到的那样，这对于DataFrame同样适用，但你也可以指定一个axis along which to fill null values. taking the previously used example2：


example2.fillna(method='ffill', axis=1)


	0	1	2	3
0	1.0	1.0	7.0	7.0
1	2.0	5.0	8.0	8.0
2	NaN	6.0	9.0	9.0

请注意，当没有先前的值可供前向填充时，空值仍然存在。

要点：在你的数据集中有多种处理缺失值的方法。你使用的具体策略（删除它们、替换它们，甚至如何替换它们）应由该数据的具体情况决定。随着你更多地处理和与数据集互动，你会更好地了解如何处理缺失值。

删除重复数据

学习目标：通过本小节的学习，你应该能够识别并删除DataFrame中的重复值。

除了缺失数据外，你经常会遇到现实世界数据集中的重复数据。幸运的是，pandas provides an easy means of detecting and removing duplicate entries.

Identifying duplicates: duplicated: You can easily spot duplicate values using the duplicated方法在pandas中提供了重复值的功能，它返回一个布尔掩码，指示DataFrame中的条目是否是之前条目的重复。让我们创建另一个示例DataFrame来看这个功能。


example4 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
                         'numbers': [1, 2, 1, 3, 3]})
example4

	字母	数字
0	A	1
1	B	2
2	A	1
3	B	3
4	B	3


example4.duplicated()


0    False
1    False
2     True
3    False
4     True
dtype: bool

删除重复项：drop_duplicates: simply returns a copy of the data for which all of the duplicated values are False：


example4.drop_duplicates()


	letters	numbers
0	A	1
1	B	2
3	B	3

duplicated and drop_duplicates默认考虑所有列，但你可以指定它们只检查DataFrame中的子集列：


example4.drop_duplicates(['letters'])


letters	numbers
0	A	1
1	B	2

要点：删除重复数据是几乎所有数据科学项目的必备步骤。重复数据可以改变你的分析结果，给你不准确的结果！

挑战

所有讨论的材料都作为Jupyter笔记本提供。此外，每个部分之后都有练习，请尝试完成它们！

课后测验

复习与自学

有许多方法可以发现并接近准备用于分析和建模的数据，并且清理数据是一个重要的步骤，需要亲身体验。尝试Kaggle上的这些挑战，以探索本课程未涵盖的技术。

作业

评估表单数据


    **声明**: 
    本文件灏天文库团队进行了翻译。尽管我们力求准确，但请注意，翻译可能包含错误或不准确之处。原文档以其原始语言为准。我们不对因使用此翻译而产生的任何误解或误译负责。

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

	萼片长度 (cm)	萼片宽度 (cm)	花瓣长度 (cm)	花瓣宽度 (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2