Pandas

dyh 收录于 Note

2022-10-09 约 2968 字预计阅读 6 分钟次阅读

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

参考

1
2


import numpy as np
import pandas as pd

Object creation

Series是一维数据

1
2


s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame是一张表格

1
2


dates = pd.date_range("20130101", periods=6)
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

1
2


df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df

	A	B	C	D
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282
2013-01-04	-0.667948	0.947373	0.011333	-0.303310
2013-01-05	1.412637	-1.539891	-0.789662	1.382914
2013-01-06	-1.925618	-0.395157	0.215027	0.800317

通过传递字典来构造DataFrame

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

DataFrames每一列可以有不同的值

1

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

Viewing data

使用DataFrame.head()和DataFrame.tail()来查看前几行和后几行，如果不指定参数则除了第一行（或者最后一行）都显示

1

df.head()

	A	B	C	D
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282
2013-01-04	-0.667948	0.947373	0.011333	-0.303310
2013-01-05	1.412637	-1.539891	-0.789662	1.382914

1

df.tail(3)

	A	B	C	D
2013-01-04	-0.667948	0.947373	0.011333	-0.303310
2013-01-05	1.412637	-1.539891	-0.789662	1.382914
2013-01-06	-1.925618	-0.395157	0.215027	0.800317

使用DataFrame.index 和 DataFrame.columns来获取行表头和列表头

1

df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

1

df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

使用DataFrame.to_numpy()可以将DataFrame转换成numpy，这将有很大的开销，因为DataFrame允许每一列的类型不一样，numpy不行

1

df.to_numpy()

array([[ 1.67261051,  1.01088994,  1.8737536 ,  1.02612614],
       [ 1.09926258,  0.33066777, -1.85849732, -1.44527193],
       [-0.83363676,  1.41695768, -0.52526275, -0.66128225],
       [-0.66794837,  0.94737275,  0.01133288, -0.30331012],
       [ 1.41263728, -1.53989118, -0.78966244,  1.38291399],
       [-1.92561833, -0.39515657,  0.21502717,  0.80031675]])

1

df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

注意，转化成numpy后将不再包含行列表头信息

describe()可以显示一些基本的统计信息

1

df.describe()

	A	B	C	D
count	6.000000	6.000000	6.000000	6.000000
mean	0.126218	0.295140	-0.178885	0.133249
std	1.466634	1.099208	1.242481	1.106116
min	-1.925618	-1.539891	-1.858497	-1.445272
25%	-0.792215	-0.213700	-0.723563	-0.571789
50%	0.215657	0.639020	-0.256965	0.248503
75%	1.334294	0.995011	0.164104	0.969674
max	1.672611	1.416958	1.873754	1.382914

转置

1

df.T

	2013-01-01	2013-01-02	2013-01-03	2013-01-04	2013-01-05	2013-01-06
A	1.672611	1.099263	-0.833637	-0.667948	1.412637	-1.925618
B	1.010890	0.330668	1.416958	0.947373	-1.539891	-0.395157
C	1.873754	-1.858497	-0.525263	0.011333	-0.789662	0.215027
D	1.026126	-1.445272	-0.661282	-0.303310	1.382914	0.800317

DataFrame.sort_index()按某个轴的表头排序

1

df.sort_index(axis=1, ascending=False)

	D	C	B	A
2013-01-01	1.026126	1.873754	1.010890	1.672611
2013-01-02	-1.445272	-1.858497	0.330668	1.099263
2013-01-03	-0.661282	-0.525263	1.416958	-0.833637
2013-01-04	-0.303310	0.011333	0.947373	-0.667948
2013-01-05	1.382914	-0.789662	-1.539891	1.412637
2013-01-06	0.800317	0.215027	-0.395157	-1.925618

DataFrame.sort_values()按某一行（列）排序

1

df.sort_values(by="B")

	A	B	C	D
2013-01-05	1.412637	-1.539891	-0.789662	1.382914
2013-01-06	-1.925618	-0.395157	0.215027	0.800317
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-04	-0.667948	0.947373	0.011333	-0.303310
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282

Selection

Getting

选择一个列

1

df["A"]

2013-01-01    1.672611
2013-01-02    1.099263
2013-01-03   -0.833637
2013-01-04   -0.667948
2013-01-05    1.412637
2013-01-06   -1.925618
Freq: D, Name: A, dtype: float64

使用切片的方式选择行

1

df[0:3]

	A	B	C	D
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282

1

df["20130102":"20130104"]

	A	B	C	D
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282
2013-01-04	-0.667948	0.947373	0.011333	-0.303310

Selection by label

获取某一行

1

df.loc[dates[0]]

A    1.672611
B    1.010890
C    1.873754
D    1.026126
Name: 2013-01-01 00:00:00, dtype: float64

通过标签选择一片区域

1

df.loc[:, ["A", "B"]]

	A	B
2013-01-01	1.672611	1.010890
2013-01-02	1.099263	0.330668
2013-01-03	-0.833637	1.416958
2013-01-04	-0.667948	0.947373
2013-01-05	1.412637	-1.539891
2013-01-06	-1.925618	-0.395157

1

df.loc["20130102":"20130104", ["A", "B"]]

	A	B
2013-01-02	1.099263	0.330668
2013-01-03	-0.833637	1.416958
2013-01-04	-0.667948	0.947373

1

df.loc["20130102", ["A", "B"]]

A    1.099263
B    0.330668
Name: 2013-01-02 00:00:00, dtype: float64

1

df.loc[dates[0], "A"]

1.6726105100658228

1

df.at[dates[0], "A"]

1.6726105100658228

Selection by position

通过传入的参数选择某一行

1

df.iloc[3]

A   -0.667948
B    0.947373
C    0.011333
D   -0.303310
Name: 2013-01-04 00:00:00, dtype: float64

通过传入的参数切片

1

df.iloc[3:5, 0:2]

	A	B
2013-01-04	-0.667948	0.947373
2013-01-05	1.412637	-1.539891

通过列表选择行和列

1

df.iloc[[1, 2, 4], [0, 2]]

	A	C
2013-01-02	1.099263	-1.858497
2013-01-03	-0.833637	-0.525263
2013-01-05	1.412637	-0.789662

对行直接切片

1

df.iloc[1:3, :]

	A	B	C	D
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282

对列直接切片

1

df.iloc[:, 1:3]

	B	C
2013-01-01	1.010890	1.873754
2013-01-02	0.330668	-1.858497
2013-01-03	1.416958	-0.525263
2013-01-04	0.947373	0.011333
2013-01-05	-1.539891	-0.789662
2013-01-06	-0.395157	0.215027

直接获得某个值

1

df.iloc[1, 1]

0.33066776522090735

1

df.iat[1, 1]

0.33066776522090735

Boolean indexing

使用单个列的值来选择

1

df[df["A"] > 0]

	A	B	C	D
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-02	1.099263	0.330668	-1.858497	-1.445272
2013-01-05	1.412637	-1.539891	-0.789662	1.382914

从DataFrame中选择

1

df[df > 0]

	A	B	C	D
2013-01-01	1.672611	1.010890	1.873754	1.026126
2013-01-02	1.099263	0.330668	NaN	NaN
2013-01-03	NaN	1.416958	NaN	NaN
2013-01-04	NaN	0.947373	0.011333	NaN
2013-01-05	1.412637	NaN	NaN	1.382914
2013-01-06	NaN	NaN	0.215027	0.800317

使用isin()来过滤

1
2
3


df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2

	A	B	C	D	E
2013-01-01	1.672611	1.010890	1.873754	1.026126	one
2013-01-02	1.099263	0.330668	-1.858497	-1.445272	one
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282	two
2013-01-04	-0.667948	0.947373	0.011333	-0.303310	three
2013-01-05	1.412637	-1.539891	-0.789662	1.382914	four
2013-01-06	-1.925618	-0.395157	0.215027	0.800317	three

1

df2[df2["E"].isin(["two", "four"])]

	A	B	C	D	E
2013-01-03	-0.833637	1.416958	-0.525263	-0.661282	two
2013-01-05	1.412637	-1.539891	-0.789662	1.382914	four

Setting

设置新列将自动按索引对齐数据

1

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

1

s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

通过label来设置值

1

df.at[dates[0], "A"] = 0

通过位置来设置值

1

df.iat[0, 1] = 0

1

df

	A	B	C	D
2013-01-01	0.000000	0.000000	1.873754	5
2013-01-02	1.099263	0.330668	-1.858497	5
2013-01-03	-0.833637	1.416958	-0.525263	5
2013-01-04	-0.667948	0.947373	0.011333	5
2013-01-05	1.412637	-1.539891	-0.789662	5
2013-01-06	-1.925618	-0.395157	0.215027	5

使用where设置值

1

df2 = df.copy()

1

df2[df2 > 0] = -df2

1

df2

	A	B	C	D
2013-01-01	0.000000	0.000000	-1.873754	-5
2013-01-02	-1.099263	-0.330668	-1.858497	-5
2013-01-03	-0.833637	-1.416958	-0.525263	-5
2013-01-04	-0.667948	-0.947373	-0.011333	-5
2013-01-05	-1.412637	-1.539891	-0.789662	-5
2013-01-06	-1.925618	-0.395157	-0.215027	-5

Missing data

np.nan表示缺失的数据

reindex可以更改、增加、删除某个轴上的索引，返回一份拷贝

1

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

1

df1.loc[dates[0] : dates[1], "E"] = 1

1

df1

	A	B	C	D	E
2013-01-01	0.000000	0.000000	1.873754	5	1.0
2013-01-02	1.099263	0.330668	-1.858497	5	1.0
2013-01-03	-0.833637	1.416958	-0.525263	5	NaN
2013-01-04	-0.667948	0.947373	0.011333	5	NaN

DataFrame.dropna()将有缺失值的行抛弃

1

df1.dropna(how="any")

	A	B	C	D	E
2013-01-01	0.000000	0.000000	1.873754	5	1.0
2013-01-02	1.099263	0.330668	-1.858497	5	1.0

DataFrame.fillna()填充缺失的数据

1

df1.fillna(value=5)

	A	B	C	D	E
2013-01-01	0.000000	0.000000	1.873754	5	1.0
2013-01-02	1.099263	0.330668	-1.858497	5	1.0
2013-01-03	-0.833637	1.416958	-0.525263	5	5.0
2013-01-04	-0.667948	0.947373	0.011333	5	5.0

isna()把nan的块变成true，其他位false

1

pd.isna(df1)

	A	B	C	D	E
2013-01-01	False	False	False	False	False
2013-01-02	False	False	False	False	False
2013-01-03	False	False	False	False	True
2013-01-04	False	False	False	False	True

Operations

Stats

平均值

1

df.mean()

A   -0.152551
B    0.126658
C   -0.178885
D    5.000000
dtype: float64

另一个轴的平均值

1

df.mean(1)

2013-01-01    1.718438
2013-01-02    1.142858
2013-01-03    1.264515
2013-01-04    1.322689
2013-01-05    1.020771
2013-01-06    0.723563
Freq: D, dtype: float64

不同维度时需要对齐和广播

1

s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

1

s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

1

df.sub(s, axis="index")

	A	B	C	D
2013-01-01	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN
2013-01-03	-1.833637	0.416958	-1.525263	4.0
2013-01-04	-3.667948	-2.052627	-2.988667	2.0
2013-01-05	-3.587363	-6.539891	-5.789662	0.0
2013-01-06	NaN	NaN	NaN	NaN

Apply

DataFrame.apply()应用某个用户自定义的函数

1

df.apply(np.cumsum)

	A	B	C	D
2013-01-01	0.000000	0.000000	1.873754	5
2013-01-02	1.099263	0.330668	0.015256	10
2013-01-03	0.265626	1.747625	-0.510006	15
2013-01-04	-0.402323	2.694998	-0.498674	20
2013-01-05	1.010315	1.155107	-1.288336	25
2013-01-06	-0.915304	0.759950	-1.073309	30

1

df.apply(lambda x: x.max() - x.min())

A    3.338256
B    2.956849
C    3.732251
D    0.000000
dtype: float64

Histogramming

1

s = pd.Series(np.random.randint(0, 7, size=10))

1

s

0    3
1    5
2    6
3    5
4    5
5    0
6    5
7    2
8    3
9    5
dtype: int32

1

s.value_counts()

5    5
3    2
6    1
0    1
2    1
dtype: int64

String Methods

1

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

1

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Merge

Concat

concat()

1

df = pd.DataFrame(np.random.randn(10, 4))

1

df

	0	1	2	3
0	-0.197911	-0.211569	0.648594	-1.123691
1	-1.486624	-0.347519	0.168468	-1.687462
2	-2.242102	0.853538	-1.275269	2.274398
3	-1.695120	0.262447	-0.312296	-0.977294
4	0.606254	-0.358807	0.517332	-0.495694
5	0.400137	-0.557555	0.032655	0.057777
6	0.842657	1.707752	0.159613	-0.077072
7	-1.918256	0.666318	-0.842085	-0.655119
8	0.955265	-1.577956	-0.073302	0.049012
9	0.518293	-1.392874	-1.549751	0.232338

1
2


# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]

1

pd.concat(pieces)

	0	1	2	3
0	-0.197911	-0.211569	0.648594	-1.123691
1	-1.486624	-0.347519	0.168468	-1.687462
2	-2.242102	0.853538	-1.275269	2.274398
3	-1.695120	0.262447	-0.312296	-0.977294
4	0.606254	-0.358807	0.517332	-0.495694
5	0.400137	-0.557555	0.032655	0.057777
6	0.842657	1.707752	0.159613	-0.077072
7	-1.918256	0.666318	-0.842085	-0.655119
8	0.955265	-1.577956	-0.073302	0.049012
9	0.518293	-1.392874	-1.549751	0.232338

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.

Join

merge()

1

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

1

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

1

left

	key	lval
0	foo	1
1	foo	2

1

right

	key	rval
0	foo	4
1	foo	5

1

pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

另一个例子

1

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

1

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

1

left

	key	lval
0	foo	1
1	bar	2

1

right

	key	rval
0	foo	4
1	bar	5

1

pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	bar	2	5

Grouping

group一般执行以下步骤

Splitting the data into groups based on some criteria

Applying a function to each group independently

Combining the results into a data structure

1
2
3
4
5
6
7
8


df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)

1

df

	A	B	C	D
0	foo	one	0.441899	0.453832
1	bar	one	1.074023	-0.715191
2	foo	two	1.716625	-1.196256
3	bar	three	0.037677	-1.320146
4	foo	two	-0.545053	-0.372236
5	bar	two	-0.763277	-1.350397
6	foo	one	-0.962743	1.600875
7	foo	three	0.684661	-0.004039

1

df.groupby("A")[["C", "D"]].sum()

	C	D
A
bar	0.348422	-3.385734
foo	1.335389	0.482176

1

df.groupby(["A", "B"]).sum()

		C	D
A	B
bar	one	1.074023	-0.715191
	three	0.037677	-1.320146
	two	-0.763277	-1.350397
foo	one	-0.520844	2.054707
	three	0.684661	-0.004039
	two	1.171571	-1.568492

Reshaping

Stack

1
2
3
4
5
6


tuples = list(
    zip(
        ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
        ["one", "two", "one", "two", "one", "two", "one", "two"],
    )
)

1

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

1

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

1

df2 = df[:4]

1

df2

		A	B
first	second
bar	one	0.124214	0.765203
bar	two	-0.519057	2.075762
baz	one	-0.830936	-0.544587
baz	two	1.660069	1.475826

1

stacked = df2.stack()

1

stacked

first  second   
bar    one     A    0.124214
               B    0.765203
       two     A   -0.519057
               B    2.075762
baz    one     A   -0.830936
               B   -0.544587
       two     A    1.660069
               B    1.475826
dtype: float64

1

stacked.unstack()

		A	B
first	second
bar	one	0.124214	0.765203
bar	two	-0.519057	2.075762
baz	one	-0.830936	-0.544587
baz	two	1.660069	1.475826

1

stacked.unstack(1)

	second	one	two
first
bar	A	0.124214	-0.519057
bar	B	0.765203	2.075762
baz	A	-0.830936	1.660069
baz	B	-0.544587	1.475826

1

stacked.unstack(0)

	first	bar	baz
second
one	A	0.124214	-0.830936
one	B	0.765203	-0.544587
two	A	-0.519057	1.660069
two	B	2.075762	1.475826

Pivot tables

1
2
3
4
5
6
7
8
9


df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)

1

df

	A	B	C	D	E
0	one	A	foo	-0.341421	-0.546007
1	one	B	foo	1.618503	0.611306
2	two	C	foo	-0.941145	1.940351
3	three	A	bar	0.799316	0.184894
4	one	B	bar	1.621951	-0.394846
5	one	C	bar	-1.334491	0.177758
6	two	A	foo	-2.192872	-0.217708
7	three	B	foo	-1.127164	-0.266335
8	one	C	foo	-0.305296	0.404488
9	one	A	bar	-0.479922	0.379061
10	two	B	bar	-2.166614	-2.103981
11	three	C	bar	1.932436	0.609512

1

pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

	C	bar	foo
A	B
one	A	-0.479922	-0.341421
	B	1.621951	1.618503
	C	-1.334491	-0.305296
three	A	0.799316	NaN
	B	NaN	-1.127164
	C	1.932436	NaN
two	A	NaN	-2.192872
	B	-2.166614	NaN
	C	NaN	-0.941145

Time series

Categoricals

1
2
3


df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

转化成类别类型

1

df["grade"] = df["raw_grade"].astype("category")

1

df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

重命名

1

new_categories = ["very good", "good", "very bad"]

1

df["grade"] = df["grade"].cat.rename_categories(new_categories)

Plotting

1

import matplotlib.pyplot as plt

1

plt.close("all")

1

ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))

1

ts = ts.cumsum()

1

ts.plot()

<AxesSubplot:>

1

plt.show()

1
2
3


df = pd.DataFrame(
    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)

1

df = df.cumsum()

1

plt.figure();

<Figure size 432x288 with 0 Axes>

1

df.plot();

1

plt.legend(loc='best');

No handles with labels found to put in legend.

目录