pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
参考
1
2
|
import numpy as np
import pandas as pd
|
Object creation
Series
是一维数据
1
2
|
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
|
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
DataFrame
是一张表格
1
2
|
dates = pd.date_range("20130101", periods=6)
dates
|
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
1
2
|
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
|
|
A |
B |
C |
D |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
2013-01-06 |
-1.925618 |
-0.395157 |
0.215027 |
0.800317 |
通过传递字典来构造DataFrame
1
2
3
4
5
6
7
8
9
10
11
|
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
|
|
A |
B |
C |
D |
E |
F |
0 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
1 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
2 |
1.0 |
2013-01-02 |
1.0 |
3 |
test |
foo |
3 |
1.0 |
2013-01-02 |
1.0 |
3 |
train |
foo |
DataFrames每一列可以有不同的值
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
Viewing data
使用DataFrame.head()
和DataFrame.tail()
来查看前几行和后几行,如果不指定参数则除了第一行(或者最后一行)都显示
|
A |
B |
C |
D |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
|
A |
B |
C |
D |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
2013-01-06 |
-1.925618 |
-0.395157 |
0.215027 |
0.800317 |
使用DataFrame.index
和 DataFrame.columns
来获取行表头和列表头
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
Index(['A', 'B', 'C', 'D'], dtype='object')
使用DataFrame.to_numpy()
可以将DataFrame转换成numpy,这将有很大的开销,因为DataFrame允许每一列的类型不一样,numpy不行
array([[ 1.67261051, 1.01088994, 1.8737536 , 1.02612614],
[ 1.09926258, 0.33066777, -1.85849732, -1.44527193],
[-0.83363676, 1.41695768, -0.52526275, -0.66128225],
[-0.66794837, 0.94737275, 0.01133288, -0.30331012],
[ 1.41263728, -1.53989118, -0.78966244, 1.38291399],
[-1.92561833, -0.39515657, 0.21502717, 0.80031675]])
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
注意,转化成numpy后将不再包含行列表头信息
describe()
可以显示一些基本的统计信息
|
A |
B |
C |
D |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
0.126218 |
0.295140 |
-0.178885 |
0.133249 |
std |
1.466634 |
1.099208 |
1.242481 |
1.106116 |
min |
-1.925618 |
-1.539891 |
-1.858497 |
-1.445272 |
25% |
-0.792215 |
-0.213700 |
-0.723563 |
-0.571789 |
50% |
0.215657 |
0.639020 |
-0.256965 |
0.248503 |
75% |
1.334294 |
0.995011 |
0.164104 |
0.969674 |
max |
1.672611 |
1.416958 |
1.873754 |
1.382914 |
转置
|
2013-01-01 |
2013-01-02 |
2013-01-03 |
2013-01-04 |
2013-01-05 |
2013-01-06 |
A |
1.672611 |
1.099263 |
-0.833637 |
-0.667948 |
1.412637 |
-1.925618 |
B |
1.010890 |
0.330668 |
1.416958 |
0.947373 |
-1.539891 |
-0.395157 |
C |
1.873754 |
-1.858497 |
-0.525263 |
0.011333 |
-0.789662 |
0.215027 |
D |
1.026126 |
-1.445272 |
-0.661282 |
-0.303310 |
1.382914 |
0.800317 |
DataFrame.sort_index()
按某个轴的表头排序
1
|
df.sort_index(axis=1, ascending=False)
|
|
D |
C |
B |
A |
2013-01-01 |
1.026126 |
1.873754 |
1.010890 |
1.672611 |
2013-01-02 |
-1.445272 |
-1.858497 |
0.330668 |
1.099263 |
2013-01-03 |
-0.661282 |
-0.525263 |
1.416958 |
-0.833637 |
2013-01-04 |
-0.303310 |
0.011333 |
0.947373 |
-0.667948 |
2013-01-05 |
1.382914 |
-0.789662 |
-1.539891 |
1.412637 |
2013-01-06 |
0.800317 |
0.215027 |
-0.395157 |
-1.925618 |
DataFrame.sort_values()
按某一行(列)排序
|
A |
B |
C |
D |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
2013-01-06 |
-1.925618 |
-0.395157 |
0.215027 |
0.800317 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
Selection
Getting
选择一个列
2013-01-01 1.672611
2013-01-02 1.099263
2013-01-03 -0.833637
2013-01-04 -0.667948
2013-01-05 1.412637
2013-01-06 -1.925618
Freq: D, Name: A, dtype: float64
使用切片的方式选择行
|
A |
B |
C |
D |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
1
|
df["20130102":"20130104"]
|
|
A |
B |
C |
D |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
Selection by label
获取某一行
A 1.672611
B 1.010890
C 1.873754
D 1.026126
Name: 2013-01-01 00:00:00, dtype: float64
通过标签选择一片区域
|
A |
B |
2013-01-01 |
1.672611 |
1.010890 |
2013-01-02 |
1.099263 |
0.330668 |
2013-01-03 |
-0.833637 |
1.416958 |
2013-01-04 |
-0.667948 |
0.947373 |
2013-01-05 |
1.412637 |
-1.539891 |
2013-01-06 |
-1.925618 |
-0.395157 |
1
|
df.loc["20130102":"20130104", ["A", "B"]]
|
|
A |
B |
2013-01-02 |
1.099263 |
0.330668 |
2013-01-03 |
-0.833637 |
1.416958 |
2013-01-04 |
-0.667948 |
0.947373 |
1
|
df.loc["20130102", ["A", "B"]]
|
A 1.099263
B 0.330668
Name: 2013-01-02 00:00:00, dtype: float64
1.6726105100658228
1.6726105100658228
Selection by position
通过传入的参数选择某一行
A -0.667948
B 0.947373
C 0.011333
D -0.303310
Name: 2013-01-04 00:00:00, dtype: float64
通过传入的参数切片
|
A |
B |
2013-01-04 |
-0.667948 |
0.947373 |
2013-01-05 |
1.412637 |
-1.539891 |
通过列表选择行和列
1
|
df.iloc[[1, 2, 4], [0, 2]]
|
|
A |
C |
2013-01-02 |
1.099263 |
-1.858497 |
2013-01-03 |
-0.833637 |
-0.525263 |
2013-01-05 |
1.412637 |
-0.789662 |
对行直接切片
|
A |
B |
C |
D |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
对列直接切片
|
B |
C |
2013-01-01 |
1.010890 |
1.873754 |
2013-01-02 |
0.330668 |
-1.858497 |
2013-01-03 |
1.416958 |
-0.525263 |
2013-01-04 |
0.947373 |
0.011333 |
2013-01-05 |
-1.539891 |
-0.789662 |
2013-01-06 |
-0.395157 |
0.215027 |
直接获得某个值
0.33066776522090735
0.33066776522090735
Boolean indexing
使用单个列的值来选择
|
A |
B |
C |
D |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
从DataFrame中选择
|
A |
B |
C |
D |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
2013-01-02 |
1.099263 |
0.330668 |
NaN |
NaN |
2013-01-03 |
NaN |
1.416958 |
NaN |
NaN |
2013-01-04 |
NaN |
0.947373 |
0.011333 |
NaN |
2013-01-05 |
1.412637 |
NaN |
NaN |
1.382914 |
2013-01-06 |
NaN |
NaN |
0.215027 |
0.800317 |
使用isin()
来过滤
1
2
3
|
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
|
|
A |
B |
C |
D |
E |
2013-01-01 |
1.672611 |
1.010890 |
1.873754 |
1.026126 |
one |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
-1.445272 |
one |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
two |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
-0.303310 |
three |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
four |
2013-01-06 |
-1.925618 |
-0.395157 |
0.215027 |
0.800317 |
three |
1
|
df2[df2["E"].isin(["two", "four"])]
|
|
A |
B |
C |
D |
E |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
-0.661282 |
two |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
1.382914 |
four |
Setting
设置新列将自动按索引对齐数据
1
|
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
|
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
通过label来设置值
1
|
df.at[dates[0], "A"] = 0
|
通过位置来设置值
|
A |
B |
C |
D |
2013-01-01 |
0.000000 |
0.000000 |
1.873754 |
5 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
5 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
5 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
5 |
2013-01-05 |
1.412637 |
-1.539891 |
-0.789662 |
5 |
2013-01-06 |
-1.925618 |
-0.395157 |
0.215027 |
5 |
使用where设置值
|
A |
B |
C |
D |
2013-01-01 |
0.000000 |
0.000000 |
-1.873754 |
-5 |
2013-01-02 |
-1.099263 |
-0.330668 |
-1.858497 |
-5 |
2013-01-03 |
-0.833637 |
-1.416958 |
-0.525263 |
-5 |
2013-01-04 |
-0.667948 |
-0.947373 |
-0.011333 |
-5 |
2013-01-05 |
-1.412637 |
-1.539891 |
-0.789662 |
-5 |
2013-01-06 |
-1.925618 |
-0.395157 |
-0.215027 |
-5 |
Missing data
np.nan表示缺失的数据
reindex可以更改、增加、删除某个轴上的索引,返回一份拷贝
1
|
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
|
1
|
df1.loc[dates[0] : dates[1], "E"] = 1
|
|
A |
B |
C |
D |
E |
2013-01-01 |
0.000000 |
0.000000 |
1.873754 |
5 |
1.0 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
5 |
1.0 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
5 |
NaN |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
5 |
NaN |
DataFrame.dropna()
将有缺失值的行抛弃
|
A |
B |
C |
D |
E |
2013-01-01 |
0.000000 |
0.000000 |
1.873754 |
5 |
1.0 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
5 |
1.0 |
DataFrame.fillna()
填充缺失的数据
|
A |
B |
C |
D |
E |
2013-01-01 |
0.000000 |
0.000000 |
1.873754 |
5 |
1.0 |
2013-01-02 |
1.099263 |
0.330668 |
-1.858497 |
5 |
1.0 |
2013-01-03 |
-0.833637 |
1.416958 |
-0.525263 |
5 |
5.0 |
2013-01-04 |
-0.667948 |
0.947373 |
0.011333 |
5 |
5.0 |
isna()
把nan的块变成true,其他位false
|
A |
B |
C |
D |
E |
2013-01-01 |
False |
False |
False |
False |
False |
2013-01-02 |
False |
False |
False |
False |
False |
2013-01-03 |
False |
False |
False |
False |
True |
2013-01-04 |
False |
False |
False |
False |
True |
Operations
Stats
平均值
A -0.152551
B 0.126658
C -0.178885
D 5.000000
dtype: float64
另一个轴的平均值
2013-01-01 1.718438
2013-01-02 1.142858
2013-01-03 1.264515
2013-01-04 1.322689
2013-01-05 1.020771
2013-01-06 0.723563
Freq: D, dtype: float64
不同维度时需要对齐和广播
1
|
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
|
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
1
|
df.sub(s, axis="index")
|
|
A |
B |
C |
D |
2013-01-01 |
NaN |
NaN |
NaN |
NaN |
2013-01-02 |
NaN |
NaN |
NaN |
NaN |
2013-01-03 |
-1.833637 |
0.416958 |
-1.525263 |
4.0 |
2013-01-04 |
-3.667948 |
-2.052627 |
-2.988667 |
2.0 |
2013-01-05 |
-3.587363 |
-6.539891 |
-5.789662 |
0.0 |
2013-01-06 |
NaN |
NaN |
NaN |
NaN |
Apply
DataFrame.apply()
应用某个用户自定义的函数
|
A |
B |
C |
D |
2013-01-01 |
0.000000 |
0.000000 |
1.873754 |
5 |
2013-01-02 |
1.099263 |
0.330668 |
0.015256 |
10 |
2013-01-03 |
0.265626 |
1.747625 |
-0.510006 |
15 |
2013-01-04 |
-0.402323 |
2.694998 |
-0.498674 |
20 |
2013-01-05 |
1.010315 |
1.155107 |
-1.288336 |
25 |
2013-01-06 |
-0.915304 |
0.759950 |
-1.073309 |
30 |
1
|
df.apply(lambda x: x.max() - x.min())
|
A 3.338256
B 2.956849
C 3.732251
D 0.000000
dtype: float64
Histogramming
1
|
s = pd.Series(np.random.randint(0, 7, size=10))
|
0 3
1 5
2 6
3 5
4 5
5 0
6 5
7 2
8 3
9 5
dtype: int32
5 5
3 2
6 1
0 1
2 1
dtype: int64
String Methods
1
|
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
|
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
Merge
Concat
concat()
1
|
df = pd.DataFrame(np.random.randn(10, 4))
|
|
0 |
1 |
2 |
3 |
0 |
-0.197911 |
-0.211569 |
0.648594 |
-1.123691 |
1 |
-1.486624 |
-0.347519 |
0.168468 |
-1.687462 |
2 |
-2.242102 |
0.853538 |
-1.275269 |
2.274398 |
3 |
-1.695120 |
0.262447 |
-0.312296 |
-0.977294 |
4 |
0.606254 |
-0.358807 |
0.517332 |
-0.495694 |
5 |
0.400137 |
-0.557555 |
0.032655 |
0.057777 |
6 |
0.842657 |
1.707752 |
0.159613 |
-0.077072 |
7 |
-1.918256 |
0.666318 |
-0.842085 |
-0.655119 |
8 |
0.955265 |
-1.577956 |
-0.073302 |
0.049012 |
9 |
0.518293 |
-1.392874 |
-1.549751 |
0.232338 |
1
2
|
# break it into pieces
pieces = [df[:3], df[3:7], df[7:]]
|
|
0 |
1 |
2 |
3 |
0 |
-0.197911 |
-0.211569 |
0.648594 |
-1.123691 |
1 |
-1.486624 |
-0.347519 |
0.168468 |
-1.687462 |
2 |
-2.242102 |
0.853538 |
-1.275269 |
2.274398 |
3 |
-1.695120 |
0.262447 |
-0.312296 |
-0.977294 |
4 |
0.606254 |
-0.358807 |
0.517332 |
-0.495694 |
5 |
0.400137 |
-0.557555 |
0.032655 |
0.057777 |
6 |
0.842657 |
1.707752 |
0.159613 |
-0.077072 |
7 |
-1.918256 |
0.666318 |
-0.842085 |
-0.655119 |
8 |
0.955265 |
-1.577956 |
-0.073302 |
0.049012 |
9 |
0.518293 |
-1.392874 |
-1.549751 |
0.232338 |
Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.
Join
merge()
1
|
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
|
1
|
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
|
1
|
pd.merge(left, right, on="key")
|
|
key |
lval |
rval |
0 |
foo |
1 |
4 |
1 |
foo |
1 |
5 |
2 |
foo |
2 |
4 |
3 |
foo |
2 |
5 |
另一个例子
1
|
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
|
1
|
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
|
1
|
pd.merge(left, right, on="key")
|
|
key |
lval |
rval |
0 |
foo |
1 |
4 |
1 |
bar |
2 |
5 |
Grouping
group一般执行以下步骤
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
1
2
3
4
5
6
7
8
|
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
|
|
A |
B |
C |
D |
0 |
foo |
one |
0.441899 |
0.453832 |
1 |
bar |
one |
1.074023 |
-0.715191 |
2 |
foo |
two |
1.716625 |
-1.196256 |
3 |
bar |
three |
0.037677 |
-1.320146 |
4 |
foo |
two |
-0.545053 |
-0.372236 |
5 |
bar |
two |
-0.763277 |
-1.350397 |
6 |
foo |
one |
-0.962743 |
1.600875 |
7 |
foo |
three |
0.684661 |
-0.004039 |
1
|
df.groupby("A")[["C", "D"]].sum()
|
|
C |
D |
A |
|
|
bar |
0.348422 |
-3.385734 |
foo |
1.335389 |
0.482176 |
1
|
df.groupby(["A", "B"]).sum()
|
|
|
C |
D |
A |
B |
|
|
bar |
one |
1.074023 |
-0.715191 |
three |
0.037677 |
-1.320146 |
two |
-0.763277 |
-1.350397 |
foo |
one |
-0.520844 |
2.054707 |
three |
0.684661 |
-0.004039 |
two |
1.171571 |
-1.568492 |
Reshaping
Stack
1
2
3
4
5
6
|
tuples = list(
zip(
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
)
)
|
1
|
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
|
1
|
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
|
|
|
A |
B |
first |
second |
|
|
bar |
one |
0.124214 |
0.765203 |
two |
-0.519057 |
2.075762 |
baz |
one |
-0.830936 |
-0.544587 |
two |
1.660069 |
1.475826 |
first second
bar one A 0.124214
B 0.765203
two A -0.519057
B 2.075762
baz one A -0.830936
B -0.544587
two A 1.660069
B 1.475826
dtype: float64
|
|
A |
B |
first |
second |
|
|
bar |
one |
0.124214 |
0.765203 |
two |
-0.519057 |
2.075762 |
baz |
one |
-0.830936 |
-0.544587 |
two |
1.660069 |
1.475826 |
|
second |
one |
two |
first |
|
|
|
bar |
A |
0.124214 |
-0.519057 |
B |
0.765203 |
2.075762 |
baz |
A |
-0.830936 |
1.660069 |
B |
-0.544587 |
1.475826 |
|
first |
bar |
baz |
second |
|
|
|
one |
A |
0.124214 |
-0.830936 |
B |
0.765203 |
-0.544587 |
two |
A |
-0.519057 |
1.660069 |
B |
2.075762 |
1.475826 |
Pivot tables
1
2
3
4
5
6
7
8
9
|
df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
|
|
A |
B |
C |
D |
E |
0 |
one |
A |
foo |
-0.341421 |
-0.546007 |
1 |
one |
B |
foo |
1.618503 |
0.611306 |
2 |
two |
C |
foo |
-0.941145 |
1.940351 |
3 |
three |
A |
bar |
0.799316 |
0.184894 |
4 |
one |
B |
bar |
1.621951 |
-0.394846 |
5 |
one |
C |
bar |
-1.334491 |
0.177758 |
6 |
two |
A |
foo |
-2.192872 |
-0.217708 |
7 |
three |
B |
foo |
-1.127164 |
-0.266335 |
8 |
one |
C |
foo |
-0.305296 |
0.404488 |
9 |
one |
A |
bar |
-0.479922 |
0.379061 |
10 |
two |
B |
bar |
-2.166614 |
-2.103981 |
11 |
three |
C |
bar |
1.932436 |
0.609512 |
1
|
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
|
|
C |
bar |
foo |
A |
B |
|
|
one |
A |
-0.479922 |
-0.341421 |
B |
1.621951 |
1.618503 |
C |
-1.334491 |
-0.305296 |
three |
A |
0.799316 |
NaN |
B |
NaN |
-1.127164 |
C |
1.932436 |
NaN |
two |
A |
NaN |
-2.192872 |
B |
-2.166614 |
NaN |
C |
NaN |
-0.941145 |
Time series
Categoricals
1
2
3
|
df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
|
转化成类别类型
1
|
df["grade"] = df["raw_grade"].astype("category")
|
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
重命名
1
|
new_categories = ["very good", "good", "very bad"]
|
1
|
df["grade"] = df["grade"].cat.rename_categories(new_categories)
|
Plotting
1
|
import matplotlib.pyplot as plt
|
1
|
ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000", periods=1000))
|
<AxesSubplot:>
1
2
3
|
df = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
|
<Figure size 432x288 with 0 Axes>
1
|
plt.legend(loc='best');
|
No handles with labels found to put in legend.