Pandas笔记

Useful link

Joyful Pandas：http://joyfulpandas.datawhale.club/Content/

Pandas学习Blog：https://re-thought.com/author/anna/

Dataframe

基础操作

改变某一列的数据类型：``df[‘col_name’]= df[‘col_name’].astype(float)`
某一列的值转换为列表输出：df[num_name].to_list()
按某列的值排序:df.sort_values(by="DUT", ascending=True)，ascending=True为升序，False为降序
新增一列相同值的数据：df['col_name']=?
改变列顺序/按指定顺序提取列：df =df[['col1', 'col2']]

repeat_df = df[ids.isin(ids[ids.duplicated()])].sort_index()
# 遍历df的index
 for x in df.index:
        # 取这个索引某一列的值
        item = df.loc[x, "col_name"]
        # df加一行数据，df.iloc[index]，inde为int类型
        error_df = error_df.append(df.iloc[x], ignore_index=False)

拼接 (重复列名处理suffixes)

# suffixes默认值为['_x', '_y']
new_df = pd.merge(scc_df1, src_df2,  on='col_name', suffixes=['_L', "_R"])

df1.merge(df2, on=['Name', 'Class'], how='left')

分组

df.groupby(分组依据)[数据来源].使用操作
df.groupby('col_name')
df.groupby('Gender')['Longevity'].mean()
df.groupby(['School', 'Gender'])['Height'].mean() # 两个维度分组

#按A列进行分组，得到分组后的groupby对象
df_gp = df.groupby("A")

#得到分组的总长度
len(df_gp.count())
df_gp.size().size

#得到分组后的分组值 List
list(df_gp.size().index.values)

#取出行分组值为"School"的df
df_gp.get_group("Schoolg")

修改列名/索引

修改某个/几个index/列标签：df.rename(index={0: ‘均值’}, columns= inplace=True)

1	df.rename(index={0: '均值'}, columns={'one': '1'}, inplace=True)

指定一列数据作为新的index：DataFrame.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
- keys：列标签或列标签/数组列表，需要设置为索引的列
- drop：默认为True，删除用作新索引的列
- append：是否将列附加到现有索引，默认为False。
- inplace：输入布尔值，表示当前操作是否对原数据生效，默认为False。
- verify_integrity：检查新索引的副本。否则，请将检查推迟到必要时进行。将其设置为false将提高该方法的性能，默认为false。

1
2
3

# 将col_name这一列的值作为新的索引，drop默认为True，删除原来那一列数据
df.set_index('col_name', inplace=True)
#

读取指定列与行

loc：基于标签的名称，包括行标签(index)和列标签(columns)，即行名称和列名称，可以使用:

df.loc[index_name,col_name]
- index可以是单个标签，List标签，或者切片标签
- columns为单个标签，List标签，或者切片标签
df.loc[df.one > 2] / df.loc[ df.one> 2, ['one']]([‘one’]返回df，’one’返回series)返回符合列Bool条件的指定列，Bool条件中调用列的名称不加引号

iloc：基于标签的位置序号，行和列标签序号从0开始。

df[….]：可以读取列：df['col_name'] 读行：df[5:10](行序号，左闭右开)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
df.loc['b':'c'](此处切片是闭区间) 
df.iloc[1:3]左闭右开
   one  two
b  2.0  2.0
c  3.0  3.0

df.loc[['b', 'd'], ['one', 'two']]
df.iloc[[1,3], [0,1]]
   one  two
b  2.0  2.0
d  NaN  4.0

df.loc['d', 'one']
df.iloc[3,0]
nan

-----col-----
df['col_name']  df[['col1', 'col2']]# 固定列名的行
df[df.columns[0]] # 第一列series，列名为序号0
df[df.columns[0:1]] # 第一列df
df[df.columns[2:]] # 第3列到最后一列(与字符串截取一样，左闭右开)
-----row-----
df[5:10] # 读取6到10行数据

columns_to_use_list = ['col1', 'col2', 'col3']
# 在读取时读取指定列名：列名不匹配则会忽略 
# 重复列名在读取时会自动添加后缀: .1 .2，因此columns_to_use_list中重复列名也需要添加对应后缀
df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use_list)
# 列名不匹配会报错
df = df[columns_to_use_list]

添加/合并两个Col标签一样的df

新增一行数据：df.append([{'col1': value}], ignore_index=True)，如果col值字典中没有的列，则该列值为Nan
合并两个df：df1.append(df2, ignore_index=True/FALSE)
- ignore_index：默认为False，保持原有的index，True则重新生成从0开始的整数index，

Series

遍历series中的index与值

for index, value in test.items():
    print(index, value )
for index, value in test.iteritems():
    print(index, value)

获取series中的最大值与对应index：series.max() ,series.idxmax()

Series转置为df

s = df.mean()
s_df = s.to_frame().T  # to_frame().df.transpose()
s_df.rename(index={0: '均值'}, inplace=True)
one    2.0             one  two
two    2.5     ->      均值  2.0  2.5

Dataframe与Series的通用操作

转置：.transpose() / .T
对小数位数进行四舍五入，而不是对小数点的输出格式进行指定 ]：.round(小数位数) / .round({'col1':2, 'col2':3})，返回新对象，原对象不作修改

df.to_excel 调整列宽

pd.ExcelWriter的使用方法

# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')

# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1') ***#this is where you create Sheet 1***

# Get the xlsxwriter objects from the dataframe writer object.
workbook  = writer.book
worksheet = writer.sheets['Sheet1'] ***#here is where you select it***

ExcelWriter调整Excel列宽

# 动态调整所有列长：
writer = pd.ExcelWriter('/path/file.xlsx') 
df.to_excel(writer, sheet_name='sheetName', index=False, na_rep='NaN')

for column in df:
    column_length = max(df[column].astype(str).map(len).max(), len(column))
    col_idx = df.columns.get_loc(column)
    writer.sheets['sheetName'].set_column(col_idx, col_idx, column_length)

writer.save()

# 使用列名手动调整列
col_idx = df.columns.get_loc('columnName')
writer.sheets['sheetName'].set_column(col_idx, col_idx, 15)

# 使用列索引手动调整列
writer.sheets['sheetName'].set_column(col_idx, col_idx, 15)

pandas读取过大过小数字的数据/科学计数法的处理

科学记数法（带 e 的数字）：表示非常大或非常小的数字的方法。当 1 到 10 之间的数字乘以 10 的幂时，数字以科学计数法书写。(Python Pandas使用该符号标准，在读取过大或者过小数据时会自动转换为科学计数法)
- 2.3 e -5: 2.3 乘以 10 的负五次方 / 0.000023
- 4.5 e+6: 4.5 乘以 10 的 6 次方 / 4500000
Python float Type : inf
- python中的正无穷或负无穷，使用float(“inf”)或float(“-inf”)来表示
- float(“inf”)，float(“INF”)或者float(‘Inf’)都是可以的
- 涉及 > 和 < 比较时，所有数都比无穷小float(“-inf”)大，所有数都比无穷大float(“inf”)小，无穷大本身和无穷大是相等的
Python float Type : nan(not a number)
- nan不是一个数，与任何数字的相关计算结果都为nan
- 涉及 > 和 < 和==比较时，结果都是False(nan==nan，结果也是False)

import math
# isinf() isnan() 
n = float('inf')
print(math.isinf(n))  # True
m = float('nan')
print(math.isnan(m))  # True

在pandas抑制自动转换的科学计数法：

df = pd.DataFrame(np.random.random(5)**10, columns=['random'])
# 解决方案 1：使用 .round()
df.round(5)  # 5 - 小数位数

# 解决方案 2：使用 apply 更改格式
df.apply(lambda x: '%.5f' % x, axis=1)
df['random'].apply(lambda x: '%.5f' % x)

# 解决方案 3：使用 .set_option()
# 请注意，.set_option()在 Jupyter Notebooks 中全局更改行为，因此它不是临时修复。
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# 将 Pandas 行为恢复为默认使用.reset_option().
pd.reset_option('display.float_format')

to_numeric 转换数据类型

能够对字符格式的数值进行快速转换和筛选。主要参数包括 :
- errors - 非数值的处理模式： ignore, raise(default), coerce（保持原来的字符串，直接报错，设为缺失Nan）
- downcast - 转换类型：integer, signed, unsigned, float, default None。