pandas数据处理的常用方法

Python效率工程 2020-09-10

461

话不多说直接上干货

Indexes

像 RDB 一样 left join
两个 dataset
在多级索引的 dataset
中使用索引
重命名某一列 renaming column/columns
增加一列 adding new column
同时处理多列 apply functions to multiple columns
有关 NaN 的处理
组合检索 chaining conditions

像 RDB 一样 join 两个 dataset

import pandas as pd
import numpy as np
import string
df1 = pd.DataFrame({
    'id': list(string.ascii_letters[:10]),
    'sales': np.random.rand(10),
})
df2 = pd.DataFrame({
    'id': list(string.ascii_letters[3:13]),
    'count': np.random.rand(10),
})
df3 = pd.merge(df1, df2, on='id')
df4 = pd.merge(df1, df2, on='id', how='left')

输出

In [49]: df = pd.merge(df1, df2, on='id')

In [50]: df
Out[50]: 
  id     sales     count
0  d  0.875390  0.902091
1  e  0.193723  0.546562
2  f  0.250351  0.858821
3  g  0.203284  0.867194
4  h  0.473823  0.991851
5  i  0.602875  0.435074
6  j  0.320997  0.193434

In [52]: pd.merge(df1, df2, on='id', how='left')
Out[52]: 
  id     sales     count
0  a  0.051131       NaN
1  b  0.571569       NaN
2  c  0.846158       NaN
3  d  0.875390  0.902091
4  e  0.193723  0.546562
5  f  0.250351  0.858821
6  g  0.203284  0.867194
7  h  0.473823  0.991851
8  i  0.602875  0.435074
9  j  0.320997  0.193434

在多级索引的 dataset 中使用索引

df = pd.DataFrame({
    'A': list('aaadeff'),
    'B': list('abbgeaz'),
    'Val': np.random.rand(7),
})
df = df.set_index(['A', 'B'])
idx = pd.IndexSlice
df.loc[idx[:,['g']], :]

输出

Out[1]: 
          Val
A B          
d g  0.257651

重命名某一列

df.rename(columns={
    'S': 'Strength',
    'W': 'Weak',
    'O': 'Opportunity',
    'T': 'Threat',
})
df.columns

增加一列 adding new column

df.insert(loc=3, column='newCol', value=None)
df.head()

同时处理多列或者多行 apply functions to multiple columns

cols = ['A', 'B', 'C']
# 处理多列

df[cols] = df[cols].apply(lambda x: x.lower(), axis=0)
# 处理多行

df[cols] = df[cols].apply(lambdax:x.lower(), axis=1)
# 注意axis参数

# 
axis = 0 表示匿名函数lambda的参数x代表一列

# axis = 1 表示匿名函数lambda的参数x代表"A","B","C"列的一行

# 
另外lambda的表达式可以是函数，函数的返回值必须是处理后的x，例如lambda x:deel_x(x)

有关 NaN 的处理

# count NaN
df.shape[0] - df.dropna().shape[0]
# selecting NaN rows
df[df['B'].isnull()]
# selecting not null rows
df[df['B'].notnull()]

组合检索 chaining conditions

df[(df['blood'] == 'B') | (df['blood'] == 'AB')]
df[(df['blood'] == 'B') & (df['blood'] == 'AB')]

上一篇快排算法的文章代码未正常显示，现附上：

def quick_sort(L, left_point, right_point):
    if left_point >= right_point:
        print(L)
        return
    # 默认以0位置为基准，左指针为0， 右指针为len(L)-1
    start = left_point
    end = right_point
    povit = L[left_point]
    # 以第一位为基准值，将大于基准值的和小于基准值的互换位置
    end_flag = True
    while left_point < right_point:


        while left_point < right_point and L[right_point] > povit and end_flag:
            right_point -= 1
        else:
            end_flag = False
        if left_point < right_point and L[left_point] > povit:
            L[left_point], L[right_point] = L[right_point], L[left_point]
            end_flag = True
        # 右边指针停止的时候，左边指针才开始右移
        if not end_flag and left_point < right_point:
            left_point += 1
    # 交换完位置后，左右指针必然相遇，交换基准值和指针位置的值
    # 此时基准值左边都比基准值小，右边都比基准值大
    L[start] = L[left_point]
    L[left_point] = povit
    # 使用递归对基准值左边和右边排序
    quick_sort(L, start, right_point-1)
    # 右边排序
    quick_sort(L, right_point+1, end)


L = [6, 1, 2, 7, 9, 3, 4, 5, 10, 8]
quick_sort(L, 0, len(L)-1)

数据库

文章转载自Python效率工程，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

pandas数据处理的常用方法

Indexes

像 RDB 一样 join 两个 dataset

在多级索引的 dataset 中使用索引

重命名某一列

增加一列 adding new column

同时处理多列或者多行 apply functions to multiple columns

有关 NaN 的处理

组合检索 chaining conditions

评论