Ex_treme's blog.

一次千万级EXCLE数据处理任务

2018/06/08 Share

一次千万级EXCLE数据处理任务

21个十万级EXCLE数据,要求对根据一列对二列作统计
痛点分析

  • 不能直接打开—太卡

解决办法:python包操作EXCLE

  • 数据量太大,pandas无法一次性统计

list可变长,两次循环做频率统计

  • 文件体积太大,读取慢

存为csv

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import xlrd


for j in range(1,22):

workbook = xlrd.open_workbook('/home/pzs741/PycharmProjects/math/B题附件/B题 附件' + str(j) + '.xlsx')
booksheet = workbook.sheet_by_index(0)
s_list = []

for i in range(booksheet._dimnrows):
text = booksheet.row_values(i)
if text[1] not in s_list:
s_list.append(text[1])

count_list = [0 for i in range(s_list.__len__())]
for i in range(booksheet._dimnrows):
text = booksheet.row_values(i)
s_index = s_list.index(text[1])
count_list[s_index]+=1

with open('data/'+ str(j) +'.csv',encoding='utf-8',mode='w') as f:
for index,i in enumerate(s_list):
add_list = ['0'+str(j)+'/04/2018',i,str(count_list[index])]
line = ",".join(add_list)+'\n'
f.write(line)
print('完成第{}个文件!'.format(j))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import xlrd


for j in range(1,22):
#
workbook = xlrd.open_workbook('/home/pzs741/PycharmProjects/math/B题附件/B题 附件' + str(j) + '.xlsx')
# workbook = xlrd.open_workbook('/home/pzs741/PycharmProjects/math/B题附件/B题 附件12.xlsx')
booksheet = workbook.sheet_by_index(0)




for i in range(booksheet._dimnrows):
text = booksheet.row_values(i)
text = [str(i) for i in text]
ship = text[1]
with open('ship/'+ship+'.csv',encoding='utf-8',mode='a') as f:
f.write(",".join(text)+'\n')

print('完成第{}个文件!!!'.format(j))

20行代码表示任务之简单,但跑完才发现错误耽误了时间,测试集和debug才是正确的打开方式。

CATALOG
  1. 1. 一次千万级EXCLE数据处理任务