Python学习笔记：高效数据格式feather（鸿毛）

作者：

风君子

在

软件

一、背景

日常使用 Python 读取数据时一般都是 json、csv、txt、xlsx 等格式，或者直接从数据库读取。

针对大数据量一般存储为 csv 格式，但文件占用空间比较大，保存和加载速度也较慢。

而 feather 便是一种速度更快、更加轻量级（压缩后）的二进制保存格式。

二、feather是什么？

Feather 是一种用于存储数据帧的数据格式。

一句话描述：高速读写压缩二进制文件。

Feather 其实是 Apache Arrow 项目中包含的一种数据格式，但是由于其优异的性能，该文件格式也被单独打包，放在 pip 中进行安装。

Pandas 也支持对 Feather 的读写操作。

最初是为了 Python 和 R 之间快速交互而设计的，初衷很简单，就是尽可能高效地完成数据在内存中转换的效率。

难能可贵的是，R、Julia、python 均可以解析 feather ，可以说是3种语言之间进行交互的强力工具了，读写速度一流。

现在 Feather 也不仅限于 Python 和 R ，基本每种主流的编程语言中都可以用 Feather 文件。

不过，它的数据格式并不是为长期存储而设计的，仅限于一般的短期存储。

— 此处不好理解：长期？短期？如何界定？

— 如果长期储存，feather 的空间压缩并不是最好的，可以了解下 Parquet。feather也可以长期存储，只不过不是最优解。

三、使用方法

在 Python 中，可以通过 pandas 或 Feather 两种方式进行操作。

但建议不要使用 pandas 自带的 to_feather 和 read_feather 。因为版本兼容性的问题，直接使用 feather 自带的 api 更优。

1.安装

注意：不要直接使用 pip install feather 进行安装，能正常显示安装但是读取时会报错 ImportError: cannot import name 'getuid' from 'os' D:anacondalibos.py)。

# pip
pip install feather-format
# 依赖会安装：pyarrow-5.0.0-cp38-cp38-win_amd64.whl

# conda
conda install -c conda-forgefeather-format # 测试报错

2.测试数据集

构建一个 5 列、1000 万行随机数。

import feather
import pandas as pd
import numpy as np

import os
os.chdirr'C:Users111Desktop')

np.random.seed = 2021
df_size = 10000000

df = pd.DataFrame{
    'a': np.random.randdf_size),
    'b': np.random.randdf_size),
    'c': np.random.randdf_size),
    'd': np.random.randdf_size),
    'e': np.random.randdf_size)
    })
df.head)
'''
          a         b         c         d         e
0  0.515694  0.879751  0.346675  0.998066  0.647965
1  0.648172  0.044250  0.546985  0.668001  0.460173
2  0.774530  0.354780  0.034965  0.259252  0.037479
3  0.843657  0.956277  0.059882  0.394459  0.088319
4  0.263218  0.409887  0.149357  0.971544  0.657425
'''

3.pandas操作方式

保存

可以直接利用 DataFrame.to_feather) 进行保存。使用语法为：

df.to_featherpath, compression, compression_level)
# -- path:文件路径
# -- compression：是否压缩以及如何压缩，支持（zstd/uncompressde/lz4)三种方式
# -- compression_level：压缩水平（lz4不支持该参数）

df.to_feather'data.feather')

加载

df = pd.read_feather'data.feather')

4.feather操作方式

原生 feather 方式与 pandas 操作方式类似，速度也差不多。

保存

feather.write_dataframedf, 'data2.feather')

加载

df = feather.read_dataframe'data2.feather')

5.csv VS feather

写入速度对比

# 导入时间模块
import time

# 1.传统csv方式
start = time.time)
df.to_csv'data_csv.csv')
end = time.time)
print'CSV Running time: %s Seconds' % end-start))

# 2.原生feather
start = time.time)
feather.write_dataframedf, 'data_feather_ys.feather')
end = time.time)
print'YS-feather Running time: %s Seconds' % end-start))

# 3.pandas-feather
start = time.time)
df.to_feather'data_feather_pd.feather')
end = time.time)
print'Pd-feather Running time: %s Seconds' % end-start))
'''
CSV Running time: 93.85435080528259 Seconds
YS-feather Running time: 0.3590412139892578 Seconds
Pd-feather Running time: 4.7694432735443115 Seconds
'''

读取速度对比

# 导入时间模块
import time

# 1.传统csv方式
start = time.time)
df1 = pd.read_csv'data_csv.csv')
end = time.time)
print'CSV Running time: %s Seconds' % end-start))

# 2.原生feather
start = time.time)
df2 = feather.read_dataframe'data_feather_ys.feather')
end = time.time)
print'YS-feather Running time: %s Seconds' % end-start))

# 3.pandas-feather
start = time.time)
df3 = pd.read_feather'data_feather_pd.feather')
end = time.time)
print'Pd-feather Running time: %s Seconds' % end-start))

'''
CSV Running time: 11.32979965209961 Seconds
YS-feather Running time: 0.34105563163757324 Seconds
Pd-feather Running time: 0.45678043365478516 Seconds
'''

文件大小对比

# 肉眼对比
data_csv.csv             -- 0.97G
data_feather_ys.feather  -- 381M
data_feather_pd.feather  -- 381M

# 利用os获取文件大小（单位：MB）
import os
def get_FileSizefilePath):
    filePath = strfilePath)
    fsize = os.path.getsizefilePath)
    fsize = fsize / float1024 * 1024)
    return roundfsize, 2)

printget_FileSize'data_feather_ys.feather'))
printget_FileSize'data_feather_pd.feather'))
printget_FileSize'data_csv.csv'))
381.57 MB
381.57 MB
1003.63 MB

# 计算压缩率
standart_ratio = os.stat'data_feather_ys.feather').st_size / os.stat'data_csv.csv').st_size
printf'Standart feather compression ratio is {standart_ratio*100 :.1f}%')
# Standart feather compression ratio is 38.0%

四、总结

Feather 相比 csv 格式拥有明显的性能提升。

适合中型数据（GB为单位的数据），比如4GB的csv文件，可能只占用700M的feather文件空间
读写速度远胜csv，而且相比较于数据库又具有便携的优势，可以作为很好的中间媒介来传输数据
类似于csv，feather也支持从源文件中仅读取所需要的列，可以减少内存的使用

df = pd.read_featherpath='data.feather', columns=["a","b","c"])

Parquet 是一种追求更多的压缩空间的数据格式，也可以考虑替代 csv 格式。

参考链接：再见 CSV，速度提升 150 倍！

参考链接python读feather格式文件

参考链接：feather——高性能的python数据读写

参考链接：轻如“鸿毛（Feather）”的文件格式却重于泰山

Python学习笔记：高效数据格式feather（鸿毛）

用哪个弄湿

少儿国寿福庆典版优缺点在哪？不足和亮点各参半

狗狗吃卫生巾怎么办

房贷二次扣款会影响征信吗？这是很有可能的

2020最新自助免费申请Office365教育版，免费5TOneDrive云盘详细图文教程

MacBook怎么绕过BootCamp安装Win10双系统

上海电信千兆宽带速度究竟如何上海电信千兆宽带

国内知名的前端博客

域名为什么会被墙如何检测域名是否被墙域名被墙如何处理

Python学习笔记：高效数据格式feather（鸿毛）

一、背景

二、feather是什么？

三、使用方法

1.安装

2.测试数据集

3.pandas操作方式

4.feather操作方式

5.csv VS feather

四、总结

更多文章

评论

发表回复 取消回复

站内搜索

标签云

热门文章

友情链接

发表回复取消回复