news 2026/4/23 12:59:13

Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

作者头像

张小明

前端开发工程师

1.2k 24
文章封面图
Stop Using CSVs for Storage – Here Are the Top 5 Alternatives

CSVs cost you time, disk space, and money. Here are five alternatives every data scientist must know.

Dario Radečić

Sep 20, 2021

7 min read

Share

Photo by jaikishan patel on Unsplash

Everyone and their grandmother know what a CSV file is.But is it the optimal way to store data?Heck no. It’s probably the worst storage format if you don’t plan to view or edit data on the fly.

If you’re storing large volumes of data, opting for CSVs will cost you both time and money.

Today you’ll learn about five CSV alternatives. Each provides an advantage, either in read/write time or in file size. Some are even better in all areas.

Let’s set up the environment before going over the file formats.


Getting started – Environment setup

You’ll need a couple of libraries to follow along. The best practice is to install them inside a virtual environment, so that’s exactly what you’ll do. The following code snippet creates a new virtual environment through Anaconda and installs every required library:

conda create --name file_formats python=3.8 conda activate file_formats conda install -c conda forge numpy pandas fastavro pyarrow feather-format jupyter jupyterlab

Once the installation finishes, you can execute the following command to start a JupyterLab session:

jupyter lab

The next step is to import the libraries and create an arbitrary dataset. You’ll make one with 5 columns and 10M rows:

import numpy as np import pandas as pd import feather import pickle import pyarrow as pa import pyarrow.orc as orc from fastavro import writer, reader, parse_schema np.random.seed = 42 df_size = 10_000_000 df = pd.DataFrame({ 'a': np.random.rand(df_size), 'b': np.random.rand(df_size), 'c': np.random.rand(df_size), 'd': np.random.rand(df_size), 'e': np.random.rand(df_size) }) df.head()

Here’s how it looks like:

Image 1 – Random dummy dataset (image by author)

You now have everything needed to start experimenting with different data formats. Let’s cover ORC first.


ORC

ORC stands forOptimized Row Columnar. It’s a data format optimized for reads and writes in Hive. As Hive is painfully slow, folks at Hortonworks decided to develop the ORC file format to speed it up.

In Python, you can use theread_orc()function from Pandas to read ORC files. Unfortunately, there’s no alternative function for writing ORC files, so you’ll have to use PyArrow.

Here’s an example of writing Pandas DataFrames:

table = pa.Table.from_pandas(df, preserve_index=False) orc.write_table(table, '10M.orc')

And here’s the command for reading ORC files:

df = pd.read_orc('10M.orc')

You can learn more about ORC here:

Stop Using CSVs for Storage – This File Format is Faster and Lighter


Avro

Avro is an open-source project which provides services of data serialization and exchange for Apache Hadoop. It stores a JSON-like schema with the data, so the correct data types are known in advance. That’s where the compression happens.

Avro has an API for every major programming language, but it doesn’t support Pandas by default.

Here’s the set of commands for saving a Pandas DataFrame to an Avro file:

# 1. Define the schema schema = { 'doc': 'Float data', 'name': 'Data', 'namespace': 'data', 'type': 'record', 'fields': [ {'name': 'a', 'type': 'float'}, {'name': 'b', 'type': 'float'}, {'name': 'c', 'type': 'float'}, {'name': 'd', 'type': 'float'}, {'name': 'e', 'type': 'float'}, ] } parsed_schema = parse_schema(schema) # 2. Convert pd.DataFrame to records - list of dictionaries records = df.to_dict('records') # 3. Write to Avro file with open('10M.avro', 'wb') as out: writer(out, parsed_schema, records)

Reading Avro files is no picnic either:

# 1. List to store the records avro_records = [] # 2. Read the Avro file with open('10M.avro', 'rb') as fo: avro_reader = reader(fo) for record in avro_reader: avro_records.append(record) # 3. Convert to pd.DataFrame df = pd.DataFrame(avro_records)

You can learn more about Avro here:

CSV Files for Storage? Absolutely Not. Use Apache Avro Instead


Parquet

Apache Parquet is a data storage format designed for efficiency. The reason behind this is the column storage architecture, as it allows you to skip data that isn’t relevant quickly. This way, both queries and aggregations are faster, resulting in hardware savings.

The best news is – Pandas has full support for Parquet files.

Here’s the command for writing a Pandas DataFrame to a Parquet file:

df.to_parquet('10M.parquet')

And here’s the equivalent for reading:

df = pd.read_parquet('10M.parquet')

You can learn more about Parquet here:

CSV Files for Storage? No Thanks. There’s a Better Option


Pickle

You can use thepicklemodule to serialize objects and save them to a file. Likewise, you can then deserialize the serialized file to load them back when needed. Pickle has one major advantage over other formats – you can use it to store any Python object. One of the most widely used functionalities is saving machine learning models after the training is complete.

The biggest downside is that Pickle is Python-specific, so cross-language support isn’t guaranteed. It could be a deal-breaker for any project requiring data communication between Python and R, for example.

Here’s how to write a Pandas DataFrame to a Pickle file:

with open('10M.pkl', 'wb') as f: pickle.dump(df, f)

You’ll only have to change the file mode when reading a Pickle file:

with open('10M.pkl', 'rb') as f: df = pickle.load(f)

You can learn more about Pickle here:

Stop Using CSVs for Storage – Pickle is an 80 Times Faster Alternative


Feather

Feather is a data format for storing data frames. It’s designed around a simple premise – to push data frames in and out of memory as efficiently as possible. It was initially designed for fast communication between Python and R, but you’re not limited to this use case.

You can use thefeatherlibrary to work with Feather files in Python. It’s the fastest available option currently.

Here’s the command for saving Pandas DataFrames to a Feather file:

feather.write_dataframe(df, '10M.feather')

And here’s the command for reading:

df = feather.read_dataframe('10M.feather')

You can learn more about Feather here:

Stop Using CSVs for Storage – This File Format Is 150 Times Faster


Comparison time – Which data format should you use?

Many highly optimized file formats are useless if you need to change or even view the data on the fly. If that’s not the case, you should generally avoid CSVs.

Below is a comparison between CSV and every other mentioned data format in write time. The goal was to save previously created 10Mx5 dataset locally:

Image 2 – Write time comparison in seconds (CSV: 34.7; ORC: 9.66; Avro: 9.58; Parquet: 2.06; Pickle: 0.5; Feather: 0.304) (image by author)

The differences are astronomical. Feather is about 115 times faster than CSV for storing identical datasets. Even if you decide to go with something more compatible such as Parquet, there’s still a 17 times decrease in write time.

Let’s talk about the read times next. The goal is to compare how much time it takes to read identical datasets in different formats:

Image 3 – Read time comparison in seconds (CSV: 3.83; ORC: 3.02, Avro: 27.6; Parquet: 1.23; Pickle: 0.193; Feather: 1.16) (image by author)

CSVs aren’t so terrible here. Apache Avro is the absolute worse due to required parsing. Pickle is the fastest one, so it looks like the most promising option if you’re working only in Python.

And finally, let’s compare the file sizes on the disk:

Image 4 – File size comparison in MB (CSV: 963.5; ORC: 379.5; Avro: 200.3; Parquet: 401.4; Pickle: 400; Feather: 400.1) (image by author)

Things don’t look good for CSVs. The file size reduction goes from 2.4x to 4.8x, depending on the file format.

To summarize, if you store gigabytes of data daily, choosing the correct file format is crucial. If you’re working only in Python, you can’t go wrong with Pickle. If you need something a bit more versatile, go with any other mentioned format.


What are your thoughts on these CSV alternatives? Which one(s) do you use if viewing and editing data on the fly isn’t required?Let me know in the comments below.

版权声明: 本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若内容造成侵权/违法违规/事实不符,请联系邮箱:809451989@qq.com进行投诉反馈,一经查实,立即删除!
网站建设 2026/4/13 21:09:38

Flutter 入门实战:用代码构建你的第一个跨平台应用

Flutter 入门实战:用代码构建你的第一个跨平台应用 在移动开发领域,效率与一致性是开发者永恒的追求。Google 推出的 Flutter 正是为此而生——它不仅支持一套代码运行在 iOS、Android、Web 和桌面端,还提供了接近原生的性能和丰富的 UI 组件…

作者头像 李华
网站建设 2026/4/18 1:44:22

图神经网络推荐不准 后来才知道用多跳元路径聚合异构关系

💓 博客主页:借口的CSDN主页 ⏩ 文章专栏:《热点资讯》 目录人类与AI的相爱相杀日常:一个普通打工人眼里的AI魔法 一、当AI住进我家:智能家居的"作妖"日常 二、AI在医疗界的"骚操作" 三、当AI开始…

作者头像 李华
网站建设 2026/4/23 11:10:06

计算广告:智能时代的营销科学与实践(二十)

目录 10.5.4 生成对抗网络 1. 核心思想:造假者与鉴伪者的博弈论游戏 2. 训练动态与挑战 3. 在计算广告中的核心应用 4. 局限性、伦理挑战与未来方向 5. 总结:从预测智能到创造智能的范式扩展 第11章 合约广告核心技术 11.1 广告排期系统 1. 问题…

作者头像 李华
网站建设 2026/4/21 14:06:27

【完整源码+数据集+部署教程】手写数字检测检测系统源码[一条龙教学YOLOV8标注好的数据集一键训练_70+全套改进创新点发刊_Web前端展示]

一、背景意义 随着人工智能技术的迅猛发展,计算机视觉领域的研究日益受到关注。手写数字识别作为计算机视觉中的一个重要应用,广泛应用于银行支票处理、邮政编码识别、自动化表单处理等多个领域。传统的手写数字识别方法多依赖于特征提取和分类器设计&am…

作者头像 李华