Apache Arrow User Guide —— Reading and writing Parquet files

肥叔菌 2023-01-28

1728

Reading Parquet files

arrow::FileReader类将整个文件或行组的数据读取到::arrow::Table中。StreamReader和StreamWriter类允许使用C++输入/输出流方法逐列逐行读取/写入字段数据。提供这种方法是为了便于使用和类型安全。当数据必须以增量方式读写文件时，它当然也很有用。请注意，由于类型检查以及一次处理一个列值的事实，StreamReader和StreamWriter类的性能将不太好。The arrow::FileReader class reads data for an entire file or row group into an ::arrow::Table. The StreamReader and StreamWriter classes allow for data to be written using a C++ input/output streams approach to read/write fields column by column and row by row. This approach is offered for ease of use and type-safety. It is of course also useful when data must be streamed as files are read and written incrementally. Please note that the performance of the StreamReader and StreamWriter classes will not be as good due to the type checking and the fact that column values are processed one at a time.

The Parquet arrow::FileReader requires a ::arrow::io::RandomAccessFile instance representing the input file. Parquet arrow::FileReader需要一个表示输入文件的::arrow::io::RandomAccessFile实例。Finer-grained options are available through the arrow::FileReaderBuilder helper class. 更细粒度的选项可通过arrow::FileReaderBuilder帮助类获得。

#include "arrow/parquet/arrow/reader.h"


{
    ...
   arrow::Status st;
   arrow::MemoryPool* pool = default_memory_pool();
   std::shared_ptr<arrow::io::RandomAccessFile> input = ...;


    Open Parquet file reader
   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
   st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
   if (!st.ok()) {
       Handle error instantiating file reader...
   }


    Read entire file as a single Arrow table
   std::shared_ptr<arrow::Table> table;
   st = arrow_reader->ReadTable(&table);
   if (!st.ok()) {
       Handle error reading Parquet data...
   }
}

The StreamReader allows for Parquet files to be read using standard C++ input operators which ensures type-safety. StreamReader允许使用标准C++输入运算符读取Parquet文件，从而确保类型安全。Please note that types must match the schema exactly i.e. if the schema field is an unsigned 16-bit integer then you must supply a uint16_t type. 请注意，类型必须与模式完全匹配，即如果模式字段是无符号16位整数，则必须提供uint16_t类型。Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances: Attempt to read field by supplying the incorrect type\Attempt to read beyond end of row\Attempt to read beyond end of file. 异常用于发出错误信号。在以下情况下会引发ParquetException：通过提供错误类型尝试读取字段\尝试读取超出行结尾\尝试读取超过文件结尾。

#include "arrow/io/file.h"
#include "parquet/stream_reader.h"


{
   std::shared_ptr<arrow::io::ReadableFile> infile;


   PARQUET_ASSIGN_OR_THROW( infile, arrow::io::ReadableFile::Open("test.parquet"));


   parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};


   std::string article; float price; uint32_t quantity;


   while ( !os.eof() ) {
      os >> article >> price >> quantity >> parquet::EndRow;
       ...
   }
}

Writing Parquet files

The arrow::WriteTable() function writes an entire ::arrow::Table to an output file. arrow::WriteTable()函数的作用是将整个::arrow::Table写入输出文件。

#include "parquet/arrow/writer.h"


{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;
   PARQUET_ASSIGN_OR_THROW( outfile, arrow::io::FileOutputStream::Open("test.parquet"));


   PARQUET_THROW_NOT_OK(
      parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

The StreamWriter allows for Parquet files to be written using standard C++ output operators. This type-safe approach also ensures that rows are written without omitting fields and allows for new row groups to be created automatically (after certain volume of data) or explicitly by using the EndRowGroup stream modifier. Exceptions are used to signal errors. A ParquetException is thrown in the following circumstances: Attempt to write a field using an incorrect type\Attempt to write too many fields in a row\Attempt to skip a required field. StreamWriter允许使用标准C++输出运算符写Parquet文件。这种类型安全的方法还可以确保在不省略字段的情况下写入行，并允许自动（在一定数量的数据之后）或使用EndRowGroup流修饰符显式地创建新的行组。异常用于发出错误信号。在以下情况下会引发ParquetException：尝试使用错误类型写入字段\尝试在行中写入过多字段\尝试跳过所需字段。

#include "arrow/io/file.h"
#include "parquet/stream_writer.h"


{
   std::shared_ptr<arrow::io::FileOutputStream> outfile;
   PARQUET_ASSIGN_OR_THROW( outfile, arrow::io::FileOutputStream::Open("test.parquet"));


   parquet::WriterProperties::Builder builder;
   std::shared_ptr<parquet::schema::GroupNode> schema;


   // Set up builder with required compression type etc.
   // Define schema.
   // ...


   parquet::StreamWriter os{ parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};


   // Loop over some data structure which provides the required
   // fields to be written and write each row.
   for (const auto& a : getArticles()){
      os << a.name() << a.price() << a.quantity() << parquet::EndRow;
   }
}

Parquet格式是一种用于复杂数据的节省空间的柱状存储格式。Parquet C++实现是Apache Arrow项目的一部分，并得益于与Arrow C++类和工具的紧密集成。The Parquet format is a space-efficient columnar storage format for complex data. The Parquet C++ implementation is part of the Apache Arrow project and benefits from tight integration with the Arrow C++ classes and facilities.

Supported Parquet features

Parquet格式有许多功能，Parquet C++支持其中的一个子集。The Parquet format has many features, and Parquet C++ supports a subset of them.

Page types
Unsupported page type: INDEX_PAGE. When reading a Parquet file, pages of this type are ignored. 不支持的页面类型：INDEX_page。读取Parquet文件时，将忽略此类型的页面。

Compression
Unsupported compression codec: LZO.

(1) On the read side, Parquet C++ is able to decompress both the regular LZ4 block format and the ad-hoc Hadoop LZ4 format used by the reference Parquet implementation. On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.

Encodings

(1) Only supported for encoding definition and repetition levels, not values.
(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version 2.4 or greater is selected in WriterProperties::version().

Types
Physical types

(1) Can be mapped to other Arrow types, depending on the logical type (see below). 根据逻辑类型，可以映射到其他arrow类型（见下文）。
(2) On the write side, ArrowWriterProperties::support_deprecated_int96_timestamps() must be enabled.
(3) On the write side, an Arrow LargeBinary can also mapped to BYTE_ARRAY. 在写端，Arrow LargeBinary也可以映射到BYTE_ARRAY。

Logical types
Specific logical types can override the default Arrow type mapping for a given physical type. 特定逻辑类型可以覆盖给定物理类型的默认箭头类型映射。

(1) On the write side, the Parquet physical type INT32 is generated.
(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emitted.
(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.
(4) On the write side, an Arrow LargeUtf8 is also mapped to a Parquet STRING.
(5) On the write side, an Arrow LargeList or FixedSizedList is also mapped to a Parquet LIST.
(6) On the read side, a key with multiple values does not get deduplicated, in contradiction with the Parquet specification.

Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is used (for example, a Parquet JSON column may be read as Arrow Binary or FixedSizeBinary). 不支持的逻辑类型：JSON、BSON、UUID。如果在读取Parquet文件时遇到这种类型，则使用默认的物理类型映射（例如，Parquet JSON列可以读取为Arrow Binary或FixedSizeBinary）。

Converted types
While converted types are deprecated in the Parquet format (they are superceded by logical types), they are recognized and emitted by the Parquet C++ implementation so as to maximize compatibility with other Parquet implementations. 虽然转换后的类型在Parquet格式中被弃用（它们被逻辑类型取代），但它们由Parquet C++实现识别并发出，以便最大限度地与其他Parquet实现兼容。

Special cases
An Arrow Extension type is written out as its storage type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below). 箭头扩展类型被写出来作为其存储类型。它仍然可以在读取时使用Parquet元数据重新创建（请参阅下面的“往返箭头类型”）。
An Arrow Dictionary type is written out as its value type. It can still be recreated at read time using Parquet metadata (see “Roundtripping Arrow types” below). 箭头字典类型作为其值类型写出。它仍然可以在读取时使用Parquet元数据重新创建（请参阅下面的“往返箭头类型”）。

Roundtripping Arrow types
While there is no bijection between Arrow types and Parquet types, it is possible to serialize the Arrow schema as part of the Parquet file metadata. This is enabled using ArrowWriterProperties::store_schema(). 虽然Arrow类型和Parquet类型之间没有双射，但可以将Arrow模式序列化为Parquet文件元数据的一部分。这是使用ArrowWriterProperties::store_schema()启用的。
On the read path, the serialized schema will be automatically recognized and will recreate the original Arrow data, converting the Parquet data as required (for example, a LargeList will be recreated from the Parquet LIST type). 在读取路径上，将自动识别序列化模式，并重新创建原始Arrow数据，根据需要转换Parquet数据（例如，将从Parquet LIST类型重新创建LargeList）。
As an example, when serializing an Arrow LargeList to Parquet:例如，当将Arrow LargeList序列化为Parquet时：
The data is written out as a Parquet LIST. When read back, the Parquet LIST data is decoded as an Arrow LargeList if ArrowWriterProperties::store_schema() was enabled when writing the file; otherwise, it is decoded as an Arrow List.数据以Parquet LIST的形式写出。回读时，如果在写入文件时启用了ArrowWriterProperties::store_schema()，则Parquet LIST数据将被解码为ArrowLargeList；否则，将其解码为箭头列表。

Serialization details
The Arrow schema is serialized as a Arrow IPC schema message, then base64-encoded and stored under the ARROW:schema metadata key in the Parquet file metadata. Arrow模式被序列化为Arrow IPC模式消息，然后base64编码并存储在Parquet文件元数据中的Arrow:schema元数据键下。

Limitations
Writing or reading back FixedSizedList data with null entries is not supported. 不支持使用空条目写入或读回FixedSizedList数据。

Encryption
Parquet C++ implements all features specified in the encryption specification, except for encryption of column index and bloom filter modules. More specifically, Parquet C++ supports: Parquet C++实现了加密规范中指定的所有特性，除了列索引和bloom过滤器模块的加密。更具体地说，Parquet C++支持：

AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms.
AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page, Data PageHeader, Dictionary PageHeader module types. Other module types (ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are not supported.
EncryptionWithFooterKey and EncryptionWithColumnKey modes.
Encrypted Footer and Plaintext Footer modes.

欢迎关注微信公众号肥叔菌PostgreSQL数据库专栏：

PostgreSQL数据库守护进程——Postmaster总体流程
PostgreSQL数据库守护进程——读取控制文件
PostgreSQL数据库守护进程——RemovePgTempFiles删除临时文件
PostgreSQL数据库守护进程——RemovePromoteSignalFiles
PostgreSQL数据库信号处理——kill backend
PostgreSQL数据库PMsignal——后端进程\Postmaster信号通信
PostgreSQL数据库后端进程——inter-process latch
PostgreSQL数据库后端进程——监视postmaster death
PostgreSQL数据库后台进程——一等公民
PostgreSQL数据库后台进程——后台一等公民进程保活
PostgreSQL数据库头胎——后台一等公民进程StartupDataBase启动
PostgreSQL数据库头胎——后台一等公民进程StartupDataBase信号通知
PostgreSQL数据库头胎——StarupXLOG函数恢复模式和目标
PostgreSQL数据库状态pmState——PM_STARTUP状态
PostgreSQL数据库复制——Setting Up Asynchronous Replication
PostgreSQL数据库复制——后台一等公民进程WalReceiver启动函数
PostgreSQL数据库复制——后台一等公民进程WalReceiver获知连接
PostgreSQL数据库复制——后台一等公民进程WalReceiver&startup交互
PostgreSQL数据库复制——后台一等公民进程WalReceiver ready_to_display
PostgreSQL数据库复制——后台一等公民进程WalReceiver提取信息
PostgreSQL数据库复制——后台一等公民进程WalReceiver收发逻辑
PostgreSQL数据库复制——后台一等公民进程WalReceiver pg_stat_wal_receiver视图
PostgreSQL数据库复制——walsender后端启动
PostgreSQL数据库守护进程——后台二等公民进程第一波启动maybe_start_bgworkers
PostgreSQL数据库参数——简述GUC
PostgreSQL数据库网络层——libpq连接参数
PostgreSQL数据库网络层——libpq客户端连接字符串和参数KV
PostgreSQL数据库网络层——pg_basebackup replication参数
PostgreSQL数据库网络层——libpq PQconnectdbParams和PQconnectdb
PostgreSQL数据库网络层——libpq服务端pqformat
PostgreSQL数据库网络层——libpq服务端网络通信方法
PostgreSQL数据库网络层——libpq服务端顶层接口
PostgreSQL数据库网络层——libpq协议连接建立阶段
PostgreSQL数据库网络层——libpq服务端ReadCommand
PostgreSQL数据库网络层——libpq服务端idle告诉前端准备好接受查询
PostgreSQL数据库网络层——libpq服务端向前端发送后端cancellation
PostgreSQL数据库网络层——libpq服务端BeginReportingGUCOptions向客户端汇报GUC
PostgreSQL数据库网络层——libpq客户端PQconnectPoll
PostgreSQL数据库网络层——libpq客户端Notice Processing
PostgreSQL数据库网络层——libpq客户端Event System
PostgreSQL数据库网络层——libpq客户端Behavior in Threaded Programs
PostgreSQL数据库网络层——libpq 查询协议PGQueryClass
PostgreSQL数据库网络层——libpq前后端协议Message Flow
PostgreSQL数据库网络层——服务端回送结果
PostgreSQL数据库网络层——libpq协议加密协商阶段
PostgreSQL数据库网络层——libpq协议认证协商阶段
PostgreSQL数据库网络层——libpq Canceling Requests in Progress
PostgreSQL数据库网络层——libpq 命令执行函数
PostgreSQL数据库锁机制——SpinLock底层实现
PostgreSQL数据库信号量机制——PGSemaphore底层原理
PostgreSQL数据库动态共享内存管理器——dynamic shared memory segment
PostgreSQL数据库WAL——资源管理器RMGR
PostgreSQL数据库WAL——备机回放checkpoint WAL
PostgreSQL数据库事务系统——phenomena
PostgreSQL数据库统计信息——analyze命令
PostgreSQL数据库统计信息——analyze大致流程
PostgreSQL数据库统计信息——analyze执行函数
PostgreSQL数据库统计信息——查找继承子表find_all_inheritors
PostgreSQL数据库统计信息——analyze流程对不同表的处理
PostgreSQL数据库统计信息——examine_attribute单列预分析
PostgreSQL数据库统计信息——acquire_sample_rows采样函数
PostgreSQL数据库统计信息——acquire_inherited_sample_rows采样函数
PostgreSQL数据库统计信息——计算统计数据
PostgreSQL数据库统计信息——compute_scalar_stats计算统计数
PostgreSQL数据库统计信息——analyze统计信息收集
PostgreSQL数据库统计信息——统计信息系统表
PostgreSQL守护进程（Postmaster）——辅助进程PgStat主流程
PostgreSQL守护进程（Postmaster）——辅助进程PgStat统计消息
PostgreSQL数据库查询监控技术——pg_stat_activity简介
PostgreSQL查询引擎——编译调试
PostgreSQL查询引擎——create table xxx(...)基础建表流程
PostgreSQL查询引擎——create table xxx(...)基础建表transformCreateStmt
PostgreSQL查询引擎——select * from where = transform流程
PostgreSQL数据库查询执行——T_VariableSetStmt
PostgreSQL数据库查询执行——T_TransactionStmt
PostgreSQL数据库查询执行——Parallel Query
PostgreSQL数据库查询执行——SeqScan节点执行
PostgreSQL数据库查询执行——Using GDB To Trace Into a Parallel Worker Spawned By Postmaster During a Large Query
PostgreSQL数据库查询执行——Parallel SeqScan节点执行
PostgreSQL数据库可插拔存储引擎——pg_am系统表
PostgreSQL数据库可插拔存储引擎——Table Access Manager
PostgreSQL数据库可插拔存储引擎——GetTableAmRoutine函数
PostgreSQL数据库可插拔存储引擎——Table scan callbacks
PostgreSQL数据库HeapAM——TupleTableSlot类型
PostgreSQL数据库HeapAM——HeapAM Scan
PostgreSQL数据库HeapAM——HeapAM Parallel table scan
PostgreSQL数据库HeapAM——synchronized scan machinery
PostgreSQL数据库缓冲区管理器——概述
PostgreSQL数据库缓冲区管理器——本地缓冲区管理
PostgreSQL数据库缓冲区管理器——Shared Buffer Pool初始化
PostgreSQL数据库存储介质管理器——SMGR
PostgreSQL数据库存储介质管理器——磁盘管理器
PostgreSQL数据库目录——目录操作封装
PostgreSQL虚拟文件描述符VFD机制——FD LRU池
PostgreSQL虚拟文件描述符VFD机制——FD LRU池其他函数
PostgreSQL数据库FDW——The Internals of PostgreSQL
PostgreSQL数据库FDW——WIP PostgreSQL Sharding
PostgreSQL数据库FDW——Parquet S3 Foreign Data Wrapper
PostgreSQL数据库FDW——Parquet S3 ParquetReader类
PostgreSQL数据库FDW——Parquet S3 ReaderCacheEntry
PostgreSQL数据库FDW——Parquet S3 ParallelCoordinator
PostgreSQL数据库FDW——Parquet S3 DefaultParquetReader类
PostgreSQL数据库FDW——Parquet S3 CachingParquetReader类
PostgreSQL数据库FDW——Parquet S3 ParquetS3FdwExecutionState类
PostgreSQL数据库FDW——Parquet S3 MultifileMergeExecutionStateBaseS3类
PostgreSQL数据库FDW——Parquet S3 读取parquet文件用例
PostgreSQL数据库使用——between and以及日期的使用
PostgreSQL数据库使用——iRedMail定时备份数据库脚本
PostgreSQL数据库使用——iRedMail初始化数据库脚本
PostgreSQL数据库使用——iRedMail创建用户脚本
PostgreSQL数据库插件——定时任务pg_cron
PostgreSQL数据库故障分析——invalid byte sequence for encoding
ETCD、Zookeeper和Consul 分布式数据库的魔法银弹
PostgreSQL数据库高可用——patroni介绍[翻译]
PostgreSQL数据库高可用——patroni配置[翻译]
PostgreSQL数据库高可用——patroni REST API[翻译]
PostgreSQL数据库高可用——将独立集群转换为Patroni集群[翻译]
PostgreSQL数据库高可用——patroni源码学习
PostgreSQL数据库高可用——patroni源码学习——abstract_main
PostgreSQL数据库高可用——patroni源码AbstractPatroniDaemon类
PostgreSQL数据库高可用——patroni源码Patroni子类简介
PostgreSQL数据库高可用——patroni源码PatroniLogger类
PostgreSQL数据库高可用——patroni RestApiServer
PostgreSQL数据库高可用——patroni源码DCS类
PostgreSQL数据库高可用——patroni源码AbstractEtcd类
PostgreSQL数据库高可用——patroni源码EtcdClient类
PostgreSQL数据库高可用——patroni源码Etcd
PostgreSQL数据库高可用——patroni源码学习——Ha类概述
PostgreSQL数据库高可用——Patroni AsyncExecutor类
PostgreSQL数据库高可用——Patroni PostmasterProcess类
PostgreSQL数据库备份恢复迁移——Barman Before you start[翻译]
PostgreSQL数据库备份恢复迁移——Barman Introduction[翻译]
Postgres-XL数据库GTM——概念
Postgres-XL数据库GTM——事务管理
Postgres-XL数据库GTM——GTM and Global Transaction Management[翻译]
Postgres-XL数据库GTM——Master & Standby启动流程
Postgres-XL数据库GTM——Master & Standby子线程
Postgres-XL数据库GTM——Node管理器

Greenplum数据库网络层——集群节点状态信息CdbComponents
Greenplum数据库网络层——Segment空闲后端进程IdleQE
Greenplum数据库统计信息——analyze命令
Greenplum数据库统计信息——分布式采样
Greenplum数据库统计信息——auto-analyze特性
Greenplum数据库Hash分布——计算哈希值和映射segment
Greenplum数据库Hash分布——GUC gp_use_legacy_hashops
Greenplum数据库数据分片策略Hash分布——执行器行为
Greenplum数据库过滤投影——ExecScan执行逻辑
Greenplum数据库外部表——Scan执行节点
Greenplum数据库外部表——fileam封装
Greenplum数据库外部表——external_getnext获取元组
Greenplum数据库外部表——url_curl创建销毁
Greenplum数据库外部协议——Define EXTPROTOCOL
Greenplum数据库外部协议——GPHDFS实现协议
Greenplum数据库外部协议——GPHDFS gphdfs_fopen
HashData数据库外部表——GPHDFS实现简介
Greenplum数据库高可用——FTS进程
Greenplum数据库高可用——FTS进程ftsConnect
Greenplum数据库高可用——FTS进程触发轮询
Greenplum数据库高可用——FTS进程ftsPoll\Send\Receive
Greenplum数据库高可用——FTS Pull模型
Greenplum数据库高可用——FTS HandleFtsWalRepProbe函数
Greenplum数据库高可用——FTS HandleFtsWalRepSyncRepOff函数
Greenplum数据库高可用——FTS HandleFtsWalRepPromote函数
Greenplum数据库高可用——FTS processRetry函数
Greenplum数据库高可用——FTS processResponse函数
Greenplum数据库高可用——FTS updateConfiguration更新系统表
Greenplum Python专用库gppylib学习——logging模块
Greenplum Python专用库gppylib学习——GpArray
Greenplum Python专用库gppylib学习——base.py
Greenplum Python工具库gpload学习——gpload类
Greenplum数据库源码分析——Standby Master操作工具分析
Greenplum数据库故障分析——利用GDB调试多线程core文件
Greenplum数据库故障分析——semop(id=,num=11) failed:invalid argument
Greenplum数据库故障分析——能对数据库base文件夹进行软连接嘛
Greenplum数据库故障分析——UDP Packet Lost(packet reassembles failed)
Greenplum数据库故障分析——版本升级后gpstart -a为何返回失败
Greenplum数据库故障分析——can not listen port
HAWQ数据库技术解析——内部架构
HAWQ数据库技术解析（一）——HAWQ简介[转载]

Apache Arrow User Guide——Reading and writing Parquet files

文章转载自肥叔菌，如果涉嫌侵权，请发送邮件至：contact@modb.pro进行举报，并提供相关证据，一经查实，墨天轮将立刻删除相关内容。

Apache Arrow User Guide —— Reading and writing Parquet files

Reading Parquet files

Writing Parquet files

Supported Parquet features

PostgreSQL数据库查询执行——Parallel Query

PostgreSQL数据库查询执行——Parallel SeqScan节点执行

PostgreSQL数据库FDW——Parquet S3 ParallelCoordinator

PostgreSQL数据库FDW——Parquet S3 DefaultParquetReader类

PostgreSQL数据库FDW——Parquet S3 CachingParquetReader类

PostgreSQL数据库FDW——Parquet S3 ParquetS3FdwExecutionState类

PostgreSQL数据库FDW——Parquet S3 MultifileMergeExecutionStateBaseS3类

PostgreSQL数据库FDW——Parquet S3 读取parquet文件用例

评论