R2 Data Catalog
R2 Data Catalog sinks write processed data from pipelines as Apache Iceberg ↗ tables to R2 Data Catalog. Iceberg tables provide ACID transactions, schema evolution, and time travel capabilities for analytics workloads.
To create an R2 Data Catalog sink, run the
pipelines sinks create command and specify the sink type, target bucket, namespace, and table name:
The sink will create the specified namespace and table if they do not exist. Sinks cannot be created for existing Iceberg tables.
R2 Data Catalog sinks only support Parquet format. JSON format is not supported for Iceberg tables.
Configure Parquet compression for optimal storage and query performance:
Available compression options:
zstd(default) - Best compression ratio
snappy- Fastest compression
gzip- Good compression, widely supported
lz4- Fast compression with reasonable ratio
uncompressed- No compression
Row groups ↗ are sets of rows in a Parquet file that are stored together, affecting memory usage and query performance. Configure the target row group size in MB:
Control when data is written to Iceberg tables. Configure based on your needs:
- Lower values: More frequent writes, smaller files, lower latency
- Higher values: Less frequent writes, larger files, better query performance
Set how often files are written (default: 300 seconds):
Set maximum file size in MB before creating a new file:
R2 Data Catalog sinks require an API token with R2 Admin Read & Write permissions. This permission grants the sink access to both R2 Data Catalog and R2 storage.
