Deleting data
Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.
R2 Data Catalog can automatically manage table maintenance operations such as snapshot expiration and compaction. These continuous operations help keep latency and storage costs down.
- Snapshot expiration: Automatically removes old snapshots. This reduces metadata overhead. Data files are not removed until orphan file removal is run.
- Compaction: Merges small data files into larger ones. This optimizes read performance and reduces the number of files read during queries.
Without enabling automatic maintenance, you need to manually handle these operations.
Learn more in the table maintenance documentation.
# Enable automatic snapshot expiration for entire catalognpx wrangler r2 bucket catalog snapshot-expiration enable my-bucket \ --older-than-days 30 \ --retain-last 5
# Enable automatic compaction for entire catalognpx wrangler r2 bucket catalog compaction enable my-bucket \ --target-size 256Refer to additional examples in the manage catalogs documentation.
You need to manually delete data for:
- Complying with data retention policies such as GDPR or CCPA.
- Selective based deletes using conditional logic.
- Removing stale or unreferenced files that R2 Data Catalog does not manage.
The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our example or the official PySpark documentation ↗.
# Creates new snapshots and marks old files for cleanupspark.sql(""" DELETE FROM r2dc.namespace.table_name WHERE column_name = 'value'""")
# The following is effectively a TRUNCATE operationspark.sql("DELETE FROM r2dc.namespace.table_name")
# For large deletes, use partitioned tables and delete entire partitions for faster performance:spark.sql(""" DELETE FROM r2dc.namespace.table_name WHERE date_partition < '2024-01-01'""")# Removes table from catalog but keeps data files in R2 storagespark.sql("DROP TABLE r2dc.namespace.table_name")
# ⚠️ DANGER: Permanently deletes all data files from R2# This operation cannot be undonespark.sql("DROP TABLE r2dc.namespace.table_name PURGE")
# Use CASCADE to drop all tables within the namespacespark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")
# You will need to PURGE the tables before running CASCADE to permanently delete data files# This can be done with a loop over all tables in the namespacetables = spark.sql("SHOW TABLES IN r2dc.namespace_name").collect()for row in tables: table_name = row['tableName'] spark.sql(f"DROP TABLE r2dc.namespace_name.{table_name} PURGE")spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")# Remove old metadata and data files marked for deletion# The following retains the last 5 snapshots and deletes files older than Nov 28, 2024spark.sql(""" CALL r2dc.system.expire_snapshots( table => 'r2dc.namespace_name.table_name', older_than => TIMESTAMP '2024-11-28 00:00:00', retain_last => 5 )""")
# Removes unreferenced data files from R2 storage (orphan files)spark.sql(""" CALL r2dc.system.remove_orphan_files( table => 'namespace.table_name' )""")
# Rewrite data files with a target file size (e.g., 512 MB)spark.sql(""" CALL r2dc.system.rewrite_data_files( table => 'r2dc.namespace_name.table_name', options => map('target-file-size-bytes', '536870912') )""")Apache Iceberg uses a layered metadata structure to manage table data efficiently. Here are the key components and file structure:
- metadata.json: Top-level JSON file pointing to the current snapshot
- snapshot-*: Immutable table state for a given point in time
- manifest-list-*.avro: An Avro file listing all manifest files for a given snapshot
- manifest-file-*.avro: An Avro file tracking data files and their statistics
- data-*.parquet: Parquet files containing actual table data
- Note: Unchanged manifest files are reused across snapshots
Directorymetadata.json Metadata File - Points to current snapshot
- Table Schema
- Partition Spec
- Sort Order
DirectorySnapshots
Directorysnapshot-3051729675574597004.avro Snapshot 1 (Historical)
Directorymanifest-list-abc123.avro Manifest List
Directorymanifest-file-001.avro Manifest File
- data-00001.parquet (10 MB, 50K rows)
- data-00002.parquet (12 MB, 60K rows)
- data-00003.parquet (11 MB, 55K rows)
Directorymanifest-file-002.avro
- data-00004.parquet (9 MB, 45K rows)
- data-00005.parquet (10 MB, 50K rows)
Directorysnapshot-3051729675574597005.avro Snapshot 2 (Current)
Directorymanifest-list-def456.avro Manifest List
Directorymanifest-file-001.avro (reused from Snapshot 1)
- data-00001.parquet
- data-00002.parquet
- data-00003.parquet
Directorymanifest-file-003.avro (new)
- data-00006.parquet (11 MB, 53K rows)
- data-00007.parquet (10 MB, 51K rows)
- data-00008.parquet (12 MB, 58K rows)
Apache Iceberg supports two deletion modes: Copy-on-Write (COW) and Merge-on-Read (MOR). Both create a new snapshot and mark old files for cleanup, but handle the deletion differently:
| Aspect | Copy-on-Write (COW) | Merge-on-Read (MOR) |
|---|---|---|
| How deletes work | Rewrites data files without deleted rows | Creates delete files marking rows to skip |
| Query performance | Fast (no merge needed) | Slower (requires read-time merge) |
| Write performance | Slower (rewrites data files) | Fast (only writes delete markers) |
| Storage impact | Creates new data files immediately | Accumulates delete files over time |
| Maintenance needs | Snapshot expiration | Snapshot expiration + compaction (rewrite_data_files) |
| Best for | Read-heavy workloads | Write-heavy workloads with frequent small mutations |
These operations work the same way for both COW and MOR tables:
| Operation | What it does | Data deleted? | Reversible? |
|---|---|---|---|
DELETE FROM | Removes rows matching condition | No (marked for cleanup) | Via time travel1 |
DROP TABLE | Removes table from catalog | No | Yes (if data files exist) |
DROP TABLE ... PURGE | Removes table and deletes data | Yes | No |
expire_snapshots | Cleans up old snapshots/files | Yes | No |
remove_orphan_files | Removes unreferenced files | Yes | No |
For Merge-on-Read tables, you may need to manually apply deletes for performance:
| Operation | What it does | When to use |
|---|---|---|
rewrite_data_files (compaction) | Applies deletes and consolidates files | When query performance degrades due to many delete files |
- Table maintenance - Learn about automatic maintenance operations
- R2 Data Catalog - Overview and getting started guide
- Query data - Query tables with R2 SQL
- Apache Iceberg Maintenance ↗ - Official Iceberg documentation on table maintenance
-
Time travel available until
expire_snapshotsis called ↩
Was this helpful?
- Resources
- API
- New to Cloudflare?
- Directory
- Sponsorships
- Open Source
- Support
- Help Center
- System Status
- Compliance
- GDPR
- Company
- cloudflare.com
- Our team
- Careers
- © 2026 Cloudflare, Inc.
- Privacy Policy
- Terms of Use
- Report Security Issues
- Trademark
-