Deleting data
Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.
# Enable automatic snapshot expiration for entire catalognpx wrangler r2 bucket catalog snapshot-expiration enable my-bucket \ --older-than-days 30 \ --retain-last 5
# Enable automatic compaction for entire catalognpx wrangler r2 bucket catalog compaction enable my-bucket \ --target-size 256More information can be found in the table maintenance and manage catalogs documentation.
The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our example or the official PySpark documentation ↗.
# Creates new snapshots and marks old files for cleanupspark.sql(""" DELETE FROM r2dc.namespace.table_name WHERE column_name = 'value'""")
# The following is effectively a TRUNCATE operationspark.sql("DELETE FROM r2dc.namespace.table_name")
# For large deletes, use partitioned tables and delete entire partitions for faster performance:spark.sql(""" DELETE FROM r2dc.namespace.table_name WHERE date_partition < '2024-01-01'""")# Removes table from catalog but keeps data files in R2 storagespark.sql("DROP TABLE r2dc.namespace.table_name")
# ⚠️ DANGER: Permanently deletes all data files from R2# This operation cannot be undonespark.sql("DROP TABLE r2dc.namespace.table_name PURGE")
# Use CASCADE to drop all tables within the namespacespark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")
# You will need to PURGE the tables before running CASCADE to permanently delete data files# This can be done with a loop over all tables in the namespacetables = spark.sql("SHOW TABLES IN r2dc.namespace_name").collect()for row in tables: table_name = row['tableName'] spark.sql(f"DROP TABLE r2dc.namespace_name.{table_name} PURGE")spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")# Remove old metadata and data files marked for deletion# The following retains the last 5 snapshots and deletes files older than Nov 28, 2024spark.sql(""" CALL r2dc.system.expire_snapshots( table => 'r2dc.namespace_name.table_name', older_than => TIMESTAMP '2024-11-28 00:00:00', retain_last => 5 )""")
# Removes unreferenced data files from R2 storage (orphan files)spark.sql(""" CALL r2dc.system.remove_orphan_files( table => 'namespace.table_name' )""")
# Rewrite data files with a target file size (e.g., 512 MB)spark.sql(""" CALL r2dc.system.rewrite_data_files( table => 'r2dc.namespace_name.table_name', options => map('target-file-size-bytes', '536870912') )""")Apache Iceberg uses a layered metadata structure to manage table data efficiently. Here are the key components and file structure:
- metadata.json: Top-level JSON file pointing to the current snapshot
- snapshot-*: Immutable table state for a given point in time
- manifest-list-*.avro: An Avro file listing all manifest files for a given snapshot
- manifest-file-*.avro: An Avro file tracking data files and their statistics
- data-*.parquet: Parquet files containing actual table data
- Note: Unchanged manifest files are reused across snapshots
Directorymetadata.json Metadata File - Points to current snapshot
- Table Schema
- Partition Spec
- Sort Order
DirectorySnapshots
Directorysnapshot-3051729675574597004.avro Snapshot 1 (Historical)
Directorymanifest-list-abc123.avro Manifest List
Directorymanifest-file-001.avro Manifest File
- data-00001.parquet (10 MB, 50K rows)
- data-00002.parquet (12 MB, 60K rows)
- data-00003.parquet (11 MB, 55K rows)
Directorymanifest-file-002.avro
- data-00004.parquet (9 MB, 45K rows)
- data-00005.parquet (10 MB, 50K rows)
Directorysnapshot-3051729675574597005.avro Snapshot 2 (Current)
Directorymanifest-list-def456.avro Manifest List
Directorymanifest-file-001.avro (reused from Snapshot 1)
- data-00001.parquet
- data-00002.parquet
- data-00003.parquet
Directorymanifest-file-003.avro (new)
- data-00006.parquet (11 MB, 53K rows)
- data-00007.parquet (10 MB, 51K rows)
- data-00008.parquet (12 MB, 58K rows)
Apache Iceberg supports two deletion modes: Copy-on-Write (COW) and Merge-on-Read (MOR). Both create a new snapshot and mark old files for cleanup, but handle the deletion differently:
| Aspect | Copy-on-Write (COW) | Merge-on-Read (MOR) |
|---|---|---|
| How deletes work | Rewrites data files without deleted rows | Creates delete files marking rows to skip |
| Query performance | Fast (no merge needed) | Slower (requires read-time merge) |
| Write performance | Slower (rewrites data files) | Fast (only writes delete markers) |
| Storage impact | Creates new data files immediately | Accumulates delete files over time |
| Maintenance needs | Snapshot expiration | Snapshot expiration + compaction (rewrite_data_files) |
| Best for | Read-heavy workloads | Write-heavy workloads with frequent small mutations |
These operations work the same way for both COW and MOR tables:
| Operation | What it does | Data deleted? | Reversible? |
|---|---|---|---|
DELETE FROM | Removes rows matching condition | No (marked for cleanup) | Via time travel1 |
DROP TABLE | Removes table from catalog | No | Yes (if data files exist) |
DROP TABLE ... PURGE | Removes table and deletes data | Yes | No |
expire_snapshots | Cleans up old snapshots/files | Yes | No |
remove_orphan_files | Removes unreferenced files | Yes | No |
For Merge-on-Read tables, you may need to manually apply deletes for performance:
| Operation | What it does | When to use |
|---|---|---|
rewrite_data_files (compaction) | Applies deletes and consolidates files | When query performance degrades due to many delete files |
- Table maintenance - Learn about automatic maintenance operations
- R2 Data Catalog - Overview and getting started guide
- Query data - Query tables with R2 SQL
- Apache Iceberg Maintenance ↗ - Official Iceberg documentation on table maintenance
-
Time travel available until
expire_snapshotsis called ↩
Was this helpful?
- Resources
- API
- New to Cloudflare?
- Directory
- Sponsorships
- Open Source
- Support
- Help Center
- System Status
- Compliance
- GDPR
- Company
- cloudflare.com
- Our team
- Careers
- © 2025 Cloudflare, Inc.
- Privacy Policy
- Terms of Use
- Report Security Issues
- Trademark
-