Skip to content
Cloudflare Docs

Deleting data

Deleting data in R2 Data Catalog

Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.

Examples of enabling automatic table maintenance in R2 Data Catalog

Terminal window
# Enable automatic snapshot expiration for entire catalog
npx wrangler r2 bucket catalog snapshot-expiration enable my-bucket \
--older-than-days 30 \
--retain-last 5
# Enable automatic compaction for entire catalog
npx wrangler r2 bucket catalog compaction enable my-bucket \
--target-size 256

More information can be found in the table maintenance and manage catalogs documentation.

Examples of deleting data from R2 Data Catalog using PySpark

The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our example or the official PySpark documentation.

Deleting rows from a table

Python
# Creates new snapshots and marks old files for cleanup
spark.sql("""
DELETE FROM r2dc.namespace.table_name
WHERE column_name = 'value'
""")
# The following is effectively a TRUNCATE operation
spark.sql("DELETE FROM r2dc.namespace.table_name")
# For large deletes, use partitioned tables and delete entire partitions for faster performance:
spark.sql("""
DELETE FROM r2dc.namespace.table_name
WHERE date_partition < '2024-01-01'
""")

Dropping tables and namespaces

Python
# Removes table from catalog but keeps data files in R2 storage
spark.sql("DROP TABLE r2dc.namespace.table_name")
# ⚠️ DANGER: Permanently deletes all data files from R2
# This operation cannot be undone
spark.sql("DROP TABLE r2dc.namespace.table_name PURGE")
# Use CASCADE to drop all tables within the namespace
spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")
# You will need to PURGE the tables before running CASCADE to permanently delete data files
# This can be done with a loop over all tables in the namespace
tables = spark.sql("SHOW TABLES IN r2dc.namespace_name").collect()
for row in tables:
table_name = row['tableName']
spark.sql(f"DROP TABLE r2dc.namespace_name.{table_name} PURGE")
spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")

Manual maintenance operations

Python
# Remove old metadata and data files marked for deletion
# The following retains the last 5 snapshots and deletes files older than Nov 28, 2024
spark.sql("""
CALL r2dc.system.expire_snapshots(
table => 'r2dc.namespace_name.table_name',
older_than => TIMESTAMP '2024-11-28 00:00:00',
retain_last => 5
)
""")
# Removes unreferenced data files from R2 storage (orphan files)
spark.sql("""
CALL r2dc.system.remove_orphan_files(
table => 'namespace.table_name'
)
""")
# Rewrite data files with a target file size (e.g., 512 MB)
spark.sql("""
CALL r2dc.system.rewrite_data_files(
table => 'r2dc.namespace_name.table_name',
options => map('target-file-size-bytes', '536870912')
)
""")

About Apache Iceberg metadata

Apache Iceberg uses a layered metadata structure to manage table data efficiently. Here are the key components and file structure:

  • metadata.json: Top-level JSON file pointing to the current snapshot
  • snapshot-*: Immutable table state for a given point in time
  • manifest-list-*.avro: An Avro file listing all manifest files for a given snapshot
  • manifest-file-*.avro: An Avro file tracking data files and their statistics
  • data-*.parquet: Parquet files containing actual table data
  • Note: Unchanged manifest files are reused across snapshots
  • Directorymetadata.json Metadata File - Points to current snapshot
    • Table Schema
    • Partition Spec
    • Sort Order
    • DirectorySnapshots
      • Directorysnapshot-3051729675574597004.avro Snapshot 1 (Historical)
        • Directorymanifest-list-abc123.avro Manifest List
          • Directorymanifest-file-001.avro Manifest File
            • data-00001.parquet (10 MB, 50K rows)
            • data-00002.parquet (12 MB, 60K rows)
            • data-00003.parquet (11 MB, 55K rows)
          • Directorymanifest-file-002.avro
            • data-00004.parquet (9 MB, 45K rows)
            • data-00005.parquet (10 MB, 50K rows)
      • Directorysnapshot-3051729675574597005.avro Snapshot 2 (Current)
        • Directorymanifest-list-def456.avro Manifest List
          • Directorymanifest-file-001.avro (reused from Snapshot 1)
            • data-00001.parquet
            • data-00002.parquet
            • data-00003.parquet
          • Directorymanifest-file-003.avro (new)
            • data-00006.parquet (11 MB, 53K rows)
            • data-00007.parquet (10 MB, 51K rows)
            • data-00008.parquet (12 MB, 58K rows)

What happens during deletion

Apache Iceberg supports two deletion modes: Copy-on-Write (COW) and Merge-on-Read (MOR). Both create a new snapshot and mark old files for cleanup, but handle the deletion differently:

AspectCopy-on-Write (COW)Merge-on-Read (MOR)
How deletes workRewrites data files without deleted rowsCreates delete files marking rows to skip
Query performanceFast (no merge needed)Slower (requires read-time merge)
Write performanceSlower (rewrites data files)Fast (only writes delete markers)
Storage impactCreates new data files immediatelyAccumulates delete files over time
Maintenance needsSnapshot expirationSnapshot expiration + compaction (rewrite_data_files)
Best forRead-heavy workloadsWrite-heavy workloads with frequent small mutations

Common deletion operations

These operations work the same way for both COW and MOR tables:

OperationWhat it doesData deleted?Reversible?
DELETE FROMRemoves rows matching conditionNo (marked for cleanup)Via time travel1
DROP TABLERemoves table from catalogNoYes (if data files exist)
DROP TABLE ... PURGERemoves table and deletes dataYesNo
expire_snapshotsCleans up old snapshots/filesYesNo
remove_orphan_filesRemoves unreferenced filesYesNo

MOR-specific operations

For Merge-on-Read tables, you may need to manually apply deletes for performance:

OperationWhat it doesWhen to use
rewrite_data_files (compaction)Applies deletes and consolidates filesWhen query performance degrades due to many delete files

Footnotes

  1. Time travel available until expire_snapshots is called