Deleting data

Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.

Automatic table maintenance

R2 Data Catalog can automatically manage table maintenance operations such as snapshot expiration and compaction. These continuous operations help keep latency and storage costs down.

Snapshot expiration: Automatically removes old snapshots. This reduces metadata overhead. Data files are not removed until orphan file removal is run.
Compaction: Merges small data files into larger ones. This optimizes read performance and reduces the number of files read during queries.

Without enabling automatic maintenance, you need to manually handle these operations.

Learn more in the table maintenance documentation.

Examples of enabling automatic table maintenance in R2 Data Catalog

# Enable automatic snapshot expiration for entire catalog
npx wrangler r2 bucket catalog snapshot-expiration enable my-bucket \
  --older-than-days 30 \
  --retain-last 5

# Enable automatic compaction for entire catalog
npx wrangler r2 bucket catalog compaction enable my-bucket \
  --target-size 256

Refer to additional examples in the manage catalogs documentation.

Manually deleting and removing data

You need to manually delete data for:

Complying with data retention policies such as GDPR or CCPA.
Selective based deletes using conditional logic.
Removing stale or unreferenced files that R2 Data Catalog does not manage.

The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our example or the official PySpark documentation ↗.

Deleting rows from a table

# Creates new snapshots and marks old files for cleanup
spark.sql("""
  DELETE FROM r2dc.namespace.table_name
  WHERE column_name = 'value'
""")

# The following is effectively a TRUNCATE operation
spark.sql("DELETE FROM r2dc.namespace.table_name")

# For large deletes, use partitioned tables and delete entire partitions for faster performance:
spark.sql("""
    DELETE FROM r2dc.namespace.table_name
    WHERE date_partition < '2024-01-01'
""")

Dropping tables and namespaces

# Removes table from catalog but keeps data files in R2 storage
spark.sql("DROP TABLE r2dc.namespace.table_name")

# ⚠️  DANGER: Permanently deletes all data files from R2
# This operation cannot be undone
spark.sql("DROP TABLE r2dc.namespace.table_name PURGE")

# Use CASCADE to drop all tables within the namespace
spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")

# You will need to PURGE the tables before running CASCADE to permanently delete data files
# This can be done with a loop over all tables in the namespace
tables = spark.sql("SHOW TABLES IN r2dc.namespace_name").collect()
for row in tables:
  table_name = row['tableName']
  spark.sql(f"DROP TABLE r2dc.namespace_name.{table_name} PURGE")
spark.sql("DROP NAMESPACE r2dc.namespace_name CASCADE")

Manual maintenance operations

# Remove old metadata and data files marked for deletion
# The following retains the last 5 snapshots and deletes files older than Nov 28, 2024
spark.sql("""
  CALL r2dc.system.expire_snapshots(
    table => 'r2dc.namespace_name.table_name',
    older_than => TIMESTAMP '2024-11-28 00:00:00',
     retain_last => 5
  )
""")

# Removes unreferenced data files from R2 storage (orphan files)
spark.sql("""
  CALL r2dc.system.remove_orphan_files(
    table => 'namespace.table_name'
  )
""")

# Rewrite data files with a target file size (e.g., 512 MB)
spark.sql("""
  CALL r2dc.system.rewrite_data_files(
    table => 'r2dc.namespace_name.table_name',
    options => map('target-file-size-bytes', '536870912')
  )
""")

About Apache Iceberg metadata

Apache Iceberg uses a layered metadata structure to manage table data efficiently. Here are the key components and file structure:

metadata.json: Top-level JSON file pointing to the current snapshot
snapshot-*: Immutable table state for a given point in time
manifest-list-*.avro: An Avro file listing all manifest files for a given snapshot
manifest-file-*.avro: An Avro file tracking data files and their statistics
data-*.parquet: Parquet files containing actual table data
Note: Unchanged manifest files are reused across snapshots

Directorymetadata.json Metadata File - Points to current snapshot
- Table Schema
- Partition Spec
- Sort Order
- DirectorySnapshots
  - Directorysnapshot-3051729675574597004.avro Snapshot 1 (Historical)
    Directorymanifest-list-abc123.avro Manifest List
    Directorymanifest-file-001.avro Manifest File
    data-00001.parquet (10 MB, 50K rows)
    data-00002.parquet (12 MB, 60K rows)
    data-00003.parquet (11 MB, 55K rows)
    Directorymanifest-file-002.avro
    data-00004.parquet (9 MB, 45K rows)
    data-00005.parquet (10 MB, 50K rows)
  - Directorysnapshot-3051729675574597005.avro Snapshot 2 (Current)
    Directorymanifest-list-def456.avro Manifest List
    Directorymanifest-file-001.avro (reused from Snapshot 1)
    data-00001.parquet
    data-00002.parquet
    data-00003.parquet
    Directorymanifest-file-003.avro (new)
    data-00006.parquet (11 MB, 53K rows)
    data-00007.parquet (10 MB, 51K rows)
    data-00008.parquet (12 MB, 58K rows)

What happens during deletion

Apache Iceberg supports two deletion modes: Copy-on-Write (COW) and Merge-on-Read (MOR). Both create a new snapshot and mark old files for cleanup, but handle the deletion differently:

Aspect	Copy-on-Write (COW)	Merge-on-Read (MOR)
How deletes work	Rewrites data files without deleted rows	Creates delete files marking rows to skip
Query performance	Fast (no merge needed)	Slower (requires read-time merge)
Write performance	Slower (rewrites data files)	Fast (only writes delete markers)
Storage impact	Creates new data files immediately	Accumulates delete files over time
Maintenance needs	Snapshot expiration	Snapshot expiration + compaction (`rewrite_data_files`)
Best for	Read-heavy workloads	Write-heavy workloads with frequent small mutations

Common deletion operations

These operations work the same way for both COW and MOR tables:

Operation	What it does	Data deleted?	Reversible?
`DELETE FROM`	Removes rows matching condition	No (marked for cleanup)	Via time travel¹
`DROP TABLE`	Removes table from catalog	No	Yes (if data files exist)
`DROP TABLE ... PURGE`	Removes table and deletes data	Yes	No
`expire_snapshots`	Cleans up old snapshots/files	Yes	No
`remove_orphan_files`	Removes unreferenced files	Yes	No

MOR-specific operations

For Merge-on-Read tables, you may need to manually apply deletes for performance:

Operation	What it does	When to use
`rewrite_data_files` (compaction)	Applies deletes and consolidates files	When query performance degrades due to many delete files

Table maintenance - Learn about automatic maintenance operations
R2 Data Catalog - Overview and getting started guide
Query data - Query tables with R2 SQL
Apache Iceberg Maintenance ↗ - Official Iceberg documentation on table maintenance

Time travel available until expire_snapshots is called ↩