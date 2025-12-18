Deleting data
Deleting data from R2 Data Catalog or any Apache Iceberg catalog requires that operations are done in a transaction through the catalog itself. Manually deleting metadata or data files directly can lead to data catalog corruption.
More information can be found in the table maintenance and manage catalogs documentation.
The following are basic examples using PySpark but similar operations can be performed using other Iceberg-compatible engines. To configure PySpark, refer to our example or the official PySpark documentation ↗.
Apache Iceberg uses a layered metadata structure to manage table data efficiently. Here are the key components and file structure:
- metadata.json: Top-level JSON file pointing to the current snapshot
- snapshot-*: Immutable table state for a given point in time
- manifest-list-*.avro: An Avro file listing all manifest files for a given snapshot
- manifest-file-*.avro: An Avro file tracking data files and their statistics
- data-*.parquet: Parquet files containing actual table data
- Note: Unchanged manifest files are reused across snapshots
Directorymetadata.json Metadata File - Points to current snapshot
- Table Schema
- Partition Spec
- Sort Order
DirectorySnapshots
Directorysnapshot-3051729675574597004.avro Snapshot 1 (Historical)
Directorymanifest-list-abc123.avro Manifest List
Directorymanifest-file-001.avro Manifest File
- data-00001.parquet (10 MB, 50K rows)
- data-00002.parquet (12 MB, 60K rows)
- data-00003.parquet (11 MB, 55K rows)
Directorymanifest-file-002.avro
- data-00004.parquet (9 MB, 45K rows)
- data-00005.parquet (10 MB, 50K rows)
Directorysnapshot-3051729675574597005.avro Snapshot 2 (Current)
Directorymanifest-list-def456.avro Manifest List
Directorymanifest-file-001.avro (reused from Snapshot 1)
- data-00001.parquet
- data-00002.parquet
- data-00003.parquet
Directorymanifest-file-003.avro (new)
- data-00006.parquet (11 MB, 53K rows)
- data-00007.parquet (10 MB, 51K rows)
- data-00008.parquet (12 MB, 58K rows)
Apache Iceberg supports two deletion modes: Copy-on-Write (COW) and Merge-on-Read (MOR). Both create a new snapshot and mark old files for cleanup, but handle the deletion differently:
|Aspect
|Copy-on-Write (COW)
|Merge-on-Read (MOR)
|How deletes work
|Rewrites data files without deleted rows
|Creates delete files marking rows to skip
|Query performance
|Fast (no merge needed)
|Slower (requires read-time merge)
|Write performance
|Slower (rewrites data files)
|Fast (only writes delete markers)
|Storage impact
|Creates new data files immediately
|Accumulates delete files over time
|Maintenance needs
|Snapshot expiration
|Snapshot expiration + compaction (
rewrite_data_files)
|Best for
|Read-heavy workloads
|Write-heavy workloads with frequent small mutations
These operations work the same way for both COW and MOR tables:
|Operation
|What it does
|Data deleted?
|Reversible?
DELETE FROM
|Removes rows matching condition
|No (marked for cleanup)
|Via time travel1
DROP TABLE
|Removes table from catalog
|No
|Yes (if data files exist)
DROP TABLE ... PURGE
|Removes table and deletes data
|Yes
|No
expire_snapshots
|Cleans up old snapshots/files
|Yes
|No
remove_orphan_files
|Removes unreferenced files
|Yes
|No
For Merge-on-Read tables, you may need to manually apply deletes for performance:
|Operation
|What it does
|When to use
rewrite_data_files (compaction)
|Applies deletes and consolidates files
|When query performance degrades due to many delete files
