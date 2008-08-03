Overview

This guide will instruct you through:

Creating your first R2 bucket and enabling its data catalog.

Creating an API token needed for query engines to authenticate with your data catalog.

Using PyIceberg ↗ to create your first Iceberg table in a marimo ↗ Python notebook.

to load sample data into your table and query it.

Prerequisites

Node.js version manager Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

1. Create an R2 bucket

Wrangler CLI

Dashboard If not already logged in, run: npx wrangler login Create an R2 bucket: npx wrangler r2 bucket create r2-data-catalog-tutorial From the Cloudflare dashboard, select R2 Object Storage from the sidebar. Select Create bucket. Enter the bucket name: r2-data-catalog-tutorial Select Create bucket.

2. Enable the data catalog for your bucket

Wrangler CLI

Dashboard Then, enable the catalog on your chosen R2 bucket: npx wrangler r2 bucket catalog enable r2-data-catalog-tutorial From the Cloudflare dashboard, select R2 Object Storage from the sidebar. Select the bucket: r2-data-catalog-tutorial. Switch to the Settings tab, scroll down to R2 Data Catalog, and select Enable. Once enabled, note the Catalog URI and Warehouse name.

3. Create an API token

Iceberg clients (including PyIceberg ↗) must authenticate to the catalog with a Cloudflare API token that has both R2 and catalog permissions.

From the Cloudflare dashboard, select R2 Object Storage from the sidebar. Expand the API dropdown and select Manage API tokens. Select Create API token. Select the R2 Token text to edit your API token name. Under Permissions, choose the Admin Read & Write permission. Select Create API Token. Note the Token value.

4. Install uv

You need to install a Python package manager. In this guide, use uv ↗. If you do not already have uv installed, follow the installing uv guide ↗.

5. Install marimo

We will use marimo ↗ as a Python notebook.

Create a directory where our notebook will be stored: mkdir r2-data-catalog-notebook Change into our new directory: cd r2-data-catalog-notebook Create a new Python virtual environment: uv venv Activate the Python virtual environment: source .venv/bin/activate Install marimo with uv: uv pip install marimo

6. Create a Python notebook to interact with the data warehouse

Create a file called r2-data-catalog-tutorial.py . Paste the following code snippet into your r2-data-catalog-tutorial.py file: import marimo __generated_with = "0.11.31" app = marimo . App ( width = "medium" ) @ app . cell def _ (): import marimo as mo return ( mo ,) @ app . cell def _ (): import pandas import pyarrow as pa import pyarrow . compute as pc import pyarrow . parquet as pq from pyiceberg . catalog . rest import RestCatalog from pyiceberg . exceptions import NamespaceAlreadyExistsError # Define catalog connection details (replace variables) WAREHOUSE = "<WAREHOUSE>" TOKEN = "<TOKEN>" CATALOG_URI = "<CATALOG_URI>" # Connect to R2 Data Catalog catalog = RestCatalog ( name = "my_catalog" , warehouse = WAREHOUSE , uri = CATALOG_URI , token = TOKEN , ) return ( CATALOG_URI , NamespaceAlreadyExistsError , RestCatalog , TOKEN , WAREHOUSE , catalog , pa , pandas , pc , pq , ) @ app . cell def _ ( NamespaceAlreadyExistsError , catalog ): # Create default namespace if needed try : catalog . create_namespace ( "default" ) except NamespaceAlreadyExistsError : pass return @ app . cell def _ ( pa ): # Create simple PyArrow table df = pa . table ({ "id" : [ 1 , 2 , 3 ], "name" : [ "Alice" , "Bob" , "Charlie" ], "score" : [ 80.0 , 92.5 , 88.0 ], }) return ( df ,) @ app . cell def _ ( catalog , df ): # Create or load Iceberg table test_table = ( "default" , "people" ) if not catalog . table_exists ( test_table ): print ( f "Creating table: { test_table } " ) table = catalog . create_table ( test_table , schema = df . schema , ) else : table = catalog . load_table ( test_table ) return table , test_table @ app . cell def _ ( df , table ): # Append data table . append ( df ) return @ app . cell def _ ( table ): print ( "Table contents:" ) scanned = table . scan (). to_arrow () print ( scanned . to_pandas ()) return ( scanned ,) @ app . cell def _ (): # Optional cleanup. To run uncomment and run cell # print(f"Deleting table: {test_table}") # catalog.drop_table(test_table) # print("Table dropped.") return if __name__ == "__main__" : app . run () Replace the CATALOG_URI , WAREHOUSE , and TOKEN variables with your values from sections 2 and 3 respectively.

In the Python notebook above, you:

Connect to your catalog. Create the default namespace. Create a simple PyArrow table. Create (or load) the people table in the default namespace. Append sample data to the table. Print the contents of the table. (Optional) Drop the people table we created for this tutorial.

Learn more

