Getting started

Overview

This guide will instruct you through:

Creating your first R2 bucket and enabling its data catalog.
Creating an API token needed for query engines to authenticate with your data catalog.
Using PyIceberg ↗ to create your first Iceberg table in a marimo ↗ Python notebook.
Using PyIceberg ↗ to load sample data into your table and query it.

Prerequisites

Sign up for a Cloudflare account ↗.
Install Node.js ↗.

Node.js version manager

Use a Node version manager like Volta ↗ or nvm ↗ to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

If not already logged in, run:
```
npx wrangler login
```

Create an R2 bucket:

npx wrangler r2 bucket create r2-data-catalog-tutorial

2. Enable the data catalog for your bucket

Wrangler CLI
Dashboard

Then, enable the catalog on your chosen R2 bucket:

npx wrangler r2 bucket catalog enable r2-data-catalog-tutorial

3. Create an API token

Iceberg clients (including PyIceberg ↗) must authenticate to the catalog with a Cloudflare API token that has both R2 and catalog permissions.

From the Cloudflare dashboard, select R2 Object Storage from the sidebar.
Expand the API dropdown and select Manage API tokens.
Select Create API token.
Select the R2 Token text to edit your API token name.
Under Permissions, choose the Admin Read & Write permission.
Select Create API Token.
Note the Token value.

4. Install uv

You need to install a Python package manager. In this guide, use uv ↗. If you do not already have uv installed, follow the installing uv guide ↗.

5. Install marimo

We will use marimo ↗ as a Python notebook.

Create a directory where our notebook will be stored:
```
mkdir r2-data-catalog-notebook
```
Change into our new directory:
```
cd r2-data-catalog-notebook
```
Create a new Python virtual environment:
```
uv venv
```
Activate the Python virtual environment:
```
source .venv/bin/activate
```
Install marimo with uv:
```
uv pip install marimo
```

6. Create a Python notebook to interact with the data warehouse

Create a file called r2-data-catalog-tutorial.py.

Paste the following code snippet into your r2-data-catalog-tutorial.py file:

import marimo

__generated_with = "0.11.31"
app = marimo.App(width="medium")


@app.cell
def _():
    import marimo as mo
    return (mo,)


@app.cell
def _():
    import pandas
    import pyarrow as pa
    import pyarrow.compute as pc
    import pyarrow.parquet as pq

    from pyiceberg.catalog.rest import RestCatalog
    from pyiceberg.exceptions import NamespaceAlreadyExistsError

    # Define catalog connection details (replace variables)
    WAREHOUSE = "<WAREHOUSE>"
    TOKEN = "<TOKEN>"
    CATALOG_URI = "<CATALOG_URI>"

    # Connect to R2 Data Catalog
    catalog = RestCatalog(
        name="my_catalog",
        warehouse=WAREHOUSE,
        uri=CATALOG_URI,
        token=TOKEN,
    )
    return (
        CATALOG_URI,
        NamespaceAlreadyExistsError,
        RestCatalog,
        TOKEN,
        WAREHOUSE,
        catalog,
        pa,
        pandas,
        pc,
        pq,
    )


@app.cell
def _(NamespaceAlreadyExistsError, catalog):
    # Create default namespace if needed
    try:
        catalog.create_namespace("default")
    except NamespaceAlreadyExistsError:
        pass
    return


@app.cell
def _(pa):
    # Create simple PyArrow table
    df = pa.table({
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Charlie"],
        "score": [80.0, 92.5, 88.0],
    })
    return (df,)


@app.cell
def _(catalog, df):
    # Create or load Iceberg table
    test_table = ("default", "people")
    if not catalog.table_exists(test_table):
        print(f"Creating table: {test_table}")
        table = catalog.create_table(
            test_table,
            schema=df.schema,
        )
    else:
        table = catalog.load_table(test_table)
    return table, test_table


@app.cell
def _(df, table):
    # Append data
    table.append(df)
    return


@app.cell
def _(table):
    print("Table contents:")
    scanned = table.scan().to_arrow()
    print(scanned.to_pandas())
    return (scanned,)


@app.cell
def _():
    # Optional cleanup. To run uncomment and run cell
    # print(f"Deleting table: {test_table}")
    # catalog.drop_table(test_table)
    # print("Table dropped.")
    return


if __name__ == "__main__":
    app.run()

Replace the CATALOG_URI, WAREHOUSE, and TOKEN variables with your values from sections 2 and 3 respectively.

In the Python notebook above, you:

Connect to your catalog.
Create the default namespace.
Create a simple PyArrow table.
Create (or load) the people table in the default namespace.
Append sample data to the table.
Print the contents of the table.
(Optional) Drop the people table we created for this tutorial.

Learn more

Managing catalogs Enable or disable R2 Data Catalog on your bucket, retrieve configuration details, and authenticate your Iceberg engine.

Connect to Iceberg engines Find detailed setup instructions for Apache Spark and other common query engines.

Was this helpful?

Community
X
Discord
YouTube
GitHub