Skip to content
Cloudflare Docs

Getting started

Overview

This guide will instruct you through:

Prerequisites

  1. Sign up for a Cloudflare account.
  2. Install Node.js.

Node.js version manager

Use a Node version manager like Volta or nvm to avoid permission issues and change Node.js versions. Wrangler, discussed later in this guide, requires a Node version of 16.17.0 or later.

1. Create an R2 bucket

  1. If not already logged in, run:

    npx wrangler login
  2. Create an R2 bucket:

    npx wrangler r2 bucket create r2-data-catalog-tutorial

2. Enable the data catalog for your bucket

Then, enable the catalog on your chosen R2 bucket:

npx wrangler r2 bucket catalog enable r2-data-catalog-tutorial

3. Create an API token

Iceberg clients (including PyIceberg) must authenticate to the catalog with a Cloudflare API token that has both R2 and catalog permissions.

  1. From the Cloudflare dashboard, select R2 Object Storage from the sidebar.

  2. Expand the API dropdown and select Manage API tokens.

  3. Select Create API token.

  4. Select the R2 Token text to edit your API token name.

  5. Under Permissions, choose the Admin Read & Write permission.

  6. Select Create API Token.

  7. Note the Token value.

4. Install uv

You need to install a Python package manager. In this guide, use uv. If you do not already have uv installed, follow the installing uv guide.

5. Install marimo

We will use marimo as a Python notebook.

  1. Create a directory where our notebook will be stored:

    mkdir r2-data-catalog-notebook
  2. Change into our new directory:

    cd r2-data-catalog-notebook
  3. Create a new Python virtual environment:

    uv venv
  4. Activate the Python virtual environment:

    source .venv/bin/activate
  5. Install marimo with uv:

    uv pip install marimo

6. Create a Python notebook to interact with the data warehouse

  1. Create a file called r2-data-catalog-tutorial.py.

  2. Paste the following code snippet into your r2-data-catalog-tutorial.py file:

    import marimo
    __generated_with = "0.11.31"
    app = marimo.App(width="medium")
    @app.cell
    def _():
    import marimo as mo
    return (mo,)
    @app.cell
    def _():
    import pandas
    import pyarrow as pa
    import pyarrow.compute as pc
    import pyarrow.parquet as pq
    from pyiceberg.catalog.rest import RestCatalog
    from pyiceberg.exceptions import NamespaceAlreadyExistsError
    # Define catalog connection details (replace variables)
    WAREHOUSE = "<WAREHOUSE>"
    TOKEN = "<TOKEN>"
    CATALOG_URI = "<CATALOG_URI>"
    # Connect to R2 Data Catalog
    catalog = RestCatalog(
    name="my_catalog",
    warehouse=WAREHOUSE,
    uri=CATALOG_URI,
    token=TOKEN,
    )
    return (
    CATALOG_URI,
    NamespaceAlreadyExistsError,
    RestCatalog,
    TOKEN,
    WAREHOUSE,
    catalog,
    pa,
    pandas,
    pc,
    pq,
    )
    @app.cell
    def _(NamespaceAlreadyExistsError, catalog):
    # Create default namespace if needed
    try:
    catalog.create_namespace("default")
    except NamespaceAlreadyExistsError:
    pass
    return
    @app.cell
    def _(pa):
    # Create simple PyArrow table
    df = pa.table({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"],
    "score": [80.0, 92.5, 88.0],
    })
    return (df,)
    @app.cell
    def _(catalog, df):
    # Create or load Iceberg table
    test_table = ("default", "people")
    if not catalog.table_exists(test_table):
    print(f"Creating table: {test_table}")
    table = catalog.create_table(
    test_table,
    schema=df.schema,
    )
    else:
    table = catalog.load_table(test_table)
    return table, test_table
    @app.cell
    def _(df, table):
    # Append data
    table.append(df)
    return
    @app.cell
    def _(table):
    print("Table contents:")
    scanned = table.scan().to_arrow()
    print(scanned.to_pandas())
    return (scanned,)
    @app.cell
    def _():
    # Optional cleanup. To run uncomment and run cell
    # print(f"Deleting table: {test_table}")
    # catalog.drop_table(test_table)
    # print("Table dropped.")
    return
    if __name__ == "__main__":
    app.run()
  3. Replace the CATALOG_URI, WAREHOUSE, and TOKEN variables with your values from sections 2 and 3 respectively.

In the Python notebook above, you:

  1. Connect to your catalog.
  2. Create the default namespace.
  3. Create a simple PyArrow table.
  4. Create (or load) the people table in the default namespace.
  5. Append sample data to the table.
  6. Print the contents of the table.
  7. (Optional) Drop the people table we created for this tutorial.

Learn more