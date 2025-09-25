Build an end to end data pipeline
Learn how to create an end-to-end data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL for real-time transaction analysis.
In this tutorial, you will learn how to build a complete data pipeline using Cloudflare Pipelines, R2 Data Catalog, and R2 SQL. This also includes a sample Python script that creates and sends financial transaction data to your Pipeline that can be queried by R2 SQL or any Apache Iceberg-compatible query engine.
This tutorial demonstrates how to:
- Set up R2 Data Catalog to store our transaction events in an Apache Iceberg table
- Set up a Cloudflare Pipeline
- Create transaction data with fraud patterns to send to your Pipeline
- Query your data using R2 SQL for fraud analysis
- Sign up for a Cloudflare account ↗.
- Install Node.js ↗.
- Install Python 3.8+ ↗ for the data generation script.
You will need API tokens to interact with Cloudflare services.
-
In the Cloudflare dashboard, go to the API tokens page.Go to Account API tokens
-
Select Create Token.
-
Select Get started next to Create Custom Token.
-
Enter a name for your API token.
-
Under Permissions, choose:
- Workers Pipelines with Read, Send, and Edit permissions
- Workers R2 Data Catalog with Read and Edit permissions
- Workers R2 SQL with Read permissions
- Workers R2 Storage with Read and Edit permissions
-
Optionally, add a TTL to this token.
-
Select Continue to summary.
-
Click Create Token
-
Note the Token value.
Export your new token as an environment variable:
If this is your first time using Wrangler, make sure to log in.
Create an R2 bucket:
-
In the Cloudflare dashboard, go to the R2 object storage page.Go to Overview
-
Select Create bucket.
-
Enter the bucket name:
fraud-pipeline
-
Select Create bucket.
Enable the catalog on your R2 bucket:
When you run this command, take note of the "Warehouse" and "Catalog URI". You will need these later.
-
In the Cloudflare dashboard, go to the R2 object storage page.Go to Overview
-
Select the bucket:
fraud-pipeline.
-
Switch to the Settings tab, scroll down to R2 Data Catalog, and select Enable.
-
Once enabled, note the Catalog URI and Warehouse name.
R2 Data Catalog can automatically compact tables for you. In production event streaming use cases, it is common to end up with many small files, so it is recommended to enable compaction. Since the tutorial only demonstrates a sample use case, this step is optional.
-
In the Cloudflare dashboard, go to the R2 object storage page.Go to Overview
-
Select the bucket:
fraud-pipeline.
-
Switch to the Settings tab, scroll down to R2 Data Catalog, click on edit icon, and select Enable.
-
You can choose a target file size or leave the default. Click save.
First, create a schema file called
raw_transactions_schema.json with the following
json schema:
Create a stream to receive incoming fraud detection events:
The output should look like this:
Create a sink that writes data to your R2 bucket as Apache Iceberg tables:
Connect your stream to your sink with SQL:
-
In the Cloudflare dashboard, go to Pipelines > Pipelines.Go to Pipelines
-
Select Create Pipeline.
-
Connect to a Stream:
- Pipeline name:
raw_events
- Enable HTTP endpoint for sending data: Enabled
- HTTP authentication: Disabled (default)
- Select Next
- Pipeline name:
-
Define Input Schema:
-
Select JSON editor
-
Copy in the schema:
-
Select Next
-
-
Define Sink:
- Select your R2 bucket:
fraud-pipeline
- Storage type: R2 Data Catalog
- Namespace:
fraud_detection
- Table name:
transactions
- Advanced Settings: Change Maximum Time Interval to
30 seconds
- Select Next
- Select your R2 bucket:
-
Credentials:
- Disable Automatically create an Account API token for your sink
- Enter Catalog Token from step 1
- Select Next
-
Pipeline Definition:
- Leave the default SQL query:
- Select Create Pipeline
-
After pipeline creation, note the Stream ID for the next step.
Create a Python script to generate realistic transaction data with fraud patterns:
Install the required Python dependency and run the script:
Now you can analyze your fraud detection data using R2 SQL. Here are some example queries:
Create a new sink that will write the filtered data to a new Apache Iceberg table in R2 Data Catalog:
Now you will create a new SQL query to process data from the original
raw_events_stream stream and only write flagged transactions that are over the
amount of 1,000.
Query the table and check the results:
Also verify that the non-fraudulent events are being filtered out:
You should see the following output:
You have successfully built an end to end data pipeline using Cloudflare's data platform. Through this tutorial, you hve learned to:
- Use R2 Data Catalog: Leveraged Apache Iceberg tables for efficient data storage
- Set up Cloudflare Pipelines: Created streams, sinks, and pipelines for data ingestion
- Generated sample data: Created transaction data with some basic fraud patterns
- Query your tables with R2 SQL: Access raw and processed data tables stored in R2 Data Catalog
Was this helpful?
- Resources
- API
- New to Cloudflare?
- Directory
- Sponsorships
- Open Source
- Support
- Help Center
- System Status
- Compliance
- GDPR
- Company
- cloudflare.com
- Our team
- Careers
- © 2025 Cloudflare, Inc.
- Privacy Policy
- Terms of Use
- Report Security Issues
- Trademark
-