Spark (Scala)
Below is an example of how you can build an Apache Spark ↗ application (with Scala) which connects to R2 Data Catalog. This application is built to run locally, but it can be adapted to run on a cluster.
- Sign up for a Cloudflare account ↗.
- Create an R2 bucket and enable the data catalog.
- Create an R2 API token with both R2 and data catalog permissions.
- Install Java 17, Spark 3.5.3, and SBT 1.10.11
- Note: The specific versions of tools are critical for getting things to work in this example.
- Tip: “SDKMAN” ↗ is a convenient package manager for installing SDKs.
To start, create a new empty project directory somewhere on your machine.
Inside that directory, create the following file at
src/main/scala/com/example/R2DataCatalogDemo.scala. This will serve as the main entry point for your Spark application.
For building this application and managing dependencies, we will use sbt (“simple build tool”) ↗. The following is an example
build.sbt file to place at the root of your project. It is configured to produce a "fat JAR", bundling all required dependencies.
To enable the sbt-assembly plugin ↗ (used to build fat JARs), add the following to a new file at
project/assembly.sbt:
Make sure Java, Spark, and sbt are installed and available in your shell. If you are using SDKMAN, you can install them as shown below:
With everything installed, you can now build the project using sbt. This will generate a single bundled JAR file.
After building, the output JAR should be located at
target/scala-2.12/R2DataCatalogDemo-assembly-1.0.jar.
To run the application, you will use
spark-submit. Below is an example shell script (
submit.sh) that includes the necessary Java compatability flags for Spark on Java 17:
Before running it, make sure the script is executable:
At this point, your project directory should be structured like this:
- Makefile
- README.md
- build.sbt
Directoryproject
- assembly.sbt
- build.properties
- project
- spark-submit.sh
Directorysrc
Directorymain
Directoryscala
Directorycom
Directoryexample
- R2DataCatalogDemo.scala
Before submitting the job, make sure you have the required environment variable set for your catalog URI, warehouse, and Cloudflare API token.
You are now ready to run the job:
