Yavor is a PM at Snowflake working on developer experience. Previously at Docker, Auth0, Hulu, and Microsoft Azure.
11 March 2025
As part of Snowpark Container Services (SPCS), Snowflake offers a Jobs concept, which is well-suited for some important workloads:
A Job enables you to run a containerized (Docker) workload on various CPU/GPU configurations, work with Snowflake data via SQL or block storage to process data, and shut down and release resources when the Job is done.
Over the next few quarters, we’ll be investing in multiple usability improvements to make the Jobs model an even better fit for the above scenarios.
This week we will talk about asynchronous processing support in Jobs - the ability to run multiple jobs in parallel.
This tutorial gives a broad overview of how to use Jobs for a simple task such as processing data in a table. The Job is executed synchronously by the execute job service
statement, meaning the SQL query will not terminate until the Job is complete.
With longer-running Jobs, this starts to present some challenges:
execute
statement continues to run and incur charges while the Job is running, while not delivering value to the user, since the sesison is blocked.To motivate this further, let’s introduce a ML customer scenario: doing text analytics on table data. The code for this sample is available here. We have developed a container that supports a variety of text analysis tasks, including summarization and sentiment analysis.
Here is an example of a summarization Job running over a table containing Google reviews.
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=summarization_job_sync
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=summarization"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
This task executes synchronously and completes in about 3 minutes, while processing 100 rows of data on a XS Compute Pool.
Let’s now modify this to use the optional async
property to the Job to trigger asynchronous execution. Infact, let’s run two copies of the same Job, with one configured to do summarization, and the other to do sentiment analysis:
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=summarization_job_async
ASYNC=TRUE
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=summarization"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=sentiment_job_async
ASYNC=TRUE
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=sentiment"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
The two Jobs will now run asynchronously in parallel, with each one spinning up it’s own instance of the container to process its data. The above query now completes in about 11 seconds, and the Jobs complete asynchronously in the background.
Over the next few quarters, we will be detailing further improvements to Jobs such as batch processing, execution history, and scheduling, among others.
The code for this sample is available here if you’d like to try it out.