Yavor is a PM at Snowflake working on developer experience. Previously at Docker, Auth0, Hulu, and Microsoft Azure.
11 March 2025
As part of Snowpark Container Services (SPCS), Snowflake offers a Jobs concept, which is well-suited for some important workloads:
A Job enables you to run a containerized (Docker) workload on various CPU/GPU configurations, work with Snowflake data via SQL or block storage to process data, and shut down and release resources when the Job is done.
Over the next few quarters, we’ll be investing in multiple usability improvements to make the Jobs model an even better fit for the above scenarios.
This week we will talk about asynchronous processing support in Jobs - the ability to run multiple jobs in parallel.
This tutorial gives a broad overview of how to use Jobs for a simple task such as processing data in a table. The Job is executed synchronously by the execute job service
statement, meaning the SQL query will not terminate until the Job is complete.
With longer-running Jobs, this starts to present some challenges:
execute
statement continues to run and incur charges while the Job is running, while not delivering value to the user, since the sesison is blocked.To motivate this further, let’s introduce a ML customer scenario: doing text analytics on table data. The code for this sample is available here. We have developed a container that supports a variety of text analysis tasks, including summarization and sentiment analysis.
Here is an example of a summarization Job running over a table containing Google reviews.
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=summarization_job_sync
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=summarization"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
This task executes synchronously and completes in about 3 minutes, while processing 100 rows of data on a XS Compute Pool.
Let’s now modify this to use the optional async
property to the Job to trigger asynchronous execution. Infact, let’s run two copies of the same Job, with one configured to do summarization, and the other to do sentiment analysis:
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=summarization_job_async
ASYNC=TRUE
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=summarization"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
EXECUTE JOB SERVICE IN COMPUTE POOL CPU_S
NAME=sentiment_job_async
ASYNC=TRUE
FROM SPECIFICATION $$
spec:
container:
- name: main
image: REGISTRY/REPO/TEXT_ANALYSIS_IMAGE:TEXT_ANALYSIS_IMAGE_VERSION
env:
SNOWFLAKE_WAREHOUSE: XSMALL
args:
- "--task=sentiment"
- "--source_table=google_reviews.google_reviews.sample_reviews"
- "--source_id_column=REVIEW_HASH"
- "--source_value_column=REVIEW_TEXT"
- "--result_table=results"
$$;
The two Jobs will now run asynchronously in parallel, with each one spinning up it’s own instance of the container to process its data. The above query now completes in about 11 seconds, and the Jobs complete asynchronously in the background.
Now that we are able to run Jobs in parallel, we need to consider how to orchestrate SQL workflows around those Jobs. For example, you may wish to run some DML queries to work with the results of parallel Jobs. To wait for the completion of all Jobs, SPCS provides the job!spcs_wait_for()
function, which can be invoked in the above context as follows:
CALL summarization_job_async!spcs_wait_for('DONE', 600);
CALL sentiment_job_async!spcs_wait_for('DONE', 600);
The spcs_wait_for
function takes as first argument the Job status to wait for, and the maximum timeout (in seconds) to wait for. To wait for multiple Jobs, just invoke the function sequentially, once for each Job in the parallel set. Using this function to wait for a single Job works, however at that point, you might as well just use a synchronous Job. The power here is scaling out your workload horizontally by running many concurrent Jobs, without using the ability to detect the overall completion of the work.
Once a Job executes, it’s the metadata about the execution is retained for 7 days. This is valuable if you kick off long-running Jobs and may need to check back on the outcome of a run on a subsequent day. You can access the execution history via the get_job_history()
table function as shown:
SELECT
name,
status,
created_time,
completed_time,
DATEDIFF('second', created_time, completed_time) AS duration_seconds
FROM TABLE(SNOWFLAKE.SPCS.GET_JOB_HISTORY(RESULT_LIMIT=>5));
See the documentation linked above for the full list of fields available as part of the history.
Pairing with the history view above, you likely also want to access the logs generated by the Job after the job completes. This is especially useful for troubleshooting a Job that may have failed unexpectedly. By default, all logs emitted to stdout
and stderr
by the container are stored in your Snowflake account-level Event Table and can easily be retrieved by querying the table directly (as shown at the link) or by using the job!spcs_get_logs()
convenience function like so:
SELECT * FROM TABLE(summarization_job_sync!spcs_get_logs());
Over the next few quarters, we will be detailing further improvements to Jobs such as batch processing, execution history, and scheduling, among others.
The code for this sample is available here if you’d like to try it out.