Triggering small ETL workloads

March 16, 2026

by R. Tyler Croy

deltalake

parquet

The deceptively simple trick for building cheaper data platforms is to process less data. Processing more data usually means more waste, so we seek to eliminate waste and overhead by processing data at most once and only when necessary. Except in cases where large merge operations must be done across data streams, most workloads can fit into a smaller package to perform their transformation in Python or Rust.

Adopting the medallion architecture usually means that the bronze/silver layers have ingestion and refinement that lends itself well to event-driven linear data processing. By configuring S3 Event Notifications an AWS Lambda can be configured to operate on data as it arrives rather than in periodic batches.

For example:


flowchart LR
  Input["Producer"]-- write parquiet -->Storage["S3"]
  Storage-- event -->SQS
  SQS-- trigger -->Process["Lambda"]
  Process-- commit -->Delta["Delta"]

In this pipeline some producer, which could be another Lambda, Airbyte, or some Apache Parquet compatible data writer creates data files in S3. Assuming that data must be processed, validated, or cleaned in some way, S3 Event Notifications can be configured to trigger a Lambda when that file arrives.

One characteristic about the Lambda triggering process is that each .parquet file can be processed as it is produced rather than after a 24 hour period, or whatever a typical batch schedule might look like. Processing these files incrementally allows for massively parallel processing using the built-in structure of AWS Lambda.

When building with Apache Spark you might find yourself configure a 10 node cluster with 8vCPUs for each node, i.e. 80 total vCPUs. Whether or not you need that much for any given job execution is not knowable ahead of time which means every time you launch that cluster you're paying for 80vCPUs. Launching a Spark cluster is not a cheap operation unto itself, so there is also overhead costs simply to stand the cluster up too!

The equivalent Lambda architecture would only run 80vCPUs if there was sufficient files necessary to process, otherwise everything scales down to zero. An additional benefit of the Lambda based approach is that most accounts have a default quota of 1000 concurrent Lambda executions which means you could theoretically have your data pipeline process thousands of input files concurrently, finishing as fast as possible, delivering the necessary data further on in the pipeline.

Building the data processing Lambda we find much easier to reason about and test because we're focusing on processing a single parquet file at a time inside of a loop.

When we deploy these pipelines we are typically connecting S3 Event Notifications to SNS and then SQS per AWS Lambda functions for triggering. Sparing you some infrastructure details, routing through SNS first allows for much more rapid deployment of additional Lambdas as needed.

resource "aws_sns_topic" "bronze" {
  name   = "bronze-data-landing"
  policy = data.aws_iam_policy_document.topic.json
}

resource "aws_s3_bucket" "bucket" {
  bucket = "super-data-lake"
}

resource "aws_s3_bucket_notification" "new_parquet" {
  bucket = aws_s3_bucket.bucket.id

  topic {
    topic_arn     = aws_sns_topic.bronze.arn
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = "tables/bronze/"
    filter_suffix = ".parquet"
  }
}

resource "aws_sns_topic" "commits" {
  name   = "data-lake-committed-txns"
  policy = data.aws_iam_policy_document.topic.json
}

resource "aws_s3_bucket_notification" "delta_commits" {
  bucket = aws_s3_bucket.bucket.id

  topic {
    topic_arn     = aws_sns_topic.commits.arn
    events        = ["s3:ObjectCreated:*"]
    filter_prefix = "tables/"
    filter_suffix = ".json"
  }
}

When designing the data processing side of this equations it is helpful to treat medallion as the foundation and approach each Lambda as a true function rather than a "job" or "data pipeline". For example, imagine input data where each row contains a JSON array containing multiple event records. Those might be exploded into individual records and then some computation performing in a traditional Spark Job. In a smaller event-driven architecture we would recommend deploying two Lambda functions, one to perform the exploding of the inner JSON array and a second to perform the necessary computation on each row.

This approach works exceptionally well for high-performance low-cost data processing, except in the case of merge..

The Merge Situation

Merge operations are fundamentally tricky from a data standpoint because we are typically seeking to unify two disparate datasets and in my experience typically merge/join operations bring two different cadences of data together as well. For example, clickstream style data usually gets merged/joined at some point against online transactional data representing user/browser state to produce engagement metrics. There are two aspects of these situations that add complexity: point in time correctness and working set limitations.

Dealing with point in time correctness is usually something to be solved at an architectural level. Deciding what the acceptable merge window might be for two disparate datasets usually involves discussions around what we consider a "session length" or acceptable batch size. Scheduled batches usually hide these tradeoffs since all data is being processed once per hour/day/etc, meaning everything is always at that granularity in the produced data.

Working set limitations are trickier to navigate in the "small compute" architecture of AWS Lambda. If 32GB of data are needed to perform a merge or join effectively, then there's no way to avoid pulling 32GB of memory together somewhere to perform the operation. These situations are where Spark can and does do plenty of heavy lifting. When larger compute resources are still needed, we typically recommend File Triggers be used for launching jobs in a similar manner to the triggered AWS Lambdas

Triggering workloads when data is available means only using the cloud resources that are strictly necessary to process the data in our data pipelines. In an environment where every second counts, running for fewer seconds means lower bills and faster results!

If your team is looking to improve performance of your streaming or batch infrastructure with Delta Lake, drop me an email and we'll chat!!

Buoyant Data

Triggering small ETL workloads

March 16, 2026

The Merge Situation

Buoyant Data Inc