Introduction
DataZen is a data pipeline and replication platform that moves data directly from virtually any source system into any target system. Rather than landing raw data into intermediate storage layers and building separate ETL processes for each transformation stage, DataZen applies an inline ETL architecture that processes data in memory as it flows through the pipeline, with optional Change Data Capture (CDC) to identify and forward only the records that have actually changed.
Because DataZen creates universal and portable Change Logs, captured data can be decoupled from both source and target systems and replayed to virtually any platform in the shape each consumer expects. The result is a simpler integration model that reduces orchestration overhead, avoids schema-bound staging tables, and lowers the total cost of data ingestion compared with traditional zone-based architectures.
Traditional Ingestion vs. the DataZen Approach
Most enterprise data ingestion today follows a layered design. Whether described as a Medallion architecture (Bronze, Silver, Gold) or as raw / stage / curated zones, implementations built with tools such as Azure Data Factory, AWS Glue, or Apache Airflow typically follow a similar pattern:
- Raw landing — An ingestion layer pushes source data into object storage (Azure Blob, ADLS, S3, GCS) in its native format: flat files, Parquet, JSON, CSV, and so on.
- Silver / staging — A separate ETL layer reads the raw files, applies cleansing and conforming logic, and materializes intermediate tables—often in a SQL environment such as Snowflake, Databricks, Microsoft Fabric, or a dedicated staging database.
- Gold / curated — Additional ETL produces business-ready datasets for replication into data warehouses, semantic models, reporting, and dashboarding.
This framework is well understood and widely adopted, but it carries structural costs. Each layer requires its own orchestration, storage, and schema management. Intermediate tables accumulate over time, creating a schema-bound ETL layer that is expensive to operate and difficult to evolve. Teams must coordinate file formats, table definitions, and pipeline schedules across multiple systems, and even modest schema changes can ripple through every downstream layer.
A critical distinction is how the Bronze / Raw layer participates in the pipeline. In traditional architectures it is inline—a required intermediate store on the critical path. Data cannot move to Silver or Gold until a full extract lands in blob storage first. That makes the raw zone a potential single point of failure: if the landing step fails or falls behind, every downstream layer stalls. It also becomes expensive over time, even on inexpensive object storage, because raw layers typically store complete, undeduplicated extracts rather than change-filtered datasets.
Left: Traditional zone architecture—Bronze and Silver are inline, orchestrated steps on the critical path. Right: DataZen—the pipeline runs inline; raw storage and change logs are optional side-loads.
DataZen changes this model. The raw layer becomes optional—it can still exist when you need a full historical archive, but it does not have to. Instead of landing every extract inline, DataZen can produce change logs: smaller, delta-oriented datasets that carry source schema information, metadata, and—when required—optional signing and encryption. Change logs are also optional and depend on the pipeline:
- Point-to-point scenarios — A change log can be discarded as soon as the target is updated; no persistent intermediate store is required.
- Enterprise replication scenarios — Change logs should be retained as a pure change record, decoupling capture from delivery so multiple targets can consume the same delta over time.
Customers can therefore keep raw data, change logs, both, or neither. When raw storage is used, it becomes a side-loading store rather than an inline, orchestrated dependency—data flows through the pipeline directly while archival copies are written in parallel when needed.
The Silver layer follows the same principle. It remains available for long-running processes that genuinely require persistent staging, but DataZen's transient ETL execution pane allows the vast majority of Silver-layer work to happen inline without durable intermediate tables. When complex SQL is needed, transient tables are created automatically from the pipeline's current schema and dropped when the execution block completes—eliminating the schema drift and maintenance burden that accumulates in traditional staging environments.
DataZen offers a different integration paradigm built around three core principles:
- Direct sourcing — Read from databases, HTTP/S APIs, messaging platforms, cloud drives, and file stores without first landing a full raw copy in blob storage.
- Mini-batch extraction — Use high watermarks and paging windows to retrieve only the records needed for each execution, rather than reprocessing entire datasets on every run. See the Watermark Pattern and Window Capture Pattern for details.
- Inline transformation — Apply ETL logic in memory as data moves through the pipeline, enforcing schema on write at the target when required, instead of maintaining persistent staging schemas across multiple layers.
DataZen does not prevent you from writing to a data lake or warehouse. It changes when, where, and whether intermediate storage exists—inline within the pipeline rather than as mandatory orchestrated layers—so you can side-load raw archives or change logs only when they add value.
Change Capture and Mini-Batching
A central reason traditional architectures re-read large raw datasets is the absence of efficient change detection at the source. DataZen addresses this with an advanced Change Capture engine that works alongside mini-batch extraction—reading data in controlled batches rather than monolithic full extracts whenever possible.
At the highest level, four foundational read patterns cover the most common ingestion scenarios. More composite patterns (such as Watermark + CDC for duplicate filtering) are documented in the Patterns Overview.
Common Ingestion Patterns
1) Full Copy — Always reads the entire data set from the source on every execution. Best suited for initial loads, full reloads, or small sources where optimization is unnecessary. See the Snapshot Pattern.
2) Full + CDC — Reads the entire data set but applies Synthetic CDC to keep only records that are new, updated, or deleted. Reduces downstream volume while the source is still read in full. See Synthetic CDC.
3) Watermark — Reads only what changed at the source by applying a high watermark filter (timestamp, monotonic ID, and similar fields). Minimizes both source reads and downstream processing. See the Watermark Pattern.
4) Window + CDC — Reads from an earlier point in time using a sliding window that intentionally overlaps with prior runs, then applies Synthetic CDC to keep only records that actually changed. Used when the source cannot support a precise forward-only watermark. See the Window Capture Pattern.
Per-Page Mini-Batching
DataZen supports a batch-oriented read model for HTTP/S and database sources. Rather than loading an entire result set into memory at once, the reader retrieves data in pages—using SQL paging, HTTP offset/limit parameters, or cursor-based APIs—and processes each page before requesting the next.
When paging is enabled, an optional inner pipeline can execute once per page. This allows
ETL operations (schema enforcement, filtering, enrichment, masking, and more) to run on each page
independently, keeping the in-memory footprint small even when the total source data set is very large.
After all pages complete, an optional outer ETL pipeline continues with any remaining transformations,
optionally followed by Synthetic CDC and writer ETL, before capture or push. In
SQL CDC, this is expressed using
APPLY PIPELINE inside a SELECT operation; see
SELECT HTTP for HTTP paging details.
Per-page mini-batching is orthogonal to the four read patterns above—any pattern can be combined with paging. For example, a watermark-based database read can page through changed records in batches, running inline ETL on each page while the high watermark advances across the full execution.
Change Capture Building Blocks
- High watermarks limit each read to records above a previously stored boundary value, enabling efficient forward-only paging over time.
- Synthetic CDC compares key columns across executions to eliminate unchanged records and optionally detect deletions when the source provides no native change feed. See Synthetic CDC.
- Native CDC can consume change streams directly from source systems that expose them.
Together, these options dramatically reduce the volume of records processed on each run. Instead of landing a full extract and re-deriving changes in a Silver layer, DataZen identifies what changed at extraction time—or reads only the changed subset at the source—and carries the result forward through the pipeline.
Inline ETL and the Execution Pane
Most DataZen transformations run in an in-memory ETL architecture. As records flow through the pipeline, components such as schema enforcement, filtering, masking, hashing, and HTTP enrichment operate on the current dataset without persisting intermediate tables. When a target requires a defined schema, DataZen applies schema on write at push time rather than maintaining a separate conforming layer.
For transformations that exceed what in-memory processing can handle efficiently—complex JOINs,
aggregations, multi-step SQL logic, or operations against large reference datasets—DataZen can
switch the execution pane from in-memory to database-native ETL inline, directly within
the same pipeline. Using the @pipelinedata() function (see
EXEC), the engine automatically:
- Creates transient tables from the current pipeline schema
- Executes the SQL batch against the selected database engine
- Rehydrates the pipeline with the result set
- Drops the transient tables when the block completes
A single pipeline can contain as many ETL execution blocks as needed, and each block can target a different database engine—MySQL, PostgreSQL, Snowflake, SQL Server, and others—without pre-provisioning staging schemas. This execution pane switching gives you the full power of relational SQL where it is needed, while keeping simple transformations lightweight and in memory.
The SQL CDC scripting language exposes this model through commands such as APPLY SCHEMA, APPLY TX, and EXEC. Document-oriented sources can be transformed into relational datasets using document-to-table and CSV-to-table pipeline operations. HTTP-based enrichment is covered in the following section.
Inline HTTP Data Enhancement
Inline data enhancement is often needed when calling external HTTP endpoints that perform calculations on the current pipeline dataset, transform it, enrich it, or consume it for downstream integration needs. Think of this as a dedicated HTTP execution pane—another pane within the same inline pipeline, alongside the DataZen in-memory pane and the database engine pane described above.
As an example, a pipeline could call an external AWS Lambda function to calculate fuel consumption anomalies from real-time fuel readings and return additional fields with the anomaly result for each driver. When the HTTP call completes, the response can optionally rehydrate the pipeline as the new dataset before other ETL operations continue.
When data must be sent to an external API, it is usually formatted first as a JSON payload or another
supported format. This is typically done with ADD COLUMN
FORMAT operations, optionally followed by
ZIP COLUMN to batch rows into a single payload. After
APPLY HTTP returns, the response can be unwound back into a regular
dataset using APPLY TX. This is one step in the overall pipeline and
can be repeated as many times as needed.
Common use cases include:
- Anomaly detection and enrichment — Send calculated or aggregated pipeline data to cloud functions (AWS Lambda, Azure Functions) or custom REST APIs and merge results back into the dataset.
- Claim-Check pattern — Some HTTP sources return only a unique identifier; use that value in a follow-on HTTP call to retrieve the full payload. See Implementing a Claim-Check Pattern.
- Image and document analysis — Send images to AI vision endpoints, PDF documents to summarization services, or other content to LLM agents for description and classification. See Calling an LLM Inline.
- Selective outbound integration — Use the
WHENsetting onAPPLY HTTPto choose which rows are sent to the endpoint. This is useful when only certain records require external processing—for example, sending alerts to another system when logical errors are detected in the data.
For additional examples, see Calling an HTTP Function.
Advanced Topics
The following table summarizes advanced integration and deployment capabilities. Additional topics will be added here as the platform evolves.
| Topic | Summary |
|---|---|
| Webhooks | DataZen can be invoked by other applications through HTTP/S webhooks. A webhook is always implemented as a Writer Pipeline, which means it depends on change logs—incoming data is captured before processing continues. Exposing a Writer Pipeline as a webhook makes it available for cloud integration scenarios. Because a webhook exposes a pipeline to the network, it should be designed to validate that input is valid and matches expectations before any downstream processing runs. See the Hook Pattern. |
| MCP Tools |
DataZen can expose a Read Pipeline as an MCP Tool with optional input parameters,
allowing pipelines to be injected directly into LLMs and AI agent workflows to return data
and/or take action. For example, a pipeline that returns the latest sales figures can be
exposed as an MCP Tool so an LLM can answer decision-makers with recent internal data.
Parameter injection is supported through DECLARE operations in SQL CDC scripts.
See SQL CDC Introduction (MCP Tool usage).
|
| AI Agents |
Pipelines can call AI and LLM endpoints inline using APPLY HTTP to enrich,
classify, or summarize data as it moves through the pipeline. In the opposite direction,
exposing Read Pipelines as MCP Tools lets external AI agents query governed DataZen
datasets rather than accessing source systems directly. Together, these patterns support
bidirectional integration between DataZen pipelines and agent-based applications.
|
| DataZen API | Enterprise customers can use the DataZen API to manage pipelines programmatically, access change logs, retrieve execution history, and obtain debug logs. API access is scope-based; see Security and API Scopes for an overview of available endpoints and permissions. |
| GitHub | Enterprise customers can link a DataZen agent to a GitHub repository to support a more formal release process and track pipeline changes over time. This enables version-controlled pipeline definitions alongside standard software development workflows. |
| Cloud vs. On-Prem | DataZen can run as a pure-cloud pipeline environment managed through an online portal, or as an on-premises service for environments with stricter security or connectivity requirements. The majority of features are available in both deployment models; notable exceptions include certain messaging consumer scenarios and some ODBC capabilities that may differ between cloud and on-prem installations. |
