DataZen Documentation
DataZen User Guide

High Watermark Values

Some jobs support the ability to track the "last highest value" of a field from the data source so that future calls can retrieve only the data that changed. Normally, this value is a DateTime or Timestamp data type, or an integer (or long) value. For example, a database system may have a timestamp field that can be used for a high watermark. Twitter offers an id field that contains a numeric value that keeps growing. A SharePoint List contains a LastModified field that can be used for this purpose.

Generally speaking a high watermark is used as an optimization technique that limits how future data is retrieved so that only the changes are extracted. High watermark values are usually not necessary when the source system is a CDC stream itself or a messaging platform.

Using the High Watermark feature

Built-in High Watermark Support

You can leverage the built-in high watermark support of DataZen in most cases; for complex scenarios, you may need to implement a custom high watermark. In addition, high watermark values are used differently depending on the source system of a reader. See the Watermark Pattern documentation for details on how to implement this feature.

When using SQL CDC, the built-in high watermark is only supported for the first SELECT operation detected in the script. If a high watermark is needed for additional SELECT operations, or you need to share a high watermark value across multiple pipelines, use the custom method explained below.

For example, the following SQL CDC operation keeps track of a high watermark value automatically. Note that we are careful to order the result by the modified date to ensure we can page the records successfully over time. When using the built-in high watermark feature, you do not need to take any additional actions; the high watermark will automatically be saved with the pipeline if the pipeline completes successfully.

SELECT * FROM DB [sql] 
  (SELECT TOP 250 name, modify_date FROM sys.tables WHERE modify_date > '@highwatermark' ORDER BY modify_date)
  WITH HWM 'modify_date';

Custom High Watermark

You may need to track high values separately from the built-in feature for a number of reasons such as when working with unsupported data types, when multiple SELECT operations are used, or when you need complex custom calculations. In addition, you may need multiple pipelines to share the same high watermark. Regardless of the reason for using a custom high watermark, the approach is similar:

  • Use the #getvar() or #trygetvar() to retrieve an environment variable value
  • Use one of the #data.calc_hwm functions to calculate the new high watermark
  • Call the SET operation to save the new calculated high watermark value

For example, consider this script that uses an environment variable called saved_hwm as a numeric value:

-- This script assumes the variable saved_hwm already exists
PRINT 'Initial HWM value: #getvar(saved_hwm)';
SELECT * FROM DB [sql] (SELECT TOP 100 * FROM mytable WHERE id > #getvar(saved_hwm) ORDER BY id );
PRINT 'New HWM candidate value: #data.calc_hwmnumber(id, #getvar(saved_hwm))';
SET @saved_hwm = '#data.calc_hwmnumber(id, #getvar(saved_hwm))';

A more robust example takes into account that the script may fail at some point in the middle and the intent is to save the high watermark only if the script succeeds. Also, this script does not assume the environment variable was previously created and initializes the value to 0 if it does not.

-- declare a script-level parameter called @hwm to avoid calling the #getvar() function multiple times
DECLARE @hmw = '#trygetvar(saved_hwm, 0)';
PRINT 'Initial HWM value: {{@hwm}}';
SELECT * FROM DB [sql] (SELECT TOP 100 * FROM mytable WHERE id > {{@hwm}} ORDER BY id );

PRINT 'New HWM candidate value: #data.calc_hwmnumber(id, {{@hwm}})';
-- do some more work... if an error is thrown, the high watermark will not be saved

-- Now save the high watermark in the environment variable before exiting since no errors took place if we got this far
-- If at this point the id column is no longer part of the pipeline, consider using the #statebag() function right after the select operation
-- to temporarily keep the new high watermark in memory, and use #statebag() here instead 
SET @saved_hwm = '#data.calc_hwmnumber(id, {{@hwm}})' CREATE SKIP_PREVIEW;

View/Edit High Watermark Values using the Manager

When high watermark values are captured and stored by DataZen, you may view and edit them. In some implementations, when the high watermark value is managed externally, this feature is not available.

To edit high watermark values (last read or last deleted), select the desired job from the list of jobs in DataZen Manager. Shortly after clicking on it, the right panel shows most job settings, including the current high watermark value in Last Read Pointer (in this example, a DateTime).

If the job holds a high watermark, the Edit Pointers button on the right panel will be enabled; click on it.

This screen shows both the Last Read Pointer and Last Delete Pointer when available. You can manually edit the value. The value can be modified as follows:

  • Reset (null): resets the value to NULL; all available data will be read again
  • Date/Time Value: Selects a date/time value from a date picker
  • Numeric Value: Enter a numeric value
  • Custom Value: Free-form text

In some cases, this setting may be an array when a job holds multiple pointers. When entering a date as free-form, use the following notation: YYYY-MM-DD hh:mm:ss.nnn

If not set correctly, changing this value may cause the job to fail.