Striim 3.9.6 documentation

Recovering applications

Subject to the following limitations, Striim applications can be recovered after planned downtime or most cluster failures with no loss of data:

  • Recovery must have been enabled when the application was created.

    Note

    Enabling recovery will have a modest impact on memory and disk requirements and event processing rates, since additional information required by the recovery process is added to each event.

  • All sources and targets to be recovered, as well as any CQs, windows, and other components connecting them, must be in the same application. Alternatively, they may be divided among multiple applications provided the streams connecting those applications are persisted to Kafka (see Persisting a stream to Kafka and Using the Striim Forwarding Agent).

  • Data from a CDC reader with a Tables property that maps a source table to multiple target tables (for example, Tables:'DB1.SOURCE1,DB2.TARGET1;DB1.SOURCE1,DB2.TARGET2') cannot be recovered.

  • Data from time-based windows that use system time rather the ON <timestamp field name> option cannot be recovered.

  • Data from sources using the HTTPReader, MultiFileReader, TCPReader, or UDPReader adapters cannot be recovered unless the source's output is a Kafka stream (see Introducing Kafka streams).

  • Standalone sources and WActionStores (see Loading standalone sources, caches, and WActionStores) are not recoverable unless persisted to Kafka (see Persisting a stream to Kafka).

  • Data from sources using an HP NonStop reader can be recovered provided that the AuditTrails property is set to its default value, merged.

  • Caches are reloaded from their sources. If the data in the source has changed in the meantime, the application's output may be different than it would have been.

  • If objects were added or renamed by DDL replication (see Including DDL operations in OracleReader output, and the Tables properties do not use wildcards, you must edit the application to add all new and renamed objects to both the OracleReader and DatabaseWriter Tables properties before restarting the application after a cluster failure.

  • Known issue DEV-13186: Recovery fails when KinesisWriter target has 250 or more shards. The error will include "Timeout while waiting for a remote call on member ..."

In some situations, after recovery there may be duplicate events.

  • ADLSWriterGen1, ADLSWriterGen2, AzureBlobWriter, FileWriter, HDFSWriter, and S3Writer restart rollover from the beginning and depending on rollover settings (see Setting output names and rollover / upload policies) may overwrite existing files. For example, if prior to the crash there were file00, file01, and the current file was file02, after recovery writing would restart from file00, and eventually overwrite all three existing files, so you may wish to back up or move the existing files before initiating recovery. After recovery, the target files may include duplicate events; the number of possible duplicates is limited to the Rollover Policy eventcount value.

  • When HiveWriter is using SQL APPEND or INSERT INTO rather than MERGE, after recovery there may be some duplicate events.

  • When KafkaWriter is in sync mode (see Setting KafkaWriter's mode property: sync versus async), if the Kafka topic's retention period is shorter than the time that has passed since the cluster failure, after recovery there may be some duplicate events, and striim.server.log will contain a warning, "Start offset of the topic is different from local checkpoint (Possible reason - Retention period of the messages expired or Messages were deleted manually). Updating the start offset ..."

  • After recovery, RedshiftWriter targets may include some duplicate events.

  • Recovered flows that include WActionStores should have no duplicate events. Recovered flows that do not include WActionStores may have some duplicate events from around the time of failure, except when a target guarantees no duplicate events ("exactly once processing," also called E1P).

To enable Striim applications to recover from system failures, you must do two things:

1. Enable persistence of all of the application's WActionStores.

2. Specify the RECOVERY option in the CREATE APPLICATION statement. The syntax is:

CREATE APPLICATION <application name> RECOVERY 5 SECOND INTERVAL;

For example:

CREATE APPLICATION PosApp RECOVERY 5 SECOND INTERVAL;

With this setting, when PosApp application is restarted after a system failure, it will resume exactly where it left off.

recovering_status.png

While recovery is in progress, the application status will be RECOVERING SOURCES, indicated by a red arrow in the web UI. The shorter the recovery interval, the less time it will take for Striim to recover from a failure. Longer recovery intervals require fewer disk writes during normal operation. To see detailed recovery status, enter MON <namespace>.<application name> <node> (see Using the MON command).