I have spent over a decade of my career developing and maintaining public flood warning systems. Through this work, I learned how to create reliable anomaly detection algorithms that could detect outliers such as timeouts from remote rain gauges or out-of-bounds readings from IoT sensors, for example.
In data analysis, anomalies represent data points that diverge significantly from established and expected patterns. Anomaly detection systems allow engineers and analysts to more easily identify these outliers within large datasets across various domains. Instead of sifting through a continuous stream of e-commerce events or weather readings, engineers can pinpoint unexpected or interesting events by using anomaly detection techniques.
What is Anomaly Detection?
Anomaly detection (also called "outlier detection") means examining single data points on univariate or multivariate axes to detect whether they deviate significantly from population norms.
Anomaly detection examines single data points on univariate or multivariate axes to detect whether they deviate from population norms.
Anomaly detection systems use different types of anomaly detection algorithms, ranging from out-of-bounds decisions to complex machine learning algorithms. Anomaly detection algorithms can be as simple as true/false expressions, or they can be quite complex; decision trees, support vector machines (SVMs) neural networks, k-nearest neighbor (kNN) algorithms, and isolation forest algorithms are common deep learning anomaly detection techniques for outlier detection on high-dimensional data.
Anomaly Detection Use Cases
Anomaly detection algorithms can be applied across a variety of use cases spanning many domains and industries, including:
- Credit card fraud detection
- Demand forecasting
- Inventory management
- Real-time personalization
- IoT system monitoring
- Predictive maintenance
- Intrusion detection
- Cybersecurity
Any system with sensors that generate events will benefit from an anomaly detection system. Sensors don't have to be physical devices; they can range from product or website instrumentation to measure customer interactions with an application, to remote IoT sensors broadcasting data, to a stream of inventory, price, and location data from a mix of IoT sensors, user inputs, and customer interactions.
Any system that creates a stream of data can benefit from anomaly detection.
Real-time Anomaly Detection: Challenges and Solutions
The underlying statistical techniques for anomaly detection have existed for centuries, and the adoption of anomaly detection techniques within data science began decades ago. Until recently, however, most anomaly detection techniques and algorithms were applied to static datasets. Due to bottlenecks at the data connection level and publishing layer, delivery of the analysis generally took several minutes, hours, or days.
However, as data volumes and velocities have exploded, the need for real-time anomaly detection has become even more critical. Batch data systems are giving way to real-time analytics and event-driven architectures, and traditional outlier detection systems will struggle to keep pace with the influx of real-time data.
Challenges with traditional algorithms for real-time anomaly detection
While advanced anomaly detection methods can be useful in some scenarios, they tend to be computationally intensive, requiring offline anomaly detection model development against training data with frequent or multiple passes through changing data sets.
For example, Gaussian Mixture Models (GMMs) make several passes through the data, generating Gaussian distributions. K-means clustering is another multi-pass method that generates data clusters for finding abnormalities in the test dataset.
Traditional anomaly detection algorithms can introduce latency that is unacceptable for real-time systems.
Regardless of the specific algorithm chosen, the main concern is the latency these techniques introduce. Latency becomes a critical factor when designing scalable real-time architectures. Quick detection and response are crucial in many domains, but many common anomaly detection algorithms introduce significant delays due to complex computations or the need for large training datasets.
Supervised vs. Unsupervised Methods for Real-Time Anomaly Detection
Anomaly detection algorithms can be broadly divided into supervised and unsupervised approaches, each with distinct advantages and limitations for real-time applications.
Supervised Anomaly Detection uses labeled training data to develop a classification model. These models explicitly mark specific data points as normal or anomalous and "learn" the defining characteristics of normal behavior based on the labeled dataset.
Supervised detection allows for high precision in anomaly identification, especially for multivariate use cases, as the model can effectively flag points that deviate significantly from the learned norm across many variables.
However, supervised learning often requires multiple passes through the data for effective training, making it less ideal for the continuous nature of real-time data streams. Manually labeling data introduces latency, hindering its use in situations requiring real-time detection.
Unsupervised Anomaly Detection, on the other hand, works quite well in scenarios where streaming data must be analyzed in real time, labeled data is scarce, or the definition of "normal" is constantly evolving.
Rather than use training data sets, unsupervised anomaly detection algorithms analyze the raw data on the fly, identifying inherent patterns and statistical properties.
Data points that fall outside the established patterns or statistical expectations by a significant margin are flagged as potential anomalies in real time. This approach is particularly well-suited for real-time anomaly detection, as it does not require pre-labeled data and can adapt to changing data trends.
Unsupervised methods offer a compelling approach to real-time anomaly detection. Techniques like Z-score and Interquartile Range (IQR) calculations can be used within unsupervised methods to identify outliers and potential anomalies in real time thanks to their statistical power and low computational requirements.
5 Real-Time Anomaly Detection Algorithms with Example Code
Necessity is the mother of invention, and real-time databases such as the open-source Apache Druid, Apache Pinot, and ClickHouse have emerged to help us achieve real-time anomaly detection. These columnar databases support the analysis of large amounts of data in real time with hundreds and thousands of concurrent connections.
The emergence of real-time databases, increased processing power, and higher network bandwidths have provided the foundation for new tools and platforms to build real-time anomaly detection systems. By enabling query response times measured in milliseconds, it is now possible to build systems that detect anomalies almost instantly, regardless of scale.
New technology, specifically real-time databases, make it possible to run complex real-time anomaly detection algorithms using online SQL queries.
Here, you'll find 5 examples of real-time anomaly detection algorithms that can be built using SQL queries over a real-time database. These examples have been developed using Tinybird, a real-time data platform built on ClickHouse. Tinybird is an ideal platform to build unsupervised anomaly detection systems with nothing but SQL and publish algorithms as scalable REST APIs.
Each code snippet uses query parameters supported by Tinybird's templating library, which allows you to define and pass dynamic variables to SQL queries exposed as APIs. For more information on how query parameters are defined in Tinybird, read this.
Tinybird allows us to publish SQL queries as dynamic APIs. We can pass configurable thresholds or time periods as query parameters to each published endpoint.
Out-Of-Range Anomaly Detection
The out-of-range anomaly detection algorithm compares data with a set of maximum and/or minimum values. This algorithm uses a simple query and can be applied to individual data points unsupervised. As each new data point arrives, it is tested to see if it is within the acceptable range, and flagged as an anomaly if not.
For example, in flood warning systems, sensors such as water level gauges have a defined maximum value for valid readings. These thresholds are based on the local "channel" cross-sections, where values over a threshold are not physically possible.
Below is a simple SQL implementation for out-of-range anomaly detection, using the Tinybird templating library to pass query parameters (min_value
and max_value
) that define the acceptable range:
To find a more detailed tutorial on out-of-range anomaly detection using SQL, read this.
Timeout Anomaly Detection
The timeout anomaly detection algorithm finds the most recent timestamp for a sensor and checks if it is outside of an acceptable timeout window. This detection method should be deployed regularly and proportionally to expected report frequencies.
For example, rain gauges are typically programmed to report every hour, even on sunny days. A sensor whose last reported timestamp is more than 60 minutes in the past could be considered "timed out". This expected interval can be configured based on real-world needs; for example, you may only want to be alerted to a timeout if the gauge has not reported data for 24 hours.
Below is a basic SQL implementation for timeout anomaly detection, using the Tinybird templating library to pass a query parameter that defines the timeout period:
To find a more detailed tutorial on timeout anomaly detection using SQL, read this.
Rate-Of-Change Anomaly Detection
The rate-of-change anomaly detection algorithm retrieves two consecutive data points and determines the rate of change, or slope, comparing that to a configurable maximum allowable slope.
The rate-of-change method is important because it assesses data in a larger context beyond just a single data point. Retrieving previous values can be challenging depending on how data is stored. In the case of ClickHouse, we use a time window function to retrieve this value.
Rate-of-change anomalies are fundamental for flood warning systems, as they can flag rapidly increasing water levels and rain accumulations. They are also critical for data validation as most monitored phenomena have physical constraints on reasonable rates of change. For example, a large reservoir with a measured rate of change of several inches in a minute would indicate either an inaccurate measurement or a dam failure.
Below is an SQL implementation for rate-of-change anomaly detection:
To find a more detailed tutorial on rate-of-change anomaly detection using SQL, read this.
Interquartile Range (IQR) Anomaly Detection
The interquartile range (IQR) anomaly detection algorithm detects values that fall outside a range based on the first and third quartiles of the data included in a configurable recent time window. This type of detection makes it possible to identify short-term anomalies even as data trends shift due to seasonality or other mid-to-long-term effects.
Below is an SQL implementation for IQR anomaly detection:
To find a more detailed tutorial on IQR anomaly detection using SQL, read this.
Z-Score Anomaly Detection
The Z-score anomaly detection algorithm calculates a Z-score for each data point based on data averages and standard deviations, with any Z-score above a configurable threshold identified as an anomaly.
As with the IQR method, this method can adapt to dynamic baselines even in an unsupervised setting, as the “normal" range of readings shifts over time.
For a flood warning example, the Z-score method can identify when water levels and sustained rain accumulations are rising fast enough to cause flooding even when short-term rates of change are not anomalous.
Below is an SQL implementation for Z-score anomaly detection:
To find a more detailed tutorial on Z-score anomaly detection using SQL, read this.
Closing and Additional Resources
As you design your anomaly detection system, consider the importance of responsiveness in your algorithm. What latency is acceptable between a real-world anomaly occurrence and its statistical detection?
Long latencies measured in hours or days may be acceptable in certain domains. For others, like healthcare, manufacturing, vehicle geospatial systems, and fraud detection, anomalous data must be detected within seconds (or less).
For more information on real-time anomaly detection, check out these resources: