DeltaMax V-2.0 Overview

This guide walks through the complete DeltaMax V2 workflow — from multi-period data generation using the H–A–B framework, to advanced anomaly detection with KNN, Trust Score computation, and seamless integration with Google Cloud for visualization and AI-powered analysis.

DeltaMax V2 is designed to enable structured data validation, statistical drift detection, machine learning–based anomaly identification, and business-level risk interpretation.

It provides the commands and structure required to run each step end-to-end and adapt it to your project setup.

The workflow is organized into five core sections:

Multi-Period Data Generation (H–A–B Framework) - Generates Historical (H), Previous (A), and Current (B) datasets to enable structured drift and stability comparison.
Trust Score Computation - Combines anomaly signals, drift metrics, and data integrity checks into a normalized, interpretable risk score.
Data Quality Checks - Executes KNN anomaly detection, statistical tests, PSI analysis, and structural validations to detect inconsistencies and risk signals.
Google Cloud & BigQuery Integration - Operationalizes outputs through Cloud Storage ingestion and BigQuery modeling for scalable analytics.
Visualization & Agentic Intelligence - Delivers executive dashboards and AI-powered natural language querying for real-time risk exploration.

Step-1 : Search for DeltaMax on Google Cloud Marketplace

https://console.cloud.google.com/marketplace/product/katalyststreet-public/sq-33833?hl=en&project=api-project-266032973323

1) You will need to create a project on Google Cloud

2) The project should be associated with your corporate billing account.

Step-2 : Deploy the DeltaMax Virtual Machine in your Organization

Getting the DeltaMax VM provisioned and configured within your Organization Settings. Check the settings for

VPC
SubNet
Zones
Firewalls

Step-3 : Ensure that VM is updated, if required install python3-full latest version, venv

Google Cloud Storage & BigQuery Integration

All DeltaMax V2 scripts are already integrated with:

Google Cloud Storage (GCS) upload logic
BigQuery table loading logic
Automatic dataset/table creation (if configured)

Once the scripts are executed, outputs are:

Uploaded to GCS
Loaded into BigQuery
Immediately available for reporting in Looker Studio

⚠️ Important:
Before running in a new environment, you must update the cloud credentials/config variables inside all scripts.

Required Configuration Variables (Present in Scripts)

If found inside any file:

These must be changed to match the user's environment.

Command to Update Across All Scripts (Bulk Replace)

From inside your project folder:

Multi-Period Data Generation (H–A–B Framework)

Step-4 : DeltaMax_synthetic_data_generator.py

This step generates synthetic datasets from January to August, where January–June act as Historical (H), July is the Previous period (A), and August is the Current period (B). The August dataset includes controlled variations to simulate drift and anomalies for H–A–B risk evaluation.

Step-5 : Dataset Profiling & Structural Summary

This step scans all generated monthly datasets (January–August) and produces a consolidated structural summary, including row counts, column counts, storage size, and month-to-month entity churn.

Step-6 : H–A–B Data Consolidation & BigQuery Load (03_hab.py)

This step combines all monthly datasets into a single structured H–A–B master file to streamline downstream drift analysis, anomaly detection, and Trust Score computation.

Important Configuration Note : BigQuery integration credentials are defined directly inside the script.

Before running in your environment, open the file using nano or vim and update the project configuration accordingly.

If deploying in a different environment, make sure to update:

Google Cloud Project ID
BigQuery Dataset name
Target Table name
Cloud Storage Bucket

Failure to update these values will result in incorrect project uploads.

Trust Score Computation

Step-6 : Trust Score Computation (04_Trust_Score.py)

Executes anomaly detection, data integrity validation, and multi-period drift analysis (H–A–B) to quantify dataset stability and risk exposure.

It aggregates anomaly health, drift health, and business rule compliance into a weighted Trust Score (0–100), providing a single interpretable risk metric for the current month.

Data Quality Checks

Step-7 : Isolation Forest & IQR Anomaly Detection (M1_HAB_ISO_IQR.py)

Runs Isolation Forest and IQR-based outlier detection across H–A–B datasets to identify global and statistical anomalies in entity behavior.

Step-8 : Variance Drift Analysis (M2_HAB_VARIANCE.py)

Measures variance shifts across H–A–B periods to detect distribution instability and structural data changes.

Step-9 : Statistical T-Test Drift Detection (M3_HAB_T_TEST.py)
Performs Welch’s T-test across periods to detect statistically significant mean shifts between Historical, Previous, and Current datasets.

Step-10 : Missing Value Anomaly Detection (M4_missing_anomalies.py)

Identifies abnormal missingness patterns in the current dataset compared to historical benchmarks.

Step-11 : Population Stability Index (PSI) Analysis (M5_PSI.py)

Calculates PSI scores to quantify distribution drift between H–A–B datasets and measure population stability.

Step-12: Decimal Formatting Mismatch Detection (M6_Decimal_Formatting_Mismatches.py)

Detects numeric precision and decimal formatting inconsistencies between historical and current datasets.

Step-14: Detects numeric precision and decimal formatting inconsistencies between historical and current datasets.

Validates string field consistency by detecting abnormal length deviations across structured text attributes.

Step-15:Unique Business Integrity Check (M8_unique_business.py)

Ensures entity uniqueness and detects duplicate or conflicting business identifiers across H–A–B datasets.

Visualization & Agentic Intelligence

Step-24: Visualization through Looker Studio

Note: This visualization is an example of how the data can be perceived using our generated datasets on Looker. It is intended to illustrate potential insights and patterns rather than represent finalized outputs.

For customized dashboards and tailored reporting, please reach out to Katalyststreet to help visualize your outputs effectively.

Figure 1 shows that the highest number of anomalies appears in H vs B (103,459), followed by H vs A (96,686) and A vs B (95,825), indicating stronger deviations when Historical data is compared with B.

Figure 2 shows that H vs B has the highest anomaly rate at 94.63%, followed by H vs A at 92.68%, and A vs B at 87.65%, confirming that distribution shifts are strongest between Historical and B.

Figure 3 shows that H vs B has the highest anomaly count (29), while H vs A (2) and A vs B (1) show very few anomalies, indicating limited model-detected deviations outside the Historical vs B comparison.

Figure 4 shows H vs B has the highest rate at 2.65%, whereas H vs A (0.19%) and A vs B (0.09%) remain very low, suggesting that most records across these comparisons are considered normal by the model.

Figure 5 highlights the key features contributing to anomalies when comparing Historical vs B. Both methods identify similar drivers, but IQR generally assigns higher impact severity than Isolation Forest, showing that IQR is more sensitive while Isolation Forest provides a more conservative view of feature influence.

Figure 6 shows that variance shifts across columns differ by comparison. The year column stands out with very large changes — 105,008.83 between Historical vs Current Month — while most other columns remain relatively stable.

Figure 7 shows that Historical vs Current Month has the highest rate (81.58%), followed by Historical vs Previous Month (73.68%), while Previous vs Current Month is lower (47.37). This indicates stronger distribution shifts when comparing Historical data.

Figure 8 shows that Historical vs Current Month exhibits the largest severity (30.91), with lower values for Historical vs Previous Month (17.48) and Previous vs Current Month (12.43). This confirms that the strongest deviations occur between Historical and Current Month.

Figure 9 shows that the average number of missing fields has dropped sharply from 5.44 in Historical data to 0.21 in the Current Month, indicating improved data completeness.

Figure 10 shows that businesses with null values have reduced significantly, from 113,875 historically to 17,240 in the Current Month, reflecting better data quality.

Figure 11 shows anomaly health scores across datasets. Historical (H) has the highest score (0.98), followed by Previous Month (A) at 0.94, and Current Month (B) at 0.84, indicating slightly reduced data health in the Current Month.

Health Comparison shows that health scores are distributed across datasets with H (35.5%), A (34.1%), and B (30.4%), indicating relatively balanced health but a slightly lower share for B.

Comparison by IQR Flag shows that anomalies flagged by the IQR method are concentrated in B (50.5%), followed by A (37.3%), while H contributes the least (12.2%).

Comparison by Missing Flag shows that missing data is entirely concentrated in B (100%), with no missing values in H or A.

Comparison by Risk Score shows that risk scores are highest in B (67.6%), moderate in A (24.4%), and minimal in H (8%), suggesting elevated risk in the current dataset.

Figure 13 shows that anomalies detected by the K-Nearest Neighbours method vary across datasets, with B having the highest anomaly rate (4.33%), A moderate (3.36%), and H the lowest (1%).

Figure 14 shows PSI scores of 20.54 (Historical vs Current), 13.3 (Historical vs Previous), and 5.22 (Previous vs Current), indicating significant distribution shifts, with the largest change observed between historical and current data.

Page updated

Report abuse