DDoS Predictive Risk – Project Documentation

Automatic Risk Calculation & Loss Forecasting (Component 4)

1. Project overview

This project implements Component 4: automatic risk calculation, predictive loss forecasting, and dynamic USD-to-local-currency conversion for DDoS risk. The backend aggregates attack data, runs an ARIMA forecast, fetches exchange rates, and computes risk and loss. The frontend dashboard shows key metrics, risk forecast charts, financial impact, reporting, what-if simulation, and alerts.

Risk formula: Risk (Loss) = Probability × Cost_per_Attack_USD × Exchange_Rate

2. Project structure & files

PathRole
backend/app/main.pyFastAPI app: risk overview, report, alerts, export CSV, what-if, manual attack, demo load
backend/app/config.pySettings: dataset dir, currencies, cost per attack, history/forecast days, min training points
backend/app/data_loader.pyLoad CICDDoS CSVs, hourly counts, aggregate daily/hourly, demo series generator, merge manual attacks
backend/app/forecast.pyARIMA training and forecast; probability of high-attack day
backend/app/risk_engine.pyExchange-rate fetch, forecast timestamp builder
backend/app/schemas.pyPydantic models for API request/response
notebooks/01_clean_cic_ddos_dataset.ipynbClean CIC-DDoS2019 CSVs → cleaned_events.csv, hourly_attack_counts.csv
notebooks/02_train_and_evaluate_models.ipynbClassification (RF, XGBoost) and forecasting (ARIMA, XGBoost); metrics and plots
scripts/run_train_and_evaluate.pyCLI script: same classification + ARIMA forecasting pipeline, writes results to results/
frontend/src/React (Vite): Dashboard, Risk Forecast, Financial Impact, Reporting, What-if, Alerts, Data & Model
data/CICDDoS2019/Raw CIC-DDoS2019 CSV files (LDAP, MSSQL, NetBIOS, Portmap, Syn, UDP, UDPLag, etc.)
data/cleaned/Output of notebook 01: cleaned_events.csv, hourly_attack_counts.csv
results/Charts: metrics bar, confusion matrix, forecast actual vs predicted, recall per class
models/Saved models and model_metadata.json

3. Dataset used

CIC-DDoS2019 is used for both classification and time-series forecasting.

4. Notebooks

01_clean_cic_ddos_dataset.ipynb

Loads all CSVs from data/CICDDoS2019/, normalizes columns, parses Timestamp and Label, drops duplicates (by Timestamp + Label), sorts by time. Writes:

Configurable MAX_ROWS_PER_FILE limits rows per CSV or uses full files.

02_train_and_evaluate_models.ipynb

Two main parts:

5. How the models were created

6. Models used, hyperparameters & why

ModelUseHyperparametersWhy
Random Forest Classification (attack type) n_estimators=100, max_depth=15, random_state=42, n_jobs=-1 Robust, less overfitting with depth limit; baseline for comparison with XGBoost.
XGBoost (XGBClassifier) Classification (attack type) n_estimators=100, max_depth=8, random_state=42, eval_metric="mlogloss" Typically better accuracy/precision/recall on tabular data; handles mixed features and class imbalance well.
ARIMA Time-series forecast (attack count) order=(1,1,1) Simple, interpretable; one lag and one MA term with differencing (I=1) for non-stationary counts.
XGBoost (XGBRegressor) Forecasting (notebook/script only) n_estimators=100, max_depth=6, random_state=42 Compare to ARIMA; better at capturing spikes/non-linear patterns in the evaluation plot.

Other settings: train_test_split test_size=0.25, stratify=y; StandardScaler fit on train only; RANDOM_STATE=42 / random_state=42 for reproducibility. Backend config (env): HISTORY_DAYS=90, FORECAST_DAYS=10, MIN_TRAINING_POINTS=5, AVG_COST_PER_ATTACK_USD=500, BASE_CURRENCY, TARGET_CURRENCY (e.g. LKR).

7. How components are connected

8. How the model works in the app

The live dashboard uses only the ARIMA-based forecasting path (no classification in the API).

  1. Backend loads last 90 days of attack counts (daily or hourly aggregated to daily for display).
  2. If < 30 points: forecast = constant (median of last 8). If ≥ 30: ARIMA(1,1,1) is fitted and forecasts the next 10 days.
  3. Forecast timestamps are the 10 days starting the day after the last history date.
  4. Probability of “high attack day” = fraction of forecast days above the 75th percentile of history.
  5. Risk score = 0.55 × (probability of high attack) + 0.45 × (expected probability); scaled 0–100. Level: Low / Medium / High / Critical.
  6. Loss per day = attack_probability × AVG_COST_PER_ATTACK_USD × exchange rate (USD → LKR or other).

What-if uses the same baseline (expected daily loss from overview); user inputs cost per attack, mitigation %, and budget; backend returns projected loss and savings.

9. Model performance

Classification and forecasting performance are evaluated in the notebooks and script and reflected in the results/ images.

Classification metrics (Accuracy, Precision, Recall, F1)

Both Random Forest and XGBoost achieve high scores; XGBoost is typically slightly better across all four metrics (e.g. ~0.97–0.98 vs ~0.95 for RF).

Classification metrics: Accuracy, Precision, Recall, F1 – Random Forest vs XGBoost

Figure 1: Classification metrics (Accuracy, Precision, Recall, F1) – Random Forest vs XGBoost.

Confusion matrix (XGBoost)

Shows true vs predicted labels for the seven classes (BENIGN, LDAP, MSSQL, NetBIOS, Portmap, Syn, UDP). Diagonal = correct; off-diagonal = confusions. Notable confusion often between NetBIOS and Portmap.

Confusion Matrix (XGBoost)

Figure 2: Confusion Matrix (XGBoost).

Recall per class (XGBoost)

Recall for each attack/benign class. Most classes near 1.0; Portmap often lower (~0.81), consistent with NetBIOS/Portmap confusion.

Recall per class (XGBoost)

Figure 3: Recall per class (XGBoost).

Forecasting: Actual vs Predicted (ARIMA & XGBoost)

Time-series plot of actual attack count vs ARIMA and XGBoost predictions. ARIMA is smoother and can miss large spikes; XGBoost often tracks spikes better. The production API uses ARIMA for simplicity and stability.

Forecasting: Actual vs Predicted (ARIMA and XGBoost)

Figure 4: Forecasting – Actual vs Predicted (ARIMA & XGBoost).

10. Dashboard screens & images

This section references and explains the six dashboard screenshots. Keep these files in the same folder as index.html: Screenshot 2026-03-08 045215.png (Dashboard), 045218.png (Risk Forecast), 045224.png (Financial Impact), 045228.png (Reporting), 045238.png (What-if), 045242.png (Alerts).

Figure 5: Dashboard – Screenshot 2026-03-08 045215.png

Dashboard – key metrics and risk formula

Dashboard: key metrics and risk formula.

What it shows: The main dashboard with “DDoS Predictive Risk” and “Automatic Risk Calculation & Loss Forecasting.” The Key metrics block shows: Risk score (e.g. 100/100, Critical), High attack probability (e.g. 100.0% – probability of a high-activity period), Expected daily attacks (forecasted attack count average), Expected daily loss (USD) and Expected daily loss (Local) (e.g. LKR at the current USD/LKR rate). The Risk formula is stated as: Risk (Loss) = Probability × Cost per attack (USD) × Exchange rate, with cost per attack (e.g. $500) and the exchange-rate source (e.g. open.er-api.com). Header controls include currency/granularity badge, “Load demo,” and “Refresh.”

Explanation: This is the overview of predictive risk and financial impact. All numbers come from GET /api/risk/overview. The risk score and high attack probability are computed from the ARIMA forecast and history; expected daily attacks and losses use the same forecast and the risk formula with the live exchange rate. "Load demo" toggles demo data; "Refresh" refetches from the API.

Figure 6: Risk Forecast – Screenshot 2026-03-08 045218.png

Risk Forecast – attack count and probability charts

Risk Forecast: history (90d) + forecast (10d), attack count and probability.

What it shows: The “Risk Forecast” page with subtitle “Last 90 days (history) and next 10 days (forecast). Attack frequency and probability.” Chart 1 – “Attack count: History (90d) + Forecast (10d)”: a line chart with time on the x-axis (e.g. 9 Mar 26–19 Mar 26). One line is history (actual attack counts over the last 90 days); a second, dashed line is the forecast (next 10 days). With demo or sparse data the forecast can be a flat line (e.g. constant 5 attacks/day). Chart 2 – “Predicted attack probability (next 10 days)”: a line over the same 10-day window showing daily attack probability (0–1). Often high and flat (e.g. ~0.9) when the forecast is constant.

How it connects: Data is from overview.history and overview.forecast. History is backend-aggregated daily (or hourly→daily) counts; forecast is ARIMA (or median baseline when <30 points). Probability is derived from predicted_attacks and the 95th-percentile baseline.

Figure 7: Financial Impact – Screenshot 2026-03-08 045224.png

Financial Impact – loss USD and local

Financial Impact: exchange rate and loss in USD and LKR.

What it shows: The “Financial Impact” page: “Last 90 days (history) and next 10 days (forecast). Loss in USD and local currency.” An Exchange rate card shows the currency pair (e.g. USD/LKR), rate (e.g. 310.7199), and source URL. Two line charts: Loss (USD): History (90d) + Forecast (10d) – daily loss in dollars (y-axis), time on x-axis; and Loss (local currency): History (90d) + Forecast (10d) – same structure in LKR (or configured target). Tooltips on the lines show date and loss value (e.g. $500.00 or LKR 155,359.93).

How it connects: History and forecast loss come from the same overview: each day’s loss = attack_probability × cost_per_attack × exchange_rate. The backend uses fetch_exchange_rate() for live conversion; the frontend uses overview.history and overview.forecast for both charts.

Figure 8: Reporting – Screenshot 2026-03-08 045228.png

Reporting – executive summary and risk windows

Reporting: executive summary, loss bands, top risk windows.

What it shows: The “Reporting” page with three blocks. Executive summary: Risk score (e.g. 100/100, Critical), Expected daily loss ($500.00 USD), Expected 7-day loss ($3,500), Expected 30-day loss ($15,000), plus “Refresh report” and “Export CSV” buttons. Loss bands (planning): A table with Best case (e.g. $375/day, LKR 116,519.95), Expected ($500/day, LKR 155,359.93), Worst case (e.g. $625/day, LKR 194,199.91). Top risk windows: A list of future dates (e.g. 10 Mar 26–14 Mar 26, 00:00) with for each: predicted attacks (e.g. 5), probability (e.g. 100.0%), loss in USD and LKR. A “Reporting notes” section explains how risk and loss are computed.

How it connects: All from GET /api/risk/report. Loss bands are 0.75×, 1.0×, 1.25× of expected daily loss. Top risk windows are the forecast points sorted by loss (highest first), limited to five. Export CSV uses GET /api/risk/export.csv.

Figure 9: What-if Simulator – Screenshot 2026-03-08 045238.png

What-if Simulator

What-if Simulator: inputs and simulation result.

What it shows: The “What-if Simulator” page: “Adjust assumptions to estimate savings from mitigation and security budget.” Simulation inputs: Cost per attack (USD), Mitigation effectiveness (%), Monthly mitigation budget (USD), and a “Run simulation” button. Simulation result: After running, four cards show Baseline daily loss (e.g. $5,125), Projected daily loss (e.g. $4,100 after mitigation), Daily savings (e.g. $1,025), and Net 30-day savings (e.g. $29,750 – savings over 30 days minus the monthly budget).

How it connects: Frontend sends the three inputs to POST /api/risk/what-if (with optional ?demo=1). Backend uses the current overview’s expected daily loss to get a baseline probability, then applies the user’s cost per attack and mitigation % to compute projected loss and savings; net 30-day savings = (daily savings × 30) − budget.

Figure 10: Active Alerts – Screenshot 2026-03-08 045242.png

Active alerts

Active alerts: threshold-based alerts and current values.

What it shows: The “Alerts” page: “Threshold-based alerts for proactive risk monitoring.” An “Active alerts” card lists only the rules that are currently triggered. Each row has a severity bar (e.g. yellow=high, red=critical, blue=medium), the rule name (e.g. “High attack probability,” “High expected daily loss,” “Insufficient training points”), the threshold (e.g. ≥65%, ≥$350/day, <12 points), and the current value (e.g. 100.0%, $500.00/day, 1). A “Refresh alerts” button re-fetches from the API.

How it connects: Data from GET /api/risk/alerts. The backend builds alerts from the risk overview: high attack probability (probability_of_high_attack ≥ 0.65), high daily loss (expected_daily_loss_usd ≥ 350), insufficient training points (training_points < 12). Only triggered rules are returned, so the list is empty when no thresholds are exceeded.

11. Special code snippets

Risk formula and forecast (backend)

backend/app/main.py – building forecast points and loss:

probability_baseline = max(1.0, float(series["attack_count"].quantile(0.95)))
for ts, predicted_attacks in zip(forecast_timestamps, forecast_values):
    attack_probability = min(1.0, predicted_attacks / probability_baseline)
    loss_usd = attack_probability * settings.avg_cost_per_attack_usd
    loss_local = loss_usd * exchange.rate
    points.append(ForecastPoint(timestamp=..., predicted_attacks=..., attack_probability=..., loss_usd=..., loss_local=...))

ARIMA forecast and short-series fallback

backend/app/forecast.py – training and forecasting:

def train_arima_and_forecast(daily_counts: pd.DataFrame, forecast_days: int) -> list[float]:
    series = daily_counts["attack_count"].astype(float)
    if len(series) < 30:
        fallback = float(series.tail(min(8, len(series))).median()) if len(series) > 0 else 0.0
        return [max(0.0, fallback) for _ in range(forecast_days)]
    model = ARIMA(series, order=(1, 1, 1))
    fitted = model.fit()
    forecast = fitted.forecast(steps=forecast_days)
    return [max(0.0, float(v)) for v in forecast.tolist()]

Risk score and level

backend/app/main.py – score and level from probabilities:

def _compute_risk_score(probability_of_high_attack: float, expected_probability: float) -> int:
    score = (probability_of_high_attack * 0.55) + (expected_probability * 0.45)
    return int(round(max(0.0, min(1.0, score)) * 100))

def _risk_level(score: int) -> str:
    if score >= 75: return "Critical"
    if score >= 55: return "High"
    if score >= 35: return "Medium"
    return "Low"

Manual attack recording (API)

backend/app/main.py – in-memory manual attacks merged into series:

@app.post("/api/risk/attack", response_model=ManualAttackResponse)
def record_manual_attack(request: ManualAttackRequest) -> ManualAttackResponse:
    date_str = request.date or date.today().isoformat()
    _manual_attacks[date_str] = _manual_attacks.get(date_str, 0) + request.count
    return ManualAttackResponse(ok=True, date=date_str, count=request.count, message=...)

_merge_manual_attacks(series, granularity) adds these counts to the loaded series so overview/report reflect them.

Classification training (script)

scripts/run_train_and_evaluate.py – classifiers and metrics:

rf = RandomForestClassifier(n_estimators=100, max_depth=15, random_state=RANDOM_STATE, n_jobs=-1)
xgb_model = xgb.XGBClassifier(n_estimators=100, max_depth=8, random_state=RANDOM_STATE, n_jobs=-1, eval_metric="mlogloss")
# ...
m_rf = metrics(y_test, y_pred_rf)   # Accuracy, Precision, Recall, F1
m_xgb = metrics(y_test, y_pred_xgb)
# Confusion matrix: confusion_matrix(y_test, y_pred_xgb)

Config (env-driven)

backend/app/config.py – main settings:

dataset_dir, base_currency, target_currency, avg_cost_per_attack_usd,
history_days=90, forecast_days=10, min_training_points=5

Back to contents