05. Threat Hunting with Machine Learning

Cyber threats are evolving rapidly, with sophisticated attack vectors targeting enterprise networks, cloud infrastructures, and IoT devices. Traditional rule-based security solutions, such as signature-based Intrusion Detection Systems (IDS), struggle to detect zero-day attacks and advanced persistent threats (APTs). This is where machine learning (ML) and AI-driven threat hunting come into play, allowing cybersecurity professionals to detect anomalies, correlate behaviors, and analyze massive security logs in real time.

In this article, we will explore how Python can be leveraged to build machine learning models for proactive threat hunting. The focus will be on three core areas:

Anomaly Detection with Scikit-Learn – Using unsupervised learning to detect unusual network activity.
Behavioral Analysis & Event Correlation – Identifying malicious user behavior and correlating attack indicators.
NLP for Threat Detection in Logs – Leveraging natural language processing (NLP) to analyze security logs for attack signatures.

Each section will provide real-world code examples, practical use cases, and explanations of common pitfalls in ML-based threat detection.

1. Anomaly Detection with Scikit-Learn

Why Anomaly Detection in Threat Hunting?

Anomalies in network traffic or user behavior often indicate potential cyber threats. While traditional security tools rely on static rules, machine learning-based anomaly detection can uncover previously unknown attack patterns by identifying deviations from normal behavior.

Common cybersecurity anomalies include:

Unusual login patterns (e.g., accessing a system from two distant locations within minutes).
Uncommon data transfers (e.g., massive data exfiltration from a critical server).
Network traffic spikes (e.g., sudden high traffic from a single IP, signaling a DDoS attempt).

Building an Anomaly Detection Model

We’ll use Scikit-Learn’s Isolation Forest algorithm to detect anomalies in network traffic logs.

Step 1: Load the Dataset

import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

# Load sample network traffic dataset
df = pd.read_csv("network_traffic_logs.csv")

# Features: Source IP, Destination IP, Packet Size, Connection Duration
X = df[['packet_size', 'connection_duration']]

# Normalize data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 2: Train the Isolation Forest Model

# Train Isolation Forest for anomaly detection
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
model.fit(X_scaled)

# Predict anomalies (-1 indicates an anomaly)
df['anomaly'] = model.predict(X_scaled)
df['anomaly'] = df['anomaly'].apply(lambda x: 'Anomaly' if x == -1 else 'Normal')

# Display flagged anomalies
print(df[df['anomaly'] == 'Anomaly'])

Interpreting the Results

Normal traffic is labeled as “Normal”.
Anomalous activities (e.g., large data transfers, long connection durations) are flagged as “Anomaly”.
Security teams can use this data to investigate potential intrusions, insider threats, or malware infections.

Common Pitfalls & Improvements

False Positives: Adjust the contamination parameter (0.05 means 5% of data is considered anomalous).
Feature Selection: Include more features, such as protocol type, port numbers, and packet timestamps, for better detection.
Real-time Monitoring: Integrate the model into a SIEM system to trigger alerts on anomalies.

2. Behavioral Analysis & Event Correlation

Why is Behavioral Analysis Important?

Instead of relying solely on signature-based detection, behavioral analysis tracks user actions over time to identify potential threats.

Common malicious behaviors include:

Lateral movement (e.g., a compromised account accessing multiple servers abnormally).
Repeated privilege escalations (e.g., an attacker attempting to gain admin access).
Abnormal file access patterns (e.g., downloading large amounts of sensitive data).

Building a User Behavior Model

We’ll use log data to analyze user authentication patterns.

Step 1: Load Authentication Logs

df = pd.read_csv("auth_logs.csv")

# Example log structure:
# Timestamp, User, Action (Login, Logout, Failed Login), Source IP

df['timestamp'] = pd.to_datetime(df['timestamp'])

Step 2: Detect Suspicious Login Attempts

from collections import Counter

# Count failed login attempts per user
failed_attempts = df[df['action'] == 'Failed Login']['user'].value_counts()

# Flag users with excessive failed attempts
suspicious_users = failed_attempts[failed_attempts > 5]
print("Suspicious Users:", suspicious_users)

Event Correlation for Threat Hunting

To correlate multiple attack indicators, we use a graph-based approach:

Step 3: Graph-Based Threat Correlation

import networkx as nx

G = nx.Graph()

# Add nodes (Users, IPs, Actions)
for _, row in df.iterrows():
    G.add_edge(row['user'], row['source_ip'])

# Detect anomalies: Users accessing multiple new IPs
for node in G.nodes():
    if G.degree(node) > 10:  # Threshold for unusual activity
        print(f"Potential threat detected: {node}")

Why Behavioral Analysis is Game-Changing?

Detects APTs & Insider Threats – Behavioral changes over time indicate stealthy attacks.
Reduces Alert Fatigue – Instead of isolated alerts, it correlates multiple events for better accuracy.

3. NLP for Threat Detection in Logs

Why NLP for Security Log Analysis?

Security logs contain unstructured textual data that can be analyzed using Natural Language Processing (NLP) to detect cyber threats.

Common use cases include:

Detecting phishing emails via text analysis.
Identifying brute-force attacks in logs.
Extracting attack patterns from SIEM alerts.

Step 1: Preprocessing Log Data

import re
from sklearn.feature_extraction.text import CountVectorizer

# Load logs
logs = [
    "Failed login attempt from IP 192.168.1.10",
    "Brute-force detected: 100 failed attempts from user admin",
    "User login success from IP 10.10.10.5"
]

# Clean logs
def preprocess(log):
    return re.sub(r'[^\w\s]', '', log.lower())

logs_cleaned = [preprocess(log) for log in logs]

Step 2: Detecting Attack Keywords

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(logs_cleaned)

# Display extracted features (keywords)
print(vectorizer.get_feature_names_out())

Advanced NLP: Using Named Entity Recognition (NER) for Threat Detection

import spacy

nlp = spacy.load("en_core_web_sm")

log = "Unauthorized root access attempt from IP 203.0.113.45"
doc = nlp(log)

# Extract entities (IPs, usernames, actions)
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

Machine learning is revolutionizing cyber threat hunting, allowing security teams to move from reactive to proactive defenses. By leveraging anomaly detection, behavioral analysis, and NLP, we can uncover sophisticated attacks, insider threats, and zero-day exploits.

As cyber threats continue to evolve, machine learning-powered cybersecurity solutions will be critical in detecting, analyzing, and mitigating modern attacks. The Python-based techniques outlined in this article provide a strong foundation for building AI-driven threat intelligence platforms that can protect organizations in real time.