Insight - Documentation

Overview

Insight is a passive web threat scanner. Submit any URL and it fetches all public resources — HTML, scripts, HTTP headers, and TLS certificates — then analyses them entirely on content. There are no calls to reputation databases or external threat intelligence APIs.

Because detection is content-based, Insight catches zero-day campaigns, freshly registered phishing domains, and newly injected skimmers that reputation feeds haven't yet indexed. It is a companion tool to vault1337.com and shares the same design system.

The result of each scan is a prioritised findings report covering JavaScript threats, phishing indicators, domain intelligence, security misconfigurations, and the full detected technology stack — each finding categorised by severity (CRITICAL / HIGH / MEDIUM / LOW / INFO) with supporting evidence extracted directly from the page content.

Disclaimer: All scan results are provided for informational purposes only and must be independently verified by a trained security analyst before any action is taken. Findings may contain false positives or miss threats not covered by the current detection rules.

Requirements

Requirement	Version	Notes
Python	3.11+	Global or virtual environment
Node.js	18+	For the frontend dev server
Redis	7+	Required for Celery task queue — Docker is the easiest option

1. Start Redis

Redis must be running before the backend or Celery worker will start successfully.

Docker (any OS — recommended)

docker run -d -p 6379:6379 redis:7-alpine

Native (Linux / macOS)

redis-server

2. Backend

Clone the repository

git clone https://github.com/DanDreadless/Insight.git
cd Insight/backend

Install Python dependencies

pip install -r requirements.txt

Configure the environment

Copy the sample env file and set at minimum a SECRET_KEY value.

cp ../.env.sample ../.env
# Edit ../.env — set SECRET_KEY to a long random string

Run migrations and start the server

python manage.py migrate
python manage.py runserver

The Django API will be available at http://localhost:8000.

3. Celery Worker

Scans are executed as asynchronous Celery tasks. The worker must be running in a separate terminal — scans will queue but never run without it.

Start the worker (from the `backend/` directory)

celery -A insight worker -l info

4. Frontend

The React frontend runs on its own dev server and proxies all /api/ requests to Django on port 8000.

Install and start (from the `frontend/` directory)

npm install
npm run dev

The dev server starts at http://localhost:5173.

5. Verify

URL	Expected
`http://localhost:5173`	React frontend — URL input and scan interface
`http://localhost:8000/api/health/`	`{"status":"ok"}`
`http://localhost:8000/api/schema/swagger-ui/`	Interactive API documentation

Run the test suite

# All tests
python manage.py test scanner

# Single module
python manage.py test scanner.tests.test_validators

Docker (Alternative)

If you prefer not to install Python and Node.js locally, the full stack can be started with Docker Compose.

Configure and start

cp .env.sample .env   # edit SECRET_KEY
docker-compose up --build

Frontend on :5173, backend on :8000.

Teardown

# Stop and remove volumes
docker-compose down -v

# Full clean (removes images too)
docker-compose down --rmi all --volumes --remove-orphans

Environment Variables

All configuration lives in .env at the repo root. Copy from .env.sample — the only required change for local development is setting SECRET_KEY.

Variable	Default	Notes
`SECRET_KEY`	(insecure sample)	Must be changed — startup fails in production if unchanged
`DEBUG`	`True`	Set `False` in production
`REDIS_URL`	`redis://localhost:6379/0`	Use `rediss://` (TLS) in production
`DATABASE_URL`	`sqlite:///db.sqlite3`	Use PostgreSQL in production
`CORS_ALLOWED_ORIGINS`	`http://localhost:5173`	Frontend origin
`RATE_LIMIT_SCANS_PER_HOUR`	`5`	Per IP address
`MAX_SCAN_RESOURCES`	`50`	External scripts analysed per scan
`SCAN_TIMEOUT_SECONDS`	`60`	Hard Celery task time limit
`CARAPACE_URL`	(unset)	Base URL of the Carapace API, e.g. `http://carapace:8080`. Screenshots disabled if unset.
`CARAPACE_API_KEY`	(unset)	Optional API key sent in `X-Api-Key` header to Carapace
`CARAPACE_SCREENSHOT_TIMEOUT`	`30`	Per-request timeout in seconds for Carapace render calls

How It Works

When a URL is submitted, Insight fetches the page and all linked external scripts using an SSRF-safe HTTP fetcher. No external threat intelligence APIs are called — all analysis is performed against the raw content returned by the target server.

Five analysis modules run against every scan:

JavaScript analyser — 59 checks against all scripts collected from the page
HTML analyser — 33+ structural checks against the page markup
Domain intelligence — 14 checks on the hostname and TLD
Header analyser — 15 checks on HTTP response headers
SSL analyser — 6 checks on the TLS certificate

Each check emits zero or more findings. A context-collapse engine fires additional synthetic findings when combinations of signals indicate coordinated attack infrastructure.

Verdict & Scoring

Verdict derivation

Verdict	Condition
MALICIOUS	Any CRITICAL finding
SUSPICIOUS	Any HIGH finding, or 2+ MEDIUM findings
CLEAN	Only LOW and INFO findings
UNKNOWN	No findings at all

Context collapse rules

Synthetic findings are generated when multiple signals combine to indicate a coordinated attack pattern:

High-risk TLD + external form action + missing security headers → HIGH "phishing infrastructure"
DGA domain + hidden iframe + obfuscated JS → CRITICAL "drive-by malware delivery"
Brand impersonation + phishing form (± new certificate) → CRITICAL "active phishing campaign"
Keylogger / skimmer + DevTools evasion → CRITICAL "sophisticated targeted malware"
Fake CAPTCHA / ClickFix UI + clipboard write → CRITICAL "ClickFix malware delivery"
Injected unknown external script + ClickFix / clipboard payload → CRITICAL "compromised site delivering ClickFix malware"
Newly registered domain (≤ 30 days) + high-risk TLD → HIGH "newly registered high-risk domain" — purpose-built attack infrastructure signal

JavaScript Analysis

All scripts are run through jsbeautifier before analysis (files ≤ 256 KB). Both the original and beautified forms are checked. Base64-encoded strings are decoded and the plaintext payload is appended to the evidence block where printable.

CRITICAL

Check	Pattern
Encoded payload execution	`eval(atob(...))`, `eval(unescape(...))`, `eval(decodeURIComponent(...))` chains
Session theft	Cookie / localStorage read + outbound `fetch` / XHR / `sendBeacon`
Credential harvester	Form submit hijack + external `fetch` / `sendBeacon` + `preventDefault`
Keylogger	`keydown` / `keyup` listener reading `event.key` + outbound network call
Magecart skimmer	DOM query targeting card / CVV fields + exfiltration + encoding or polling
Crypto miner	Stratum protocol strings, CoinHive / CryptoLoot names, WebWorker + WASM pattern
Unix shell dropper	`base64 -d \| bash` pattern embedded in JS strings
PowerShell dropper	`irm ... \| iex` / `Invoke-RestMethod ... \| iex` embedded in JS strings
HTML smuggling	`new Blob([...])` + `URL.createObjectURL` + auto-download trigger
Web3 wallet drainer	`window.ethereum` + `eth_sendTransaction` / `eth_signTypedData` / `personal_sign`
Malicious service worker	`navigator.serviceWorker.register()` from `blob:` or `data:` URI
Remote code execution	`fetch()` + `.then()` / `await` + `eval()` / `new Function()` in same async chain — compromised WordPress pattern
Decrypt-then-execute	`crypto.subtle.decrypt` / `importKey` + `eval()` / `new Function()` — encrypted payload executed at runtime
ClickFix clipboard payload	`navigator.clipboard.writeText()` argument contains shell command indicators (PowerShell, mshta, cmd.exe, `\| bash`, etc.) — content-based detection regardless of click handler

HIGH

Check	Pattern
Obfuscator.io fingerprint	`_0x` array-rotation variable pattern
Character-code obfuscation	`String.fromCharCode(...)` building strings character by character
High entropy string	Shannon entropy > 5.5 bits/char on literals > 64 chars
Dynamic hidden iframe	`createElement('iframe')` + `display:none` / `width:0`
Forced download	`createElement('a')` + `.download` + `.click()`
Beacon exfiltration	`navigator.sendBeacon()` to external domain
Clipboard hijack	`navigator.clipboard.writeText()` outside a recognisable click handler
Script injection	`document.write()` injecting external `<script src="https://...">`
Shell string	`bash -c` execution string embedded in JS
C2 infrastructure	`curl` / `wget` to bare IP address
External service worker	`serviceWorker.register()` loading from an external domain
Living off Trusted Sites (LoTS)	Exfiltration via Telegram Bot API, Discord webhook, Slack webhook, Google Apps Script, Webhook.site, Pipedream, RequestBin, Pastebin API
Dynamic module import	`import('https://...')` loading an ES module from an unknown external URL

MEDIUM

Check	Pattern
URL evasion	`['a','b','c'].join('')` array-split string construction
Moderate entropy	Shannon entropy 4.8–5.5 bits/char (possibly encoded)
Auto-redirect	`window.location` inside `setTimeout` < 3000ms
Right-click disable	`contextmenu` event + `preventDefault()`
DevTools detection	`outerWidth` / `outerHeight` delta, `__REACT_DEVTOOLS_GLOBAL_HOOK__`, console timing tricks

HTML Analysis

Severity	Check
CRITICAL	Phishing form: `action` domain ≠ page domain + brand keyword in page title
CRITICAL	Shell command (PowerShell, mshta, cmd, iex) embedded in HTML `data-*` attribute or event handler — ClickFix payload storage pattern
CRITICAL	Shell command embedded in hidden HTML element (`display:none`, hidden input, `<template>`) — ClickFix payload storage pattern
HIGH	Phishing form: `action` domain ≠ page domain (no brand signal in title)
HIGH	Hidden iframe (`display:none`, `width=0`, `height=0`, off-screen position)
HIGH	`<base href>` pointing to external domain (URL hijacking)
HIGH	Meta refresh redirect with delay ≤ 2s to external domain
HIGH	Login form transmitting credentials over plain HTTP
HIGH	Fake browser update page: browser/update terminology + executable download link (SocGholish / ClearFake)
HIGH	Fake CAPTCHA / ClickFix: human-verification text + Win+R / terminal execution instructions (expanded: "click to fix", "browser verification", "run the following command" variants)
HIGH	Clickjacking overlay: full-viewport fixed/absolute element with z-index > 100 + click handler
HIGH	External script loaded from unknown domain that is also `dns-prefetch`-staged in the same page — deliberate WordPress compromise pattern (e.g. WPCode injection)
MEDIUM	`<base href>` present — same origin, verify it is intentional
MEDIUM	Meta refresh redirect (any delay)
MEDIUM	Right-click disabled via `oncontextmenu="return false"`
MEDIUM	Suspicious executable download link (`.exe`, `.msi`, `.ps1`, `.bat`, `.hta`, etc.)
MEDIUM	Inline script dominates page content (script > 3× non-script HTML)
MEDIUM	`<noscript>` block contains external URL redirect
MEDIUM	Sensitive keywords in HTML comments (password, api_key, token, secret, etc.)
MEDIUM	Resources loaded from IPFS gateways (takedown-resistant phishing / drainer hosting)
MEDIUM	External script preloaded via `<link rel="preload" as="script">` or `<link rel="prefetch">` from unknown domain — WordPress malware injection staging pattern
MEDIUM	External `<script src>` from unknown domain without dns-prefetch staging
LOW	External scripts loaded without Subresource Integrity (SRI)
LOW	Password field missing `autocomplete` attribute
LOW	CSS `user-select: none` disabling text selection

Domain Intelligence

Domain checks run against the hostname of the scanned URL. No WHOIS or DNS lookups are made — detection is based on the domain string itself.

Severity	Check
CRITICAL	Subdomain token is a typosquat of a known brand (Levenshtein edit distance 1)
CRITICAL	Subdomain contains exact brand keyword (e.g. `paypal.attacker.com`)
HIGH	SLD (registered domain) is a typosquat of a known brand (edit distance 1)
HIGH	IDN / homograph attack — Cyrillic or mixed-script characters in domain
HIGH	Brand keyword in registered domain (e.g. `paypal-secure.com`) — attacker owns the SLD
MEDIUM	High-risk TLD (`.xyz`, `.top`, `.click`, `.loan`, `.zip`, `.cyou`, and 20+ more) — strong context signal, not conclusive alone
HIGH	DGA probability score > 0.8 — strong algorithmic generation signal, characteristics consistent with C2 infrastructure
MEDIUM	DGA probability score 0.6–0.8 (consonant ratio + entropy + English subword absence)
MEDIUM	Digit substitution in SLD (e.g. `g00gle`, `faceb00k`)
MEDIUM	Excessive subdomain depth (> 4 labels)
MEDIUM	Hosted on abuse-prone free platform with long random subdomain (Cloudflare R2, Pages.dev, Firebase)
MEDIUM	Newly registered domain (≤ 30 days old) — disproportionately present in threat feeds; requires WHOIS data
MEDIUM	Subdomain encodes a domain via dot-to-hyphen substitution (e.g. `support-paypal-com.zapier.app`) — phishing-as-a-service technique using free subdomain hosting; HIGH if hosted on an abuse-prone platform
MEDIUM	Delivery/postal brand keyword embedded in SLD (USPS, FedEx, DHL) without being the official site — fake parcel notification phishing
INFO	Recently registered domain (31–90 days old) — context signal for analysts

Brands monitored include: PayPal, Google, Microsoft, Apple, Amazon, Facebook, Instagram, Netflix, Steam, Coinbase, Binance, MetaMask, Ledger, Trezor, Trust Wallet, OpenSea, Roblox, Discord, Twitch, Spotify, Chase, Barclays, and more.

Header Analysis

15 checks on HTTP response headers. All checks are passive — no additional requests are made.

Severity philosophy: Missing defensive headers are configuration debt, not threat indicators. Industry consensus (Cobalt, OWASP, Invicti, pentest report standards) rates them LOW/INFO. They only escalate in meaning when combined with active threat signals — handled by the context collapse engine.

Severity	Check
HIGH	Site served over unencrypted HTTP — credentials and sessions exposed in plaintext
HIGH	End-of-life server software (Apache 2.2, PHP 5.x, IIS 6/7) — unpatched CVEs, likely compromised or abandoned infra
HIGH	CORS misconfiguration: wildcard `Access-Control-Allow-Origin: *` + `Access-Control-Allow-Credentials: true`
LOW	Missing `X-Content-Type-Options: nosniff`
LOW	Missing HSTS on HTTPS site — SSL-stripping attack possible via active MitM
LOW	HSTS `max-age` below recommended 1 year
LOW	Server header discloses software version — reconnaissance aid
LOW	`X-Powered-By` header exposes backend technology
LOW	Insecure cookie flags — missing `HttpOnly`, `Secure`, or `SameSite`
LOW	CORS wildcard `Access-Control-Allow-Origin: *` (acceptable for public APIs, noted as LOW)
LOW	CSP allows `unsafe-inline` or `unsafe-eval` — weakens XSS protection
INFO	Missing Content-Security-Policy — hardening gap, not a threat signal
INFO	Missing X-Frame-Options — hardening gap, not a threat signal
INFO	Missing Referrer-Policy — privacy gap
INFO	Missing Permissions-Policy — rarely set by any site

SSL Analysis

6 checks on the TLS certificate. The certificate is retrieved directly — no third-party certificate transparency APIs are used.

Severity	Check
HIGH	Certificate expires in fewer than 14 days
HIGH	Self-signed certificate
HIGH	Hostname / SAN mismatch
HIGH	Let's Encrypt certificate issued to a brand-impersonating domain
MEDIUM	Deprecated TLS version (1.0 or 1.1) negotiated
INFO	Certificate issued within the last 7 days — new cert on suspicious domain is a phishing indicator

Technology Detection

Identifies the technology stack from HTML, script sources, HTTP headers, and cookies. Displayed as colour-coded badges with logos on the results page.

Category	Technologies detected
CMS	WordPress, Drupal, Joomla, Ghost, Shopify, Wix, Squarespace, Webflow, HubSpot CMS
JS Framework	React, Next.js, Vue, Nuxt, Angular, Svelte, SvelteKit, Ember, Backbone.js, Astro, Remix, Gatsby, Solid.js
Build Tool	Vite, webpack
JS Library	jQuery, Lodash, Axios, GSAP, Three.js, Alpine.js, htmx, Socket.io, Chart.js, D3.js, Swiper, Pusher
CSS Framework	Bootstrap, Tailwind CSS, Bulma, Font Awesome, UIkit
Backend	PHP, Python, Node.js, Express, ASP.NET, Laravel, Django, Ruby on Rails, Java, Flask, FastAPI, Symfony, Spring Boot
Web Server	nginx, Apache, Caddy, Gunicorn, LiteSpeed
CDN	Cloudflare, AWS CloudFront, Fastly, Akamai, jsDelivr
Hosting	Vercel, Netlify, GitHub Pages, Firebase, Render
Analytics	Google Analytics, Google Tag Manager, Facebook Pixel, Hotjar, Intercom, Mixpanel, Plausible, Matomo, TikTok Pixel, LinkedIn Insight, Cloudflare Web Analytics
Security	Cloudflare Turnstile, reCAPTCHA, hCaptcha, Cloudflare Bot Management
Payment	Stripe, PayPal, Square, Klarna

Visual Renderer (Carapace)

Carapace is an optional companion service that provides a Chromium-headless visual screenshot of each scanned URL. It renders the page with JavaScript fully enabled but all network requests intercepted and blocked — allowing dynamic overlays (ClickFix, SocGholish, ClearFake, drainers) to execute and render visibly in the screenshot, revealing the actual attack UI. A verdict badge is composited onto every screenshot.

Carapace runs as a separate Docker service alongside the Insight backend. When CARAPACE_URL is configured, every scan automatically calls the POST /render endpoint and the screenshot is stored with the scan result. If Carapace is unavailable the scan continues normally — it is entirely best-effort.

In addition to the screenshot, Carapace returns a threat report with:

A risk score (0–100) derived from renderer-level observations
Threat flags — findings from the renderer converted directly into Insight findings under the Renderer category
Technology detections — a DOM-parsed tech stack that is merged with Insight's own BeautifulSoup-based detections, catching things a static parser may miss

Carapace is open-source and available at github.com/DanDreadless/Carapace. It is designed to be deployed as a sidecar alongside Insight — see the Carapace README for setup instructions.

Renderer Findings

Threat flags from Carapace are converted to Insight findings in the Renderer category. Sanitisation-behaviour codes that fire on almost every real page (e.g. BLOCKED_ELEMENT_SCRIPT, NETWORK_ATTEMPT_BLOCKED) are suppressed — only signals with independent threat value are surfaced.

Severity	Flag	Description
CRITICAL	Drive-by download blocked	Renderer intercepted an automatic file download — filename, MIME type, and SHA-256 recorded without executing the file
HIGH	JS eval detected	`eval()` or `new Function()` execution observed at render time — runtime obfuscation
HIGH	Exfiltration attempt blocked	Outbound XHR / fetch to an external domain intercepted by the network block
HIGH	Credential field on HTTP	Password input found on an unencrypted page — credentials would be sent in plaintext
MEDIUM	Redirect chain detected	Multiple HTTP redirects observed before final page load — common in traffic distribution systems
MEDIUM	Suspicious download link	Executable file linked from page content (`.exe`, `.msi`, `.ps1`, etc.)
MEDIUM	DevTools evasion detected	Renderer-side debugger / devtools detection attempt observed

API Endpoints

All endpoints are under /api/. No authentication is required — rate limiting is enforced per IP address.

Method	Path	Description
`POST`	`/api/scan/`	Submit a URL for scanning. Returns `{"id": "...", "status": "PENDING"}`. 202 on success, 429 if rate limited.
`GET`	`/api/scan/{id}/`	Poll scan status and retrieve full results once complete.
`GET`	`/api/scan/{id}/stream/`	Server-Sent Events stream — events: `status_update`, `complete`, `error`. Auto-retry on disconnect (3 attempts, 2s delay).
`GET`	`/api/scan/{id}/source/`	Re-fetch the raw source of a URL that belongs to a completed scan. Only permits URLs within the original scan scope.
`GET`	`/api/history/`	Paginated list of completed scans. Supports `?q=` URL substring filter and `?page=`.
`GET`	`/api/health/`	Health check — returns `{"status":"ok"}`.
`GET`	`/api/schema/swagger-ui/`	Interactive API documentation (drf-spectacular).

SSRF protection: All outbound HTTP requests made during a scan go through an SSRF-safe fetcher that rejects non-HTTP(S) schemes, resolves DNS and blocks RFC1918 / loopback / link-local addresses, caps response bodies at 5 MB, and enforces hard timeouts. It is not possible to use the scan endpoint to probe internal network resources.

Tech Stack

Layer	Technology
Backend	Python 3.11 / Django 5.2 / Django REST Framework
Task queue	Celery + Redis
API docs	drf-spectacular — Swagger UI at `/api/schema/swagger-ui/`
Frontend	React 19 / TypeScript / Vite / Tailwind CSS 4
Database	SQLite (development) / PostgreSQL (production)
Cache / broker	Redis
Visual renderer	Carapace — Chromium-headless screenshot service (optional sidecar)

Acknowledgements

Insight is built on a number of excellent open-source libraries.

Backend & Analysis

Python Django Django REST Framework Celery jsbeautifier BeautifulSoup4 tldextract Requests drf-spectacular

Frontend

React Vite TypeScript Tailwind CSS Axios React Router Simple Icons

Infrastructure

Redis PostgreSQL Gunicorn Docker

Visual Renderer

Carapace

Overview

Requirements

1. Start Redis

Docker (any OS — recommended)

Native (Linux / macOS)

2. Backend

Clone the repository

Install Python dependencies

Configure the environment

Run migrations and start the server

3. Celery Worker

Start the worker (from the backend/ directory)

4. Frontend

Install and start (from the frontend/ directory)

5. Verify

Run the test suite

Docker (Alternative)

Configure and start

Teardown

Environment Variables

How It Works

Verdict & Scoring

Verdict derivation

Context collapse rules

JavaScript Analysis

CRITICAL

HIGH

MEDIUM

HTML Analysis

Domain Intelligence

Header Analysis

SSL Analysis

Technology Detection

Visual Renderer (Carapace)

Renderer Findings

API Endpoints

Tech Stack

Acknowledgements

Start the worker (from the `backend/` directory)

Install and start (from the `frontend/` directory)