Overview
Insight is a passive web threat scanner. Submit any URL and it fetches all public resources — HTML, scripts, HTTP headers, and TLS certificates — then analyses them entirely on content. There are no calls to reputation databases or external threat intelligence APIs.
Because detection is content-based, Insight catches zero-day campaigns, freshly registered phishing domains, and newly injected skimmers that reputation feeds haven't yet indexed. It is a companion tool to vault1337.com and shares the same design system.
The result of each scan is a prioritised findings report covering JavaScript threats, phishing indicators, domain intelligence, security misconfigurations, and the full detected technology stack — each finding categorised by severity (CRITICAL / HIGH / MEDIUM / LOW / INFO) with supporting evidence extracted directly from the page content.
Requirements
| Requirement | Version | Notes |
|---|---|---|
| Python | 3.11+ | Global or virtual environment |
| Node.js | 18+ | For the frontend dev server |
| Redis | 7+ | Required for Celery task queue — Docker is the easiest option |
1. Start Redis
Redis must be running before the backend or Celery worker will start successfully.
Docker (any OS — recommended)
docker run -d -p 6379:6379 redis:7-alpine
Native (Linux / macOS)
redis-server
2. Backend
Clone the repository
git clone https://github.com/DanDreadless/Insight.git cd Insight/backend
Install Python dependencies
pip install -r requirements.txt
Configure the environment
Copy the sample env file and set at minimum a SECRET_KEY value.
cp ../.env.sample ../.env # Edit ../.env — set SECRET_KEY to a long random string
Run migrations and start the server
python manage.py migrate python manage.py runserver
The Django API will be available at http://localhost:8000.
3. Celery Worker
Scans are executed as asynchronous Celery tasks. The worker must be running in a separate terminal — scans will queue but never run without it.
Start the worker (from the backend/ directory)
celery -A insight worker -l info
4. Frontend
The React frontend runs on its own dev server and proxies all /api/ requests to Django on port 8000.
Install and start (from the frontend/ directory)
npm install npm run dev
The dev server starts at http://localhost:5173.
5. Verify
| URL | Expected |
|---|---|
http://localhost:5173 | React frontend — URL input and scan interface |
http://localhost:8000/api/health/ | {"status":"ok"} |
http://localhost:8000/api/schema/swagger-ui/ | Interactive API documentation |
Run the test suite
# All tests python manage.py test scanner # Single module python manage.py test scanner.tests.test_validators
Docker (Alternative)
If you prefer not to install Python and Node.js locally, the full stack can be started with Docker Compose.
Configure and start
cp .env.sample .env # edit SECRET_KEY docker-compose up --build
Frontend on :5173, backend on :8000.
Teardown
# Stop and remove volumes docker-compose down -v # Full clean (removes images too) docker-compose down --rmi all --volumes --remove-orphans
Environment Variables
All configuration lives in .env at the repo root. Copy from .env.sample — the only required change for local development is setting SECRET_KEY.
| Variable | Default | Notes |
|---|---|---|
SECRET_KEY | (insecure sample) | Must be changed — startup fails in production if unchanged |
DEBUG | True | Set False in production |
REDIS_URL | redis://localhost:6379/0 | Use rediss:// (TLS) in production |
DATABASE_URL | sqlite:///db.sqlite3 | Use PostgreSQL in production |
CORS_ALLOWED_ORIGINS | http://localhost:5173 | Frontend origin |
RATE_LIMIT_SCANS_PER_HOUR | 5 | Per IP address |
MAX_SCAN_RESOURCES | 50 | External scripts analysed per scan |
SCAN_TIMEOUT_SECONDS | 60 | Hard Celery task time limit |
CARAPACE_URL | (unset) | Base URL of the Carapace API, e.g. http://carapace:8080. Screenshots disabled if unset. |
CARAPACE_API_KEY | (unset) | Optional API key sent in X-Api-Key header to Carapace |
CARAPACE_SCREENSHOT_TIMEOUT | 30 | Per-request timeout in seconds for Carapace render calls |
How It Works
When a URL is submitted, Insight fetches the page and all linked external scripts using an SSRF-safe HTTP fetcher. No external threat intelligence APIs are called — all analysis is performed against the raw content returned by the target server.
Five analysis modules run against every scan:
- JavaScript analyser — 59 checks against all scripts collected from the page
- HTML analyser — 33+ structural checks against the page markup
- Domain intelligence — 14 checks on the hostname and TLD
- Header analyser — 15 checks on HTTP response headers
- SSL analyser — 6 checks on the TLS certificate
Each check emits zero or more findings. A context-collapse engine fires additional synthetic findings when combinations of signals indicate coordinated attack infrastructure.
Verdict & Scoring
Verdict derivation
| Verdict | Condition |
|---|---|
| MALICIOUS | Any CRITICAL finding |
| SUSPICIOUS | Any HIGH finding, or 2+ MEDIUM findings |
| CLEAN | Only LOW and INFO findings |
| UNKNOWN | No findings at all |
Context collapse rules
Synthetic findings are generated when multiple signals combine to indicate a coordinated attack pattern:
- High-risk TLD + external form action + missing security headers → HIGH "phishing infrastructure"
- DGA domain + hidden iframe + obfuscated JS → CRITICAL "drive-by malware delivery"
- Brand impersonation + phishing form (± new certificate) → CRITICAL "active phishing campaign"
- Keylogger / skimmer + DevTools evasion → CRITICAL "sophisticated targeted malware"
- Fake CAPTCHA / ClickFix UI + clipboard write → CRITICAL "ClickFix malware delivery"
- Injected unknown external script + ClickFix / clipboard payload → CRITICAL "compromised site delivering ClickFix malware"
- Newly registered domain (≤ 30 days) + high-risk TLD → HIGH "newly registered high-risk domain" — purpose-built attack infrastructure signal
JavaScript Analysis
All scripts are run through jsbeautifier before analysis (files ≤ 256 KB). Both the original and beautified forms are checked. Base64-encoded strings are decoded and the plaintext payload is appended to the evidence block where printable.
CRITICAL
| Check | Pattern |
|---|---|
| Encoded payload execution | eval(atob(...)), eval(unescape(...)), eval(decodeURIComponent(...)) chains |
| Session theft | Cookie / localStorage read + outbound fetch / XHR / sendBeacon |
| Credential harvester | Form submit hijack + external fetch / sendBeacon + preventDefault |
| Keylogger | keydown / keyup listener reading event.key + outbound network call |
| Magecart skimmer | DOM query targeting card / CVV fields + exfiltration + encoding or polling |
| Crypto miner | Stratum protocol strings, CoinHive / CryptoLoot names, WebWorker + WASM pattern |
| Unix shell dropper | base64 -d | bash pattern embedded in JS strings |
| PowerShell dropper | irm ... | iex / Invoke-RestMethod ... | iex embedded in JS strings |
| HTML smuggling | new Blob([...]) + URL.createObjectURL + auto-download trigger |
| Web3 wallet drainer | window.ethereum + eth_sendTransaction / eth_signTypedData / personal_sign |
| Malicious service worker | navigator.serviceWorker.register() from blob: or data: URI |
| Remote code execution | fetch() + .then() / await + eval() / new Function() in same async chain — compromised WordPress pattern |
| Decrypt-then-execute | crypto.subtle.decrypt / importKey + eval() / new Function() — encrypted payload executed at runtime |
| ClickFix clipboard payload | navigator.clipboard.writeText() argument contains shell command indicators (PowerShell, mshta, cmd.exe, | bash, etc.) — content-based detection regardless of click handler |
HIGH
| Check | Pattern |
|---|---|
| Obfuscator.io fingerprint | _0x array-rotation variable pattern |
| Character-code obfuscation | String.fromCharCode(...) building strings character by character |
| High entropy string | Shannon entropy > 5.5 bits/char on literals > 64 chars |
| Dynamic hidden iframe | createElement('iframe') + display:none / width:0 |
| Forced download | createElement('a') + .download + .click() |
| Beacon exfiltration | navigator.sendBeacon() to external domain |
| Clipboard hijack | navigator.clipboard.writeText() outside a recognisable click handler |
| Script injection | document.write() injecting external <script src="https://..."> |
| Shell string | bash -c execution string embedded in JS |
| C2 infrastructure | curl / wget to bare IP address |
| External service worker | serviceWorker.register() loading from an external domain |
| Living off Trusted Sites (LoTS) | Exfiltration via Telegram Bot API, Discord webhook, Slack webhook, Google Apps Script, Webhook.site, Pipedream, RequestBin, Pastebin API |
| Dynamic module import | import('https://...') loading an ES module from an unknown external URL |
MEDIUM
| Check | Pattern |
|---|---|
| URL evasion | ['a','b','c'].join('') array-split string construction |
| Moderate entropy | Shannon entropy 4.8–5.5 bits/char (possibly encoded) |
| Auto-redirect | window.location inside setTimeout < 3000ms |
| Right-click disable | contextmenu event + preventDefault() |
| DevTools detection | outerWidth / outerHeight delta, __REACT_DEVTOOLS_GLOBAL_HOOK__, console timing tricks |
HTML Analysis
| Severity | Check |
|---|---|
| CRITICAL | Phishing form: action domain ≠ page domain + brand keyword in page title |
| CRITICAL | Shell command (PowerShell, mshta, cmd, iex) embedded in HTML data-* attribute or event handler — ClickFix payload storage pattern |
| CRITICAL | Shell command embedded in hidden HTML element (display:none, hidden input, <template>) — ClickFix payload storage pattern |
| HIGH | Phishing form: action domain ≠ page domain (no brand signal in title) |
| HIGH | Hidden iframe (display:none, width=0, height=0, off-screen position) |
| HIGH | <base href> pointing to external domain (URL hijacking) |
| HIGH | Meta refresh redirect with delay ≤ 2s to external domain |
| HIGH | Login form transmitting credentials over plain HTTP |
| HIGH | Fake browser update page: browser/update terminology + executable download link (SocGholish / ClearFake) |
| HIGH | Fake CAPTCHA / ClickFix: human-verification text + Win+R / terminal execution instructions (expanded: "click to fix", "browser verification", "run the following command" variants) |
| HIGH | Clickjacking overlay: full-viewport fixed/absolute element with z-index > 100 + click handler |
| HIGH | External script loaded from unknown domain that is also dns-prefetch-staged in the same page — deliberate WordPress compromise pattern (e.g. WPCode injection) |
| MEDIUM | <base href> present — same origin, verify it is intentional |
| MEDIUM | Meta refresh redirect (any delay) |
| MEDIUM | Right-click disabled via oncontextmenu="return false" |
| MEDIUM | Suspicious executable download link (.exe, .msi, .ps1, .bat, .hta, etc.) |
| MEDIUM | Inline script dominates page content (script > 3× non-script HTML) |
| MEDIUM | <noscript> block contains external URL redirect |
| MEDIUM | Sensitive keywords in HTML comments (password, api_key, token, secret, etc.) |
| MEDIUM | Resources loaded from IPFS gateways (takedown-resistant phishing / drainer hosting) |
| MEDIUM | External script preloaded via <link rel="preload" as="script"> or <link rel="prefetch"> from unknown domain — WordPress malware injection staging pattern |
| MEDIUM | External <script src> from unknown domain without dns-prefetch staging |
| LOW | External scripts loaded without Subresource Integrity (SRI) |
| LOW | Password field missing autocomplete attribute |
| LOW | CSS user-select: none disabling text selection |
Domain Intelligence
Domain checks run against the hostname of the scanned URL. No WHOIS or DNS lookups are made — detection is based on the domain string itself.
| Severity | Check |
|---|---|
| CRITICAL | Subdomain token is a typosquat of a known brand (Levenshtein edit distance 1) |
| CRITICAL | Subdomain contains exact brand keyword (e.g. paypal.attacker.com) |
| HIGH | SLD (registered domain) is a typosquat of a known brand (edit distance 1) |
| HIGH | IDN / homograph attack — Cyrillic or mixed-script characters in domain |
| HIGH | Brand keyword in registered domain (e.g. paypal-secure.com) — attacker owns the SLD |
| MEDIUM | High-risk TLD (.xyz, .top, .click, .loan, .zip, .cyou, and 20+ more) — strong context signal, not conclusive alone |
| HIGH | DGA probability score > 0.8 — strong algorithmic generation signal, characteristics consistent with C2 infrastructure |
| MEDIUM | DGA probability score 0.6–0.8 (consonant ratio + entropy + English subword absence) |
| MEDIUM | Digit substitution in SLD (e.g. g00gle, faceb00k) |
| MEDIUM | Excessive subdomain depth (> 4 labels) |
| MEDIUM | Hosted on abuse-prone free platform with long random subdomain (Cloudflare R2, Pages.dev, Firebase) |
| MEDIUM | Newly registered domain (≤ 30 days old) — disproportionately present in threat feeds; requires WHOIS data |
| MEDIUM | Subdomain encodes a domain via dot-to-hyphen substitution (e.g. support-paypal-com.zapier.app) — phishing-as-a-service technique using free subdomain hosting; HIGH if hosted on an abuse-prone platform |
| MEDIUM | Delivery/postal brand keyword embedded in SLD (USPS, FedEx, DHL) without being the official site — fake parcel notification phishing |
| INFO | Recently registered domain (31–90 days old) — context signal for analysts |
Brands monitored include: PayPal, Google, Microsoft, Apple, Amazon, Facebook, Instagram, Netflix, Steam, Coinbase, Binance, MetaMask, Ledger, Trezor, Trust Wallet, OpenSea, Roblox, Discord, Twitch, Spotify, Chase, Barclays, and more.
Header Analysis
15 checks on HTTP response headers. All checks are passive — no additional requests are made.
Severity philosophy: Missing defensive headers are configuration debt, not threat indicators. Industry consensus (Cobalt, OWASP, Invicti, pentest report standards) rates them LOW/INFO. They only escalate in meaning when combined with active threat signals — handled by the context collapse engine.
| Severity | Check |
|---|---|
| HIGH | Site served over unencrypted HTTP — credentials and sessions exposed in plaintext |
| HIGH | End-of-life server software (Apache 2.2, PHP 5.x, IIS 6/7) — unpatched CVEs, likely compromised or abandoned infra |
| HIGH | CORS misconfiguration: wildcard Access-Control-Allow-Origin: * + Access-Control-Allow-Credentials: true |
| LOW | Missing X-Content-Type-Options: nosniff |
| LOW | Missing HSTS on HTTPS site — SSL-stripping attack possible via active MitM |
| LOW | HSTS max-age below recommended 1 year |
| LOW | Server header discloses software version — reconnaissance aid |
| LOW | X-Powered-By header exposes backend technology |
| LOW | Insecure cookie flags — missing HttpOnly, Secure, or SameSite |
| LOW | CORS wildcard Access-Control-Allow-Origin: * (acceptable for public APIs, noted as LOW) |
| LOW | CSP allows unsafe-inline or unsafe-eval — weakens XSS protection |
| INFO | Missing Content-Security-Policy — hardening gap, not a threat signal |
| INFO | Missing X-Frame-Options — hardening gap, not a threat signal |
| INFO | Missing Referrer-Policy — privacy gap |
| INFO | Missing Permissions-Policy — rarely set by any site |
SSL Analysis
6 checks on the TLS certificate. The certificate is retrieved directly — no third-party certificate transparency APIs are used.
| Severity | Check |
|---|---|
| HIGH | Certificate expires in fewer than 14 days |
| HIGH | Self-signed certificate |
| HIGH | Hostname / SAN mismatch |
| HIGH | Let's Encrypt certificate issued to a brand-impersonating domain |
| MEDIUM | Deprecated TLS version (1.0 or 1.1) negotiated |
| INFO | Certificate issued within the last 7 days — new cert on suspicious domain is a phishing indicator |
Technology Detection
Identifies the technology stack from HTML, script sources, HTTP headers, and cookies. Displayed as colour-coded badges with logos on the results page.
| Category | Technologies detected |
|---|---|
| CMS | WordPress, Drupal, Joomla, Ghost, Shopify, Wix, Squarespace, Webflow, HubSpot CMS |
| JS Framework | React, Next.js, Vue, Nuxt, Angular, Svelte, SvelteKit, Ember, Backbone.js, Astro, Remix, Gatsby, Solid.js |
| Build Tool | Vite, webpack |
| JS Library | jQuery, Lodash, Axios, GSAP, Three.js, Alpine.js, htmx, Socket.io, Chart.js, D3.js, Swiper, Pusher |
| CSS Framework | Bootstrap, Tailwind CSS, Bulma, Font Awesome, UIkit |
| Backend | PHP, Python, Node.js, Express, ASP.NET, Laravel, Django, Ruby on Rails, Java, Flask, FastAPI, Symfony, Spring Boot |
| Web Server | nginx, Apache, Caddy, Gunicorn, LiteSpeed |
| CDN | Cloudflare, AWS CloudFront, Fastly, Akamai, jsDelivr |
| Hosting | Vercel, Netlify, GitHub Pages, Firebase, Render |
| Analytics | Google Analytics, Google Tag Manager, Facebook Pixel, Hotjar, Intercom, Mixpanel, Plausible, Matomo, TikTok Pixel, LinkedIn Insight, Cloudflare Web Analytics |
| Security | Cloudflare Turnstile, reCAPTCHA, hCaptcha, Cloudflare Bot Management |
| Payment | Stripe, PayPal, Square, Klarna |
Visual Renderer (Carapace)
Carapace is an optional companion service that provides a Chromium-headless visual screenshot of each scanned URL. It renders the page with JavaScript fully enabled but all network requests intercepted and blocked — allowing dynamic overlays (ClickFix, SocGholish, ClearFake, drainers) to execute and render visibly in the screenshot, revealing the actual attack UI. A verdict badge is composited onto every screenshot.
Carapace runs as a separate Docker service alongside the Insight backend. When
CARAPACE_URL is configured, every scan automatically calls the
POST /render endpoint and the screenshot is stored with the scan result.
If Carapace is unavailable the scan continues normally — it is entirely best-effort.
In addition to the screenshot, Carapace returns a threat report with:
- A risk score (0–100) derived from renderer-level observations
- Threat flags — findings from the renderer converted directly into Insight findings under the
Renderercategory - Technology detections — a DOM-parsed tech stack that is merged with Insight's own BeautifulSoup-based detections, catching things a static parser may miss
Renderer Findings
Threat flags from Carapace are converted to Insight findings in the Renderer category.
Sanitisation-behaviour codes that fire on almost every real page (e.g.
BLOCKED_ELEMENT_SCRIPT, NETWORK_ATTEMPT_BLOCKED) are suppressed — only
signals with independent threat value are surfaced.
| Severity | Flag | Description |
|---|---|---|
| CRITICAL | Drive-by download blocked | Renderer intercepted an automatic file download — filename, MIME type, and SHA-256 recorded without executing the file |
| HIGH | JS eval detected | eval() or new Function() execution observed at render time — runtime obfuscation |
| HIGH | Exfiltration attempt blocked | Outbound XHR / fetch to an external domain intercepted by the network block |
| HIGH | Credential field on HTTP | Password input found on an unencrypted page — credentials would be sent in plaintext |
| MEDIUM | Redirect chain detected | Multiple HTTP redirects observed before final page load — common in traffic distribution systems |
| MEDIUM | Suspicious download link | Executable file linked from page content (.exe, .msi, .ps1, etc.) |
| MEDIUM | DevTools evasion detected | Renderer-side debugger / devtools detection attempt observed |
API Endpoints
All endpoints are under /api/. No authentication is required — rate limiting is enforced per IP address.
| Method | Path | Description |
|---|---|---|
POST | /api/scan/ | Submit a URL for scanning. Returns {"id": "...", "status": "PENDING"}. 202 on success, 429 if rate limited. |
GET | /api/scan/{id}/ | Poll scan status and retrieve full results once complete. |
GET | /api/scan/{id}/stream/ | Server-Sent Events stream — events: status_update, complete, error. Auto-retry on disconnect (3 attempts, 2s delay). |
GET | /api/scan/{id}/source/ | Re-fetch the raw source of a URL that belongs to a completed scan. Only permits URLs within the original scan scope. |
GET | /api/history/ | Paginated list of completed scans. Supports ?q= URL substring filter and ?page=. |
GET | /api/health/ | Health check — returns {"status":"ok"}. |
GET | /api/schema/swagger-ui/ | Interactive API documentation (drf-spectacular). |
Tech Stack
| Layer | Technology |
|---|---|
| Backend | Python 3.11 / Django 5.2 / Django REST Framework |
| Task queue | Celery + Redis |
| API docs | drf-spectacular — Swagger UI at /api/schema/swagger-ui/ |
| Frontend | React 19 / TypeScript / Vite / Tailwind CSS 4 |
| Database | SQLite (development) / PostgreSQL (production) |
| Cache / broker | Redis |
| Visual renderer | Carapace — Chromium-headless screenshot service (optional sidecar) |
Acknowledgements
Insight is built on a number of excellent open-source libraries.
Backend & Analysis
Frontend
Infrastructure
Visual Renderer