The split workflow
I needed LLMs for security research but could not send sensitive data to external APIs. The obvious answer — use a local model for everything — ran into a quality wall fast. Local 8B models handle structured extraction but struggle with the multi-step reasoning that Snort rule generation requires. The workflow settled into a split: local models for anything touching sensitive data, remote models for tasks that need better reasoning. The deciding factor is what leaves the machine.
I came across Foundation-Sec-8B while looking for something better than a general-purpose model for security work. Cisco released it as a security-domain fine-tune of Llama 3.1 8B, trained on vulnerability data, threat reports and security documentation. It runs on a single GPU and inference is fast enough for batch processing. For CVE triage, IOC extraction and initial classification of threat data, it handles the work without any data leaving the local network.
Claude handles the tasks where reasoning quality matters more than data sensitivity: Snort rule generation from a vulnerability description, multi-step attack chain analysis and reviewing detection logic for gaps. The output quality difference between an 8B local model and a frontier model is significant for these tasks.
CVE analysis in practice
The most common workflow is CVE analysis. A vulnerability writeup goes in, structured output comes out: root cause, affected components, CWE classification, ATT&CK technique mapping and Snort rule suggestions. The prompt template is plain text, versioned in git and iterated on over months.
Let's have a look at the concrete workflow. The llm tool handles all of this from the command line:
$ cat cve-2024-3094-writeup.txt | llm -m foundation-sec -t vuln_research -j
The -j flag gives structured JSON output:
{
"model": "foundation-sec-8b:latest",
"text": {
"system": "Analyze this vulnerability report and present a 10 point summary ...",
"user": "<writeup contents>",
"assistant": "1. XZ Utils backdoor (CVE-2024-3094) is a supply chain attack ...\n2. CWE-506: Embedded Malicious Code ...\n3. ATT&CK: T1195.002 Supply Chain Compromise ...\n4. Snort suggestion: alert tcp any any -> any 22 (msg:\"Possible XZ backdoor ...\"; ...)"
},
"stats": {
"tokens": 1247,
"tokens_per_second": 38.6,
"runtime": 32.3
}
}
Comparing the same writeup across models shows the quality gap:
$ cat writeup.txt | llm -m foundation-sec -t vuln_research -j > out_local.json
$ cat writeup.txt | llm -m claude -t vuln_research -j > out_claude.json
Foundation-Sec-8B handles the structured extraction well: it identifies CWE categories, pulls out affected version ranges and maps to ATT&CK techniques. Where it falls short is generating a Snort rule that would actually catch exploitation in network traffic. The rule syntax is correct, but the detection logic is often too broad or misses the key indicator.
Claude produces better Snort rules. Not production-ready (nothing generated is), but closer to a useful starting point. It handles multi-step reasoning: this is a heap overflow in the HTTP parser, the malformed input appears in this header field, the Snort rule needs to match this byte pattern after the header name. That chain of reasoning is where model size matters.
Batch processing and local models
The batch workflow runs the prompt against a directory of CVE writeups. The tidal pipeline collects the raw data. llm processes it:
$ ./tidal/bin/collect-all.sh
$ for f in tidal/data/cve-*.json; do
cat "$f" | llm -m foundation-sec -t vuln_research -j > "analyzed/$(basename $f)"
done
Processing 50 CVEs through Foundation-Sec-8B takes about 20 minutes on local hardware. The same batch through Claude's API takes less time but costs money and sends the data externally.
For internal vulnerability data — pre-disclosure, customer-reported, in-progress research — the local model is the only option. The quality tradeoff is acceptable because the output is a first pass, not a final product. A researcher reviews every output, corrects the classification and writes the actual detection. The LLM saves time on the structured extraction, not on the analysis.
Generating test packets for rule validation is part of the same pipeline. hexwhisper takes a natural language description and produces Wireshark-compatible hex:
$ hexwhisper "malformed HTTP request with directory traversal in URI" --llm ollama
The output is structured JSON with per-layer hex data, field names matching Wireshark display filters and the complete packet hex ready for text2pcap. Not a substitute for real traffic captures, but useful for quick validation that a Snort rule triggers on the expected pattern.
What the 8B model actually knows
Foundation-Sec-8B has useful domain knowledge. It knows CVE numbering conventions, CWE taxonomy, CVSS scoring factors and common vulnerability classes. It handles security-specific terminology without the confusion that general-purpose models sometimes show: conflating buffer overflow types, misidentifying protocol layers. The fine-tuning on security data shows in these areas.
Where it falls short is synthesis. Given a novel vulnerability description, it will correctly classify it but will not generate detection strategies tailored to the specific vulnerability mechanics. It produces template-like Snort rules. This is the expected tradeoff at 8B parameters.
Tradeoffs after six months
Local models are mandatory for sensitive data, adequate for structured extraction and weak at multi-step reasoning. Remote models produce better analysis but require data to leave the local environment. The split workflow handles both constraints.
The tooling around this — llmtools, prompt templates, batch scripts — is more useful than any individual model. Models improve and get replaced. The workflow, the prompt templates and the evaluation criteria persist across model changes. That's where the investment is worth making.