Local-Llm on danilchenko.dev

Local-Llm on danilchenko.devhttps://www.danilchenko.dev/tags/local-llm/Recent content in Local-Llm on danilchenko.devHugoen-usSat, 11 Apr 2026 06:00:00 +0000TriAttention Compresses KV Cache 10.7x — How Trigonometry Fixed Long-Context Reasoninghttps://www.danilchenko.dev/posts/2026-04-11-triattention-kv-cache-compression-long-reasoning/Sat, 11 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-11-triattention-kv-cache-compression-long-reasoning/TriAttention uses pre-RoPE vector concentration and trigonometric scoring to compress KV cache 10.7x while matching full attention accuracy on reasoning tasks.How to Run Gemma 4 Locally With Ollama, llama.cpp, and vLLMhttps://www.danilchenko.dev/posts/2026-04-07-run-gemma-4-locally-ollama-llama-cpp-vllm/Tue, 07 Apr 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-04-07-run-gemma-4-locally-ollama-llama-cpp-vllm/Step-by-step guide to running Google Gemma 4 locally on your hardware with Ollama, llama.cpp, and vLLM — including model picks, VRAM requirements, and real gotchas.Google's TurboQuant Compresses LLM Memory 6x With Zero Accuracy Loss — Here's How It Workshttps://www.danilchenko.dev/posts/2026-03-27-google-turboquant-llm-compression-6x-zero-accuracy-loss/Fri, 27 Mar 2026 06:00:00 +0000https://www.danilchenko.dev/posts/2026-03-27-google-turboquant-llm-compression-6x-zero-accuracy-loss/Google's TurboQuant algorithm compresses LLM KV cache memory by 6x with zero accuracy loss and no retraining needed. We break down the ICLR 2026 paper.