Ein Testtag zwischen Genie und Speicherwarnung

Gemini_Generated_Image_ulfpmjulfpmjulfp

Lokale LLMs auf einem M4 mit 16 GB: Ein Testtag zwischen Genie und Speicherwarnung

Datum: 2026-04-24
Autor: Testteam (aka: wir, Kaffee und Activity Monitor)

DEUTSCH

Warum dieser Test überhaupt?

Weil Benchmarks in schönen Diagrammen nett sind, aber im echten Leben nicht beantworten, was man wirklich wissen will:

Läuft das Modell stabil auf echter Alltags-Hardware?
Antwortet es zuverlässig auf Business-Aufgaben?
Ist es nur schnell oder auch brauchbar?
Und ganz wichtig: Muss jemand den ganzen Tag daneben sitzen und manuell Tabellen pflegen?

Wir haben also nicht „schau mal, wie viele Tokens pro Sekunde“ getestet, sondern „kann ich damit in der Praxis arbeiten, ohne jeden zweiten Run zu verfluchen“.

Das Setup: Mac mini M4, 16 GB RAM, Realität statt Rechenzentrum

Unser Testsystem war bewusst kein Monster-Server, sondern ein praxisnahes Setup:

Gerät: Mac mini (M4)
RAM: 16 GB
Tooling: LM Studio + dokumentierte Screenshots + strukturierte Score-Dateien
Bewertungslogik pro Run:
- Accuracy/Faithfulness (0–5)
- Format/Compliance (0–5)
- Robustness (0–5)
- Practicality (0–5)
- plus Pass/Fail, Latenz, Output-Tokens, Tokens/s

Die 4 Modellkonfigurationen im Test

Getestet wurden:

qwen3.6-35b-a3b-ud (stark komprimiert, UD/Unsloth)
qwen3.6-35b-a3b-jangtq2 (stark komprimiert, JANGTQ2)
ternary-bonsai-8b-mlx
google/gemma-4-e2b

Wichtig: Die zwei Qwen-Varianten wurden teilweise unter demselben Modellnamen in Logs geführt, daher erfolgte die Auswertung in einer gemeinsamen Hauptsequenz plus expliziter Erwähnung beider Varianten.

Die 8 Testfelder (immer gleich, für alle Modelle)

Damit der Vergleich fair bleibt, mussten alle Modelle dieselben Aufgaben lösen:

MIX-HR-001 – HR-Faktencheck + Absage + Compliance
RAG-001 – Long-Context-Recall (Eskalationsregel E-17)
CS-001 – Kundenservice-Deeskalation + valides JSON
MKT-002 – Marketing-Slogans + LinkedIn + Krisenantwort
LEG-001 – DSGVO-Risiko + Klausel + Grenzfall
FIN-001 – Zahlenanalyse + Executive Summary + Python-Funktion
PROD-001 – Sicherheitsurteil + technisches JSON + Shutdown-Prozess
CODE-002 – SQL-Injection-Fix + FastAPI + Passwort-Hashing-Best-Practice

Der eigentliche Bossfight: Qwen laden auf 16 GB

Der wichtigste Praxispunkt des ganzen Tages:

Die Qwen-Modelle waren im Normalzustand nicht stabil ladbar. Nicht „ein bisschen langsam“, sondern „LM Studio sagt nein“.

Beobachtet wurde bei hohen Default-Settings sinngemäß: Not enough resources to load the model with the current settings.

Stabil wurde es erst nach gezieltem Tuning. Das funktionierende Profil sah so aus:

Context Length: ~4096 (statt sehr hoch)
GPU Offload: 40
CPU Thread Pool: 10
Evaluation Batch Size: 512
Unified KV Cache: aktiv
Offload KV Cache to GPU Memory: aktiv
Keep Model in Memory: aktiv
Try mmap(): aktiv
Flash Attention: aktiv

Kurz: Das Modell war nicht „kaputt“, aber ohne Runtime-Disziplin nicht praxistauglich auf dieser Hardware.

Was lief gut, was tat weh?

Qwen (UD/JANGTQ2)

Stärken:

inhaltlich oft sehr stark
gute Antworten in HR, RAG, Marketing, Legal, Production

Schwächen:

klar settings-sensitiv
unter Last spürbarer Memory-/Swap-Druck
in der Coding-Sequenz einmal Context-Limit-bedingter Output-Cut

Qwen war damit wie ein sehr talentierter Kollege, der hervorragende Arbeit macht, aber nur wenn der Schreibtisch exakt richtig eingestellt ist.

Ternary Bonsai 8B MLX

Stärken:

schnell, klar, stabil
in mehreren Bereichen sehr präzise
im Alltag überraschend stark konkurrenzfähig

Einziger nennenswerter Abzug:

FIN-001 war vollständig, aber die Ziel-Abweichungslogik wurde uneinheitlich dargestellt (inhaltlich brauchbar, methodisch nicht perfekt).

Ternary war der Kandidat, der ohne großes Drama liefert und dabei konstant wirkt.

Gemma 4 E2B

Stärken:

insgesamt beste Balance aus Qualität, Stabilität und Geschwindigkeit
sehr robuste Antworten in fast allen Feldern
besonders stark im Coding/Security-Paket

Abzug:

in PROD-001 wurden im JSON die Grenzwerte nicht sauber auf die Sollwerte gesetzt (inhaltliche Sicherheitslogik war aber korrekt).

Gemma war an dem Tag der „ich mache einfach meinen Job“-Performer.

Der versteckte Produktivitätskiller: Erfassungsworkflow

Die größte Bremse war nicht die Modellqualität, sondern die Begleitdokumentation:

Rohdaten waren da (Screenshots, Antworten)
aber nicht jeder Run wurde sofort konsistent in strukturierte Tabellen übernommen
dadurch stieg der manuelle Nachpflegeaufwand massiv

Das ist der Kern-Learning-Punkt: Nicht nur LLMs benchmarken, sondern den Erfassungsprozess genauso ernst nehmen wie die Antworten.

Unsere Zeitbilanz (heute)

Erster Screenshot: 13:45:16
Letzter Screenshot: 19:56:21
Brutto-Zeitfenster: 6h 11m 05s
Netto-Aktivzeit (je nach Pausenfilter): grob 3.5h+ aktive Testarbeit

Anders gesagt: Ja, wir haben viel gelernt. Ja, es war aufwändig. Und ja, der Activity Monitor hatte heute auch einen vollen Arbeitstag.

Sternchenbewertung (Antwortqualität gesamt)

google/gemma-4-e2b: ***** (4.88/5)
ternary-bonsai-8b-mlx: ***** (4.75/5)
qwen3.6-35b-a3b (UD/JANGTQ2 zusammen): ****- (4.28/5)

Zusammenfassung und Fazit (DE)

Wenn wir nur nach „gefühlter Alltagstauglichkeit“ entscheiden:

Gemma 4 E2B ist aktuell die beste Primärwahl.
Ternary Bonsai 8B MLX ist ein sehr starker Zweitkandidat und muss sich nicht verstecken.
Qwen (UD/JANGTQ2) bleibt qualitativ stark, braucht aber diszipliniertes Runtime-Tuning auf 16-GB-Hardware.

Das wichtigste Meta-Fazit: Der nächste große Gewinn kommt nicht nur von besseren Modellen, sondern von einem modularen, automatisierten Test- und Reporting-Flow.

ENGLISH

Why we ran this test in the first place

Because pretty benchmark charts rarely answer the painful real-world questions:

Does the model run reliably on normal hardware?
Is the output business-usable, not just verbose?
Is it fast and accurate, or just fast and confident?
And the big one: can you test at scale without babysitting logs all day?

So this was a practical stress test, not a vanity metric exercise.

The setup: Mac mini M4, 16 GB RAM, real constraints

We intentionally used a constrained but realistic machine:

Device: Mac mini (M4)
RAM: 16 GB
Stack: LM Studio + screenshot evidence + structured scoring sheets
Scoring dimensions per run:
- Accuracy/Faithfulness (0–5)
- Format/Compliance (0–5)
- Robustness (0–5)
- Practicality (0–5)
- plus Pass/Fail, latency, output tokens, tokens/s

The 4 model configurations

qwen3.6-35b-a3b-ud
qwen3.6-35b-a3b-jangtq2
ternary-bonsai-8b-mlx
google/gemma-4-e2b

Important: the two Qwen variants were partly logged under overlapping names, so we evaluated them as one main sequence with explicit variant notes.

The 8 test domains (identical for each model)

MIX-HR-001 – HR fact-check + rejection email + compliance
RAG-001 – long-context recall (E-17 escalation rule)
CS-001 – customer de-escalation + valid JSON
MKT-002 – slogans + LinkedIn post + public crisis response
LEG-001 – GDPR risk + legal clause + edge-case handling
FIN-001 – analytics + executive summary + Python function
PROD-001 – safety decision + technical JSON + shutdown process
CODE-002 – SQL injection fix + FastAPI endpoint + password hashing best practice

The real boss level: loading Qwen on 16 GB

This was the key operational finding:

Qwen did not run reliably on default settings. We hit LM Studio resource errors at high context settings.

The stable profile required tuning:

Context length: ~4096
GPU offload: 40
CPU threads: 10
Eval batch: 512
Unified KV cache: enabled
Offload KV cache to GPU memory: enabled
Keep model in memory: enabled
Try mmap(): enabled
Flash attention: enabled

In short: strong model, but only with disciplined runtime configuration on this hardware.

What worked and what hurt

Qwen (UD/JANGTQ2)

Pros:

often excellent content quality
strong performance across multiple business tasks

Cons:

highly sensitive to memory/runtime settings
visible swap pressure under load
one coding run suffered from context-length clipping

Ternary Bonsai 8B MLX

Pros:

very fast and responsive
consistently strong, practical answers
surprisingly competitive overall

Main deduction:

FIN-001 had complete output but inconsistent target-deviation logic.

Gemma 4 E2B

Pros:

best overall balance (quality + speed + stability)
robust across almost all domains
especially strong in coding/security tasks

Deduction:

PROD-001 JSON values were not populated with the expected max spec values (while safety reasoning itself was correct).

The hidden bottleneck: capture workflow

The biggest pain point was not model intelligence. It was process friction:

raw evidence existed,
but normalization into structured scoring files was not always immediate,
which increased manual cleanup effort dramatically.

Lesson learned: your evaluation pipeline is part of your benchmark.

Time spent (today)

First screenshot: 13:45:16
Last screenshot: 19:56:21
Gross window: 6h 11m 05s
Active test effort: roughly 3.5h+ depending on break filtering

Yes, the models were tested. Also yes, the humans were tested.

Star rating (overall answer quality)

google/gemma-4-e2b: ***** (4.88/5)
ternary-bonsai-8b-mlx: ***** (4.75/5)
qwen3.6-35b-a3b (UD/JANGTQ2 combined): ****- (4.28/5)

Summary and final verdict (EN)

If we optimize for practical pilot readiness:

Gemma 4 E2B is the top primary candidate.
Ternary Bonsai 8B MLX is a very strong second option.
Qwen (UD/JANGTQ2) remains high-quality but requires strict runtime tuning on 16 GB hardware.

Big-picture conclusion: The next major improvement will come from a modular capture-scoring-reporting pipeline, not only from switching to newer models.

SRPSKI

Zašto smo radili ovaj test?

Zato što lepi benchmark grafikoni ne odgovaraju na najvažnija praktična pitanja:

Da li model stabilno radi na realnom hardveru?
Da li su odgovori upotrebljivi za posao?
Da li je samo brz ili i tačan?
I najvažnije: da li test može da se vodi bez celodnevnog ručnog praćenja?

Dakle, cilj je bio praksa, ne samo brojke.

Setup: Mac mini M4, 16 GB RAM, realna ograničenja

Koristili smo namerno realno ograničen sistem:

Uređaj: Mac mini (M4)
RAM: 16 GB
Alati: LM Studio + screenshot evidencija + strukturisane score tabele
Ocenjivanje po run-u:
- Accuracy/Faithfulness (0–5)
- Format/Compliance (0–5)
- Robustness (0–5)
- Practicality (0–5)
- plus Pass/Fail, latency, output tokens, tokens/s

Korišćene 4 konfiguracije modela

qwen3.6-35b-a3b-ud
qwen3.6-35b-a3b-jangtq2
ternary-bonsai-8b-mlx
google/gemma-4-e2b

Napomena: dve Qwen varijante su delom vođene pod sličnim nazivima, pa su u glavnoj sekvenci posmatrane zajedno uz jasne napomene.

Osam test-polja (isto za sve modele)

MIX-HR-001 – HR provera činjenica + odbijenica + compliance
RAG-001 – long-context recall (E-17 pravilo eskalacije)
CS-001 – deeskalacija korisnika + validan JSON
MKT-002 – slogani + LinkedIn + javni krizni odgovor
LEG-001 – GDPR rizik + klauzula + granični slučaj
FIN-001 – analitika + executive summary + Python funkcija
PROD-001 – safety odluka + tehnički JSON + shutdown procedura
CODE-002 – SQL injection fix + FastAPI + best practice za hash lozinki

Glavni izazov: Qwen na 16 GB

Ključni operativni nalaz:

Qwen nije bio stabilno upotrebljiv na podrazumevanim visokim podešavanjima. Dobijali smo greške zbog resursa.

Stabilan profil je zahtevao tuning:

Context: ~4096
GPU offload: 40
CPU threads: 10
Eval batch: 512
Unified KV cache: uključen
Offload KV cache to GPU memory: uključen
Keep model in memory: uključen
Try mmap(): uključen
Flash attention: uključen

Ukratko: odličan potencijal, ali traži disciplinovano runtime podešavanje.

Šta je bilo dobro, a šta nas je kočilo?

Qwen (UD/JANGTQ2)

Prednosti:

često vrlo kvalitetan sadržaj
jaki rezultati kroz više poslovnih scenarija

Slabosti:

osetljivost na podešavanja i memorijski budžet
vidljiv swap pritisak pod opterećenjem
jedan coding run presečen zbog context limita

Ternary Bonsai 8B MLX

Prednosti:

veoma brz i responzivan
stabilni, praktični odgovori
vrlo konkurentan u ukupnom zbiru

Glavni minus:

FIN-001 je kompletan, ali logika odstupanja od cilja nije potpuno dosledna.

Gemma 4 E2B

Prednosti:

najbolji ukupni balans kvaliteta, brzine i stabilnosti
robustan kroz skoro sve test-polja
posebno jak u coding/security delu

Minus:

u PROD-001 JSON vrednosti nisu ubačene po očekivanim max specifikacijama (iako je safety zaključak bio dobar).

Skriveni problem: workflow beleženja

Najveći problem nije bio kvalitet modela, već proces:

sirova evidencija je postojala,
ali normalizacija u strukturisane tabele nije uvek bila trenutna,
što je povećalo ručni rad.

Pouka: pipeline za evidenciju je jednako važan kao i sam model.

Vreme (danas)

Prvi screenshot: 13:45:16
Poslednji screenshot: 19:56:21
Bruto prozor: 6h 11m 05s
Aktivni deo rada: približno 3.5h+ (zavisno od filtera pauza)

Drugim rečima: testirali smo modele, ali i sopstveno strpljenje.

Zvezdana ocena (ukupan kvalitet odgovora)

google/gemma-4-e2b: ***** (4.88/5)
ternary-bonsai-8b-mlx: ***** (4.75/5)
qwen3.6-35b-a3b (UD/JANGTQ2 zajedno): ****- (4.28/5)

Sažetak i zaključak (SR)

Ako gledamo praktičnu spremnost za pilot:

Gemma 4 E2B je prvi izbor.
Ternary Bonsai 8B MLX je vrlo jak drugi kandidat.
Qwen (UD/JANGTQ2) ostaje kvalitetan, ali traži striktna runtime podešavanja na 16 GB hardveru.

Glavni zaključak: Sledeći veliki napredak neće doći samo od „još boljeg modela“, već od modularnog i automatizovanog capture-scoring-reporting procesa.

Lokale LLMs auf einem M4 mit 16 GB: Ein Testtag zwischen Genie und Speicherwarnung

DEUTSCH

Warum dieser Test überhaupt?

Das Setup: Mac mini M4, 16 GB RAM, Realität statt Rechenzentrum

Die 4 Modellkonfigurationen im Test

Die 8 Testfelder (immer gleich, für alle Modelle)

Der eigentliche Bossfight: Qwen laden auf 16 GB

Was lief gut, was tat weh?

Qwen (UD/JANGTQ2)

Ternary Bonsai 8B MLX

Gemma 4 E2B

Der versteckte Produktivitätskiller: Erfassungsworkflow

Unsere Zeitbilanz (heute)

Sternchenbewertung (Antwortqualität gesamt)

Zusammenfassung und Fazit (DE)

ENGLISH

Why we ran this test in the first place

The setup: Mac mini M4, 16 GB RAM, real constraints

The 4 model configurations

The 8 test domains (identical for each model)

The real boss level: loading Qwen on 16 GB

What worked and what hurt

Qwen (UD/JANGTQ2)

Ternary Bonsai 8B MLX

Gemma 4 E2B

The hidden bottleneck: capture workflow

Time spent (today)

Star rating (overall answer quality)

Summary and final verdict (EN)

SRPSKI

Zašto smo radili ovaj test?

Setup: Mac mini M4, 16 GB RAM, realna ograničenja

Korišćene 4 konfiguracije modela

Osam test-polja (isto za sve modele)

Glavni izazov: Qwen na 16 GB

Šta je bilo dobro, a šta nas je kočilo?

Qwen (UD/JANGTQ2)

Ternary Bonsai 8B MLX

Gemma 4 E2B

Skriveni problem: workflow beleženja

Vreme (danas)

Zvezdana ocena (ukupan kvalitet odgovora)

Sažetak i zaključak (SR)

Syndicate