Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:
• M1 Max (MLX native)
• LM Studio (GLM, MLX, GGUFs)
• Llama.cp (GGUFs)
• n8n for orchestration + automation (multi-stage LLM
workflows)
My emerging use cases:
-Rapid narration scripting
-Roleplay agents with embedded prompt personas
-Reviewing image/video attachments + structuring copy for clarity
-Local RAG and eval pipelines
My current lineup of small LLMs (this changes every month depending on what is updated):
MLX-native models (mlx-community):
-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following
-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization
-GLM-Z1-9B-bf16 → reliable multilingual output + inference density
GGUF via LM Studio / llama.cpp:
-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue
-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once
-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts
Emerging / niche models tested:
MedFound-7B-GGUF → early tests for narrative medicine tasks
llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks
PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)
I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference.
The meta-trend: models are getting better, smaller, faster, especially for edge workflows.
Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.
Very cool to hear your perspective in how you are using the small LLMs! I’ve been experimenting extensively with local LLM stacks on:
• M1 Max (MLX native)
• LM Studio (GLM, MLX, GGUFs)
• Llama.cp (GGUFs)
• n8n for orchestration + automation (multi-stage LLM workflows)
My emerging use cases: -Rapid narration scripting -Roleplay agents with embedded prompt personas -Reviewing image/video attachments + structuring copy for clarity -Local RAG and eval pipelines
My current lineup of small LLMs (this changes every month depending on what is updated):
MLX-native models (mlx-community):
-Qwen2.5-VL-7B-Instruct-bf16 → excellent VQA and instruction following
-InternVL3-8B-3bit → fast, memory-light, solid for doc summarization
-GLM-Z1-9B-bf16 → reliable multilingual output + inference density
GGUF via LM Studio / llama.cpp:
-Gemma-3-12B-it-qat → well-aligned, solid for RP dialogue
-Qwen2.5-0.5B-MLX-4bit → blazing fast; chaining 2+ agents at once
-GLM-4-32B-0414-8bit (Cobra4687) → great for iterative copy drafts
Emerging / niche models tested:
MedFound-7B-GGUF → early tests for narrative medicine tasks
X-Ray_Alpha-mlx-8Bit → experimental story/dialogue hybrid
llama-3.2-3B-storyteller-Q4_K_M → small, quick, capable of structured hooks
PersonalityParty_saiga_fp32-i1 → RP grounding experiments (still rough)
I test most new LLMs on release. QAT models in particular are showing promise, balancing speed + fidelity for chained inference. The meta-trend: models are getting better, smaller, faster, especially for edge workflows.
Happy to swap notes if others are mixing MLX, GGUF, and RAG in low-latency pipelines.