Your Starter Guide to Playing with Local AI Models
Overview and practical guidance
Easy to understand, conversational, and encouraging
0 0 1
This guide provides a beginner-friendly introduction to running Large Language Models (LLMs) locally. It aims to take users from knowing nothing to successfully having an AI on their computer by simplifying complex information. The article covers hardware-specific installation instructions (Nvidia, AMD, Mac, older machines), and explains fundamental AI concepts like LLMs, fine-tuning, context, model sizes, quantization, file formats (GGML/GGUF/GPTQ/exl2), and essential loading settings. It prioritizes getting users started with practical steps rather than deep technical dives.
main points
unique insights
practical applications
key topics
key insights
learning outcomes
• main points
1
Excellent for absolute beginners, demystifying complex local AI setup.
2
Provides clear, hardware-specific installation recommendations for various operating systems.
3
Breaks down essential LLM terminology and concepts in an accessible manner.
• unique insights
1
Offers practical advice on choosing model sizes and quantization levels based on VRAM limitations.
2
Explains the trade-offs between different quantization methods (e.g., q4_K_S vs. q6) and their impact on performance and quality.
• practical applications
Enables users to overcome initial hurdles in setting up and running local LLMs, providing a clear path to experimentation.
• key topics
1
Local LLM Installation
2
LLM Terminology (LLM, Fine-tune, Context)
3
Model Selection (Size, Quantization, Formats)
• key insights
1
Simplifies the often-intimidating process of setting up local AI models.
2
Provides actionable advice tailored to different hardware configurations and operating systems.
3
Demystifies jargon and technical concepts for a non-expert audience.
• learning outcomes
1
Successfully install and run an LLM on their local machine.
2
Understand fundamental concepts related to LLMs, fine-tuning, and model parameters.
3
Confidently choose appropriate LLM models and configurations based on their hardware.
The first step to playing with your own local AI is setting up the necessary software. The installation process can vary depending on your computer's hardware. We'll break down the recommendations for different graphics cards and operating systems to ensure a smooth setup.
“ Nvidia Graphics Card (Windows/Linux)
AMD users also have solid options for running AI models locally. On Windows, Koboldcpp continues to be the top recommendation due to its excellent AMD support. It also provides API capabilities for tools like SillyTavern. Refer to the quick start guide: [https://github.com/LostRuins/koboldcpp/wiki#quick-start](https://github.com/LostRuins/koboldcpp/wiki#quick-start). For more in-depth information on AMD-specific configurations, consult this resource: [https://github.com/YellowRoseCx/koboldcpp-rocm/releases](https://github.com/YellowRoseCx/koboldcpp-rocm/releases).
For Linux users with AMD GPUs, Oobabooga is also a viable option. It offers support for AMD hardware and can function as an API. You can find installation details here: [https://github.com/oobabooga/text-generation-webui/blob/main/docs/One-Click-Installers.md#using-an-amd-gpu-in-linux](https://github.com/oobabooga/text-generation-webui/blob/main/docs/One-Click-Installers.md#using-an-amd-gpu-in-linux). As with Nvidia users on Linux, consider using Docker for Oobabooga for a robust setup.
“ Mac Users
If you have an older machine with limited VRAM (2GB or less) or an older CPU, running local AI models can be a challenge, but it's not impossible. Start small and experiment. GPT4All is a good starting point as it's CPU-based on Windows and supports Metal on Mac, offering smaller models. After that, consider Koboldcpp, known for its lightweight nature and good performance: [https://github.com/LostRuins/koboldcpp/wiki#quick-start](https://github.com/LostRuins/koboldcpp/wiki#quick-start).
When working with limited resources, focus on smaller models (e.g., 7b) and heavily quantized versions (e.g., q3_K_S). It's often better to run a smaller model with more layers on the CPU than to try and force a larger model into insufficient VRAM. You might need to offload some layers to the CPU. The key is trial and error to find what works best for your specific hardware. Don't expect top-tier speeds, but you can still achieve functional results.
“ Understanding the Basics: Key AI Concepts
An LLM, or Large Language Model, is the core 'brain' of an AI. It's the engine that performs the thinking and processing, enabling AI to understand and generate human-like text. You can think of it as your personal ChatGPT running on your computer. Many models you encounter are based on foundational models like Meta's Llama or Llama 2, which are then modified or 'fine-tuned' for specific purposes. Llama 2 generally offers improved capabilities and context handling over its predecessor.
“ What is a Fine-Tune?
'Context' is crucial for an LLM to understand and respond appropriately. LLMs don't inherently remember past interactions; you need to provide them with relevant information in each prompt. This includes character descriptions, conversation history, and instructions. The 'context window' is the limit on how much information can be sent. Llama models typically have a context window of 2048 tokens (roughly 1500 words), while Llama 2 models support 4096 tokens. A larger context window allows for more detailed prompts and longer conversations. Fortunately, most AI programs handle context management for you.
“ Where and How to Get LLMs
LLMs are categorized by their size, denoted in 'b' for billions of parameters (e.g., 3b, 7b, 13b, 70b). A larger number of parameters generally correlates with a more intelligent and capable model. A 70b model can feel very human-like, while a 3b model might struggle with sustained conversation. However, don't underestimate smaller models; 13b models can be surprisingly effective.
'Quantization' is a process that compresses these models, reducing their file size and memory requirements. Quantized models are denoted by 'q' followed by a number (e.g., q2, q3, q4, q5, q6, q8). A smaller 'q' number means a more compressed and potentially less capable model. For instance, a 34b model quantized to q3 might be around 17GB, a significant reduction from its full size. A general rule of thumb is that a smaller quantization of a larger model (e.g., 34b q3) is often superior to a higher quantization of a smaller model (e.g., 13b q8). The goal is to find the largest 'b' size and highest 'q' level that fits within your VRAM.
“ File Types: GGML, GGUF, GPTQ, and exl2
When loading an LLM, several settings can impact performance and stability:
* **Context (ctx):** Set this to the maximum context window supported by your model (usually 2048 or 4096). Check the model's readme on Hugging Face for specifics.
* **ROPE Settings:** Avoid altering these advanced settings (alpha, rope compress, etc.) unless you understand their function. For GGUF models, most programs automatically configure these.
* **Threads:** Set this to the number of CPU cores your system has. For Macs, it's often advisable to subtract 4 from the total core count to account for 'Efficiency Cores' which are less suited for AI tasks.
* **GPU Layers (n-gpu-layers/ngl):** For Mac users, any number greater than 0 enables GPU offloading. For Windows/Linux, start with around 50 layers. If your VRAM can accommodate the entire model, set it to the model's total layers or higher. If not, gradually reduce the number of layers until the model runs smoothly, allowing some layers to be processed by the CPU.
* **Koboldcpp Specifics:** For Koboldcpp, you can initially leave 'BLAS threads' blank. Other checkboxes and fields can be explored later as you become more familiar with the software.
“ Choosing the Right Model Flavor
Performance is typically measured in 'Tokens Per Second' (T/s), which should be displayed by your AI software. Weak hardware might yield 1-2 T/s, while high-end systems (like a 3090/4090 or Mac Studio Ultra) can achieve 15-20 T/s or more with 13b models.
If your model is acting 'dumb' or performing poorly, consider these common issues:
* **Incorrect Settings:** Revisit your context size and ROPE settings. Ensure they are appropriate for the model.
* **Preset Issues:** Experiment with different presets like 'Deterministic' or 'Divine Intellect' in your software. These can provide a good baseline. Research presets suitable for your specific task.
* **Low Context:** If you're sending very little information (under 500 tokens) in your prompt, the model might perform poorly. Gradually increasing the prompt length can improve its responses. Models tend to perform better as they receive more context, up to their limit.
* **Hardware Limitations:** Ensure your hardware is capable of running the model you've chosen, especially considering VRAM limitations. You may need to use smaller or more heavily quantized models.
We use cookies that are essential for our site to work. To improve our site, we would like to use additional cookies to help us understand how visitors use it, measure traffic to our site from social media platforms and to personalise your experience. Some of the cookies that we use are provided by third parties. To accept all cookies click ‘Accept’. To reject all optional cookies click ‘Reject’.
Comment(0)