Simplified Tutorial on Running LLMs (Llama 3) Locally with llama.cpp

Artificial Intelligence, Windows system administration

With the rise of open-source large language models (LLMs), the ability to run them efficiently on local devices is becoming a game-changer. In this guide, we’ll dive into using llama.cpp, an open-source C++ library that allows you to run LLMs like Llama 3 locally. Whether you’re a developer or a machine learning enthusiast, this step-by-step tutorial will help you get started with llama.cpp, an easy-to-install library that optimizes LLM inference on your hardware, whether it’s a desktop computer or cloud-based infrastructure.

What You Need to Get Started

Before diving into the specifics, it’s important to ensure your hardware is ready. To follow this guide, having at least 8 GB of VRAM is recommended. However, llama.cpp offers flexibility with optimizations, especially when it comes to model quantization, which we’ll cover in a bit.

This tutorial works with models like Llama-3–8B-Instruct, but you can choose other models available from Hugging Face.

Understanding llama.cpp

So, what is llama.cpp? Essentially, it’s a lightweight C++ library designed to simplify the process of running LLMs locally. It allows for efficient inference on different hardware setups, from basic desktops to high-performance cloud servers.

With llama.cpp, you’ll benefit from several features:

Top Performance: It provides cutting-edge inference for large models.
Easy Setup: Installing llama.cpp is straightforward, requiring no external dependencies.
Quantization Support: Reduces model size by converting them to lower precision, helping run models on devices with limited memory.
Multi-platform Compatibility: It works on MacOS, Windows, and Linux, with support for Docker and FreeBSD.
Efficient Resource Use: llama.cpp takes full advantage of both CPU and GPU resources, making hybrid inference possible.

Challenges You Might Face

While llama.cpp is an incredible tool, it’s not without its challenges:

Sparse Documentation: Because it’s open-source, the available documentation can sometimes be lacking. However, the active community is a great resource.
Technical Expertise: Setting up and running models, particularly with custom configurations, can be a bit complex, especially for users new to C++ or machine learning infrastructure.

Step-by-Step Setup

Let’s go through the installation and setup process so you can get llama.cpp up and running with the Llama 3 model.

1. Cloning the Repository

First, you’ll need to download the llama.cpp repository from GitHub. You can do this with the following commands:

$ git clone https://github.com/ggerganov/llama.cpp
$ cd llama.cpp

2. Building llama.cpp

Next, you’ll need to build the project. The build process will vary slightly depending on whether you’re using MacOS or Windows.

For MacOS, with Metal support, you can run:

$ make llama-server

Alternatively, you can use CMake:

$ cmake -B build
$ cmake --build build --config Release

For Windows, with CUDA support, the commands are slightly different:

C:\Users\Bob> make llama-server LLAMA_CUDA=1

Or using CMake:

C:\Users\Bob> cmake -B build -DLLAMA_CUDA=ON
C:\Users\Bob> cmake --build build --config Release

Or Download prebuilt release from the official repo:

LLAMA CPP Releases Page

Once completed, you’re ready to start using the model.

Downloading and Preparing the Model

In this example, we’ll use the Meta-Llama-3–8B-Instruct model, though you can adapt this for any model you prefer.

1. Install Hugging Face CLI

To begin, install the Hugging Face command line interface:

$ pip install -U "huggingface_hub[cli]"

Create an account on Hugging Face if you don’t have one already, and generate your access token from their settings page. You’ll need this to access the models.

$ huggingface-cli login

Once you’re logged in, accept the terms for the Llama-3–8B-Instruct model and wait for access approval.

3. Downloading the Model

You have two main options when downloading the model: non-quantized or GGUF quantized.

Option 1: Non-quantized Model

To download the non-quantized model:

$ huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --exclude "original/*"  --local-dir models/Meta-Llama-3-8B-Instruct

After downloading, you’ll need to install Python dependencies and convert the model to GGUF format:

$ python -m pip install -r requirements.txt
$ python convert-hf-to-gguf.py models/Meta-Llama-3-8B-Instruct

Option 2: Quantized Model

If you’re working with hardware constraints, you can download a quantized version directly:

$ huggingface-cli download path_to_gguf_model --exclude "original/*" --local-dir models/Meta-Llama-3-8B-Instruct

Using Quantization for Hardware Optimization

Quantization allows you to run models on devices with lower memory, such as systems with less than 16 GB of VRAM. By reducing the precision of model weights from 16-bit to 4-bit, you can save a lot of memory without sacrificing too much performance.

To quantize the model:

$ ./llama-quantize ./models/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf ./models/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf Q4_K_M

This will produce a quantized model that is ready for local inference.

Running llama-server

Once your model is set up, you can launch the llama-server to handle HTTP requests, allowing you to interact with the model using standard APIs.

For MacOS:

$ ./llama-server -m models/Meta-Llama-3-8B-Instruct/ggml-model-Q4_K_M.gguf -c 2048

On Windows, use:

C:\Users\Bob> llama-server.exe -m models\\Meta-Llama-3-8B-Instruct\\ggml-model-Q4_K_M.gguf -c 2048

Now, you can start sending requests to http://localhost:8080.

Building a Python Chatbot

You can also create a simple Python chatbot to interact with your model. Here’s a basic script to send requests to the llama-server:

import requests
import json
def get_response(server_url, messages, temperature=0.7, max_tokens=4096, stream=True):
    data = {
        "messages": messages,
        "temperature": temperature,
        "max_tokens": max_tokens,
        "stream": stream,
    }
    response = requests.post(f"{server_url}/v1/chat/completions", json=data)
    return response.json()
def chatbot(server_url):
    messages = [{"role": "system", "content": "You are a helpful assistant."}]
    while True:
        user_input = input("You: ")
        if user_input.lower() == "exit":
            break
        messages.append({"role": "user", "content": user_input})
        print("Assistant: ", get_response(server_url, messages))
if __name__ == "__main__":
    chatbot("http://localhost:8080")

This simple script creates a chatbot that communicates with the llama-server to generate responses.

Conclusion:

Using llama.cpp to run large language models like Llama 3 locally can be a powerful and efficient solution, especially when high-performance inference is required. Despite its complexity, llama.cpp provides flexibility with multi-platform support, quantization, and hardware acceleration. Although it may be challenging for newcomers, its growing community and extensive features make it a top choice for developers and researchers alike.

For those willing to dive deep, llama.cpp offers endless possibilities for local and cloud-based LLM inference.

Simplified Tutorial on Running LLMs (Llama 3) Locally with llama.cpp

What You Need to Get Started

Understanding llama.cpp

Challenges You Might Face

Step-by-Step Setup

1. Cloning the Repository

2. Building llama.cpp

Downloading and Preparing the Model

1. Install Hugging Face CLI

3. Downloading the Model

Using Quantization for Hardware Optimization

Running llama-server

Building a Python Chatbot

Conclusion:

LEAVE A COMMENT Cancel reply

Sections

Recent tutorials

Change language

Our products

About

Follow us