AI Modules¶

Notolog Editor integrates three AI inference modules for the AI Assistant feature. Each module supports different backends and use cases.

Overview¶

Module	Backend	Use Case	Model Format
OpenAI API	Cloud API	Cloud-based inference	N/A (API)
On Device LLM	ONNX Runtime GenAI	Local inference with hardware acceleration	ONNX
Module llama.cpp	llama-cpp-python	Local inference with GGUF models	GGUF

1. OpenAI API Module¶

Description¶

The OpenAI API module enables cloud-based inference using OpenAI's language models (GPT-4o, GPT-5.2, etc.) or any OpenAI-compatible API endpoint.

Requirements¶

OpenAI API key
Internet connection
Optional: Custom API endpoint for compatible services (Azure OpenAI, LocalAI, etc.)

Configuration Settings (UI Labels)¶

UI Label	Description
API URL	API endpoint URL
API Key	Your OpenAI API key (stored encrypted)
Supported Models	Model name dropdown (e.g., `gpt-4o`, `gpt-5`)
System Prompt	Initial instructions for the model
Temperature	Controls response randomness (0-100)
Maximum Response Tokens	Maximum response length
Prompt History Size	Number of conversation turns to retain

Pros & Cons¶

Pros	Cons
Most capable models	Requires internet
No local hardware requirements	API costs
Easy setup	Data sent to cloud

2. On Device LLM Module (ONNX)¶

Description¶

The On Device LLM module uses ONNX Runtime GenAI for local inference with hardware acceleration support.

Requirements¶

Python 3.10-3.13 (Note: onnxruntime-genai does not yet support Python 3.14)
ONNX model files
Package: onnxruntime-genai (included with Notolog)

Supported Models¶

Download ONNX-optimized models from: - Hugging Face ONNX Models - Microsoft Phi-3 ONNX

Model directories should contain .onnx files and configuration.

Configuration Settings (UI Labels)¶

UI Label	Description
ONNX Model Location	Directory containing ONNX model files
Temperature	Controls response randomness (0-100, displayed as 0.0-1.0)
Maximum Response Tokens	Maximum response length (0 = unlimited)
Hardware Acceleration	Execution provider selection
Prompt History Size	Number of conversation turns to retain

Hardware Acceleration Providers¶

Provider	Platform	Hardware Required	Package
CPU (Default)	All	Any CPU	`onnxruntime-genai`
CUDA	Linux/Windows	NVIDIA GPU + CUDA + cuDNN	`onnxruntime-genai-cuda`
DirectML	Windows	DirectX 12 GPU	`onnxruntime-genai-directml`
OpenVINO	Windows/Linux	Intel CPU/GPU/VPU	`onnxruntime-genai`
CoreML	macOS	Apple Silicon/Neural Engine	`onnxruntime-genai`
TensorRT RTX	Windows/Linux	NVIDIA RTX GPU	`onnxruntime-genai`
QNN	Linux	Qualcomm Snapdragon NPU	`onnxruntime-genai`
MIGraphX	Linux	AMD GPU with ROCm	`onnxruntime-genai`

Provider Details¶

CPU (Default)¶

Hardware: Any x86_64 or ARM64 CPU
Performance: Baseline, no acceleration
Use Case: Universal fallback, development

CUDA (NVIDIA GPUs)¶

Hardware: NVIDIA GPU with sufficient VRAM
Requirements:
CUDA Toolkit
cuDNN library
NVIDIA drivers

Install on Ubuntu 24.04+:

sudo apt install nvidia-cuda-toolkit nvidia-cudnn

Install package: pip install onnxruntime-genai-cuda (replaces base package)
Model Selection: Use CUDA-optimized models (look for cuda in model directory name)
Use Case: NVIDIA GPU users on Linux/Windows

Important: Choose models that fit your GPU memory: - Phi-3-mini (~4GB VRAM) - Good for most GPUs - Phi-3-medium (~14GB VRAM) - Requires high-end GPU

DirectML (Windows)¶

Hardware: Any DirectX 12 compatible GPU (NVIDIA, AMD, Intel)
Performance: Good GPU acceleration on Windows
Use Case: Windows users with modern GPUs
Install: pip install onnxruntime-genai-directml (replaces base package)

OpenVINO (Intel)¶

Hardware: Intel CPUs (with AVX2/AVX-512), Intel GPUs (Iris, Arc), Intel VPUs
Performance: Optimized for Intel hardware
Use Case: Intel-based systems

CoreML (Apple)¶

Hardware: Apple Silicon (M1/M2/M3/M4), Neural Engine
Performance: Optimized for Apple hardware
Use Case: macOS on Apple Silicon

TensorRT RTX (NVIDIA)¶

Hardware: NVIDIA RTX 20/30/40 series GPUs
Performance: Highly optimized for NVIDIA RTX GPUs
Use Case: NVIDIA RTX GPU users

QNN (Qualcomm)¶

Hardware: Qualcomm Snapdragon with NPU (AI Engine)
Performance: Optimized for Snapdragon AI accelerators
Use Case: Qualcomm-based devices

MIGraphX (AMD)¶

Hardware: AMD GPUs with ROCm support
Performance: Optimized for AMD GPUs
Use Case: AMD GPU users on Linux

Model Selection Tips¶

CPU models: Look for cpu-int4 in model name (e.g., Phi-3-mini-4k-instruct-onnx/cpu-int4-rtn-block-32)
CUDA/GPU models: Look for cuda in directory name (e.g., Phi-3-mini-4k-instruct-onnx-cuda/cuda-int4-rtn-block-32)
Model size matters: Phi-3-mini (~2GB) vs Phi-3-medium (~9GB) - choose based on your GPU VRAM

Pros & Cons¶

Pros	Cons
Local/private inference	Model download required
Hardware acceleration	Limited model availability
No API costs	Hardware-dependent performance
Works offline	Provider compatibility varies

3. Module llama.cpp (GGUF)¶

Description¶

The Module llama.cpp uses llama-cpp-python for local inference with GGUF format models.

Requirements¶

Python 3.10-3.14
GGUF model file
Package: llama-cpp-python (optional extra)

Installation¶

# Install with llama.cpp support
pip install "notolog[llama]"

# Or install separately
pip install llama-cpp-python

Supported Models¶

GGUF models from: - Hugging Face GGUF Models - TheBloke's Collection (historic archive, 2,000+ models)

Popular models: - Llama ⅔ - Mistral - Phi-⅔ - Gemma - Qwen

Configuration Settings (UI Labels)¶

UI Label	Description
Model Location	Path to `.gguf` file
Context Window Size	Maximum context size (default: 2048)
Chat Formats	Model-specific chat template dropdown
System Prompt	Custom system instructions
Response Temperature	Controls response randomness (0-100)
Max Tokens per Response	Maximum response length (0 = context window limit)
Size of the Prompt History	Conversation history limit

Chat Formats¶

Format	Models
`auto`	Auto-detect from model metadata
`chatml`	Mistral, Qwen, many others
`llama-2`	Llama 2 models
`llama-3`	Llama 3 models
`gemma`	Google Gemma
`phi-3`	Microsoft Phi-3

Performance Tips¶

The module automatically uses all CPU cores
Quantized models (Q4_K_M, Q5_K_M) offer good quality/speed balance
Larger context windows require more memory
macOS Apple Silicon (M1/M2/M3/M4): Automatically uses Metal GPU acceleration (GPU Layers = Auto)

macOS GPU Acceleration (Metal)¶

On Apple Silicon Macs (M1/M2/M3/M4), the Module llama.cpp automatically uses Metal GPU acceleration when GPU Layers is set to Auto.

Configuration: Go to Settings → Module llama.cpp tab → GPU Layers:

Value	Behavior
Auto (default)	Platform auto-detection: GPU on Apple Silicon, CPU on Intel Mac/Linux/Windows
-1	Offload all layers to GPU (explicit GPU mode)
0	CPU-only mode (recommended for Intel Macs)
1-999	Partial GPU offloading (advanced)

What GPU Layers Does

The n_gpu_layers parameter controls how many transformer layers are offloaded from CPU to GPU memory. Each layer offloaded reduces CPU load and speeds up inference, but requires corresponding GPU VRAM. Setting -1 offloads all layers for maximum GPU acceleration, while 0 keeps everything on CPU.

Auto vs -1

Auto intelligently selects the best mode for your hardware. On Apple Silicon it uses GPU (-1), on Intel Mac it uses CPU (0). -1 always forces GPU mode regardless of platform.

Pros & Cons¶

Pros	Cons
Wide model compatibility	May be slower than dedicated GPU
Large model ecosystem	Model file required
Works offline	Memory-intensive for large models
No API costs
Metal GPU on Apple Silicon

Comparison¶

Performance (Relative)¶

Module	Speed	Quality	Privacy	Ease of Setup
OpenAI API	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐
On Device LLM (GPU)	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
On Device LLM (CPU)	⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Module llama.cpp	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

Recommendations¶

Use Case	Recommended Module
Best quality, no hardware limits	OpenAI API
Privacy-focused, have NVIDIA GPU	On Device LLM + CUDA
Privacy-focused, Windows GPU	On Device LLM + DirectML
Privacy-focused, Apple Silicon	On Device LLM + CoreML
Privacy-focused, Intel CPU	On Device LLM + OpenVINO
Wide model choice, privacy	Module llama.cpp
Offline work required	On Device LLM or Module llama.cpp

Troubleshooting¶

On Device LLM (ONNX) Common Issues¶

Error: "Unknown provider name 'X'" - The provider is not supported in your onnxruntime-genai build - Solution: Use CPU or install the correct package variant

Error: "CUDA execution provider is not enabled" - CUDA requires separate package - Solution: pip install onnxruntime-genai-cuda (replaces base package)

Error: "libcublasLt.so.12: cannot open shared object file" or "libcudnn.so.9: cannot open shared object file" - CUDA runtime libraries not installed - Solution (Ubuntu 24.04+):

sudo apt install nvidia-cuda-toolkit nvidia-cudnn

Error: "Failed to allocate memory for requested buffer" or "Could not allocate the key-value cache buffer" - GPU doesn't have enough VRAM for the model or response generation - Note: Notolog automatically: 1. Falls back from GPU to CPU when model initialization fails 2. Reduces max_length (response tokens) when generator allocation fails - If issues persist: 1. Use a smaller model (e.g., Phi-3-mini instead of Phi-3-medium) 2. Reduce "Maximum Response Tokens" in settings 3. Close other GPU applications 4. Manually select CPU provider in settings

Error: "Model not found" or "error opening genai_config.json" - Model directory structure is incompatible - ONNX Runtime GenAI requires genai_config.json + .onnx files in the same directory - Common cause: Models optimized for transformers.js have different structure and lack genai_config.json - Solutions: 1. Use models specifically built for onnxruntime-genai (look for genai_config.json) 2. Recommended sources: microsoft/ on Hugging Face or models with "onnx-genai" in name 3. Example compatible model: microsoft/Phi-3-mini-4k-instruct-onnx. Navigate to the cpu_and_mobile/cpu-int4-rtn-block-32 directory inside the model folder.

Chat format tokens appearing in output (e.g., "<|assistant|>") - Model's genai_config.json may have incorrect stop tokens or chat template - Solutions: 1. Try a different quantization of the same model 2. Check if the model's genai_config.json has proper stop_strings defined 3. Use official Microsoft ONNX models which have tested configurations

Module llama.cpp Common Issues¶

Error: "Model not found" - File path is incorrect or file doesn't exist - Solution: Use full path to the .gguf file

Model loading hangs (Intel Macs only) - Set GPU Layers to "0" in Settings, or downgrade: pip install llama-cpp-python==0.2.90 --force-reinstall

Slow performance - Context window too large - Solution: Reduce Context Window Size in settings

Out of memory - Model too large for available RAM - Solution: Use a smaller/more quantized model (e.g., Q4_K_M)

macOS-Specific Issues¶

Installation error: "zsh: no matches found: notolog[llama]" - zsh interprets square brackets as glob patterns - Solutions:

# Option 1: Quote the package specification
pip install "notolog[llama]"

# Option 2: Escape the brackets
pip install notolog\[llama\]

# Option 3: Install llama-cpp-python separately
pip install notolog
pip install llama-cpp-python

Metal warnings: "skipping kernel_*_bf16 (not supported)" - These are informational messages, not errors - BF16 (bfloat16) operations require newer Apple Silicon (M1+) or Metal 3 - Intel Macs don't support BF16 and will use FP16/FP32 fallbacks automatically - The model will still work correctly, just without BF16 optimization

Context window warning: "n_ctx_per_seq (2048) < n_ctx_train (40960)" - Informational only - the model has a larger training context than configured - To use more context, increase "Context Window Size" in settings - Note: Larger context uses more memory and may slow down inference

Package Installation Summary¶

Base Installation¶

pip install notolog

Includes: onnxruntime-genai (CPU)

Optional Extras¶

# llama.cpp support
pip install "notolog[llama]"

GPU Acceleration (Manual)¶

# NVIDIA CUDA (Linux/Windows) - replaces base onnxruntime-genai
pip uninstall onnxruntime-genai
pip install onnxruntime-genai-cuda

# DirectML (Windows) - replaces base onnxruntime-genai
pip uninstall onnxruntime-genai
pip install onnxruntime-genai-directml

# To switch back to CPU-only (use --force-reinstall to ensure clean install)
pip uninstall onnxruntime-genai-cuda  # or -directml
pip install --force-reinstall onnxruntime-genai

Important Notes: - CUDA, DirectML, and base packages all provide the same onnxruntime_genai Python module - They cannot be installed together - the last installed package wins - If you uninstall the GPU package, use --force-reinstall when reinstalling the base package - After changing packages, restart the application