Resolving CP949 Errors in Local LLM Benchmarking and Building an Automatic Model Recommendation System

Ever run into CP949 encoding errors when benchmarking local LLMs, or felt frustrated by the lack of model management features? In this post, I'll share my experience overcoming CP949 encoding issues and building an automatic model recommendation system to enhance local model research and management capabilities.

Attempts and Pitfalls

Initially, I wanted to build a simple feature in the admin page to switch and benchmark local models. I also prepared a more diverse set of benchmark questions in Korean.

// riel_agent/src/app/admin/tabs/LocalModelLabTab.tsx (excerpt)

import { Button, Select, Input } from '@mantine/core';
import { useState, useEffect } from 'react';
import {
  getLocalModels,
  switchLocalModel,
  runBenchmark,
  getBenchmarkResults,
} from '../../api/admin'; // Actual API call functions

function LocalModelLabTab() {
  const [models, setModels] = useState<string[]>([]);
  const [selectedModel, setSelectedModel] = useState<string>('');
  const [benchmarkQuestions, setBenchmarkQuestions] = useState<string[]>([]);
  const [benchmarkResults, setBenchmarkResults] = useState<any>(null);

  useEffect(() => {
    // Load local model list
    getLocalModels().then(setModels);
    // Load Korean benchmark questions (expanded to 25)
    // ...
  }, []);

  const handleModelChange = async (modelName: string) => {
    await switchLocalModel(modelName); // Actual model switching API
    setSelectedModel(modelName);
  };

  const handleRunBenchmark = async () => {
    const results = await runBenchmark(selectedModel, benchmarkQuestions); // Actual benchmark execution API
    setBenchmarkResults(results);
  };

  // ... UI rendering ...

  return (
    <div>
      <Select
        label="Select Local Model"
        data={models}
        value={selectedModel}
        onChange={handleModelChange}
      />
      <Button onClick={handleRunBenchmark}>Run Benchmark</Button>
      {/* Results display section */}
    </div>
  );
}

export default LocalModelLabTab;

Switching models and expanding questions were relatively straightforward. The problem arose when running benchmarks, especially with Korean data, where I frequently encountered CP949 encoding errors.

UnicodeEncodeError: 'cp949' codec can't encode characters in position 1-3: illegal multibyte sequence

Seeing this error message, I initially thought it was just a Korean string processing issue. So, I tried changing the encoding settings in Python files or explicitly encoding/decoding strings to utf-8. However, after hours of struggling, the problem persisted.

# riel_backend/api/local_llm.py (part of initial attempts)

import json

def process_text_with_model(text: str, model_name: str) -> str:
    # ... Model call logic ...
    # CP949 error occurred here
    # text = text.encode('utf-8').decode('cp949', errors='ignore') # Attempts like this
    # ...
    pass

The Cause

After hours of debugging, I finally pinpointed the root cause. It wasn't just an encoding issue with the Python script itself. The local LLM worker was attempting to forcibly convert data to CP949, the default encoding on certain environments (especially Windows), during the process of handling and saving model responses.

# tools/local_llm_worker/worker.py (suspected point of failure)

def save_output(output_data: dict):
    # ...
    with open(output_file_path, 'w', encoding='cp949') as f: # <-- Problem occurred here
        json.dump(output_data, f, ensure_ascii=False)
    # ...

The json.dump function, when used with ensure_ascii=False, outputs Unicode characters as they are. However, specifying encoding='cp949' during file writing caused an error because it tried to convert them to that encoding.

The Solution

The fix was simple: modify the local LLM worker to explicitly use utf-8 encoding when saving files.

# tools/local_llm_worker/worker.py (after modification)

import json

def save_output(output_data: dict):
    # ...
    with open(output_file_path, 'w', encoding='utf-8') as f: # <-- Changed to utf-8
        json.dump(output_data, f, ensure_ascii=False, indent=4) # Added indent for better readability
    # ...

Along with this, I built a system to automatically download models, benchmark them, and recommend better ones.

# tools/local_llm_bench/auto_bench.py (automatic benchmark loop)

import os
import json
import time
from typing import List, Dict

# Import necessary functions (e.g., download_model, run_single_benchmark, get_best_model)
from .utils import download_model, run_single_benchmark, get_best_model
from ..local_llm_worker.worker import process_prompt # Import prompt processing function from worker module

def auto_benchmark_loop(model_dir: str, benchmark_prompts_path: str, num_iterations: int = 5):
    current_best_model = None
    candidate_models = ["model_a", "model_b", "model_c"] # Actual model list would be fetched dynamically

    for i in range(num_iterations):
        print(f"Iteration {i+1}/{num_iterations}")

        # 1. Download candidate models (if they don't exist yet)
        for model_name in candidate_models:
            if not os.path.exists(os.path.join(model_dir, model_name)):
                print(f"Downloading {model_name}...")
                download_model(model_name, model_dir) # Actual download function

        # 2. Benchmark current best model
        if current_best_model:
            print(f"Benchmarking current best model: {current_best_model}")
            results = run_single_benchmark(current_best_model, benchmark_prompts_path)
            # Analyze and save results
            # ...

        # 3. Benchmark all candidate models
        all_results: Dict[str, List[float]] = {}
        for model_name in candidate_models:
            print(f"Benchmarking candidate model: {model_name}")
            results = run_single_benchmark(model_name, benchmark_prompts_path)
            all_results[model_name] = results['scores'] # Example: list of scores

        # 4. Select best model based on latest results
        new_best_model = get_best_model(all_results) # Actual best model selection logic

        if new_best_model != current_best_model:
            print(f"New best model found: {new_best_model}. Updating...")
            current_best_model = new_best_model
            # Notify the system about the best model via admin API, etc.
            # switchLocalModel(current_best_model) # Example
        else:
            print("Current best model remains the best.")

        time.sleep(60 * 5) # Wait before the next iteration

if __name__ == "__main__":
    MODEL_DIRECTORY = "/path/to/local/models" # Actual path
    PROMPTS_FILE = "tools/local_llm_bench/prompts.json"
    auto_benchmark_loop(MODEL_DIRECTORY, PROMPTS_FILE, num_iterations=10)

During this process, I discovered that the Gemma2:2b model performed significantly better than the EXAONE model I was using previously. I documented and shared this finding.

## Gemma2:2b Model Performance Analysis (As of June 15, 2026)

Recently, I've been analyzing the performance of various models using my automated local model benchmarking system. In particular, I've confirmed that the **Gemma2:2b** model shows a significant advantage over the **EXAONE** model, which I was using previously, in terms of Korean language processing and overall response quality.

**Key Observations:**

*   **Response Speed:** Gemma2:2b maintained a similar response speed to EXAONE while generating higher quality results.
*   **Korean Comprehension:** Gemma2:2b provided much more accurate and natural answers to complex and nuanced Korean questions.
*   **Creative Generation:** Gemma2:2b also scored higher in its ability to generate creative responses to given prompts.

These findings suggest that Gemma2:2b should be prioritized when building local LLM systems in the future.

Results

Research, management, and benchmarking capabilities for local models have been significantly enhanced.
The CP949 encoding errors encountered during benchmark execution have been completely resolved, improving system stability.
It was objectively confirmed and documented that the Gemma2:2b model outperforms EXAONE.

Summary — To Avoid the Same Pitfalls

[ ] When performing file I/O in a local environment, do not rely on the operating system's default encoding (CP949 on Windows); always explicitly use utf-8.
[ ] When using Python's json.dump, prevent Korean garbling and encoding errors by specifying encoding='utf-8' during file writing, along with the ensure_ascii=False option.
[ ] Build automated scripts for local LLM model management and benchmarking to improve model performance and ensure efficient operation.
[ ] Regularly benchmark various models, and when you discover a high-performing model, immediately document it and incorporate it into your system.
[ ] When encountering errors like UnicodeEncodeError: 'cp949' codec can't encode characters..., investigate not only the encoding issues of the code itself but also the entire system environment and file I/O logic.