Resolving CP949 Errors in Local LLM Benchmarking and Building an Automatic Model Recommendation System
Ever run into CP949 encoding errors when benchmarking local LLMs, or felt frustrated by the lack of model management features? In this post, I'll share my experience overcoming CP949 encoding issues and building an automatic model recommendation system to enhance local model research and management capabilities.
Attempts and Pitfalls
Initially, I wanted to build a simple feature in the admin page to switch and benchmark local models. I also prepared a more diverse set of benchmark questions in Korean.
// riel_agent/src/app/admin/tabs/LocalModelLabTab.tsx (excerpt)
import { Button, Select, Input } from '@mantine/core';
import { useState, useEffect } from 'react';
import {
getLocalModels,
switchLocalModel,
runBenchmark,
getBenchmarkResults,
} from '../../api/admin'; // Actual API call functions
function LocalModelLabTab() {
const [models, setModels] = useState<string[]>([]);
const [selectedModel, setSelectedModel] = useState<string>('');
const [benchmarkQuestions, setBenchmarkQuestions] = useState<string[]>([]);
const [benchmarkResults, setBenchmarkResults] = useState<any>(null);
useEffect(() => {
// Load local model list
getLocalModels().then(setModels);
// Load Korean benchmark questions (expanded to 25)
// ...
}, []);
const handleModelChange = async (modelName: string) => {
await switchLocalModel(modelName); // Actual model switching API
setSelectedModel(modelName);
};
const handleRunBenchmark = async () => {
const results = await runBenchmark(selectedModel, benchmarkQuestions); // Actual benchmark execution API
setBenchmarkResults(results);
};
// ... UI rendering ...
return (
<div>
<Select
label="Select Local Model"
data={models}
value={selectedModel}
onChange={handleModelChange}
/>
<Button onClick={handleRunBenchmark}>Run Benchmark</Button>
{/* Results display section */}
</div>
);
}
export default LocalModelLabTab;
Switching models and expanding questions were relatively straightforward. The problem arose when running benchmarks, especially with Korean data, where I frequently encountered CP949 encoding errors.
UnicodeEncodeError: 'cp949' codec can't encode characters in position 1-3: illegal multibyte sequence
Seeing this error message, I initially thought it was just a Korean string processing issue. So, I tried changing the encoding settings in Python files or explicitly encoding/decoding strings to utf-8. However, after hours of struggling, the problem persisted.
# riel_backend/api/local_llm.py (part of initial attempts)
import json
def process_text_with_model(text: str, model_name: str) -> str:
# ... Model call logic ...
# CP949 error occurred here
# text = text.encode('utf-8').decode('cp949', errors='ignore') # Attempts like this
# ...
pass
The Cause
After hours of debugging, I finally pinpointed the root cause. It wasn't just an encoding issue with the Python script itself. The local LLM worker was attempting to forcibly convert data to CP949, the default encoding on certain environments (especially Windows), during the process of handling and saving model responses.
# tools/local_llm_worker/worker.py (suspected point of failure)
def save_output(output_data: dict):
# ...
with open(output_file_path, 'w', encoding='cp949') as f: # <-- Problem occurred here
json.dump(output_data, f, ensure_ascii=False)
# ...
The json.dump function, when used with ensure_ascii=False, outputs Unicode characters as they are. However, specifying encoding='cp949' during file writing caused an error because it tried to convert them to that encoding.
The Solution
The fix was simple: modify the local LLM worker to explicitly use utf-8 encoding when saving files.
# tools/local_llm_worker/worker.py (after modification)
import json
def save_output(output_data: dict):
# ...
with open(output_file_path, 'w', encoding='utf-8') as f: # <-- Changed to utf-8
json.dump(output_data, f, ensure_ascii=False, indent=4) # Added indent for better readability
# ...
Along with this, I built a system to automatically download models, benchmark them, and recommend better ones.
# tools/local_llm_bench/auto_bench.py (automatic benchmark loop)
import os
import json
import time
from typing import List, Dict
# Import necessary functions (e.g., download_model, run_single_benchmark, get_best_model)
from .utils import download_model, run_single_benchmark, get_best_model
from ..local_llm_worker.worker import process_prompt # Import prompt processing function from worker module
def auto_benchmark_loop(model_dir: str, benchmark_prompts_path: str, num_iterations: int = 5):
current_best_model = None
candidate_models = ["model_a", "model_b", "model_c"] # Actual model list would be fetched dynamically
for i in range(num_iterations):
print(f"Iteration {i+1}/{num_iterations}")
# 1. Download candidate models (if they don't exist yet)
for model_name in candidate_models:
if not os.path.exists(os.path.join(model_dir, model_name)):
print(f"Downloading {model_name}...")
download_model(model_name, model_dir) # Actual download function
# 2. Benchmark current best model
if current_best_model:
print(f"Benchmarking current best model: {current_best_model}")
results = run_single_benchmark(current_best_model, benchmark_prompts_path)
# Analyze and save results
# ...
# 3. Benchmark all candidate models
all_results: Dict[str, List[float]] = {}
for model_name in candidate_models:
print(f"Benchmarking candidate model: {model_name}")
results = run_single_benchmark(model_name, benchmark_prompts_path)
all_results[model_name] = results['scores'] # Example: list of scores
# 4. Select best model based on latest results
new_best_model = get_best_model(all_results) # Actual best model selection logic
if new_best_model != current_best_model:
print(f"New best model found: {new_best_model}. Updating...")
current_best_model = new_best_model
# Notify the system about the best model via admin API, etc.
# switchLocalModel(current_best_model) # Example
else:
print("Current best model remains the best.")
time.sleep(60 * 5) # Wait before the next iteration
if __name__ == "__main__":
MODEL_DIRECTORY = "/path/to/local/models" # Actual path
PROMPTS_FILE = "tools/local_llm_bench/prompts.json"
auto_benchmark_loop(MODEL_DIRECTORY, PROMPTS_FILE, num_iterations=10)
During this process, I discovered that the Gemma2:2b model performed significantly better than the EXAONE model I was using previously. I documented and shared this finding.
## Gemma2:2b Model Performance Analysis (As of June 15, 2026)
Recently, I've been analyzing the performance of various models using my automated local model benchmarking system. In particular, I've confirmed that the **Gemma2:2b** model shows a significant advantage over the **EXAONE** model, which I was using previously, in terms of Korean language processing and overall response quality.
**Key Observations:**
* **Response Speed:** Gemma2:2b maintained a similar response speed to EXAONE while generating higher quality results.
* **Korean Comprehension:** Gemma2:2b provided much more accurate and natural answers to complex and nuanced Korean questions.
* **Creative Generation:** Gemma2:2b also scored higher in its ability to generate creative responses to given prompts.
These findings suggest that Gemma2:2b should be prioritized when building local LLM systems in the future.
Results
- Research, management, and benchmarking capabilities for local models have been significantly enhanced.
- The
CP949encoding errors encountered during benchmark execution have been completely resolved, improving system stability. - It was objectively confirmed and documented that the Gemma2:2b model outperforms EXAONE.
Summary — To Avoid the Same Pitfalls
- [ ] When performing file I/O in a local environment, do not rely on the operating system's default encoding (
CP949on Windows); always explicitly useutf-8. - [ ] When using Python's
json.dump, prevent Korean garbling and encoding errors by specifyingencoding='utf-8'during file writing, along with theensure_ascii=Falseoption. - [ ] Build automated scripts for local LLM model management and benchmarking to improve model performance and ensure efficient operation.
- [ ] Regularly benchmark various models, and when you discover a high-performing model, immediately document it and incorporate it into your system.
- [ ] When encountering errors like
UnicodeEncodeError: 'cp949' codec can't encode characters..., investigate not only the encoding issues of the code itself but also the entire system environment and file I/O logic.