Need help with your JSON?
Try our JSON Formatter tool to automatically identify and fix syntax errors in your JSON. JSON Formatter tool
Resolving Unicode Character Issues in JSON Documents
Unicode characters in JSON documents can create subtle and challenging issues that are difficult to debug. From encoding problems to improper escaping, Unicode-related errors often cause JSON parsing failures that can be frustrating to resolve. This article explores common Unicode issues in JSON and provides practical strategies for identifying and fixing them.
Understanding Unicode in JSON
The JSON specification (ECMA-404 and RFC 8259) fully supports Unicode. However, there are specific rules for how Unicode characters should be represented in JSON strings.
Unicode Representation in JSON:
- Direct representation - Unicode characters can be included directly in JSON strings (e.g., "Hello, 世界")
- Escape sequences - Unicode characters can be represented with escape sequences (e.g., "\u4E16\u754C" for "世界")
- Control characters - Control characters (U+0000 to U+001F) must always be escaped in JSON
JSON Encoding Requirements:
JSON documents must be encoded in UTF-8, UTF-16, or UTF-32. While the specification allows all three, UTF-8 is by far the most commonly used encoding for JSON in practice.
Common Unicode Issues in JSON
1. Incorrect Encoding
Using the wrong character encoding is one of the most common sources of Unicode issues in JSON.
Encoding Problem Example:
// JSON saved with ISO-8859-1 (Latin-1) encoding instead of UTF-8 { "name": "Café" // 'é' is encoded differently in ISO-8859-1 vs UTF-8 } // When parsed with UTF-8 expectation, this can result in: // SyntaxError: Unexpected token 'C' at position 10
When JSON containing non-ASCII characters is saved with the wrong encoding, those characters get misinterpreted during parsing.
2. Unpaired Surrogate Characters
Surrogate pairs are used to represent characters outside the Basic Multilingual Plane (BMP) in UTF-16. Issues occur when these pairs are improperly split or handled.
Surrogate Pair Issue:
// Incorrect: Unpaired high surrogate { "emoji": "\uD83D" } // Correct: Complete surrogate pair for '😀' (U+1F600) { "emoji": "\uD83D\uDE00" }
Certain characters like emoji require surrogate pairs in UTF-16. Incomplete pairs cause parsing errors.
3. Byte Order Mark (BOM) Issues
The Byte Order Mark (BOM) is a special Unicode character that can appear at the beginning of a file to indicate its encoding.
BOM Issue Example:
// File starts with a UTF-8 BOM (not visible here, but present in the file) { "property": "value" } // This can cause errors like: // SyntaxError: Unexpected token '' in JSON at position 0
While some parsers handle BOMs correctly, others interpret them as part of the JSON content, causing syntax errors.
4. Escape Sequence Errors
Improper formatting of Unicode escape sequences is another common issue.
Escape Sequence Errors:
// Incorrect: Missing leading zeros (must be 4 hex digits) { "symbol": "\u3A" } // Correct: { "symbol": "\u003A" } // Incorrect: Invalid hexadecimal character { "value": "\uGHIJ" } // Correct (if intended as text): { "value": "\\uGHIJ" }
Unicode escape sequences in JSON must be exactly 4 hexadecimal digits. If you want to represent the literal sequence "\uGHIJ", the backslash itself must be escaped.
5. Control Characters
JSON requires that all control characters (U+0000 to U+001F) be escaped with special sequences.
Control Character Issue:
// Incorrect: Contains literal newline and tab characters { "description": "First line Second line indented" } // Correct: Using escape sequences { "description": "First line\nSecond line\tindented" }
Literal control characters in JSON strings cause syntax errors. They must be represented with their escape sequences.
Detecting Unicode Issues in JSON
1. Check Error Messages
Parse errors related to Unicode issues often have specific characteristics:
Typical Unicode-Related Error Messages:
Error Message Pattern | Likely Unicode Issue |
---|---|
"Unexpected token '' at position 0" | BOM or encoding mismatch |
"Invalid Unicode escape sequence" | Malformed \u escape |
"Unexpected token at position X" | Character encoding issue |
"Unexpected end of input" | Unpaired surrogate character |
"Invalid character" | Control character or invalid byte |
2. Visual Inspection
Sometimes, Unicode issues are visible when you examine the JSON:
- Look for question marks, boxes, or other placeholder characters that indicate encoding problems
- Watch for "mojibake" (garbled text) where characters don't display as expected
- Check for unexpected characters at the beginning of the file (potential BOM)
3. Use Specialized Tools
Various tools can help identify Unicode issues in JSON:
Helpful Tools:
- Hex editors - Show the actual byte values, revealing encoding issues
- Encoding detectors - Tools like chardet (Python) can identify the actual encoding used
- Unicode inspectors - Display detailed information about each character
- JSON validators - Many provide specific Unicode-related error messages
Resolving Unicode Issues in JSON
1. Fix Encoding Issues
Ensure your JSON is saved with the correct encoding:
JavaScript Example:
// Node.js example to read and save with explicit UTF-8 encoding const fs = require('fs'); // Reading with explicit encoding const jsonString = fs.readFileSync('data.json', 'utf8'); // Process the JSON... const data = JSON.parse(jsonString); // Save with explicit UTF-8 encoding fs.writeFileSync('fixed.json', JSON.stringify(data, null, 2), 'utf8');
2. Handle BOMs Correctly
If your JSON files have a BOM, you can either remove it or ensure your parser handles it:
BOM Removal Example:
// JavaScript: Remove BOM if present function removeBOM(str) { if (str.charCodeAt(0) === 0xFEFF) { return str.slice(1); } return str; } // Read JSON from file const jsonWithBOM = fs.readFileSync('data.json', 'utf8'); const jsonWithoutBOM = removeBOM(jsonWithBOM); // Now parse const data = JSON.parse(jsonWithoutBOM);
3. Properly Escape Unicode Characters
For maximum compatibility, consider escaping non-ASCII characters:
Unicode Escaping Example:
// JavaScript function to escape all non-ASCII characters function escapeNonAscii(str) { return str.replace(/[^\x00-\x7F]/g, function(char) { const code = char.charCodeAt(0); // For characters outside BMP (needs surrogate pairs) if (code > 0xFFFF) { const highSurrogate = Math.floor((code - 0x10000) / 0x400) + 0xD800; const lowSurrogate = ((code - 0x10000) % 0x400) + 0xDC00; return '\\u' + highSurrogate.toString(16).toUpperCase() + '\\u' + lowSurrogate.toString(16).toUpperCase(); } // Simple case for BMP characters return '\\u' + ('0000' + code.toString(16).toUpperCase()).slice(-4); }); } // Convert an object with Unicode characters to escaped JSON function safeStringify(obj) { return JSON.stringify(obj, (key, value) => { if (typeof value === 'string') { return escapeNonAscii(value); } return value; }); } // Example const data = { message: "Hello, 世界!" }; const safeJson = safeStringify(data); // Result: {"message":"Hello, \u4E16\u754C!"}
4. Fix Surrogate Pair Issues
If you need to handle surrogate pairs manually:
Surrogate Pair Validation:
// JavaScript function to check for unpaired surrogates in JSON strings function detectUnpairedSurrogates(str) { const issues = []; for (let i = 0; i < str.length; i++) { const code = str.charCodeAt(i); // Check for high surrogates (0xD800 to 0xDBFF) if (code >= 0xD800 && code <= 0xDBFF) { // A high surrogate should be followed by a low surrogate if (i + 1 >= str.length || str.charCodeAt(i + 1) < 0xDC00 || str.charCodeAt(i + 1) > 0xDFFF) { issues.push({ position: i, issue: 'Unpaired high surrogate', code: code.toString(16) }); } else { // Skip the next character as it's a valid low surrogate i++; } } // Check for isolated low surrogates (0xDC00 to 0xDFFF) else if (code >= 0xDC00 && code <= 0xDFFF) { issues.push({ position: i, issue: 'Isolated low surrogate', code: code.toString(16) }); } } return issues; } // Example usage const jsonString = '{"emoji": "\uD83D"}'; // Incomplete surrogate pair const obj = JSON.parse(jsonString); const issues = detectUnpairedSurrogates(obj.emoji); console.log(issues);
5. Normalize Unicode Representations
Some Unicode characters can be represented in multiple ways, which can cause comparison issues:
Unicode Normalization:
// JavaScript example using Unicode Normalization Form C (NFC) function normalizeJson(jsonString) { const obj = JSON.parse(jsonString); // Recursively normalize all string values function normalizeObject(obj) { if (typeof obj === 'string') { return obj.normalize('NFC'); } else if (Array.isArray(obj)) { return obj.map(normalizeObject); } else if (obj !== null && typeof obj === 'object') { const result = {}; for (const key of Object.keys(obj)) { // Normalize both keys and values const normalizedKey = typeof key === 'string' ? key.normalize('NFC') : key; result[normalizedKey] = normalizeObject(obj[key]); } return result; } return obj; } const normalizedObj = normalizeObject(obj); return JSON.stringify(normalizedObj); } // Example with accent characters const json = '{"café": "résumé"}'; const normalized = normalizeJson(json);
Unicode normalization ensures consistent representation of characters that can be written in multiple ways (like accented characters).
Prevention Strategies
Follow these best practices to prevent Unicode issues in your JSON:
- Always specify encoding explicitly when reading or writing JSON files
- Use UTF-8 without BOM as your standard encoding for JSON files
- Validate JSON after generation, especially if it contains non-ASCII characters
- Consider escaping all non-ASCII characters for maximum compatibility
- Use a library that handles Unicode correctly rather than manually constructing JSON
- Include charset=utf-8 in Content-Type headers when serving JSON over HTTP
HTTP Header Example:
Content-Type: application/json; charset=utf-8
Always specify the character encoding when serving JSON via HTTP to ensure proper interpretation.
Language-Specific Considerations
Unicode Handling by Language:
Language | Unicode Handling in JSON |
---|---|
JavaScript | Good native support, but surrogate pair issues can occur with String.fromCharCode |
Python | Excellent Unicode support; specify encoding with open(file, encoding='utf-8') |
Java | Good support; use InputStreamReader with correct charset |
Ruby | Strong Unicode handling; specify File.read(file, encoding: 'utf-8') |
PHP | Uneven support; use json_encode with JSON_UNESCAPED_UNICODE flag |
C# | Good native support; prefer Encoding.UTF8 for file operations |
Conclusion
Unicode character issues in JSON documents can be complex to diagnose and fix, but understanding the common problems and their solutions will help you handle these situations effectively. By following encoding best practices, properly escaping characters when needed, and using the right tools for detection and validation, you can prevent most Unicode-related JSON errors.
Remember that JSON's simplicity as a data format depends on correct handling of Unicode. Taking the time to ensure your JSON is properly encoded and formatted will save you from difficult debugging sessions and ensure your data flows smoothly between systems, regardless of what languages or special characters it contains.
Need help with your JSON?
Try our JSON Formatter tool to automatically identify and fix syntax errors in your JSON. JSON Formatter tool