Need help with your JSON?

Try our JSON Formatter tool to automatically identify and fix syntax errors in your JSON. JSON Formatter tool

Resolving Unicode Character Issues in JSON Documents

Unicode characters in JSON documents can create subtle and challenging issues that are difficult to debug. From encoding problems to improper escaping, Unicode-related errors often cause JSON parsing failures that can be frustrating to resolve. This article explores common Unicode issues in JSON and provides practical strategies for identifying and fixing them.

Understanding Unicode in JSON

The JSON specification (ECMA-404 and RFC 8259) fully supports Unicode. However, there are specific rules for how Unicode characters should be represented in JSON strings.

Unicode Representation in JSON:

Direct representation - Unicode characters can be included directly in JSON strings (e.g., "Hello, 世界")
Escape sequences - Unicode characters can be represented with escape sequences (e.g., "\u4E16\u754C" for "世界")
Control characters - Control characters (U+0000 to U+001F) must always be escaped in JSON

JSON Encoding Requirements:

JSON documents must be encoded in UTF-8, UTF-16, or UTF-32. While the specification allows all three, UTF-8 is by far the most commonly used encoding for JSON in practice.

Common Unicode Issues in JSON

1. Incorrect Encoding

Using the wrong character encoding is one of the most common sources of Unicode issues in JSON.

Encoding Problem Example:

// JSON saved with ISO-8859-1 (Latin-1) encoding instead of UTF-8
{
  "name": "Café" // 'é' is encoded differently in ISO-8859-1 vs UTF-8
}

// When parsed with UTF-8 expectation, this can result in:
// SyntaxError: Unexpected token 'C' at position 10

When JSON containing non-ASCII characters is saved with the wrong encoding, those characters get misinterpreted during parsing.

2. Unpaired Surrogate Characters

Surrogate pairs are used to represent characters outside the Basic Multilingual Plane (BMP) in UTF-16. Issues occur when these pairs are improperly split or handled.

Surrogate Pair Issue:

// Incorrect: Unpaired high surrogate
{
  "emoji": "\uD83D"
}

// Correct: Complete surrogate pair for '😀' (U+1F600)
{
  "emoji": "\uD83D\uDE00"
}

Certain characters like emoji require surrogate pairs in UTF-16. Incomplete pairs cause parsing errors.

3. Byte Order Mark (BOM) Issues

The Byte Order Mark (BOM) is a special Unicode character that can appear at the beginning of a file to indicate its encoding.

BOM Issue Example:

// File starts with a UTF-8 BOM (not visible here, but present in the file)
{
  "property": "value"
}

// This can cause errors like:
// SyntaxError: Unexpected token '' in JSON at position 0

While some parsers handle BOMs correctly, others interpret them as part of the JSON content, causing syntax errors.

4. Escape Sequence Errors

Improper formatting of Unicode escape sequences is another common issue.

Escape Sequence Errors:

// Incorrect: Missing leading zeros (must be 4 hex digits)
{
  "symbol": "\u3A"
}

// Correct:
{
  "symbol": "\u003A"
}

// Incorrect: Invalid hexadecimal character
{
  "value": "\uGHIJ"
}

// Correct (if intended as text): 
{
  "value": "\\uGHIJ"
}

Unicode escape sequences in JSON must be exactly 4 hexadecimal digits. If you want to represent the literal sequence "\uGHIJ", the backslash itself must be escaped.

5. Control Characters

JSON requires that all control characters (U+0000 to U+001F) be escaped with special sequences.

Control Character Issue:

// Incorrect: Contains literal newline and tab characters
{
  "description": "First line
  Second line	indented"
}

// Correct: Using escape sequences
{
  "description": "First line\nSecond line\tindented"
}

Literal control characters in JSON strings cause syntax errors. They must be represented with their escape sequences.

Detecting Unicode Issues in JSON

1. Check Error Messages

Parse errors related to Unicode issues often have specific characteristics:

Typical Unicode-Related Error Messages:

Error Message Pattern	Likely Unicode Issue
"Unexpected token '' at position 0"	BOM or encoding mismatch
"Invalid Unicode escape sequence"	Malformed \u escape
"Unexpected token at position X"	Character encoding issue
"Unexpected end of input"	Unpaired surrogate character
"Invalid character"	Control character or invalid byte

2. Visual Inspection

Sometimes, Unicode issues are visible when you examine the JSON:

Look for question marks, boxes, or other placeholder characters that indicate encoding problems
Watch for "mojibake" (garbled text) where characters don't display as expected
Check for unexpected characters at the beginning of the file (potential BOM)

3. Use Specialized Tools

Various tools can help identify Unicode issues in JSON:

Helpful Tools:

Hex editors - Show the actual byte values, revealing encoding issues
Encoding detectors - Tools like chardet (Python) can identify the actual encoding used
Unicode inspectors - Display detailed information about each character
JSON validators - Many provide specific Unicode-related error messages

Resolving Unicode Issues in JSON

1. Fix Encoding Issues

Ensure your JSON is saved with the correct encoding:

JavaScript Example:

// Node.js example to read and save with explicit UTF-8 encoding
const fs = require('fs');

// Reading with explicit encoding
const jsonString = fs.readFileSync('data.json', 'utf8');

// Process the JSON...
const data = JSON.parse(jsonString);

// Save with explicit UTF-8 encoding
fs.writeFileSync('fixed.json', JSON.stringify(data, null, 2), 'utf8');

2. Handle BOMs Correctly

If your JSON files have a BOM, you can either remove it or ensure your parser handles it:

BOM Removal Example:

// JavaScript: Remove BOM if present
function removeBOM(str) {
  if (str.charCodeAt(0) === 0xFEFF) {
    return str.slice(1);
  }
  return str;
}

// Read JSON from file
const jsonWithBOM = fs.readFileSync('data.json', 'utf8');
const jsonWithoutBOM = removeBOM(jsonWithBOM);

// Now parse
const data = JSON.parse(jsonWithoutBOM);

3. Properly Escape Unicode Characters

For maximum compatibility, consider escaping non-ASCII characters:

Unicode Escaping Example:

// JavaScript function to escape all non-ASCII characters
function escapeNonAscii(str) {
  return str.replace(/[^\x00-\x7F]/g, function(char) {
    const code = char.charCodeAt(0);
    // For characters outside BMP (needs surrogate pairs)
    if (code > 0xFFFF) {
      const highSurrogate = Math.floor((code - 0x10000) / 0x400) + 0xD800;
      const lowSurrogate = ((code - 0x10000) % 0x400) + 0xDC00;
      return '\\u' + highSurrogate.toString(16).toUpperCase() + 
             '\\u' + lowSurrogate.toString(16).toUpperCase();
    }
    // Simple case for BMP characters
    return '\\u' + ('0000' + code.toString(16).toUpperCase()).slice(-4);
  });
}

// Convert an object with Unicode characters to escaped JSON
function safeStringify(obj) {
  return JSON.stringify(obj, (key, value) => {
    if (typeof value === 'string') {
      return escapeNonAscii(value);
    }
    return value;
  });
}

// Example
const data = { message: "Hello, 世界!" };
const safeJson = safeStringify(data);
// Result: {"message":"Hello, \u4E16\u754C!"}

4. Fix Surrogate Pair Issues

If you need to handle surrogate pairs manually:

Surrogate Pair Validation:

// JavaScript function to check for unpaired surrogates in JSON strings
function detectUnpairedSurrogates(str) {
  const issues = [];
  
  for (let i = 0; i < str.length; i++) {
    const code = str.charCodeAt(i);
    
    // Check for high surrogates (0xD800 to 0xDBFF)
    if (code >= 0xD800 && code <= 0xDBFF) {
      // A high surrogate should be followed by a low surrogate
      if (i + 1 >= str.length || 
          str.charCodeAt(i + 1) < 0xDC00 || 
          str.charCodeAt(i + 1) > 0xDFFF) {
        issues.push({
          position: i,
          issue: 'Unpaired high surrogate',
          code: code.toString(16)
        });
      } else {
        // Skip the next character as it's a valid low surrogate
        i++;
      }
    }
    
    // Check for isolated low surrogates (0xDC00 to 0xDFFF)
    else if (code >= 0xDC00 && code <= 0xDFFF) {
      issues.push({
        position: i,
        issue: 'Isolated low surrogate',
        code: code.toString(16)
      });
    }
  }
  
  return issues;
}

// Example usage
const jsonString = '{"emoji": "\uD83D"}'; // Incomplete surrogate pair
const obj = JSON.parse(jsonString);
const issues = detectUnpairedSurrogates(obj.emoji);
console.log(issues);

5. Normalize Unicode Representations

Some Unicode characters can be represented in multiple ways, which can cause comparison issues:

Unicode Normalization:

// JavaScript example using Unicode Normalization Form C (NFC)
function normalizeJson(jsonString) {
  const obj = JSON.parse(jsonString);
  
  // Recursively normalize all string values
  function normalizeObject(obj) {
    if (typeof obj === 'string') {
      return obj.normalize('NFC');
    } else if (Array.isArray(obj)) {
      return obj.map(normalizeObject);
    } else if (obj !== null && typeof obj === 'object') {
      const result = {};
      for (const key of Object.keys(obj)) {
        // Normalize both keys and values
        const normalizedKey = typeof key === 'string' ? key.normalize('NFC') : key;
        result[normalizedKey] = normalizeObject(obj[key]);
      }
      return result;
    }
    return obj;
  }
  
  const normalizedObj = normalizeObject(obj);
  return JSON.stringify(normalizedObj);
}

// Example with accent characters
const json = '{"café": "résumé"}';
const normalized = normalizeJson(json);

Unicode normalization ensures consistent representation of characters that can be written in multiple ways (like accented characters).

Prevention Strategies

Follow these best practices to prevent Unicode issues in your JSON:

Always specify encoding explicitly when reading or writing JSON files
Use UTF-8 without BOM as your standard encoding for JSON files
Validate JSON after generation, especially if it contains non-ASCII characters
Consider escaping all non-ASCII characters for maximum compatibility
Use a library that handles Unicode correctly rather than manually constructing JSON
Include charset=utf-8 in Content-Type headers when serving JSON over HTTP

HTTP Header Example:

Content-Type: application/json; charset=utf-8

Always specify the character encoding when serving JSON via HTTP to ensure proper interpretation.

Language-Specific Considerations

Unicode Handling by Language:

Language	Unicode Handling in JSON
JavaScript	Good native support, but surrogate pair issues can occur with `String.fromCharCode`
Python	Excellent Unicode support; specify encoding with `open(file, encoding='utf-8')`
Java	Good support; use `InputStreamReader` with correct charset
Ruby	Strong Unicode handling; specify `File.read(file, encoding: 'utf-8')`
PHP	Uneven support; use `json_encode` with `JSON_UNESCAPED_UNICODE` flag
C#	Good native support; prefer `Encoding.UTF8` for file operations

Conclusion

Unicode character issues in JSON documents can be complex to diagnose and fix, but understanding the common problems and their solutions will help you handle these situations effectively. By following encoding best practices, properly escaping characters when needed, and using the right tools for detection and validation, you can prevent most Unicode-related JSON errors.

Remember that JSON's simplicity as a data format depends on correct handling of Unicode. Taking the time to ensure your JSON is properly encoded and formatted will save you from difficult debugging sessions and ensure your data flows smoothly between systems, regardless of what languages or special characters it contains.

Need help with your JSON?

Try our JSON Formatter tool to automatically identify and fix syntax errors in your JSON. JSON Formatter tool