Charset and Encoding Cheat Sheet

Here’s a cheat sheet for character sets (charsets) and character encodings:

Character Sets

  • ASCII:
    • 7-bit character set.
    • Includes basic Latin alphabet, numerals, and control characters.
    • Encoded as US-ASCII.
  • ISO-8859-1 (Latin-1):
    • 8-bit character set.
    • Covers most Western European languages.
    • Commonly used in HTML documents.
    • Other ISO-8859-x variations for different language sets.
  • UTF-8:
    • Variable-width encoding.
    • Compatible with ASCII.
    • Dominant character encoding on the web.
    • Supports all Unicode characters.
  • UTF-16:
    • Fixed or variable-width encoding.
    • Supports the entire Unicode character set.
    • Used internally by languages like JavaScript and Java.
  • UTF-32:
    • Fixed-width encoding.
    • Supports the entire Unicode character set.
    • Less common due to larger file sizes.

Common Encodings

  • UTF-8:
    • Variable-width encoding.
    • Default for web pages and most modern applications.
    • One to four bytes per character.
  • UTF-16:
    • Fixed or variable-width encoding.
    • One or two 16-bit code units per character.
    • Used in Windows internals.
  • UTF-32:
    • Fixed-width encoding.
    • Four 32-bit code units per character.
    • Less common due to larger size.
  • ISO-8859-1:
    • Single-byte encoding.
    • Limited to 256 characters.
  • Windows-1252:
    • Extension of ISO-8859-1.
    • Adds more characters including smart quotes.

HTML and HTTP

<meta> Tag:

Set the character set in HTML.

<meta charset="UTF-8">

HTTP Header:

Set the character set in HTTP response headers.

Content-Type: text/html; charset=UTF-8

Unicode BOM (Byte Order Mark)

UTF-8 BOM:

  • Optional in UTF-8.
  • Indicates byte order and encoding.
  • Not recommended for general use.

UTF-16 BOM:

  • Identifies byte order and encoding.

Character Encoding Detection

  • Auto-Detection:
    • Browsers often auto-detect encoding from the document content.
  • <meta> Tag:
    • Use <meta charset="..."> in HTML to specify encoding.
  • Byte Order Mark (BOM):
    • Some encodings include a BOM to indicate the byte order and encoding.

Tips

  • Consistency:
    • Ensure consistency between character set declarations in HTML, HTTP headers, and actual content.
  • Default to UTF-8:
    • Default to UTF-8 for new projects and when possible.
  • Encode/Decode Properly:
    • Use appropriate functions/libraries for encoding and decoding text in programming languages.

This cheat sheet provides a quick reference to common character sets and encodings.