Here’s a cheat sheet for character sets (charsets) and character encodings:
Character Sets
- ASCII:
- 7-bit character set.
- Includes basic Latin alphabet, numerals, and control characters.
- Encoded as
US-ASCII
.
- ISO-8859-1 (Latin-1):
- 8-bit character set.
- Covers most Western European languages.
- Commonly used in HTML documents.
- Other ISO-8859-x variations for different language sets.
- UTF-8:
- Variable-width encoding.
- Compatible with ASCII.
- Dominant character encoding on the web.
- Supports all Unicode characters.
- UTF-16:
- Fixed or variable-width encoding.
- Supports the entire Unicode character set.
- Used internally by languages like JavaScript and Java.
- UTF-32:
- Fixed-width encoding.
- Supports the entire Unicode character set.
- Less common due to larger file sizes.
Common Encodings
- UTF-8:
- Variable-width encoding.
- Default for web pages and most modern applications.
- One to four bytes per character.
- UTF-16:
- Fixed or variable-width encoding.
- One or two 16-bit code units per character.
- Used in Windows internals.
- UTF-32:
- Fixed-width encoding.
- Four 32-bit code units per character.
- Less common due to larger size.
- ISO-8859-1:
- Single-byte encoding.
- Limited to 256 characters.
- Windows-1252:
- Extension of ISO-8859-1.
- Adds more characters including smart quotes.
HTML and HTTP
<meta>
Tag:
Set the character set in HTML.
<meta charset="UTF-8">
HTTP Header:
Set the character set in HTTP response headers.
Content-Type: text/html; charset=UTF-8
Unicode BOM (Byte Order Mark)
UTF-8 BOM:
- Optional in UTF-8.
- Indicates byte order and encoding.
- Not recommended for general use.
UTF-16 BOM:
- Identifies byte order and encoding.
Character Encoding Detection
- Auto-Detection:
- Browsers often auto-detect encoding from the document content.
<meta>
Tag:- Use
<meta charset="...">
in HTML to specify encoding.
- Use
- Byte Order Mark (BOM):
- Some encodings include a BOM to indicate the byte order and encoding.
Tips
- Consistency:
- Ensure consistency between character set declarations in HTML, HTTP headers, and actual content.
- Default to UTF-8:
- Default to UTF-8 for new projects and when possible.
- Encode/Decode Properly:
- Use appropriate functions/libraries for encoding and decoding text in programming languages.
This cheat sheet provides a quick reference to common character sets and encodings.