"Why is my 10-character text 30 bytes?" — Character count and byte count may sound similar, but they measure very different things. This article explains the difference and how encoding affects byte count.
Character Count vs. Byte Count
What Is Character Count?
Character count is simply the number of characters. Whether it's English or Japanese, each character counts as one.
Example: "Hello" → 5 characters, "こんにちは" → 5 characters
What Is Byte Count?
Byte count is the size of data. It represents how much storage a computer needs to save the text.
Key point: Different characters require different numbers of bytes.
| Character Type | UTF-8 Bytes | Shift-JIS Bytes |
|---|---|---|
| ASCII (A, 1, @) | 1 byte | 1 byte |
| Half-width katakana (ア, イ) | 3 bytes | 1 byte |
| Hiragana (あ, い) | 3 bytes | 2 bytes |
| Katakana (ア, イ) | 3 bytes | 2 bytes |
| Kanji (山, 川) | 3 bytes | 2 bytes |
| Emoji (😀) | 4 bytes | Not supported |
So the same 5 characters can have very different byte sizes:
- "Hello" → 5 bytes in UTF-8
- "こんにちは" → 15 bytes in UTF-8
What Is Encoding?
Encoding (character encoding) is the set of rules that converts characters into byte sequences a computer can process. Here are the most common ones.
UTF-8
The most widely used encoding today. It can represent virtually every character in every language.
- How it works: ASCII takes 1 byte, Japanese characters take 3 bytes
- Used in: Websites, programming, email (current standard)
- Advantage: Universal language support; ASCII stays compact at 1 byte
Shift-JIS
An encoding developed specifically for Japanese text.
- How it works: ASCII takes 1 byte, Japanese characters take 2 bytes
- Used in: Legacy Japanese systems, CSV files for Excel
- Advantage: More byte-efficient for Japanese text (2 bytes vs. UTF-8's 3 bytes)
UTF-16
An encoding used internally by Windows and Java.
- How it works: Most characters take 2 bytes
- Used in: Windows internals, Java
- Advantage: Near-fixed-width, making processing simpler
When Byte Count Matters
Database Design
Database columns have byte limits like VARCHAR(255). In UTF-8, one Japanese character takes 3 bytes, so a 255-byte column can hold at most 85 Japanese characters.
Input Forms
Some web forms enforce byte-count limits rather than character-count limits. Older banking and government systems in particular may use Shift-JIS-based byte limits.
Email Subject Lines
Email subject lines have character limits under the RFC specification. For Japanese emails, byte count becomes especially important.
CSV Files
CSV files intended for Excel typically use Shift-JIS encoding. Opening a UTF-8 CSV in Excel can cause garbled text (mojibake).
Free Tool
Character Counter
Count characters, words, lines, and bytes in real time. Great for social media posts and reports.
Try it now →How to Check Byte Count
sakutto's character counter lets you check byte counts in three encodings simultaneously:
- UTF-8 bytes: For websites and programming
- Shift-JIS bytes: For legacy systems and Excel-compatible CSVs
- UTF-16 bytes: For Windows internal processing
Full-Width, Half-Width, and Byte Count
In Japanese computing, the concepts of "full-width" and "half-width" characters are closely related to byte count.
| Type | Shift-JIS Bytes | Examples |
|---|---|---|
| Half-width | 1 byte | A, 1, ア |
| Full-width | 2 bytes | A, 1, ア, あ, 漢 |
This "half-width = 1 byte, full-width = 2 bytes" relationship is based on Shift-JIS. In UTF-8, this correspondence does not hold.
FAQ
Why does Japanese take 3 bytes in UTF-8?
UTF-8 was designed to represent ASCII efficiently in 1 byte. CJK characters (Chinese, Japanese, Korean) are assigned to the 3-byte range because they are less commonly used in the global context.
Should I use Shift-JIS or UTF-8?
UTF-8 is the standard for modern web development and programming. Use Shift-JIS only for Excel-compatible CSV files or when integrating with legacy systems. For new projects, always choose UTF-8.
How many bytes does an emoji take?
In UTF-8, an emoji takes 4 bytes. Shift-JIS cannot represent emoji at all. Compound emoji (such as those with skin tone modifiers) combine multiple code points and consume even more bytes.
How can I save bytes in forms with byte limits?
To reduce byte count, convert full-width characters to half-width. For example, changing full-width numbers "123" (9 bytes in UTF-8) to half-width "123" (3 bytes) saves 6 bytes.
Summary
Character count and byte count are different concepts, and byte count varies depending on the encoding. In UTF-8, one Japanese character takes 3 bytes; in Shift-JIS, it takes 2. sakutto's character counter lets you check byte counts across multiple encodings at once.
Free Tool
Character Counter
Count characters, words, lines, and bytes in real time. Great for social media posts and reports.
Try it now →