Character Count vs. Byte Count

"Why is my 10-character text 30 bytes?" — Character count and byte count may sound similar, but they measure very different things. This article explains the difference and how encoding affects byte count.

What Is Character Count?

Character count is simply the number of characters. Whether it's English or Japanese, each character counts as one.

Example: "Hello" → 5 characters, "こんにちは" → 5 characters

What Is Byte Count?

Byte count is the size of data. It represents how much storage a computer needs to save the text.

Key point: Different characters require different numbers of bytes.

Character Type	UTF-8 Bytes	Shift-JIS Bytes
ASCII (A, 1, @)	1 byte	1 byte
Half-width katakana (ｱ, ｲ)	3 bytes	1 byte
Hiragana (あ, い)	3 bytes	2 bytes
Katakana (ア, イ)	3 bytes	2 bytes
Kanji (山, 川)	3 bytes	2 bytes
Emoji (😀)	4 bytes	Not supported

So the same 5 characters can have very different byte sizes:

"Hello" → 5 bytes in UTF-8
"こんにちは" → 15 bytes in UTF-8

What Is Encoding?

Encoding (character encoding) is the set of rules that converts characters into byte sequences a computer can process. Here are the most common ones.

UTF-8

The most widely used encoding today. It can represent virtually every character in every language.

How it works: ASCII takes 1 byte, Japanese characters take 3 bytes
Used in: Websites, programming, email (current standard)
Advantage: Universal language support; ASCII stays compact at 1 byte

Shift-JIS

An encoding developed specifically for Japanese text.

How it works: ASCII takes 1 byte, Japanese characters take 2 bytes
Used in: Legacy Japanese systems, CSV files for Excel
Advantage: More byte-efficient for Japanese text (2 bytes vs. UTF-8's 3 bytes)

UTF-16

An encoding used internally by Windows and Java.

How it works: Most characters take 2 bytes
Used in: Windows internals, Java
Advantage: Near-fixed-width, making processing simpler

When Byte Count Matters

Database Design

Database columns have byte limits like VARCHAR(255). In UTF-8, one Japanese character takes 3 bytes, so a 255-byte column can hold at most 85 Japanese characters.

Input Forms

Some web forms enforce byte-count limits rather than character-count limits. Older banking and government systems in particular may use Shift-JIS-based byte limits.

Email Subject Lines

Email subject lines have character limits under the RFC specification. For Japanese emails, byte count becomes especially important.

CSV Files

CSV files intended for Excel typically use Shift-JIS encoding. Opening a UTF-8 CSV in Excel can cause garbled text (mojibake).

Free Tool

Character Counter

Count characters, words, lines, and bytes in real time. Great for social media posts and reports.

Try it now →

How to Check Byte Count

sakutto's character counter lets you check byte counts in three encodings simultaneously:

UTF-8 bytes: For websites and programming
Shift-JIS bytes: For legacy systems and Excel-compatible CSVs
UTF-16 bytes: For Windows internal processing

Full-Width, Half-Width, and Byte Count

In Japanese computing, the concepts of "full-width" and "half-width" characters are closely related to byte count.

Type	Shift-JIS Bytes	Examples
Half-width	1 byte	A, 1, ｱ
Full-width	2 bytes	Ａ, １, ア, あ, 漢

This "half-width = 1 byte, full-width = 2 bytes" relationship is based on Shift-JIS. In UTF-8, this correspondence does not hold.

FAQ

Why does Japanese take 3 bytes in UTF-8?

UTF-8 was designed to represent ASCII efficiently in 1 byte. CJK characters (Chinese, Japanese, Korean) are assigned to the 3-byte range because they are less commonly used in the global context.

Should I use Shift-JIS or UTF-8?

UTF-8 is the standard for modern web development and programming. Use Shift-JIS only for Excel-compatible CSV files or when integrating with legacy systems. For new projects, always choose UTF-8.

How many bytes does an emoji take?

In UTF-8, an emoji takes 4 bytes. Shift-JIS cannot represent emoji at all. Compound emoji (such as those with skin tone modifiers) combine multiple code points and consume even more bytes.

How can I save bytes in forms with byte limits?

To reduce byte count, convert full-width characters to half-width. For example, changing full-width numbers "１２３" (9 bytes in UTF-8) to half-width "123" (3 bytes) saves 6 bytes.

Summary

Character count and byte count are different concepts, and byte count varies depending on the encoding. In UTF-8, one Japanese character takes 3 bytes; in Shift-JIS, it takes 2. sakutto's character counter lets you check byte counts across multiple encodings at once.

Free Tool

Character Counter

Count characters, words, lines, and bytes in real time. Great for social media posts and reports.

Try it now →

Character Count vs. Byte Count | UTF-8 and Shift-JIS Explained