sakutto
Knowledge

Character Count vs. Byte Count | UTF-8 and Shift-JIS Explained

character countbyte countUTF-8Shift-JISencoding

"Why is my 10-character text 30 bytes?" — Character count and byte count may sound similar, but they measure very different things. This article explains the difference and how encoding affects byte count.

Character Count vs. Byte Count

What Is Character Count?

Character count is simply the number of characters. Whether it's English or Japanese, each character counts as one.

Example: "Hello" → 5 characters, "こんにちは" → 5 characters

What Is Byte Count?

Byte count is the size of data. It represents how much storage a computer needs to save the text.

Key point: Different characters require different numbers of bytes.

Character TypeUTF-8 BytesShift-JIS Bytes
ASCII (A, 1, @)1 byte1 byte
Half-width katakana (ア, イ)3 bytes1 byte
Hiragana (あ, い)3 bytes2 bytes
Katakana (ア, イ)3 bytes2 bytes
Kanji (山, 川)3 bytes2 bytes
Emoji (😀)4 bytesNot supported

So the same 5 characters can have very different byte sizes:

  • "Hello" → 5 bytes in UTF-8
  • "こんにちは" → 15 bytes in UTF-8

What Is Encoding?

Encoding (character encoding) is the set of rules that converts characters into byte sequences a computer can process. Here are the most common ones.

UTF-8

The most widely used encoding today. It can represent virtually every character in every language.

  • How it works: ASCII takes 1 byte, Japanese characters take 3 bytes
  • Used in: Websites, programming, email (current standard)
  • Advantage: Universal language support; ASCII stays compact at 1 byte

Shift-JIS

An encoding developed specifically for Japanese text.

  • How it works: ASCII takes 1 byte, Japanese characters take 2 bytes
  • Used in: Legacy Japanese systems, CSV files for Excel
  • Advantage: More byte-efficient for Japanese text (2 bytes vs. UTF-8's 3 bytes)

UTF-16

An encoding used internally by Windows and Java.

  • How it works: Most characters take 2 bytes
  • Used in: Windows internals, Java
  • Advantage: Near-fixed-width, making processing simpler

When Byte Count Matters

Database Design

Database columns have byte limits like VARCHAR(255). In UTF-8, one Japanese character takes 3 bytes, so a 255-byte column can hold at most 85 Japanese characters.

Input Forms

Some web forms enforce byte-count limits rather than character-count limits. Older banking and government systems in particular may use Shift-JIS-based byte limits.

Email Subject Lines

Email subject lines have character limits under the RFC specification. For Japanese emails, byte count becomes especially important.

CSV Files

CSV files intended for Excel typically use Shift-JIS encoding. Opening a UTF-8 CSV in Excel can cause garbled text (mojibake).

Free Tool

Character Counter

Count characters, words, lines, and bytes in real time. Great for social media posts and reports.

Try it now →

How to Check Byte Count

sakutto's character counter lets you check byte counts in three encodings simultaneously:

  • UTF-8 bytes: For websites and programming
  • Shift-JIS bytes: For legacy systems and Excel-compatible CSVs
  • UTF-16 bytes: For Windows internal processing

Full-Width, Half-Width, and Byte Count

In Japanese computing, the concepts of "full-width" and "half-width" characters are closely related to byte count.

TypeShift-JIS BytesExamples
Half-width1 byteA, 1, ア
Full-width2 bytesA, 1, ア, あ, 漢

This "half-width = 1 byte, full-width = 2 bytes" relationship is based on Shift-JIS. In UTF-8, this correspondence does not hold.

FAQ

Why does Japanese take 3 bytes in UTF-8?

UTF-8 was designed to represent ASCII efficiently in 1 byte. CJK characters (Chinese, Japanese, Korean) are assigned to the 3-byte range because they are less commonly used in the global context.

Should I use Shift-JIS or UTF-8?

UTF-8 is the standard for modern web development and programming. Use Shift-JIS only for Excel-compatible CSV files or when integrating with legacy systems. For new projects, always choose UTF-8.

How many bytes does an emoji take?

In UTF-8, an emoji takes 4 bytes. Shift-JIS cannot represent emoji at all. Compound emoji (such as those with skin tone modifiers) combine multiple code points and consume even more bytes.

How can I save bytes in forms with byte limits?

To reduce byte count, convert full-width characters to half-width. For example, changing full-width numbers "123" (9 bytes in UTF-8) to half-width "123" (3 bytes) saves 6 bytes.

Summary

Character count and byte count are different concepts, and byte count varies depending on the encoding. In UTF-8, one Japanese character takes 3 bytes; in Shift-JIS, it takes 2. sakutto's character counter lets you check byte counts across multiple encodings at once.

Free Tool

Character Counter

Count characters, words, lines, and bytes in real time. Great for social media posts and reports.

Try it now →

Related Tools

Free Tool

Character Counter

Count characters, words, lines, and bytes in real time. Great for social media posts and reports.

Try it now

Free Tool

Fullwidth / Halfwidth Converter

Convert between fullwidth and halfwidth characters. Also supports katakana-hiragana conversion for data cleanup.

Try it now

Related Tool Categories

Articles