They Look the Same, but They're Different - Unicode's Visual Attacks
Do apple.com and аpple.com look the same to you? They're actually different. The first "а" in the second URL is not the Latin letter "a" but the Cyrillic letter "а" (U+0430). Indistinguishable to the human eye, but to a computer they are completely different characters - and therefore different domains.
Unicode is a groundbreaking system that can handle characters from around the world uniformly, but its rich character set also creates security threats.
Homograph Attacks - How Fake Domains Are Created
A homograph attack (IDN homograph attack) creates fake domain names using different characters that look identical (or very similar).
- Latin "a" (U+0061) and Cyrillic "а" (U+0430)
- Latin "o" (U+006F) and Cyrillic "о" (U+043E)
- Latin "p" (U+0070) and Cyrillic "р" (U+0440)
- Latin "e" (U+0065) and Cyrillic "е" (U+0435)
Combining these, you can create аррle.com (all Cyrillic characters) - a domain visually indistinguishable from apple.com. Used as a phishing site URL, even careful users could be deceived.
Browser Countermeasures
Major browsers counter homograph attacks by displaying domains with mixed scripts (multiple writing systems) in Punycode notation. For example, the Cyrillic аррle.com would be displayed as xn--pple-43d0151b.com, making the deception immediately obvious.
However, when the entire domain consists of a single script (e.g., all Cyrillic), some browsers may display it as-is without converting to Punycode.
Invisible Characters - The Threat of Zero-Width Characters
Unicode includes several "zero-width characters" that are not displayed on screen.
- U+200B: Zero Width Space
- U+200C: Zero Width Non-Joiner
- U+200D: Zero Width Joiner
- U+FEFF: Zero Width No-Break Space (BOM)
These characters are invisible but affect string comparison and hash calculations.
- Watermarking: A technique that embeds zero-width characters in confidential documents to identify who leaked them. By inserting different patterns of zero-width characters for each recipient, the leaker can be identified from the pattern in the leaked document
- Password issues: When copy-pasting a password, if zero-width characters are included, you get the situation of "the correct password won't log in"
- Code tampering: Inserting zero-width characters into source code to create code that looks identical but behaves differently (Trojan Source attack)
Directional Control Characters - Manipulating Text Flow
Unicode includes characters that control text display direction. Since Arabic and Hebrew are written right-to-left, these control characters serve legitimate purposes, but they can also be exploited for attacks.
- U+202E: Right-to-Left Override. Forces all subsequent text to display right-to-left
For example, the filename document_fdp.exe displays on screen as document_exe.pdf. The user thinks it's a PDF file and double-clicks, but it's actually an .exe file (executable).
What Developers Should Watch For
- When accepting user input, sanitize (remove) zero-width characters and directional control characters
- When validating domain names, convert to Punycode before comparison
- When displaying filenames, remove or escape directional control characters
- During code reviews, use tools to detect zero-width character injection (e.g.,
grep -P '[\x{200B}-\x{200F}\x{202A}-\x{202E}]')
Summary
Unicode is a remarkable system that handles characters from around the world uniformly, but its richness also creates security threats. Homograph attacks, zero-width characters, directional control characters - knowing about these threats heightens your vigilance against phishing site URLs and suspicious filenames. When accessing IP Check-san, make it a habit to verify that the domain name in the URL bar is correct.