I � Unicode
A curated collection of fascinating, funny, weird, and strange Unicode phenomena from my talk at the Webnesday St. Gallen Meetup.
📊 View the presentation slides
Special thanks to Pascal Helfenstein for organizing this fantastic event, Frontify for hosting us, and my fellow speakers Erdem (keyboards) and Christoph Bühler (hypecycles) for making it such an engaging evening!
Unicode is far more than just "international characters"—it's a deep rabbit hole of linguistic history, cultural nuance, and technical complexity that continues to surprise developers daily. This page serves as an extended resource for my presentation, featuring real-world Unicode gotchas, security implications, and delightfully bizarre examples.
Getting Started with Unicode
Before diving into the weird and wonderful world of Unicode oddities, let's start with some excellent introductions to Unicode basics:
Essential Reading
These articles provide a solid foundation for understanding Unicode:
-
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!) by Nikita Prokopov A modern, practical guide that covers what's changed since Joel Spolsky's famous 2003 article. Key insight: "In 2023, it's no longer a question: with a 98% probability, it's UTF-8. Finally! We can stick our heads in the sand again!" Covers UTF-8, grapheme clusters, normalization, and why
"🤦🏼♂️".length
gives different answers in different programming languages. -
An Introduction to Unicode by Aleksandr Hovhannisyan A comprehensive deep-dive into Unicode and UTF-8 encoding with mathematical explanations. Covers the history from ASCII to Unicode, how UTF-8 works under the hood with bitwise operations, character boundaries, and self-synchronization. Perfect for understanding the technical details of how UTF-8 encoding actually works.
-
Unicode for Curious Developers by Julien Sobczak An incredibly detailed guide covering the complete story of Unicode from ancient cave paintings to modern emojis. Explores the Unicode Standard, character database, encoding forms (UTF-8, UTF-16, UTF-32), and implementation details in various programming languages. Includes extensive code examples and practical guidance for developers.
-
Unicode in Five Minutes by Richard Harris A concise but comprehensive overview covering normalization, casefolding, sorting, encodings, and practical Unicode issues. Excellent coverage of "gotchas" like grapheme clusters, variation selectors, and internationalized domain names. Great for developers who need practical Unicode knowledge quickly.
These articles will help you understand the fundamentals before we explore the more unusual aspects of Unicode below.
🐍 Real-World Unicode Gotchas
The Hyphen That Broke Everything
A perfect example comes from Joseph Carboni's experience parsing
financial data. He was extracting dollar amounts from PDFs, expecting negative values to have hyphens. His regex
[^-.0-9]
should have preserved hyphens, but negative values kept coming out positive!
The culprit? Two visually identical but different characters:
- HYPHEN-MINUS (U+002D):
-
(the one on your keyboard) - HYPHEN (U+2010):
‐
(the "proper" typographic hyphen)
# These look identical but aren't!
=
=
# HYPHEN-MINUS
# HYPHEN
# The solution: Use Unicode categories
return == # Punctuation, dash
Discussion on Hacker News reveals this is surprisingly common!
The Greek Question Mark Prank
Replace a semicolon (;
) with a Greek question mark (;
) in someone's code and watch them go insane trying to find
the syntax error. They're visually identical but completely different Unicode characters!
- Semicolon: U+003B (
;
) - Greek Question Mark: U+037E (
;
)
Tools like mimic can help identify these confusables. HN discussion
🌍 Programming in Other Scripts
قلب (Qalb) - Arabic Programming Language
قلب is a programming language that explores the role of human culture in coding by using Arabic script. It demonstrates how deeply embedded English/Latin assumptions are in our programming tools and thinking.
🎭 Security Implications
IDN Homograph Attacks
IDN Homograph attacks exploit visually similar characters from different scripts to create deceptive domain names:
аpple.com
(using Cyrillic 'а' instead of Latin 'a')goog1е.com
(using Cyrillic 'е' instead of Latin 'e')
These attacks rely on Punycode encoding, which converts Unicode domain names to ASCII. The browser might show the
Unicode version, but the actual domain uses encoded ASCII like xn--pple-43d.com
.
Right-to-Left Override Attacks
Using the Right-to-Left Override character (U+202E), attackers can make malicious files appear safe:
document.txt.exe
This appears as document.exe.txt
but is actually an executable file! The RLO character reverses the text display.
Example: http://www.example.com?site/moc.elgoog.www//:ptth
- try copying this URL!
🔤 Extreme Unicode Examples
The 15KB Character
There exists a Unicode character that takes up approximately 15,000 bytes when encoded. You can calculate character byte sizes at mothereff.in/byte-counter. Reddit source
ꙮ - The Multiocular O
Meet ꙮ
(U+A66E) - the Cyrillic letter multiocular O with many eyes!
䨺 - The Character with the Most Strokes
䨺
is the Taito kanji with a whopping 84 strokes, making it one of
the most complex characters in Unicode.
Ancient Scripts
- 힘 (Korean): U+D798 - Graphemica
- 𓈝 (Egyptian Hieroglyph): Ancient Egyptian cow symbol
- 𒐫 (Cuneiform): Cuneiform number from the Cuneiform Numbers and Punctuation block
Egyptian Hieroglyphs with... Personality
The hieroglyphs 𓂸𓂹𓂺 have caused quite a stir in the Unicode community. Adrian Kennard wrote about them in Unicode Dicks with an entertaining Hacker News discussion.
📚 Essential Reading for Developers
The Modern Unicode Developer's Bible
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 by Nikita Prokopov is hands-down the best comprehensive Unicode guide for developers. This brilliant article covers:
Key Insights from Tonsky's Guide
- UTF-8 has won: 98% of the web uses UTF-8, so we can finally stop worrying about encoding detection
- Grapheme clusters matter more than code points: What users see as "one character" often spans multiple code points
- String length is broken in most languages: Only Swift and Elixir get
"🤦🏼♂️".length
right (answer: 1) - Use Unicode libraries for everything: Even basic operations like
strlen
,indexOf
, andsubstring
need proper Unicode handling - Unicode updates yearly: Rules change, new emoji are added, and your app needs to keep up
Real-World Developer Problems It Solves
- Why
"🤦🏼♂️"
reports different lengths in different languages (Python: 5, JavaScript: 7, Rust: 17 bytes) - How normalization prevents
"Å" !== "Å" !== "Å"
comparison failures - When
String.toLowerCase()
requires a Locale parameter (Turkishi
/I
problem) - Why UTF-16 surrogate pairs still matter for JavaScript/Java/.NET developers
Essential Takeaways
- Extended Grapheme Clusters are what humans think of as "characters"
- Normalization is required before any string comparison
- Locale matters for case conversion and rendering
- Even English text uses Unicode beyond ASCII (curly quotes, em dashes, café)
This guide perfectly complements the weird examples on this page by explaining the underlying technical foundations that make Unicode both powerful and occasionally maddening.
More Essential Reading
Dark Corners of Unicode by eevee - A brilliant deep dive into the practical problems you'll encounter with Unicode in the real world:
- Terminal rendering nightmares: Why emoji overlap text in VTE and cursor positioning breaks in Konsole
- JavaScript's broken string type: How
"💣".length
returns 2 because JavaScript uses UTF-16 surrogate pairs - The wcwidth() disaster: Different implementations report different character widths, breaking text everywhere
- Sorting is impossible: German "ß" vs "ss", Turkish dotless "ı", and why normalization isn't a silver bullet
- There's no such thing as emoji: The arbitrary definition of what counts as emoji and why fonts matter
I Can Text You A Pile of Poo, But I Can't Write My Name by Aditya Mukerjee - A powerful critique of Unicode's cultural and representational problems:
- Second-class languages: Bengali (7th most spoken language) missing basic characters for decades
- Han Unification controversy: Forcing Chinese, Japanese, and Korean into shared character sets
- Colonial echoes: How Unicode Consortium's composition reflects historical power imbalances
- Emoji prioritization: 1,000 emoji characters while people can't write their own names correctly
💻 Developer Deep Dives
Monospace Fonts and Unicode Ligatures
Shaping Ligatures in Monospace Fonts by Josh Leeb explores the technical challenges of implementing Unicode ligatures in code editors:
- The ligature spacer problem: How
"#{"
becomes 4 glyphs instead of 3 when shaped - Invisible glyphs: The mysterious "LIGSPACE" character with 2×0 pixel dimensions
- Monospace constraints: Why fonts use phantom spacer glyphs to maintain fixed-width requirements
- Rendering complexity: Real-world text shaping is never as simple as "one character = one glyph"
This perfectly illustrates how even "simple" programming contexts involve deep Unicode complexity.
👹 Zalgo Text - When Unicode Goes Wrong
Zalgo text uses combining diacritics to create chaotic, "corrupted" looking text:
H̵̛͕̞̦̰̜͍̰̥̟͆̏͂̌͑ͅä̷͔̟͓̬̯̟͍̭͉͈̮͙̣̯̬͚̞̭̍̀̾͠m̴̡̧̛̝̯̹̗̹̤̲̺̟̥̈̏͊̔̑̍͆̌̀̚͝͝b̴̢̢̫̝̠̗̼̬̻̮̺̭͔̘͑̆̎̚ư̵̧̡̥̙̭̿̈̀̒̐̊͒͑r̷̡̡̲̼̖͎̫̮̜͇̬͌͘g̷̹͍͎̬͕͓͕̐̃̈́̓̆̚͝ẻ̵̡̼̬̥̹͇̭͔̯̉͛̈́̕r̸̮̖̻̮̣̗͚͖̝̂͌̾̓̀̿̔̀͋̈́͌̈́̋͜
🔧 Developer Tools and Libraries
String Length: Not What You Think
Counting emoji and Unicode characters is trickier than it seems: How to Count Emojis with JavaScript
Libraries and Tools
- PHP: Symfony String Component - Unicode-aware string handling
- Python: Use
u'...'
for Unicode-aware strings (Python 2) or just strings in Python 3+ - Fonts: Symbola - Comprehensive Unicode font for ancient scripts
Fun Unicode Tools
Regional Indicator Generator
LingoJam Regional Indicator Generator - Convert text to spaced-out flag emoji letters like 🇨 🇴 🇩 🇪.
How it works: Regional Indicator symbols (U+1F1E6-U+1F1FF) represent letters A-Z and are intended for ISO 3166-1 country codes. When two valid codes are adjacent, they become flag emoji (🇺🇸, 🇨🇦), but when spaced out, they create stylized text effects popular on Discord and social media.
The Unicode trick: Each "letter" is actually U+1F1E8 REGIONAL INDICATOR SYMBOL LETTER C, etc. - originally designed for encoding country flags, but cleverly repurposed for decorative text.
Unicode Mirror Characters
Some characters have "mirrored" versions for right-to-left text: Stack Overflow discussion
📊 Unicode Visualization
The Big Picture
Ian Albert's Unicode Chart - A massive visual representation of the entire Unicode space. Perfect for understanding the scale and organization of Unicode blocks.
🐦 Twitter/X's Unicode Reality Check
Twitter's character counting documentation reveals the messy reality of Unicode in production systems at massive scale:
The "280 Character" Lie
Twitter's character limit isn't actually 280 Unicode characters:
- Most characters count as 1: Latin-1, basic punctuation, directional marks (U+0000-U+10FF)
- CJK characters count as 2: Chinese/Japanese/Korean users get only 140 characters max
- ALL emoji count as 2: Even simple ones like 👾, regardless of underlying complexity
- Complex emoji still count as 2: 👨👩👧👦 (7 Unicode code points) = 2 Twitter "characters"
Unicode Normalization in Production
Twitter normalizes all text to NFC (Normalization Form C) before counting:
"café" (composed): 0x63 0x61 0x66 0xC3 0xA9 = 4 characters
"café" (decomposed): 0x63 0x61 0x66 0x65 0xCC 0x81 = still 4 characters (after NFC)
The t.co URL Hack
All URLs become exactly 23 characters regardless of actual length:
https://example.com
= 23 charactershttps://reallyreallylongdomainname.com/with/many/paths
= also 23 characters
Zero Width Joiner Magic
Twitter recognizes emoji sequences using Zero Width Joiner (U+200D) but counts them as 2 characters total:
- 👨🎤 = 👨 + ZWJ + 🎤 = 2 Twitter characters (not 3 Unicode code points)
This demonstrates how even major platforms struggle with Unicode complexity and create arbitrary rules to make things work. It also shows why Unicode awareness matters—CJK users get half the character limit of English users!
🎯 Key Takeaways
- Never assume character equality - Always normalize and compare properly
- Security matters - Unicode can be weaponized for phishing and attacks
- Cultural context is everything - Scripts carry deep cultural meaning
- Test with real data - PDF extractions and copy-paste introduce surprising characters
- Use proper libraries - Don't roll your own Unicode handling
Unicode isn't just a technical specification—it's a reflection of human linguistic diversity, complete with all the complexity, beauty, and occasional chaos that entails. Every weird edge case has a story, often rooted in centuries of cultural and typographic history.
🎥 Recommended Unicode Talks
Unicode: What Everyone Should Know - A wonderful deep-dive talk exploring Unicode fundamentals, practical challenges, and real-world implications for developers. Perfect complement to the resources and examples on this page.
Unicode Talks Collection - Additional insights into Unicode complexity and practical developer challenges.
Unicode Technical Deep-Dive - Further exploration of Unicode implementation details and real-world scenarios.
Unicode Advanced Topics - Extended discussion of Unicode complexities and advanced implementation considerations.
Found a great Unicode oddity or have a war story to share? Unicode never stops surprising us!