python-unicode

Table of Contents

1 Unicode in Python    slide

2 2 types of "string" in Python    slide

  • (Byte) string and unicode
"Nguyễn Tuấn Anh"  # 'Nguy\xe1\xbb\x85n Tu\xe1\xba\xa5n Anh'
u"Nguyễn Tuấn Anh" # u'Nguy\u1ec5n Tu\u1ea5n Anh'

3 Encoding & decoding    slide

  • Let's just assume utf-8
  • You can encode unicodes into byte strings, and decode byte strings into unicode
s = "Nguyễn Tuấn Anh"  # 'Nguy\xe1\xbb\x85n Tu\xe1\xba\xa5n Anh'
u = u"Nguyễn Tuấn Anh" # u'Nguy\u1ec5n Tu\u1ea5n Anh'
s == u.encode("utf8")  # True
u == s.decode("utf8")  # True
  • If you know what a thing is, you can convert it, but…

4 You usually don't know    slide

  • Some libraries accept/return unicodes, some byte strings, some both!
  • To further complicate it, byte strings have encode method, and unicode have decode method
"Nguyen Tuan Anh".encode("utf8")  # 'Nguyen Tuan Anh'
"Nguyen Tuan Anh".decode("utf8")  # u'Nguyen Tuan Anh'
u"Nguyen Tuan Anh".encode("utf8") # 'Nguyen Tuan Anh'
u"Nguyen Tuan Anh".decode("utf8") # u'Nguyen Tuan Anh'

5 .decode & .encode seem to work fine!    slide

  • Yes, due to implicit conversion
  • Until you start using non-ascii characters
"Nguyễn Tuấn Anh".encode("utf8")  # UnicodeDecodeError (no not encode error)
"Nguyễn Tuấn Anh".decode("utf8")  # u'Nguy\u1ec5n Tu\u1ea5n Anh'
u"Nguyễn Tuấn Anh".encode("utf8") # 'Nguy\xe1\xbb\x85n Tu\xe1\xba\xa5n Anh'
u"Nguyễn Tuấn Anh".decode("utf8") # UnicodeEncodeError (no not decode error)
  • Which you often don't, during development

6 How about str & unicode functions?    slide

  • Similar to decode & encode, except faster and
    • Default to ascii coding
    • str doesn't do coding, at all
s = "Nguyễn Tuấn Anh"
u = u"Nguyễn Tuấn Anh"
unicode(s)         # UnicodeDecodeError
unicode(s, "utf8") # Ok
unicode(u)         # Ok
unicode(u, "utf8") # No sorry
str(u)             # UnicodeEncodeError
str(u, "utf8")     # Sorry no

7 So, how to survive    slide

  • Use unicode for all your functions
  • Carefully, patiently, thoroughly read documentation to know what go in/out of libraries (twisted & warp like strings (most of the time))
  • Additionally, when receiving data, assume everyone wants to stab your back. This may save you from having to know what libraries give you
  • Most importantly, test with non-ascii characters

8 Like this    slide

def uni(s, coding="utf-8"):
    if isinstance(s, unicode):
        return s
    return unicode(s, coding)

# XXX: You would assume 'str' is symmetric to 'unicode'...
def st(maybe_unicode_or_contain_unicode):
    try:
        return str(maybe_unicode_or_contain_unicode)
    except UnicodeEncodeError:
        return
    unicode(maybe_unicode_or_contain_unicode).encode("utf-8")

# XXX: And you just want to print the exception, to debug
def print_exception(e):
    # Hilarity ensures if you do this
    # print Exception(u"Nguyễn Tuấn Anh")
    print st(e)

Date: 2012-07-20 00:01:01 ICT

Author: Nguyễn Tuấn Anh

Org version 7.8.11 with Emacs version 24

Validate XHTML 1.0