Python 2 unicode!

Ascii, Unicode, UTF-8, double-bytes, single-bytes. Four bytes. JSON encodings!

Welcome to the Python2 world of strings.
String that can be unicode, 8-bit and probably something else that we might see in the future versions of the very beloved language.

Let me quickly jump into an issue:

something = "Whattheheck" # Чезафигня

when you try to run the above script, you would probably see Python panicking:

SyntaxError: Non-ASCII character '\xd0' ...

Ok, what does it all mean and why we can not have non-ascii character in the source code of a python file?

There is a hint on the pep-0263 page:
Python will default to ASCII as standard encoding if no other
encoding hints are given.

What it means is whenever you type inside a python source file using any characters but ASCII the interpreter will get really unhappy.
If you want to write comments or define a string with a weird value in another language, you will need to change default encoding of the source file.
This is an example on how to do the change.

You need to add below statement in the first two lines of a python file.

# -*- coding: utf-8 -*-

Now lets try to adjust the source code a little.

Import chardet library that helps to detect encodings and run the script again:

# -*- coding: utf-8 -*-
import chardet
# sudo pip install chardet
something = "Whatheheck"  # English translation for Чезафигня
something_rus = "Чезафигня"
print chardet.detect(something) # {'confidence': 1.0, 'encoding': 'ascii'}
print chardet.detect(something_rus) # {'confidence': 0.99, 'encoding': 'utf-8'}

We just set a default encoding to be ‘UTF-8’, but still see ASCII encoding for the english value of a string, and ‘UTF-8’ for a value that is a translation.
Aren’t we suppose to have all the strings defined as ‘UTF-8’ at this point? Not really.

And there is no room for surprises.

# -*- coding: utf-8 -*-

sets the encoding of the source file only. It does not have an effect on encoding string values using ‘UTF-8’.

Meaning, that if you define a coding it only affects the possibility of interpreting the file by the interpreter.

The time has come to go crazy and try to print out something really cool!
We are going to decode a byte string into a unicode based string using utf-8 character mapping.

# -*- coding: utf-8 -*-
something_rus = "Чезафигня"
# Lets see how the string looks in unicode representation
print unicode(something_rus, "utf-8")
# ... UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-8: ordinal not in range(128)

Oh my!
We still have to deal with the encoding/decoding issues to be able to print a unicode string!

By the way,


will produce a byte string that will be encoded using UTF-8.



results in an unicode object.

# -*- coding: utf-8 -*-

actually did not affect the default encoding for the variable values, and if we call

print sys.getdefaultencoding() # ascii

We will see that it is still ASCII. That we can easily change by calling

import sys
sys.setdefaultencoding('UTF8')# And finally be able to printsomething_rus = "Чезафигня"
print(unicode(something_rus)) # Чезафигня.

Python print function converts stuff into a str, in our case using UTF-8 encoding. If we were to call unicode(something_rus)) the console output would look like ‘\\u2012\\u2001…’

Now we can officially start tackling Unicode and deal with our As and Бс.

When we form a unicode string, we form a string of code points. A unicode string can be used internally in the application, but on the output we have to encode it into a byte string.
Internally a unicode string could look like “\\u0411\\u0411\u0411”.

The rules for translating a Unicode string into a sequence of bytes are called an encoding.

Now we can poke ‘UTF-8’, just a little.


– Python 2 default encoding is ASCII.

– For the source.

– For the variables.

– Hmm…

– Declare # -*- coding: utf-8 -*- when you want to deal with utf-8 in the source file.

– Set the default encoding when you want to call unicode and all dat good stuff.

– Call string.decode or unicode(string) when you want to convert byte string to a unicode representation.

  • A string of ASCII text is also valid UTF-8 text.
  • UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
  • If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize. It’s also unlikely that random 8-bit data will look like valid UTF-8.

Leave a Reply

Your email address will not be published.