Unicode for Pentesters – The Basics

It is often possible to bypass WAFs or filters with unicode normalization. There are many ways of encoding unicode strings that I’m sure you’ve seen before:
%u0048
\u0048
%E2%99%A5

But what about how Unicode works, or what any of this means?

Reading through the ‘Joel on Software’ blog post, which I highly recommend to get a basic idea of Unicode, there are some common Unicode misconceptions. In essence, characters do not work in the common 8 bits 0-127 ascii way anymore. Unicode handles these issues, but by using ‘code points’ rather than mapping characters to memory – that’s what encoding is for. So Hello, in unicode, is:

U+0048 U+0065 U+006C U+006C U+006F

Note that these are just points mapped to the already definted unicode map. (see http://www.unicode.org/). No memory > character stuff has happened yet, because that’s what encoding is for.

A common Unicode misconception is that Unicode is simply a 16-bit code where there are 65,536 possible characters. This is not correct. The misconception makes sense as it seems like Unicode just adds an extra byte to the characters, but it does not work like that.

UTF-8 is the main standard on the web, and can be described as follows:

“UTF-8 encodes each Unicode character as a variable number of 1 to 4 octets, where the number of octets depends on the integer value assigned to the Unicode character. It is an efficient encoding of Unicode documents that use mostly US-ASCII characters because it represents each character in the range U+0000 through U+007F as a single octet.”

  • http://www.utf-8.com/

To be Continued.

References: