CryptoSys Home > PKI > UTF-8 and Byte Order Marks

UTF-8 and Byte Order Marks


The problem

Creating a file of UTF-8 data to pass to a hash or signature function in the CryptoSys PKI Toolkit is tricky. The "text" file containing the input must contain exactly the correct bytes with no Byte Order Mark (BOM) headers or trailing CR-LF characters. UTF-8 files created by a some applications (like .NET) may have these additional bytes added to them whether you want them or not. If there are additional bytes - even just one - the signature will be wrong!

Worse, when you open these files in a UTF-8-aware text editor, you won't see these extra bytes, because the editor is expecting them and doesn't show them. And if you open and then save, your editor may add these extra bytes without telling you. Windows Notepad does this, for example.

Hexdump utility

Here is a simple command-line program based on the hexdump command in Linux. We will use it to examine the data files.

hexdump for Windows

Download the EXE file and put it in a directory that Windows will search. Open a command prompt window (Start > Run > cmd) or (Start > Programs > Accessories > Command Prompt).

Example 1: a correct UTF-8 file

> hexdump -C Muestra-v2_PipedString-UTF8.txt

Note the "-C" option (that's a capital letter C) and enclose the filename in quotes if it has spaces in it, e.g. "file name". These three example files can be downloaded as a zipped file (1.6 kB).

000000  7c 7c 32 2e 30 7c 41 7c 31 7c 32 30 30 39 2d 30  ||2.0|A|1|2009-0
000010  38 2d 31 36 54 31 36 3a 33 30 3a 30 30 7c 31 7c  8-16T16:30:00|1|
000020  32 30 30 39 7c 69 6e 67 72 65 73 6f 7c 55 6e 61  2009|ingreso|Una
000030  20 73 6f 6c 61 20 65 78 68 69 62 69 63 69 c3 b3   sola exhibici..
000040  6e 7c 33 35 30 2e 30 30 7c 35 2e 32 35 7c 33 39  n|350.00|5.25|39
000050  37 2e 32 35 7c 49 53 50 39 30 30 39 30 39 51 38  7.25|ISP900909Q8
000060  38 7c 49 6e 64 75 73 74 72 69 61 73 20 64 65 6c  8|Industrias del
000070  20 53 75 72 20 50 6f 6e 69 65 6e 74 65 2c 20 53   Sur Poniente, S
000080  2e 41 2e 20 64 65 20 43 2e 56 2e 7c 41 6c 76 61  .A. de C.V.|Alva
000090  72 6f 20 4f 62 72 65 67 c3 b3 6e 7c 33 37 7c 33  ro Obreg..n|37|3
0000a0  7c 43 6f 6c 2e 20 52 6f 6d 61 20 4e 6f 72 74 65  |Col. Roma Norte
0000b0  7c 4d c3 a9 78 69 63 6f 7c 43 75 61 75 68 74 c3  |M..xico|Cuauht.
0000c0  a9 6d 6f 63 7c 44 69 73 74 72 69 74 6f 20 46 65  .moc|Distrito Fe
0000d0  64 65 72 61 6c 7c 4d c3 a9 78 69 63 6f 7c 30 36  deral|M..xico|06
0000e0  37 30 30 7c 50 69 6e 6f 20 53 75 61 72 65 7a 7c  700|Pino Suarez|
0000f0  32 33 7c 43 65 6e 74 72 6f 7c 4d 6f 6e 74 65 72  23|Centro|Monter
000100  72 65 79 7c 4d 6f 6e 74 65 72 72 65 79 7c 4e 75  rey|Monterrey|Nu
000110  65 76 6f 20 4c c3 a9 6f 6e 7c 4d c3 a9 78 69 63  evo L..on|M..xic
000120  6f 7c 39 35 34 36 30 7c 43 41 55 52 33 39 30 33  o|95460|CAUR3903
000130  31 32 53 38 37 7c 52 6f 73 61 20 4d 61 72 c3 ad  12S87|Rosa Mar..
000140  61 20 43 61 6c 64 65 72 c3 b3 6e 20 55 72 69 65  a Calder..n Urie
000150  67 61 73 7c 54 6f 70 6f 63 68 69 63 6f 7c 35 32  gas|Topochico|52
000160  7c 4a 61 72 64 69 6e 65 73 20 64 65 6c 20 56 61  |Jardines del Va
000170  6c 6c 65 7c 4d 6f 6e 74 65 72 72 65 79 7c 4d 6f  lle|Monterrey|Mo
000180  6e 74 65 72 72 65 79 7c 4e 75 65 76 6f 20 4c 65  nterrey|Nuevo Le
000190  c3 b3 6e 7c 4d c3 a9 78 69 63 6f 7c 39 35 34 36  ..n|M..xico|9546
0001a0  35 7c 31 30 7c 43 61 6a 61 7c 56 61 73 6f 73 20  5|10|Caja|Vasos 
0001b0  64 65 63 6f 72 61 64 6f 73 7c 32 30 2e 30 30 7c  decorados|20.00|
0001c0  32 30 30 7c 31 7c 70 69 65 7a 61 7c 43 68 61 72  200|1|pieza|Char
0001d0  6f 6c 61 20 6d 65 74 c3 a1 6c 69 63 61 7c 31 35  ola met..lica|15
0001e0  30 2e 30 30 7c 31 35 30 7c 49 56 41 7c 31 35 2e  0.00|150|IVA|15.
0001f0  30 30 7c 35 32 2e 35 30 7c 7c                    00|52.50||

Note that

  1. The very first two characters are the pipe symbol '|' with hex code 0x7c.
  2. The very last two characters are the pipe symbol '|' with hex code 0x7c.
  3. Words which contain accented characters like "México" on the 12th line (0000b0) have two funny characters in them - 0xc3 and 0xa9 - represented as dots in the ASCII dump on the right-hand side. This means they are in UTF-8 encoding, which is what we want.

Example 2: a file in ISO-8859-1 (Latin-1) encoding

> hexdump -C Muestra-v2_PipedString-Latin1.txt
000000  7c 7c 32 2e 30 7c 41 7c 31 7c 32 30 30 39 2d 30  ||2.0|A|1|2009-0
000010  38 2d 31 36 54 31 36 3a 33 30 3a 30 30 7c 31 7c  8-16T16:30:00|1|
000020  32 30 30 39 7c 69 6e 67 72 65 73 6f 7c 55 6e 61  2009|ingreso|Una
000030  20 73 6f 6c 61 20 65 78 68 69 62 69 63 69 f3 6e   sola exhibici.n
000040  7c 33 35 30 2e 30 30 7c 35 2e 32 35 7c 33 39 37  |350.00|5.25|397
000050  2e 32 35 7c 49 53 50 39 30 30 39 30 39 51 38 38  .25|ISP900909Q88
000060  7c 49 6e 64 75 73 74 72 69 61 73 20 64 65 6c 20  |Industrias del 
000070  53 75 72 20 50 6f 6e 69 65 6e 74 65 2c 20 53 2e  Sur Poniente, S.
000080  41 2e 20 64 65 20 43 2e 56 2e 7c 41 6c 76 61 72  A. de C.V.|Alvar
000090  6f 20 4f 62 72 65 67 f3 6e 7c 33 37 7c 33 7c 43  o Obreg.n|37|3|C
0000a0  6f 6c 2e 20 52 6f 6d 61 20 4e 6f 72 74 65 7c 4d  ol. Roma Norte|M
0000b0  e9 78 69 63 6f 7c 43 75 61 75 68 74 e9 6d 6f 63  .xico|Cuauht.moc
0000c0  7c 44 69 73 74 72 69 74 6f 20 46 65 64 65 72 61  |Distrito Federa
0000d0  6c 7c 4d e9 78 69 63 6f 7c 30 36 37 30 30 7c 50  l|M.xico|06700|P
0000e0  69 6e 6f 20 53 75 61 72 65 7a 7c 32 33 7c 43 65  ino Suarez|23|Ce
0000f0  6e 74 72 6f 7c 4d 6f 6e 74 65 72 72 65 79 7c 4d  ntro|Monterrey|M
000100  6f 6e 74 65 72 72 65 79 7c 4e 75 65 76 6f 20 4c  onterrey|Nuevo L
000110  e9 6f 6e 7c 4d e9 78 69 63 6f 7c 39 35 34 36 30  .on|M.xico|95460
000120  7c 43 41 55 52 33 39 30 33 31 32 53 38 37 7c 52  |CAUR390312S87|R
000130  6f 73 61 20 4d 61 72 ed 61 20 43 61 6c 64 65 72  osa Mar.a Calder
000140  f3 6e 20 55 72 69 65 67 61 73 7c 54 6f 70 6f 63  .n Uriegas|Topoc
000150  68 69 63 6f 7c 35 32 7c 4a 61 72 64 69 6e 65 73  hico|52|Jardines
000160  20 64 65 6c 20 56 61 6c 6c 65 7c 4d 6f 6e 74 65   del Valle|Monte
000170  72 72 65 79 7c 4d 6f 6e 74 65 72 72 65 79 7c 4e  rrey|Monterrey|N
000180  75 65 76 6f 20 4c 65 f3 6e 7c 4d e9 78 69 63 6f  uevo Le.n|M.xico
000190  7c 39 35 34 36 35 7c 31 30 7c 43 61 6a 61 7c 56  |95465|10|Caja|V
0001a0  61 73 6f 73 20 64 65 63 6f 72 61 64 6f 73 7c 32  asos decorados|2
0001b0  30 2e 30 30 7c 32 30 30 7c 31 7c 70 69 65 7a 61  0.00|200|1|pieza
0001c0  7c 43 68 61 72 6f 6c 61 20 6d 65 74 e1 6c 69 63  |Charola met.lic
0001d0  61 7c 31 35 30 2e 30 30 7c 31 35 30 7c 49 56 41  a|150.00|150|IVA
0001e0  7c 31 35 2e 30 30 7c 35 32 2e 35 30 7c 7c        |15.00|52.50||

This time, words with accented characters like "México" are shown with only one byte for the letter é with hex value 0xe9. This is not what we want. The file is in ISO-8859-1 or Latin-1 encoding and you need to convert it to UTF-8 or your hash value and signatures will be wrong.

Example 3: a UTF-8 file with Byte Order Marks and extra lines

> hexdump -C Muestra-v2_PipedString-UTF8-BOM.txt
000000  ef bb bf 7c 7c 32 2e 30 7c 41 7c 31 7c 32 30 30  ...||2.0|A|1|200
000010  39 2d 30 38 2d 31 36 54 31 36 3a 33 30 3a 30 30  9-08-16T16:30:00
000020  7c 31 7c 32 30 30 39 7c 69 6e 67 72 65 73 6f 7c  |1|2009|ingreso|
000030  55 6e 61 20 73 6f 6c 61 20 65 78 68 69 62 69 63  Una sola exhibic
000040  69 c3 b3 6e 7c 33 35 30 2e 30 30 7c 35 2e 32 35  i..n|350.00|5.25
000050  7c 33 39 37 2e 32 35 7c 49 53 50 39 30 30 39 30  |397.25|ISP90090
000060  39 51 38 38 7c 49 6e 64 75 73 74 72 69 61 73 20  9Q88|Industrias 
000070  64 65 6c 20 53 75 72 20 50 6f 6e 69 65 6e 74 65  del Sur Poniente
000080  2c 20 53 2e 41 2e 20 64 65 20 43 2e 56 2e 7c 41  , S.A. de C.V.|A
000090  6c 76 61 72 6f 20 4f 62 72 65 67 c3 b3 6e 7c 33  lvaro Obreg..n|3
0000a0  37 7c 33 7c 43 6f 6c 2e 20 52 6f 6d 61 20 4e 6f  7|3|Col. Roma No
0000b0  72 74 65 7c 4d c3 a9 78 69 63 6f 7c 43 75 61 75  rte|M..xico|Cuau
0000c0  68 74 c3 a9 6d 6f 63 7c 44 69 73 74 72 69 74 6f  ht..moc|Distrito
0000d0  20 46 65 64 65 72 61 6c 7c 4d c3 a9 78 69 63 6f   Federal|M..xico
0000e0  7c 30 36 37 30 30 7c 50 69 6e 6f 20 53 75 61 72  |06700|Pino Suar
0000f0  65 7a 7c 32 33 7c 43 65 6e 74 72 6f 7c 4d 6f 6e  ez|23|Centro|Mon
000100  74 65 72 72 65 79 7c 4d 6f 6e 74 65 72 72 65 79  terrey|Monterrey
000110  7c 4e 75 65 76 6f 20 4c c3 a9 6f 6e 7c 4d c3 a9  |Nuevo L..on|M..
000120  78 69 63 6f 7c 39 35 34 36 30 7c 43 41 55 52 33  xico|95460|CAUR3
000130  39 30 33 31 32 53 38 37 7c 52 6f 73 61 20 4d 61  90312S87|Rosa Ma
000140  72 c3 ad 61 20 43 61 6c 64 65 72 c3 b3 6e 20 55  r..a Calder..n U
000150  72 69 65 67 61 73 7c 54 6f 70 6f 63 68 69 63 6f  riegas|Topochico
000160  7c 35 32 7c 4a 61 72 64 69 6e 65 73 20 64 65 6c  |52|Jardines del
000170  20 56 61 6c 6c 65 7c 4d 6f 6e 74 65 72 72 65 79   Valle|Monterrey
000180  7c 4d 6f 6e 74 65 72 72 65 79 7c 4e 75 65 76 6f  |Monterrey|Nuevo
000190  20 4c 65 c3 b3 6e 7c 4d c3 a9 78 69 63 6f 7c 39   Le..n|M..xico|9
0001a0  35 34 36 35 7c 31 30 7c 43 61 6a 61 7c 56 61 73  5465|10|Caja|Vas
0001b0  6f 73 20 64 65 63 6f 72 61 64 6f 73 7c 32 30 2e  os decorados|20.
0001c0  30 30 7c 32 30 30 7c 31 7c 70 69 65 7a 61 7c 43  00|200|1|pieza|C
0001d0  68 61 72 6f 6c 61 20 6d 65 74 c3 a1 6c 69 63 61  harola met..lica
0001e0  7c 31 35 30 2e 30 30 7c 31 35 30 7c 49 56 41 7c  |150.00|150|IVA|
0001f0  31 35 2e 30 30 7c 35 32 2e 35 30 7c 7c 0d 0a 0d  15.00|52.50||...
000200  0a 0d 0a 0d 0a 0d 0a 0d 0a 0d a0                 ...........

This file is in UTF-8 encoding (see the double dots in "M..xico") but

  1. The first three bytes are 0xef, 0xbb, 0xbf. These are the UTF-8 Byte Order Marks to indicate to an application reading the file that the data following is UTF-8 encoded.
  2. At the end of the file are a sequence of 0x0d, 0x0a, 0x0d, 0x0d, ... bytes. These are CR-LF pairs or newline characters giving the file a few extra "lines" at the end.
  3. The data, therefore, does not begin and end with two '|' symbols, as required.

If you compute the hash value or signature on this file it will be wrong. The extra bytes added to the required data will cause the value to be different (i.e. wrong).

Byte Order Marks

A Byte Order Mark (BOM) is used in Unicode to indicate the "endianness" of the data. This is useful for UTF-16 (which is an extension of the old UCS-2) where characters are always stored as two bytes (well, there can be more, but in practice you should almost always see just two). Different computers store these pairs of bytes in different orders (big-endian or little-endian) depending on their architecture. The BOM character for UTF-16 is U+FEFF and will be stored as either (0xFE, 0xFF) or (0xFF, 0xFE) depending on your machine. This may show up as ÿþ or þÿ when you view the file.

The BOM for UTF-8 has three bytes 0xef, 0xbb, 0xbf and may show up as  when you view the file or as ´╗┐ on the command-line console. The use of a BOM for UTF-8 is not recommended. It does not give any indication about byte order (despite its name) and UTF-8 data can be detected by a simple test anyway. Even so, Windows Notepad will add this BOM when it saves.

The problem for hash values and signatures

If you are using the tools from CryptoSys PKI to read in data from a file, the bytes will be read exactly as they are. It does not "test" for BOMs and ignore them. This is by design. Any message digest hash values or signatures computed from data with these extra bytes will be wrong.

The correct MD5 hash of the data in the first example above is (0x)4CD8ED248D7A02314C50778A37D1522D.

To fix

To fix, either use a hex editor to make sure the file is correct (we use Frhed Free Hex Editor) or take extra care when creating your text files and check with the hexdump utility before using.

If you use Notepad++ then open your text file and use the menu options Encoding > Convert to UTF-8 without BOM then save again.

Convert to UTF-8 without BOM

Alternatively, instead of reading from a file, put your data in a string and pass that instead. In VB6 replace this

strDataFile = "Muestra-v2_PipedString-UTF8.txt"
ReDim abDigest(PKI_MD5_BYTES - 1)
nRet = HASH_File(abDigest(0), PKI_MD5_BYTES, strDataFile, PKI_HASH_MD5)

with this

strData = "||2.0|A|1|2009-08-16T16:30:00 ... IVA|15.00|52.50||"
' Convert string to UTF-8.
nLen = CNV_UTF8FromLatin1("", 0, strData)
strDataUTF8 = String(nLen, " ")
nLen = CNV_UTF8FromLatin1(strDataUTF8, Len(strDataUTF8), strData)
' Convert string to bytes
abData = StrConv(strDataUTF8, vbFromUnicode)
' Compute the MD5 hash value
ReDim abDigest(PKI_MD5_BYTES - 1)
nRet = HASH_Bytes(abDigest(0), PKI_MD5_BYTES, abData(0), UBound(abData) + 1, PKI_HASH_MD5)

In VB.NET, it's much simpler:

strData = "||2.0|A|1|2009-08-16T16:30:00 ... IVA|15.00|52.50||"
' Convert string to bytes in UTF-8 encoding
abData = System.Text.Encoding.UTF8.GetBytes(strData)
abDigest = Hash.BytesFromBytes(abData, HashAlgorithm.Md5)

See also Creating the message digest of the piped-string on the SAT Mexico page.

Contact

For more information or to comment on this page, please send us a message.

This page last updated 10 September 2025