SMS 7-bit Text Encoding
Over the last couple weeks, I’ve been working on a firmware that responds to
commands sent over LTE text messages. Although the LTE stuff in my project is
all done using a SIMCom SIM7670G modem over its AT interface, and that interface
can nominally handle encoding and decoding SMS “user data” (the text in the text
message) using something like AT+CMGS=...Hello world!
, I’ve wound up
implementing the encoding/decoding of the Protocol Data
Units as implemented in 3GPP
TS 23.040 so the interface looks more
like AT+CMGS=...C8329BFD06DDDF72363904
. In this post, I’ll describe how this
all works, at least to the extent that I now understand it!
A couple notes:
- I more-or-less interchangeably use the words “bytes” and “octets”.
- The 3GPP specs are available online, however they’re quite dense and long (the relevant one for SMS is a couple hundred pages) so I mostly link to Wikipedia.
As in other aspects of telephony, there are a few layers of history involved, and the complexity comes mainly from the points of contact between those layers. The oldest layer here I think comes from GSM in the early ’90s. It’s the “default 7-bit alphabet” GSM 03.38 / 3GPP 23.038, encoded with a 7-bit scheme. Up to 140 bytes can fit in the user data of a SMS message; using this scheme that character limit is:
140 bytes * 8 bits/byte = 1120 bits / 7bits/character = 160 characters
The basic alphabet has enough characters for English (significantly overlapping with ASCII) and some support for Western-European languages, in Rust source it might look like:
#[rustfmt::skip]
static GSM7_CHARSET: [char; 128] = [
'@', '£', '$', '¥', 'è', 'é', 'ù', 'ì', 'ò', 'Ç', '\n', 'Ø', 'ø', '\r', 'Å', 'å',
'Δ', '_', 'Φ', 'Γ', 'Λ', 'Ω', 'Π', 'Ψ', 'Σ', 'Θ', 'Ξ', '\x1B', 'Æ', 'æ', 'ß', 'É',
' ', '!', '"', '#', '¤', '%', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/',
'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '?',
'¡', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O',
'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Ä', 'Ö', 'Ñ', 'Ü', '§',
'¿', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o',
'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ä', 'ö', 'ñ', 'ü', 'à',
];
To encode a message, each character is first converted to a number using the alphabet - eg ‘¥’ is index 3 in the table above. In source code, maybe that’s:
/// Map a char to a 7b value using GSM 03.38 / 3GPP 23.038 default alphabet
fn char_to_septet(input: char) -> u8 {
for (i, c) in GSM7_CHARSET.iter().enumerate() {
if *c == input {
return i as u8;
}
}
panic!("input is not in GSM7_CHARSET");
}
Since the alphabet has 127 characters, the high bit in an 8-bit numeric representation is always 0. The 7-bit encoding ignores this high bit and packs the bits together least-significant first, using a scheme like this:
-------------------map to 7b integer using default alphabet-------------------
Character: 'H' | 'e' | 'l'
integer: 0x48 | 0x65 | 0x6c
bits: 1 0 0 1 0 0 0 | 1 1 0 0 1 0 1 | 1 1 0 1 1 0 0
: H6 H5 H4 H3 H2 H1 H0 | e6 e5 e4 e3 e2 e1 e0 | l6 l5 l4 l3 l2 l1 l0
-------------------pack 7b septets in to 8b octets aka bytes-------------------
: e0 H6 H5 H4 H3 H2 H1 H0 | l1 l0 e6 e5 e4 e3 e2 e1 | l2 l1 l0 l6
bits: 1 1 0 0 1 0 0 0 | 0 0 1 1 0 0 1 0 | 1 0 0 1
integer: 0xC8 | 0x32 | 0x9
A simple implementation might be done like this:
/// Convert a slice of 7-bit "septets" to 7-bit encoded bytes
fn to_7bit(input: &[u8]) -> Vec<u8> {
let mut buf = 0u16;
let mut buffered_bits = 0usize;
let mut input = input.iter();
let mut ret = Vec::new();
loop {
if buffered_bits < 8 {
// The 3GPP spec calls the 7-bit representations "septets"
if let Some(septet) = input.next() {
buf |= (*septet as u16) << buffered_bits;
buffered_bits += 7;
println!(" Loaded septet 0x{septet:02X}, {buffered_bits:2}b buffered 0x{buf:04X}");
continue;
}
}
if buffered_bits == 0 {
return ret;
}
let octet = (buf & 0xFF) as u8;
ret.push(octet);
buf >>= 8;
buffered_bits = buffered_bits.saturating_sub(8);
println!(" Output octet 0x{octet:02X}, {buffered_bits:2}b buffered 0x{buf:04X}");
}
}
fn main() {
let mut numeric = Vec::new();
for c in "Hello world!".chars() {
numeric.push(char_to_septet(c));
}
println!("Septets {numeric:X?}");
let encoded = to_7bit(&numeric);
println!("Octets {:X?}", encoded);
}
This program produces output:
Septets [48, 65, 6C, 6C, 6F, 20, 77, 6F, 72, 6C, 64, 21]
Loaded septet 0x48, 7b buffered 0x0048
Loaded septet 0x65, 14b buffered 0x32C8
Output octet 0xC8, 6b buffered 0x0032
Loaded septet 0x6C, 13b buffered 0x1B32
Output octet 0x32, 5b buffered 0x001B
Loaded septet 0x6C, 12b buffered 0x0D9B
Output octet 0x9B, 4b buffered 0x000D
Loaded septet 0x6F, 11b buffered 0x06FD
Output octet 0xFD, 3b buffered 0x0006
Loaded septet 0x20, 10b buffered 0x0106
Output octet 0x06, 2b buffered 0x0001
Loaded septet 0x77, 9b buffered 0x01DD
Output octet 0xDD, 1b buffered 0x0001
Loaded septet 0x6F, 8b buffered 0x00DF
Output octet 0xDF, 0b buffered 0x0000
Loaded septet 0x72, 7b buffered 0x0072
Loaded septet 0x6C, 14b buffered 0x3672
Output octet 0x72, 6b buffered 0x0036
Loaded septet 0x64, 13b buffered 0x1936
Output octet 0x36, 5b buffered 0x0019
Loaded septet 0x21, 12b buffered 0x0439
Output octet 0x39, 4b buffered 0x0004
Output octet 0x04, 0b buffered 0x0000
Octets [C8, 32, 9B, FD, 6, DD, DF, 72, 36, 39, 4]
However, of course it’s not quite that simple! One nuance and two complications arise.
The nuance is perhaps not necessary to consider deeply in the modern world, but it’s easy enough to understand, and to implement a probably-superfluous workaround for someone else not doing so a few decades ago… Every 7th encoded byte will have one bit from one character, and seven bits from the next. But, since there might not be a “next” character, the decoder can’t know exactly how many septets are encoded in a message just based on the number of octets. There is a length field in the PDU, which contains the number of septets in the message (when 7-bit encoding is used), so a good decoder will read that length field and know when to stop decoding septets. However, apparently the decoding algorithm on some phones wasn’t good at stopping on the specified septet, and in cases where the last byte had only one bit of valid data, that algorithm would interpret the 7 remaining bits as valid too! With the above encoding algorithm, the remaining 7 bits would be 0, so a text of “awesome” encoded using it might decode on these phones as “awesome@”.
To work around this, when the last encoded byte has only one valid bit, the
remaining 7 are to be padded with carriage return aka '\r'
or 0x0D in our
alphabet. This is just a matter of setting the octet
local variable in our
encoder like this:
let octet = if buffered_bits == 1 {
// Work around for poor decoders
(buf & 0xFF) as u8 | 0x0D << 1
} else {
(buf & 0xFF) as u8
};
The first complication is the simpler one: people needed a bigger alphabet.
As the wikipedia page linked above describes, there are several language-specific alphabets in the 3GPP standard, which can be selected using data in the User Data Header (UDH - to be discussed below!), however I’m not sure these are used in practice/anymore. Experiments I’ve done so far with modern Android and iOS phones indicate the only 7b character set in use is the default one. Characters not in the default 7b alphabet are encoded with UTF-16, which might be discussed in a later post. However, all phones I’ve tested do use the default alphabet “extension” which adds a few characters to the alphabet by prefixing them with a septet of the ESC code 0x1B. These new characters are encoded in two septets; ‘~’ for instance would be 0x1B, 0x3D. The above code might be extended to support the extension like:
#[rustfmt::skip]
static GSM7_CHARSET_EXTENSION: [(u8, char); 10] = [
(0x0A, '\x0C'),
(0x14, '^'),
(0x28, '{'),
(0x29, '}'),
(0x2F, '\\'),
(0x3C, '['),
(0x3D, '~'),
(0x3E, ']'),
(0x40, '|'),
(0x65, '€'),
];
/// Map a char to 7b value(s) using GSM 03.38 / 3GPP 23.038 default alphabet
fn char_to_septet(input: char) -> (u8, Option<u8>) {
for (i, c) in GSM7_CHARSET.iter().enumerate() {
if *c == input {
return (i as u8, None);
}
}
for (i, c) in GSM7_CHARSET_EXTENSION {
if c == input {
return (0x1b, Some(i))
}
}
panic!("input is not in GSM7_CHARSET nor extension");
}
// Packing septets together is unchanged to support the extension
fn main() {
let mut numeric = Vec::new();
for c in "Hello {^~^}".chars() {
let (first, second) = char_to_septet(c);
numeric.push(first);
if let Some(c) = second {
numeric.push(c);
}
}
println!("Septets {numeric:X?}");
// Septets [48, 65, 6C, 6C, 6F, 20, 1B, 28, 1B, 14, 1B, 3D, 1B, 14, 1B, 29]
}
The bigger complication with 7-bit encoding is due to the User Data Header (UDH), which among other purposes can be used to keep track of the parts of messages that require >140B of user data (>160 characters, assuming none are from the extension). The UDH is included in the beginning of the user data itself, so if the UDH is 6B, then the message data itself can only use 140B - 6B = 134B. Perhaps splitting and re-combining longer SMS messages would make another post too!
Encoding after the UDH would be simple enough if phones could start decoding the data after the UDH. But, it appears that the UDH was added to the spec after there were phones in circulation that used the 7-bit encoding, and with this happening in the ’90s, those phones weren’t able to be updated. A decision was made to send SMS messages with UDHs in a way that these pre-UDH phones could still receive, but perhaps outputting a few garbage characters (decoding the UDH data as if it were 7-bit text) at the beginning of the message. To make this work, the 7-bit encoded data needs to be shifted so that it lines up with the decoding algorithm on the these pre-UDH phones.
To illustrate this problem, a text “yes please” encodes as [F9, F2, 1C, 04, 67,
97, C3, F3, 32]
using the above encoder. If a UDH consisting of [05, 00, 03,
42, 01, 02]
(Indicating concatenated SMS 1 of 2 w/ 8-bit reference number 0x42)
is prepended on to those encoded bytes, a naive decoder fails to decode it
entirely:
println!("Octets {:X?}", encoded);
let decoded = from_7bit(&encoded, input_septets);
println!("Decoded: {decoded}");
let udh = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];
let mut just_concat = Vec::new();
just_concat.extend_from_slice(&udh);
just_concat.extend_from_slice(&encoded);
println!("Concatenated: {just_concat:X?}");
let decoded = from_7bit(&just_concat, input_septets);
println!("Decoded: {decoded}");
Gives output:
Septets [79, 65, 73, 20, 70, 6C, 65, 61, 73, 65]
Octets [F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: yes please
Concatenated: [5, 0, 3, 42, 1, 2, F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: é@øΔΛ¡¡ör9
However, overwriting the beginning of the encoded data with the UDH does get some of the input text through - because the 7-bit encoded data stream is synchronised with the naive decoder state:
for (i, d) in udh.iter().enumerate() {
encoded[i] = *d;
}
println!("Overwritten: {encoded:X?}");
let decoded = from_7bit(&encoded, input_septets, 0);
println!("Decoded: {decoded}");
Output:
Octets [F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: yes please
Overwritten: [5, 0, 3, 42, 1, 2, C3, F3, 32]
Decoded: é@øΔΛ¡¡ase
One solution is to prepend data to the input with padding, so the first real character starts after the UDH bytes, and overwrite the UDH as above, but this is a bit hacky:
let udh: [u8; 6] = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];
// Number of septets that the UDH will overwrite. Adding 6 causes the
// division by 7 to round up.
let pad_septets = ((udh.len() * 8) + 6) / 7;
let mut input = String::new();
for _ in 0..pad_septets {
input.push(' ');
}
input.push_str("yes please");
let input_septets = input.len();
// encoding works the same as before
for c in input.chars() {
let (first, second) = char_to_septet(c);
numeric.push(first);
if let Some(second) = second {
numeric.push(second);
}
}
let mut encoded = to_7bit(&numeric);
println!("Octets {:X?}", encoded);
for (i, d) in udh.iter().enumerate() {
encoded[i] = *d;
}
println!("Overwritten: {encoded:X?}");
let decoded = from_7bit(&encoded, input_septets, 0);
println!("Decoded: {decoded}");
Giving the desired behaviour:
Octets [20, 10, 8, 4, 2, 81, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Overwritten: [5, 0, 3, 42, 1, 2, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Decoded: é@øΔΛ¡@yes please
One reason I say this approach is “hacky” is that prepending the padding to input, in practice, will usually involve a copy of the entire input string - more-or-less doubling the required RAM requirement of this basic implementation and wasting some CPU cycles too. The naive appending approach doesn’t carry the same problem, but of course it didn’t work either. (A more advanced approach might work for streaming data, avoiding the need for the encoder to have the complete message in memory at once.)
Looking at the encoding algorithm, we can observe that after the first byte is output, there are 6 bits of buffered data; after the second byte is output, there are 5, etc down to 0 then the pattern restarts at 6. So, we can introduce a new argument to the encoder function, which makes it behave as if it had already output some number of bytes:
/// Convert a slice of 7-bit "septets" to 7-bit encoded bytes
///
/// skip_bytes is used to preserve synchronization of the decoder when a User
/// Data Header (UDH) of length `skip_bytes` is in the beginning to the 7-bit
/// encoded data field.
fn to_7bit(input: &[u8], skip_bytes: usize) -> Vec<u8> {
let mut buf = 0u16;
let mut buffered_bits = 7 - (skip_bytes % 7);
// remainder of the method is unchanged
This change lets us simply append the encoded data on to the UDH, in a way that will decode the message body:
let input = "yes please";
let mut numeric = Vec::new();
for c in input.chars() {
let (first, second) = char_to_septet(c);
numeric.push(first);
if let Some(second) = second {
numeric.push(second);
}
}
let udh: [u8; 6] = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];
let encoded = to_7bit(&numeric, udh.len());
println!("Octets {:X?}", encoded);
// Calculate encoded_septets to account for both the UDH and the actual data
let encoded_septets = ((udh.len() * 8) + 6) / 7 + numeric.len();
let mut just_concat = Vec::new();
just_concat.extend_from_slice(&udh);
just_concat.extend_from_slice(&encoded);
println!("Concatenated: {just_concat:X?}");
let decoded = from_7bit(&just_concat, encoded_septets, 0);
println!("Decoded: {decoded}");
Octets [F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Concatenated: [5, 0, 3, 42, 1, 2, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Decoded: é@øΔΛ¡@yes please
The observant reader may have caught on to the last argument to the
so-far-undisclosed decoder method; it’s called skip_bytes
. Using that will
result in what we actually want - a text message without the leading garbage
characters!
// Read UDH length out of the User Data field
let udh_len_bytes = (just_concat[0] + 1) as usize;
let udh_len_septets = ((udh_len_bytes * 8) + 6) / 7;
// Decode the data after the UDH
let decoded = from_7bit(
&just_concat[udh_len_bytes..],
encoded_septets - udh_len_septets,
udh_len_bytes,
);
Decoded: yes please
My aim is to publish an implementation of 7-bit encoding and decoding as part of a crate for working with the SIMCom SIM76xx series of LTE modules, but until then the Rust Playground I’ve been using for the examples here (including a decoder) can be found here