SMS 7-bit Text Encoding

Over the last couple weeks, I’ve been working on a firmware that responds to commands sent over LTE text messages. Although the LTE stuff in my project is all done using a SIMCom SIM7670G modem over its AT interface, and that interface can nominally handle encoding and decoding SMS “user data” (the text in the text message) using something like AT+CMGS=...Hello world!, I’ve wound up implementing the encoding/decoding of the Protocol Data Units as implemented in 3GPP TS 23.040 so the interface looks more like AT+CMGS=...C8329BFD06DDDF72363904. In this post, I’ll describe how this all works, at least to the extent that I now understand it!

A couple notes:

I more-or-less interchangeably use the words “bytes” and “octets”.
The 3GPP specs are available online, however they’re quite dense and long (the relevant one for SMS is a couple hundred pages) so I mostly link to Wikipedia.

As in other aspects of telephony, there are a few layers of history involved, and the complexity comes mainly from the points of contact between those layers. The oldest layer here I think comes from GSM in the early ’90s. It’s the “default 7-bit alphabet” GSM 03.38 / 3GPP 23.038, encoded with a 7-bit scheme. Up to 140 bytes can fit in the user data of a SMS message; using this scheme that character limit is:

140 bytes * 8 bits/byte = 1120 bits / 7bits/character = 160 characters

The basic alphabet has enough characters for English (significantly overlapping with ASCII) and some support for Western-European languages, in Rust source it might look like:

#[rustfmt::skip]
static GSM7_CHARSET: [char; 128] = [
    '@', '£', '$', '¥', 'è', 'é', 'ù', 'ì',  'ò', 'Ç', '\n', 'Ø',    'ø', '\r', 'Å', 'å',
    'Δ', '_', 'Φ', 'Γ', 'Λ', 'Ω', 'Π', 'Ψ',  'Σ', 'Θ', 'Ξ',  '\x1B', 'Æ', 'æ',  'ß', 'É',
    ' ', '!', '"', '#', '¤', '%', '&', '\'', '(', ')', '*',  '+',    ',', '-',  '.', '/',
    '0', '1', '2', '3', '4', '5', '6', '7',  '8', '9', ':',  ';',    '<', '=',  '>', '?',
    '¡', 'A', 'B', 'C', 'D', 'E', 'F', 'G',  'H', 'I', 'J',  'K',    'L', 'M',  'N', 'O',
    'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W',  'X', 'Y', 'Z',  'Ä',    'Ö', 'Ñ',  'Ü', '§',
    '¿', 'a', 'b', 'c', 'd', 'e', 'f', 'g',  'h', 'i', 'j',  'k',    'l', 'm',  'n', 'o',
    'p', 'q', 'r', 's', 't', 'u', 'v', 'w',  'x', 'y', 'z',  'ä',    'ö', 'ñ',  'ü', 'à',
];

To encode a message, each character is first converted to a number using the alphabet - eg ‘¥’ is index 3 in the table above. In source code, maybe that’s:

/// Map a char to a 7b value using GSM 03.38 / 3GPP 23.038 default alphabet
fn char_to_septet(input: char) -> u8 {
    for (i, c) in GSM7_CHARSET.iter().enumerate() {
        if *c == input {
            return i as u8;
        }
    }
    panic!("input is not in GSM7_CHARSET");
}

Since the alphabet has 127 characters, the high bit in an 8-bit numeric representation is always 0. The 7-bit encoding ignores this high bit and packs the bits together least-significant first, using a scheme like this:

-------------------map to 7b integer using default alphabet-------------------
Character: 'H'                  | 'e'                  | 'l'
  integer: 0x48                 | 0x65                 | 0x6c
     bits: 1  0  0  1  0  0  0  | 1  1  0  0  1  0  1  | 1  1  0  1  1  0  0
         : H6 H5 H4 H3 H2 H1 H0 | e6 e5 e4 e3 e2 e1 e0 | l6 l5 l4 l3 l2 l1 l0
-------------------pack 7b septets in to 8b octets aka bytes-------------------
         : e0 H6 H5 H4 H3 H2 H1 H0 | l1 l0 e6 e5 e4 e3 e2 e1 | l2 l1 l0 l6
     bits: 1  1  0  0  1  0  0  0  | 0  0  1  1  0  0  1  0  | 1  0  0  1
  integer: 0xC8                    | 0x32                    | 0x9

A simple implementation might be done like this:

/// Convert a slice of 7-bit "septets" to 7-bit encoded bytes
fn to_7bit(input: &[u8]) -> Vec<u8> {
    let mut buf = 0u16;
    let mut buffered_bits = 0usize;
    
    let mut input = input.iter();
    let mut ret = Vec::new();
    loop {
        if buffered_bits < 8 {
            // The 3GPP spec calls the 7-bit representations "septets"
            if let Some(septet) = input.next() {
                buf |= (*septet as u16) << buffered_bits;
                buffered_bits += 7;
                println!(" Loaded septet 0x{septet:02X}, {buffered_bits:2}b buffered 0x{buf:04X}");
                continue;
            }
        }
        
        if buffered_bits == 0 {
            return ret;
        }
        
        let octet = (buf & 0xFF) as u8;
        ret.push(octet);
        buf >>= 8;
        buffered_bits = buffered_bits.saturating_sub(8);
        println!("  Output octet 0x{octet:02X}, {buffered_bits:2}b buffered 0x{buf:04X}");
    }
}

fn main() {
    let mut numeric = Vec::new();
    for c in "Hello world!".chars() {
        numeric.push(char_to_septet(c));
    }
    println!("Septets {numeric:X?}");
    let encoded = to_7bit(&numeric);
    println!("Octets {:X?}", encoded);
}

This program produces output:

Septets [48, 65, 6C, 6C, 6F, 20, 77, 6F, 72, 6C, 64, 21]
 Loaded septet 0x48,  7b buffered 0x0048
 Loaded septet 0x65, 14b buffered 0x32C8
  Output octet 0xC8,  6b buffered 0x0032
 Loaded septet 0x6C, 13b buffered 0x1B32
  Output octet 0x32,  5b buffered 0x001B
 Loaded septet 0x6C, 12b buffered 0x0D9B
  Output octet 0x9B,  4b buffered 0x000D
 Loaded septet 0x6F, 11b buffered 0x06FD
  Output octet 0xFD,  3b buffered 0x0006
 Loaded septet 0x20, 10b buffered 0x0106
  Output octet 0x06,  2b buffered 0x0001
 Loaded septet 0x77,  9b buffered 0x01DD
  Output octet 0xDD,  1b buffered 0x0001
 Loaded septet 0x6F,  8b buffered 0x00DF
  Output octet 0xDF,  0b buffered 0x0000
 Loaded septet 0x72,  7b buffered 0x0072
 Loaded septet 0x6C, 14b buffered 0x3672
  Output octet 0x72,  6b buffered 0x0036
 Loaded septet 0x64, 13b buffered 0x1936
  Output octet 0x36,  5b buffered 0x0019
 Loaded septet 0x21, 12b buffered 0x0439
  Output octet 0x39,  4b buffered 0x0004
  Output octet 0x04,  0b buffered 0x0000
Octets [C8, 32, 9B, FD, 6, DD, DF, 72, 36, 39, 4]

However, of course it’s not quite that simple! One nuance and two complications arise.

The nuance is perhaps not necessary to consider deeply in the modern world, but it’s easy enough to understand, and to implement a probably-superfluous workaround for someone else not doing so a few decades ago… Every 7th encoded byte will have one bit from one character, and seven bits from the next. But, since there might not be a “next” character, the decoder can’t know exactly how many septets are encoded in a message just based on the number of octets. There is a length field in the PDU, which contains the number of septets in the message (when 7-bit encoding is used), so a good decoder will read that length field and know when to stop decoding septets. However, apparently the decoding algorithm on some phones wasn’t good at stopping on the specified septet, and in cases where the last byte had only one bit of valid data, that algorithm would interpret the 7 remaining bits as valid too! With the above encoding algorithm, the remaining 7 bits would be 0, so a text of “awesome” encoded using it might decode on these phones as “awesome@”.

To work around this, when the last encoded byte has only one valid bit, the remaining 7 are to be padded with carriage return aka '\r' or 0x0D in our alphabet. This is just a matter of setting the octet local variable in our encoder like this:

        let octet = if buffered_bits == 1 {
            // Work around for poor decoders
            (buf & 0xFF) as u8 | 0x0D << 1
        } else {
            (buf & 0xFF) as u8
        };

The first complication is the simpler one: people needed a bigger alphabet.

As the wikipedia page linked above describes, there are several language-specific alphabets in the 3GPP standard, which can be selected using data in the User Data Header (UDH - to be discussed below!), however I’m not sure these are used in practice/anymore. Experiments I’ve done so far with modern Android and iOS phones indicate the only 7b character set in use is the default one. Characters not in the default 7b alphabet are encoded with UTF-16, which might be discussed in a later post. However, all phones I’ve tested do use the default alphabet “extension” which adds a few characters to the alphabet by prefixing them with a septet of the ESC code 0x1B. These new characters are encoded in two septets; ‘~’ for instance would be 0x1B, 0x3D. The above code might be extended to support the extension like:

#[rustfmt::skip]
static GSM7_CHARSET_EXTENSION: [(u8, char); 10] = [
    (0x0A, '\x0C'),
    (0x14, '^'),
    (0x28, '{'),
    (0x29, '}'),
    (0x2F, '\\'),
    (0x3C, '['),
    (0x3D, '~'),
    (0x3E, ']'),
    (0x40, '|'),
    (0x65, '€'),
];

/// Map a char to 7b value(s) using GSM 03.38 / 3GPP 23.038 default alphabet
fn char_to_septet(input: char) -> (u8, Option<u8>) {
    for (i, c) in GSM7_CHARSET.iter().enumerate() {
        if *c == input {
            return (i as u8, None);
        }
    }
    
    for (i, c) in GSM7_CHARSET_EXTENSION {
        if c == input {
            return (0x1b, Some(i))
        }
    }
    
    panic!("input is not in GSM7_CHARSET nor extension");
}

// Packing septets together is unchanged to support the extension

fn main() {
    let mut numeric = Vec::new();
    for c in "Hello {^~^}".chars() {
        let (first, second) = char_to_septet(c);
        numeric.push(first);
        if let Some(c) = second {
            numeric.push(c);
        }
    }
    println!("Septets {numeric:X?}");
    // Septets [48, 65, 6C, 6C, 6F, 20, 1B, 28, 1B, 14, 1B, 3D, 1B, 14, 1B, 29]
}

The bigger complication with 7-bit encoding is due to the User Data Header (UDH), which among other purposes can be used to keep track of the parts of messages that require >140B of user data (>160 characters, assuming none are from the extension). The UDH is included in the beginning of the user data itself, so if the UDH is 6B, then the message data itself can only use 140B - 6B = 134B. Perhaps splitting and re-combining longer SMS messages would make another post too!

Encoding after the UDH would be simple enough if phones could start decoding the data after the UDH. But, it appears that the UDH was added to the spec after there were phones in circulation that used the 7-bit encoding, and with this happening in the ’90s, those phones weren’t able to be updated. A decision was made to send SMS messages with UDHs in a way that these pre-UDH phones could still receive, but perhaps outputting a few garbage characters (decoding the UDH data as if it were 7-bit text) at the beginning of the message. To make this work, the 7-bit encoded data needs to be shifted so that it lines up with the decoding algorithm on the these pre-UDH phones.

To illustrate this problem, a text “yes please” encodes as [F9, F2, 1C, 04, 67, 97, C3, F3, 32] using the above encoder. If a UDH consisting of [05, 00, 03, 42, 01, 02] (Indicating concatenated SMS 1 of 2 w/ 8-bit reference number 0x42) is prepended on to those encoded bytes, a naive decoder fails to decode it entirely:

    println!("Octets {:X?}", encoded);

    let decoded = from_7bit(&encoded, input_septets);
    println!("Decoded: {decoded}");
    
    let udh = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];

    let mut just_concat = Vec::new();
    just_concat.extend_from_slice(&udh);
    just_concat.extend_from_slice(&encoded);
    println!("Concatenated: {just_concat:X?}");

    let decoded = from_7bit(&just_concat, input_septets);
    println!("Decoded: {decoded}");

Gives output:

Septets [79, 65, 73, 20, 70, 6C, 65, 61, 73, 65]
Octets [F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: yes please
Concatenated: [5, 0, 3, 42, 1, 2, F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: é@øΔΛ¡¡ör9

However, overwriting the beginning of the encoded data with the UDH does get some of the input text through - because the 7-bit encoded data stream is synchronised with the naive decoder state:

    for (i, d) in udh.iter().enumerate() {
        encoded[i] = *d;
    }
    println!("Overwritten: {encoded:X?}");
    let decoded = from_7bit(&encoded, input_septets, 0);
    
    println!("Decoded: {decoded}");

Output:

Octets [F9, F2, 1C, 4, 67, 97, C3, F3, 32]
Decoded: yes please
Overwritten: [5, 0, 3, 42, 1, 2, C3, F3, 32]
Decoded: é@øΔΛ¡¡ase

One solution is to prepend data to the input with padding, so the first real character starts after the UDH bytes, and overwrite the UDH as above, but this is a bit hacky:

    let udh: [u8; 6] = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];
    
    // Number of septets that the UDH will overwrite.  Adding 6 causes the
    // division by 7 to round up.
    let pad_septets = ((udh.len() * 8) + 6) / 7;
    
    let mut input = String::new();
    for _ in 0..pad_septets {
        input.push(' ');
    }
    
    input.push_str("yes please");
    let input_septets = input.len();
    
    // encoding works the same as before
    for c in input.chars() {
        let (first, second) = char_to_septet(c);
        numeric.push(first);
        if let Some(second) = second {
            numeric.push(second);
        }
    }
    let mut encoded = to_7bit(&numeric);
    println!("Octets {:X?}", encoded);
    
    for (i, d) in udh.iter().enumerate() {
        encoded[i] = *d;
    }
    println!("Overwritten: {encoded:X?}");
    let decoded = from_7bit(&encoded, input_septets, 0);
    
    println!("Decoded: {decoded}");

Giving the desired behaviour:

Octets [20, 10, 8, 4, 2, 81, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Overwritten: [5, 0, 3, 42, 1, 2, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Decoded: é@øΔΛ¡@yes please

One reason I say this approach is “hacky” is that prepending the padding to input, in practice, will usually involve a copy of the entire input string - more-or-less doubling the required RAM requirement of this basic implementation and wasting some CPU cycles too. The naive appending approach doesn’t carry the same problem, but of course it didn’t work either. (A more advanced approach might work for streaming data, avoiding the need for the encoder to have the complete message in memory at once.)

Looking at the encoding algorithm, we can observe that after the first byte is output, there are 6 bits of buffered data; after the second byte is output, there are 5, etc down to 0 then the pattern restarts at 6. So, we can introduce a new argument to the encoder function, which makes it behave as if it had already output some number of bytes:

/// Convert a slice of 7-bit "septets" to 7-bit encoded bytes
///
/// skip_bytes is used to preserve synchronization of the decoder when a User
/// Data Header (UDH) of length `skip_bytes` is in the beginning to the 7-bit
/// encoded data field.
fn to_7bit(input: &[u8], skip_bytes: usize) -> Vec<u8> {
    let mut buf = 0u16;
    let mut buffered_bits = 7 - (skip_bytes % 7);

    // remainder of the method is unchanged

This change lets us simply append the encoded data on to the UDH, in a way that will decode the message body:

    let input = "yes please";
    
    let mut numeric = Vec::new();
    for c in input.chars() {
        let (first, second) = char_to_septet(c);
        numeric.push(first);
        if let Some(second) = second {
            numeric.push(second);
        }
    }

    let udh: [u8; 6] = [0x05, 0x00, 0x03, 0x42, 0x01, 0x02];
    let encoded = to_7bit(&numeric, udh.len());
    
    println!("Octets {:X?}", encoded);

    // Calculate encoded_septets to account for both the UDH and the actual data
    let encoded_septets = ((udh.len() * 8) + 6) / 7 + numeric.len();
    
    let mut just_concat = Vec::new();
    just_concat.extend_from_slice(&udh);
    just_concat.extend_from_slice(&encoded);
    println!("Concatenated: {just_concat:X?}");

    let decoded = from_7bit(&just_concat, encoded_septets, 0);
    
    println!("Decoded: {decoded}");

Octets [F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Concatenated: [5, 0, 3, 42, 1, 2, F2, E5, 39, 8, CE, 2E, 87, E7, 65]
Decoded: é@øΔΛ¡@yes please

The observant reader may have caught on to the last argument to the so-far-undisclosed decoder method; it’s called skip_bytes. Using that will result in what we actually want - a text message without the leading garbage characters!

    // Read UDH length out of the User Data field
    let udh_len_bytes = (just_concat[0] + 1) as usize;
    let udh_len_septets = ((udh_len_bytes * 8) + 6) / 7;
    
    // Decode the data after the UDH
    let decoded = from_7bit(
        &just_concat[udh_len_bytes..],
        encoded_septets - udh_len_septets,
        udh_len_bytes,
    );

Decoded: yes please

My aim is to publish an implementation of 7-bit encoding and decoding as part of a crate for working with the SIMCom SIM76xx series of LTE modules, but until then the Rust Playground I’ve been using for the examples here (including a decoder) can be found here