One of my trapping projects involves using an LTE modem to send and receive text messages, the modem provides an AT interface which I’m using atat to manage. The modem’s AT protocol for text messages supports either a simple “text mode” or a more-versatile “PDU mode”, but I couldn’t get sending to work with text mode, at least on the one carrier (2degrees) that I have a SIM card for.

PDU mode presents a text message as something like 07914622207276F2040B914622468921F300005250908115718418C779FB06DAA0CA787AD94D2E933729D086E2726D28, so I needed to parse those PDUs. Further, some commands like CMGL return multiple SMS messages, with a specification:

5)If PDU mode (AT+CMGF=0)and Command successful:
+CMGL: <index>,<stat>,[<alpha>],<length>
<pdu>[
+CMGL: <index>,<stat>,[<alpha>],<length>
<pdu>
[...]]

The CMGL AT response message is separated by newlines, which is a bit atypical, and which atat can’t handle out-of-the-box… atat supports custom parsers written using nom though, so I wound up writing a parser for SMS PDUs in nom!

(n.b. this was done with a slightly older nom 7.1.7 as used by atat, examples are 8.0.0)

About nom

nom is a toolkit for making parsers; it’s more structured than making a parser from scratch, but it’s not as formal as a parsing expression grammar (PEG, which is a specification that can be used to make a recursive descent parser). The nom approach feels quite functional, and results in code that looks similar to a PEG. nom was originally intended for parsing binary files, but has been expanded to support text which examples here will use for legibility.

nom Parsers

A parser in nom is a function which implements the Parser trait for an input type I, an output type O. An implementation is provided for functions with the canonical signature:

fn(input: I) -> IResult<I, O>;

(more to follow on IResult<>)

These parser methods are usually quite small, the ones nom provides do things like extract an encoded number, or a fixed sequence, from a slice of bytes. More complicated parsers (like for PDUs) are made by combining other parsers.

use nom::IResult;
use nom::bytes::complete::tag;

/// Parses the string "Hello" at the beginning of input
fn hello(input: &str) -> IResult<&str, &str> {
  tag("Hello")(input)
}

fn main() {
    let input = "Hello this is a test";
    let (input, output) = hello(input).unwrap();
    println!("input: {input:?}, output: {output:?}");
}

This produces output input: " this is a test", output: "Hello". So, we can see that these parsers return the parsed information along with remaining un-parsed input.

nom calls itself a “parser combinator library”; combinators combine parsers. For example, we can use opt() with hello() like this:

    let input = "This doesn't start with hello";
    let (input, output) = opt(hello).parse(input).unwrap();
    println!("input: {input:?}, output: {output:?}");

Which outputs input: "This doesn't start with hello", output: None

More Complicated Parsers

From that first example, we see a nom pattern in the use of tag(): sometimes, we want to write functions that take arguments beyond the ones in the signature of a canonical parser, fn(input: I) -> IResult<I, O>;. One solution is to create a type and manually implement Parser on it, but if a new type isn’t justified, another option is to create a function that returns a parser:

/// Parses the first match from a list of candidate strings
fn starts_with<'a>(
    candidates: &'a [&str],
) -> impl Fn(&'a str) -> IResult<&'a str, &'a str> {
    move |input| {
        for candidate in candidates {
            let res = tag(*candidate)(input);
            if res.is_ok() {
                return res;
            }
        }

        // Typically, we focus on using nom's provided combinators to make
        // parsers, so just use whatever errors they provide r/t specifying them
        // manually...
        Err(nom::Err::Error(nom::error::Error::new(
            input,
            ErrorKind::Tag,
        )))
    }
}

fn main() {
    let greetings_evening = ["Hello", "Tē nā koutou"];
    let evening_parser = starts_with(&greetings_evening);

    let input = "Tē nā koutou, Code Craft";
    
    let (input, output) = evening_parser(input).unwrap();

    println!("input: {input:?}, output: {output:?}");
}

Which produces output input: ", Code Craft", output: "Tē nā koutou"

Error Handling

nom parsers and combinators return IResult<>, which is aliased to the regular Result<> type: type IResult<I, O, E = Error<I>> = Result<(I, O), Err<E>>;. Although the naming is somewhat confusing, the scheme is fairly straightforward; control flow and error reporting are addressed separately, with the Err<E> enum used for control flow, and the E/Error type parameter used for error reporting.

Err<E> is a an enum with three variants:

pub enum Err<Failure, Error = Failure> {
  /// There was not enough data
  Incomplete(Needed),

  /// The parser had an error (recoverable)
  Error(Error),

  /// The parser had an unrecoverable error: we got to the right
  /// branch and we know other branches won't work, so backtrack
  /// as fast as possible
  Failure(Failure),
}

The distinction between Error and Failure variants helps parsers communicate to combinators that some other parser might be expected to succeed (e.g. opt() or alt() as used above), or to fail entirely when no other parser could work (e.g. malformed input).

The E/Error type parameter defaults to a simple struct Error<I> which provides lightweight error reporting functionality. It contains the remaining input at the point of failure (hence the I parameter), and a code that indicates which nom parser failed.

/// Parses the supplied greeting phrase
fn greeting(input: &str) -> IResult<&str, &str> {
   alt([tag("Hello"), tag("Tē nā koutou")]).parse(input)
}

/// Parses the message after a greeting
fn message(input: &str) -> IResult<&str, &str> {
    // pro tip: The Parser trait is implemented for tuples of parsers
    let (input, _greeting) = (
        greeting,
        char(','),
        opt(space0)
    ).parse(input)?;

    rest(input)
}

fn main() {
    let input = "Hello; Your tax returns are due.";

    if let Err(err) = message(input) {
        match err {
            Err::Incomplete(needed) => {
                println!("Incomplete: {needed:?}");
            }
            Err::Error(err) | Err::Failure(err) => {
                eprintln!("{}", convert_error(input, err));
            }
        }
    }
}

This generates:

Parsing error:
Input: Hello; Your tax returns are due.
Error:      ^ Char

Improving Error Handling: Different Error Types

The easiest improvement in error messages is to switch from Error<I> to VerboseError<I>, which was recently moved to the nom-language crate. Changing the return type of message() and greeting() to IResult<&str, &str, VerboseError<&str>> allows for generating errors like this, with a single call to convert_error():

0: at line 1:
Hello; Your tax returns are due.
     ^
expected ',', found ;

The change between error types can be made easier by using a generic for custom parsers:

fn greeting<'a, E: ParseError<&'a str>>(
    input: &'a str,
) -> IResult<&'a str, &'a str, E> {
    alt([tag("Hello"), tag("Tē nā koutou")]).parse(input)
}

Improving Error Handling: Adding Context

VerboseError accumulates context from the stack of Parsers to further improve error messages. The context() combinator is a no-op when using error types that don’t use the context, such as the default Error:

/// Parses the message after a greeting
fn message(input: &str) -> IResult<&str, &str, VerboseError<&str>> {
    let (input, _greeting) = context(
        "message(), discarding greeting",
        (
            alt([greeting("Hello"), greeting("Tē nā koutou")]),
            char(','),
            opt(space0),
        ),
    )
    .parse(input)?;

    context("message(), returning message", rest).parse(input)
}
0: at line 1:
Hello; Your tax returns are due.
     ^
expected ',', found ;

1: at line 1, in message(), discarding greeting:
Hello; Your tax returns are due.
^

Further Reading:

Well-written post with some background on parsing and (an older version of) nom, the main topic is an approach to error recovery in the context of systems like IDEs where invalid code is expected: https://eyalkalderon.com/blog/nom-error-recovery/ . The error recovery mechanism is described in this paper

For better user-facing error reporting of larger input, have a look at nom_locate. This promises to give finer-grained reporting of where errors were encountered, but wasn’t needed in my application.