Parsing with nom
One of my trapping projects involves using an LTE modem to send and receive text messages, the modem provides an AT interface which I’m using atat to manage. The modem’s AT protocol for text messages supports either a simple “text mode” or a more-versatile “PDU mode”, but I couldn’t get sending to work with text mode, at least on the one carrier (2degrees) that I have a SIM card for.
PDU mode presents a text message as something like
07914622207276F2040B914622468921F300005250908115718418C779FB06DAA0CA787AD94D2E933729D086E2726D28
,
so I needed to parse those PDUs. Further, some commands like CMGL
return
multiple SMS messages, with a specification:
5)If PDU mode (AT+CMGF=0)and Command successful:
+CMGL: <index>,<stat>,[<alpha>],<length>
<pdu>[
+CMGL: <index>,<stat>,[<alpha>],<length>
<pdu>
[...]]
The CMGL
AT response message is separated by newlines, which is a bit
atypical, and which atat can’t handle out-of-the-box… atat supports custom
parsers written using nom though, so I wound up writing a parser for SMS PDUs in
nom!
(n.b. this was done with a slightly older nom 7.1.7 as used by atat, examples are 8.0.0)
About nom
nom is a toolkit for making parsers; it’s more structured than making a parser from scratch, but it’s not as formal as a parsing expression grammar (PEG, which is a specification that can be used to make a recursive descent parser). The nom approach feels quite functional, and results in code that looks similar to a PEG. nom was originally intended for parsing binary files, but has been expanded to support text which examples here will use for legibility.
nom Parsers
A parser in nom is a function which implements the Parser
trait for an input
type I, an output type O. An implementation is provided for functions with the
canonical signature:
fn(input: I) -> IResult<I, O>;
(more to follow on IResult<>
)
These parser methods are usually quite small, the ones nom provides do things like extract an encoded number, or a fixed sequence, from a slice of bytes. More complicated parsers (like for PDUs) are made by combining other parsers.
use nom::IResult;
use nom::bytes::complete::tag;
/// Parses the string "Hello" at the beginning of input
fn hello(input: &str) -> IResult<&str, &str> {
tag("Hello")(input)
}
fn main() {
let input = "Hello this is a test";
let (input, output) = hello(input).unwrap();
println!("input: {input:?}, output: {output:?}");
}
This produces output input: " this is a test", output: "Hello"
. So, we can
see that these parsers return the parsed information along with remaining
un-parsed input.
nom calls itself a “parser combinator library”; combinators combine parsers.
For example, we can use opt()
with hello()
like this:
let input = "This doesn't start with hello";
let (input, output) = opt(hello).parse(input).unwrap();
println!("input: {input:?}, output: {output:?}");
Which outputs input: "This doesn't start with hello", output: None
More Complicated Parsers
From that first example, we see a nom pattern in the use of tag()
: sometimes,
we want to write functions that take arguments beyond the ones in the signature
of a canonical parser, fn(input: I) -> IResult<I, O>;
. One solution is to
create a type and manually implement Parser
on it, but if a new type isn’t
justified, another option is to create a function that returns a parser:
/// Parses the first match from a list of candidate strings
fn starts_with<'a>(
candidates: &'a [&str],
) -> impl Fn(&'a str) -> IResult<&'a str, &'a str> {
move |input| {
for candidate in candidates {
let res = tag(*candidate)(input);
if res.is_ok() {
return res;
}
}
// Typically, we focus on using nom's provided combinators to make
// parsers, so just use whatever errors they provide r/t specifying them
// manually...
Err(nom::Err::Error(nom::error::Error::new(
input,
ErrorKind::Tag,
)))
}
}
fn main() {
let greetings_evening = ["Hello", "Tē nā koutou"];
let evening_parser = starts_with(&greetings_evening);
let input = "Tē nā koutou, Code Craft";
let (input, output) = evening_parser(input).unwrap();
println!("input: {input:?}, output: {output:?}");
}
Which produces output input: ", Code Craft", output: "Tē nā koutou"
Error Handling
nom parsers and combinators return IResult<>
, which is aliased to the regular
Result<>
type: type IResult<I, O, E = Error<I>> = Result<(I, O), Err<E>>;
.
Although the naming is somewhat confusing, the scheme is fairly straightforward;
control flow and error reporting are addressed separately, with the Err<E>
enum used for control flow, and the E
/Error
type parameter used for error
reporting.
Err<E>
is a an enum with three variants:
pub enum Err<Failure, Error = Failure> {
/// There was not enough data
Incomplete(Needed),
/// The parser had an error (recoverable)
Error(Error),
/// The parser had an unrecoverable error: we got to the right
/// branch and we know other branches won't work, so backtrack
/// as fast as possible
Failure(Failure),
}
The distinction between Error
and Failure
variants helps parsers communicate
to combinators that some other parser might be expected to succeed (e.g. opt()
or alt()
as used above), or to fail entirely when no other parser could work
(e.g. malformed input).
The E
/Error
type parameter defaults to a simple struct Error<I>
which
provides lightweight error reporting functionality. It contains the remaining
input at the point of failure (hence the I
parameter), and a code that
indicates which nom parser failed.
/// Parses the supplied greeting phrase
fn greeting(input: &str) -> IResult<&str, &str> {
alt([tag("Hello"), tag("Tē nā koutou")]).parse(input)
}
/// Parses the message after a greeting
fn message(input: &str) -> IResult<&str, &str> {
// pro tip: The Parser trait is implemented for tuples of parsers
let (input, _greeting) = (
greeting,
char(','),
opt(space0)
).parse(input)?;
rest(input)
}
fn main() {
let input = "Hello; Your tax returns are due.";
if let Err(err) = message(input) {
match err {
Err::Incomplete(needed) => {
println!("Incomplete: {needed:?}");
}
Err::Error(err) | Err::Failure(err) => {
eprintln!("{}", convert_error(input, err));
}
}
}
}
This generates:
Parsing error:
Input: Hello; Your tax returns are due.
Error: ^ Char
Improving Error Handling: Different Error Types
The easiest improvement in error messages is to switch from Error<I>
to
VerboseError<I>
,
which was recently moved to the nom-language
crate. Changing the return type
of message()
and greeting()
to IResult<&str, &str, VerboseError<&str>>
allows for generating errors like this, with a single call to convert_error()
:
0: at line 1:
Hello; Your tax returns are due.
^
expected ',', found ;
The change between error types can be made easier by using a generic for custom parsers:
fn greeting<'a, E: ParseError<&'a str>>(
input: &'a str,
) -> IResult<&'a str, &'a str, E> {
alt([tag("Hello"), tag("Tē nā koutou")]).parse(input)
}
Improving Error Handling: Adding Context
VerboseError
accumulates context from the stack of Parser
s to further
improve error messages. The context()
combinator is a no-op when using error
types that don’t use the context, such as the default Error
:
/// Parses the message after a greeting
fn message(input: &str) -> IResult<&str, &str, VerboseError<&str>> {
let (input, _greeting) = context(
"message(), discarding greeting",
(
alt([greeting("Hello"), greeting("Tē nā koutou")]),
char(','),
opt(space0),
),
)
.parse(input)?;
context("message(), returning message", rest).parse(input)
}
0: at line 1:
Hello; Your tax returns are due.
^
expected ',', found ;
1: at line 1, in message(), discarding greeting:
Hello; Your tax returns are due.
^
Further Reading:
Well-written post with some background on parsing and (an older version of) nom, the main topic is an approach to error recovery in the context of systems like IDEs where invalid code is expected: https://eyalkalderon.com/blog/nom-error-recovery/ . The error recovery mechanism is described in this paper
For better user-facing error reporting of larger input, have a look at nom_locate. This promises to give finer-grained reporting of where errors were encountered, but wasn’t needed in my application.