Tim Van Wassenhove

Passionate geek, interested in Technology. Proud father of two

16 Jun 2022

Parsing with Pest

A couple of weeks ago I was working on datafusion-catalogprovider-glue, a catalogprovider for Datafusion sourcing from AWS Glue.

The AWS SDK for Rust returns an UTF-8 string as the datatype value for each column in a table. A crawler will use datatypes which are supported by Athena.

Parsing the initial data types was pretty simple:

match glue_type {
    "int" => Ok(DataType::Int32),
    "boolean" => Ok(DataType::Boolean),
    "bigint" => Ok(DataType::Int64),
    "float" => Ok(DataType::Float32),
    ...
}

Support for decimal and arrays remained simple with some regular expressions and recursion:

static ref DECIMAL_RE: Regex = Regex::new("^decimal\\((?P<precision>\\d+)\\s*,\\s*(?P<scale>\\d+)\\)$").unwrap();

static ref ARRAY_RE: Regex = Regex::new("^array<(?P<array_type>.+)>$").unwrap();

_ => {
    if let Some(decimal_cg) = DECIMAL_RE.captures(glue_type) {
        let precision = decimal_cg
            .name("precision")
            .unwrap()
            .as_str()
            .parse()
            .unwrap();
        let scale = decimal_cg.name("scale").unwrap().as_str().parse().unwrap();
        Ok(DataType::Decimal(precision, scale))
    } else if let Some(array_cg) = ARRAY_RE.captures(glue_type) {
        let array_type = array_cg.name("array_type").unwrap().as_str();
        let field = Self::map_to_arrow_field(glue_name, array_type)?;
        Ok(DataType::List(Box::new(field)))
    ...

But then I also needed to support maps and structs which require matching nested constructs with balancing groups. And I gave up :)

I made the decision to leverage a proper parser, wrote an EBNF grammar but was unable to find a good parser-generator for Rust and decided to use Pest, a general purpose parser written in Rust with a focus on accessibility, correctness, and performance.

Following the examples in the Pest book made it easy to write the parsing expression grammar, PEG: glue_datatype.pest. Using the Pest plugin for IDEA made it even easier:

DataType = _{ SimpleType | ArrayType | MapType | StructType }
SimpleType = _{ TinyInt | SmallInt | Int | Boolean | BigInt | Float | Double | Binary | Date | Timestamp | String | Char | Varchar | Decimal }

TinyInt = { "tinyint" }
SmallInt = { "smallint" }
Int = { "int" | "integer" }

Decimal = { "decimal(" ~ number ~ "," ~ number ~ ")" }
ArrayType = { "array<" ~ DataType ~ ">" }
MapType = { "map<" ~ DataType ~ "," ~ DataType ~ ">" }
StructType = { "struct<" ~ structFields ~ ">" }
structFields = _{ structField ~ ("," ~ structField)* }
structField = { ident ~ ":" ~ DataType }

I decided to first parse this into a Rust enum, GlueDataType and only then map from this enum to Datafusion/Arrow datatypes.

In summary this exercise felt like a walk in the park ;)