Parsing with Pest
A couple of weeks ago I was working on datafusion-catalogprovider-glue, a catalogprovider for Datafusion sourcing from AWS Glue.
The AWS SDK for Rust returns an UTF-8 string as the datatype value for each column in a table. A crawler will use datatypes which are supported by Athena.
Parsing the initial data types was pretty simple:
match glue_type {
"int" => Ok(DataType::Int32),
"boolean" => Ok(DataType::Boolean),
"bigint" => Ok(DataType::Int64),
"float" => Ok(DataType::Float32),
...
}
Support for decimal and arrays remained simple with some regular expressions and recursion:
static ref DECIMAL_RE: Regex = Regex::new("^decimal\\((?P<precision>\\d+)\\s*,\\s*(?P<scale>\\d+)\\)$").unwrap();
static ref ARRAY_RE: Regex = Regex::new("^array<(?P<array_type>.+)>$").unwrap();
_ => {
if let Some(decimal_cg) = DECIMAL_RE.captures(glue_type) {
let precision = decimal_cg
.name("precision")
.unwrap()
.as_str()
.parse()
.unwrap();
let scale = decimal_cg.name("scale").unwrap().as_str().parse().unwrap();
Ok(DataType::Decimal(precision, scale))
} else if let Some(array_cg) = ARRAY_RE.captures(glue_type) {
let array_type = array_cg.name("array_type").unwrap().as_str();
let field = Self::map_to_arrow_field(glue_name, array_type)?;
Ok(DataType::List(Box::new(field)))
...
But then I also needed to support maps and structs which require matching nested constructs with balancing groups. And I gave up :)
I made the decision to leverage a proper parser, wrote an EBNF grammar but was unable to find a good parser-generator for Rust and decided to use Pest, a general purpose parser written in Rust with a focus on accessibility, correctness, and performance.
Following the examples in the Pest book made it easy to write the parsing expression grammar, PEG: glue_datatype.pest. Using the Pest plugin for IDEA made it even easier:
DataType = _{ SimpleType | ArrayType | MapType | StructType }
SimpleType = _{ TinyInt | SmallInt | Int | Boolean | BigInt | Float | Double | Binary | Date | Timestamp | String | Char | Varchar | Decimal }
TinyInt = { "tinyint" }
SmallInt = { "smallint" }
Int = { "int" | "integer" }
Decimal = { "decimal(" ~ number ~ "," ~ number ~ ")" }
ArrayType = { "array<" ~ DataType ~ ">" }
MapType = { "map<" ~ DataType ~ "," ~ DataType ~ ">" }
StructType = { "struct<" ~ structFields ~ ">" }
structFields = _{ structField ~ ("," ~ structField)* }
structField = { ident ~ ":" ~ DataType }
I decided to first parse this into a Rust enum, GlueDataType and only then map from this enum to Datafusion/Arrow datatypes.
In summary this exercise felt like a walk in the park ;)