A Comment-Preserving Pretty-Printer
The goal is to implement a comment-preserving autoformatter for a lambda
calculus language using s-expressions. Without comments, the grammar of the
languages is (with v
as a metavariable for identifiers and n
as a
metavariable for literal integers):
v ∈ Vars
n ∈ Integers
e ::= v : variable
| n : literal integer
| ( fun v e ) : abstraction
| ( e e ) : application
In our little language comments begin with a semicolon (;
) character and
extend to the end of the line. Comments can occur anywhere whitespace can occur.
For example, the following would be a valid program in our little language:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x
(f (x
x)
))
(fun x (f (x x)))))
After pretty-printing (with a target of 25 characters per line), it will look like:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x (f (x x)))
(fun x (f (x x)))))
There are two steps involved in auto-formatting the language:
- parsing in a way that preserves comment information, and
- rendering the syntax tree.
First, we have our language extensions and imports. We’ll be using the following libraries in this implementation:
- text,
- megaparsec,
- parser-combinators,
- parsers, and
- prettyprinter.
{-# LANGUAGE DuplicateRecordFields #-}
{-# LANGUAGE ImportQualifiedPost #-}
{-# LANGUAGE OverloadedStrings #-}
module PrettyPrintWithComments where
import Control.Applicative (many)
import Data.Foldable (asum)
import Data.Functor (void)
import Data.List (intersperse)
import Data.List.NonEmpty (NonEmpty, some1)
import Data.Maybe (catMaybes, isNothing)
import Data.Text (Text)
import Data.Text qualified as Text
import Data.Void (Void)
import Prettyprinter (Doc, Pretty (pretty))
import Prettyprinter qualified as PP
import Text.Megaparsec qualified as P
import Text.Megaparsec.Char qualified as P
import Text.Megaparsec.Char.Lexer qualified as Lexer
import Text.Parser.Combinators qualified as P (skipOptional)
Preserving Comments While Parsing
What information about the comments we need to put the syntax tree depends on what about the comments we want to preserve while formatting the code. In this example, we want to keep comments on the same line as some code on that line, and we want to maintain separation between blocks of comments that are on their own lines and separated by empty lines.
newtype Comment = Comment
-- | The content of the comment, not including the leading comment
{ -- character
commentContent :: Text
}deriving (Show)
data CommentSection = CommentSection
-- | The comment on the same line as the lexeme, if there is one
{ commentSameLine :: Maybe Comment
-- | Blocks of comments following lexemes
, commentBlocks :: [NonEmpty Comment]
}deriving (Show)
If we wanted to preserve user-inserted line breaks or otherwise use them as hints for formatting, we would also need to store where the line breaks occur.
We’ll be using a parser combinator library. The typical way to use a parser combinator is to first define the combinators for dealing with lexemes, and then use those to define the parsers for syntactic elements. Whitespace is handled by having lexeme parsers consume and discard all trailing whitespace.
Because we want to preserve comment information, instead of having the whitespace parser discard all of its information, we will have it preserve the comment information that we are interested in.
When we parse a single line of a comment, we will consume the trailing whitespace from the concluding newline through either the end of the next line (not including the newline) or the first non-whitespace character, whichever comes first.
type Parser = P.Parsec Void Text
commentP :: Parser Comment
= do
commentP $ P.char ';'
void <- P.takeWhileP (Just "character") (/= '\n')
content <* P.skipMany P.hspace1
P.skipOptional P.newline pure $ Comment content
A block of comments is just one or more single comment lines.
commentBlockP :: Parser (NonEmpty Comment)
= some1 commentP commentBlockP
And comment blocks can be separated by lines of only whitespace.
commentBlocksP :: Parser [NonEmpty Comment]
= fmap catMaybes . many . P.choice $
commentBlocksP Just <$> commentBlockP
[ Nothing <$ P.space1
, ]
Given these definitions, we make a new definition of space
that serves the
same purpose as the one from Megaparse.Char.Lexer
, except that it also
produces the comments it encountered while consuming whitespace.
space :: Parser CommentSection
= P.hidden $ do
space
P.skipMany P.hspace1<- P.optional commentP
sameLineC <- commentBlocksP
cs pure $ CommentSection sameLineC cs
We define a similar variants of lexeme
and symbol
. Since token parsers
consume trailing whitespace, parsing a token produces the comments included in
that trailing whitespace.
lexeme :: Parser a -> Parser (a, CommentSection)
= do
lexeme p <- p
x <- space
cs pure (x, cs)
symbol :: Text -> Parser CommentSection
= fmap snd . lexeme . P.string symbol
With that, we define parsers for the tokens in our language.
lparenP :: Parser CommentSection
= symbol "("
lparenP
rparenP :: Parser CommentSection
= symbol ")"
rparenP
funP :: Parser CommentSection
= symbol "fun"
funP
dotP :: Parser CommentSection
= symbol "fun"
dotP
colonP :: Parser CommentSection
= symbol "::"
colonP
identifierP :: Parser (Text, CommentSection)
= lexeme (fst <$> P.match (P.letterChar *> many P.alphaNumChar))
identifierP
literalP :: Parser (Integer, CommentSection)
= lexeme Lexer.decimal literalP
The definition of the AST is standard, except for one field on each of the Abs
and App
constructors which are used to capture information about the comments
that appear within those syntactic constructs.
Despite including the comments that follow a lexical token with the parsing of
that token, when representing the AST it is easier to work only with the
comments contained within the syntactic element (in part due to how the
prettyprinter
library handles indentation and newlines, which we’ll explain
later).
The inclusion only of the comments that occur within the syntactic elements is
why the Lit
and Var
constructors have no comment field.
For AbsComments
and AppComments
, the comment fields is named after the
preceding syntactic element. Because only the comments contained within the
overall syntactic construct are included, there is an lparenComments
field,
but no rparenComments
field. The fields that would trail the right parenthesis
are instead included in the containing construct.
type Var = Text
data Exp
= Lit Integer
| Var Var
| Abs AbsComments Var Exp
| App AppComments Exp Exp
deriving (Show)
data AbsComments = AbsComments
lparenComments :: CommentSection
{ keywordComments :: CommentSection
, boundNameComments :: CommentSection
, bodyComments :: CommentSection
,
}deriving (Show)
data AppComments = AppComments
lparenComments :: CommentSection
{ functionComments :: CommentSection
, argumentComments :: CommentSection
,
}deriving (Show)
The overall program starts at the beginning of the input and goes to the end of
the input, and so includes all leading and trailing comments outside of the
expressions being parsed. Leading comments can’t occur on the same line as some
preceding code, because there is no preceding code, so the leading comments are
just a list of Comment
blocks instead of a CommentSection
.
data Program = Program [NonEmpty Comment] Exp CommentSection
deriving (Show)
As with the parsers for the lexical tokens, the parsers for the syntactic elements produce both the parsed value and the trailing comment. With the literal and variable parsers, this is straightforward: the parsers just wrap the values with the appropriate constructors.
litP :: Parser (Exp, CommentSection)
= do
litP <- literalP
(n, c) pure (Lit n, c)
varP :: Parser (Exp, CommentSection)
= do
varP <- identifierP
(v, c) pure $ (Var v, c)
For abstraction and application, we first define the parsers for the content within the parentheses. Because we aren’t parsing the parentheses here, we don’t have the comment following the left parenthesis and so must take it as an argument. We also don’t need to return an extra comment, because the comment trailing the last token parsed is still within the construct.
absP :: Parser (CommentSection -> Exp)
= do
absP <- funP
kwc <- identifierP
(x, xc) <- expP
(e, ec) pure $ \lp -> Abs (AbsComments lp kwc xc ec) x e
appP :: Parser (CommentSection -> Exp)
= do
appP <- expP
(f, fc) <- expP
(e, ec) pure $ \lp -> App (AppComments lp fc ec) f e
We then use the withParensP
helper to handle parsing the parentheses,
providing the comment trailing the left parenthesis and producing the comment
trailing the right parenthesis.
parensP :: Parser a -> Parser (CommentSection, a, CommentSection)
= (,,) <$> lparenP <*> p <*> rparenP
parensP p
withParensP :: Parser (CommentSection -> b) -> Parser (b, CommentSection)
= do
withParensP p <- parensP p
(lc, f, rc) pure $ (f lc, rc)
With that, we can assemble our expression parser and program parser.
expP :: Parser (Exp, CommentSection)
=
expP
asum
[ withParensP (asum [absP, appP])
, varP
, litP
]
programP :: Parser Program
= do
programP
P.skipMany P.hspace1<- commentBlocksP
leadingC <- expP
(e, trailingC) pure $ Program leadingC e trailingC
Pretty-Printing with Comments
Now that we can parse programs while preserving comment information, we can move
on to rendering the syntax in our desired format. We will use prettyprint
Wadler-Leijen pretty printing library to handle the actual layout.
First, we define how to pretty-print the comments. This is mostly straightforward,
except that the final newline of a comment block has to be omitted. The nest
function from the prettyprint
library does not affect indentation until after
the next newline (rather than affecting indentation of the first character after
the newline). Therefore, if we want the indentation of anything following a
comment to differ from the indentation of the comment, we need to be able to
emit the newline after the close of the indentation group.
instance Pretty Comment where
= (";" <>) . pretty . Text.stripEnd $ commentContent c
pretty c =
prettyList cs if null cs
then mempty
else mconcat $ intersperse PP.hardline (map pretty cs)
prettyCommentBlocks :: [NonEmpty Comment] -> Doc a
=
prettyCommentBlocks mconcat . PP.punctuate (PP.hardline <> PP.hardline) . map pretty
instance Pretty CommentSection where
CommentSection sl cs) =
pretty (mconcat
maybe mempty ((PP.space <>) . pretty) sl
[ if null cs then mempty else PP.hardline
,
, prettyCommentBlocks cs ]
Despite preserving this ability, in our choice of pretty-printing below, we don’t end up needing it, and so we define the following helpers that we can use. However, if you want to have the closing parenthesis of an expression at a different indentation level than the indentation level of the previous part of the expression, then you will need to render the comment blocks and the final newline separately.
isEmpty :: CommentSection -> Bool
CommentSection sl cs) = isNothing sl && null cs
isEmpty (
prettyCommentSectionOrElse :: Doc ann -> CommentSection -> Doc ann
=
prettyCommentSectionOrElse alt cb mconcat
[ pretty cbif isEmpty cb then alt else PP.hardline
,
]
prettyCommentSection :: CommentSection -> Doc ann
= prettyCommentSectionOrElse mempty
prettyCommentSection
hasComment :: Exp -> Bool
= case e of
hasComment e Lit _ -> False
Var _ -> False
Abs (AbsComments lpc kwc vc bc) _ b ->
or
not (isEmpty lpc)
[ not (isEmpty kwc)
, not (isEmpty vc)
, not (isEmpty bc)
,
, hasComment b
]App (AppComments lpc fc xc) f x ->
or
not (isEmpty lpc)
[ not (isEmpty fc)
, not (isEmpty xc)
,
, hasComment f
, hasComment x ]
With all of this, the actual rendering is straightforward. Most of the implementation is about choices of how the rendered code should look:
- What should the indentation level be? For this, I’ve chosen values that I think look nice.
- Should a space be turned into newline (i.e., should we use
space
orsoftine
) if there isn’t enough room? We don’t in some cases because it would not move the expression any further to the left.
…and so on.
instance Pretty Exp where
= PP.group $ case e of
pretty e Lit n -> pretty n
Var v -> pretty v
Abs (AbsComments lpc kwc vc bc) v b ->
let vbBreak =
if not (isEmpty kwc) || hasComment b
then PP.hardline
else PP.softline
in mconcat $
1 . mconcat $
[ PP.nest "("
[
, prettyCommentSection lpc"fun"
,
]5 . mconcat $
, PP.nest
[ prettyCommentSectionOrElse PP.space kwc
, pretty v
]3 . mconcat $
, PP.nest
[ prettyCommentSectionOrElse vbBreak vc
, pretty b
, prettyCommentSection bc")"
,
]
]App (AppComments lpc fc xc) f x ->
let fxBreak =
if hasComment f || hasComment x
then PP.hardline
else PP.softline
in mconcat $
"("
[ . mconcat $
, PP.align
[ prettyCommentSection lpc
, pretty f
, prettyCommentSectionOrElse fxBreak fc
, pretty x
, prettyCommentSection xc")"
,
]
]
instance Pretty Program where
Program leadingC e trailingC) =
pretty (mconcat
[ prettyCommentBlocks leadingCif null leadingC then mempty else PP.hardline
,
, pretty e
, prettyCommentSectionOrElse PP.hardline trailingC ]
We are subject to the usual limitations of Wadler-Leijen pretty-printers in terms of where the layout algorithm will decide to insert line breaks. For example, when rendering our original example at 40 columns wide, the break is not in the ideal location:
; Definition of the Y combinator
(fun f ; function to make recursive
; remember, 'fun' is 'lambda'
((fun x (f (x x))) (fun x (f (x
x)))))
Depending on the desired output for your pretty-printer, you might need to do
additional wrangling of prettyprint
, use a different rendering algorithm (such
as a Bernardy
pretty-printer)
or use a custom algorithm.