When people think of linting, the first thing that comes to mind is usually static code analysis for programming languages, but rarely for markup languages.

In this article, I would like to share how our team developed ZK Client MVVM Linter, an XML linter that automates migration assessment for our new Client MVVM feature in the upcoming ZK 10 release. The basic idea is to compile a catalog of known compatibility issues as lint rules to allow users to assess the potential issues flagged by the linter before committing to the migration.

For those unfamiliar with ZK, ZK is a Java framework for building enterprise applications; ZUL (ZK User Interface Markup Language) is its XML-based language for simplifying user interface creation. Through sharing our experience developing ZK Client MVVM Linter, we hope XML linters can find broader applications.

File Parsing

The Problem

Like other popular linters, our ZUL linter starts by parsing source code into AST (abstract syntax tree). Although Java provides several libraries for XML parsing, they lose the original line and column numbers of elements in the parsing process. As the subsequent analysis stage will need this positional information to report compatibility issues precisely, our first task is to find a way to obtain and store the original line and column numbers in AST.

How We Address This

After exploring different online sources, we found a Stack Overflow solution that leverages the event-driven property of SAX Parser to store the end position of each start tag in AST. Its key observation was that the parser invokes the startElement method whenever it encounters the ending ‘>’ character. Therefore, the parser position returned by the locator must be equivalent to the end position of the start tag, making the startElement method the perfect opportunity for creating new AST nodes and storing their end positions.

<div class="codeMirror-code–wrapper" data-code="public static Document parse(File file) throws Exception { Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); SAXParser parser = SAXParserFactory.newInstance().newSAXParser(); parser.parse(file, new DefaultHandler() { private Locator _locator; private final Stack _stack = new Stack(); @Override public void setDocumentLocator(Locator locator) { _locator = locator; _stack.push(document); } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) { // Create a new AST node Element element = document.createElement(qName); for (int i = 0; i

public static Document parse(File file) throws Exception { Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument(); SAXParser parser = SAXParserFactory.newInstance().newSAXParser(); parser.parse(file, new DefaultHandler() { private Locator _locator; private final Stack<Node> _stack = new Stack<>(); @Override public void setDocumentLocator(Locator locator) { _locator = locator; _stack.push(document); } @Override public void startElement(String uri, String localName, String qName, Attributes attributes) { // Create a new AST node Element element = document.createElement(qName); for (int i = 0; i < attributes.getLength(); i++) element.setAttribute(attributes.getQName(i), attributes.getValue(i)); // Store its end position int lineNumber = _locator.getLineNumber(), columnNumber = _locator.getColumnNumber(); element.setUserData("position", lineNumber + ":" + columnNumber, null); _stack.push(element); } @Override public void endElement(String uri, String localName, String qName) { Node element = _stack.pop(); _stack.peek().appendChild(element); } }); return document;
}

Building on the solution above, we implemented a more sophisticated parser capable of storing the position of each attribute. Our parser uses the end positions returned by the locator as reference points to reduce the task into finding attribute positions relative to the end position. Initially, we started with a simple idea of iteratively finding and removing the last occurrence of each attribute-value pair from the buffer. For example, if <elem attr1="value" attr2="value"> ends at 3:34 (line 3: column 34), our parser will perform the following steps:

Plain Text

 

<div class="codeMirror-code–wrapper" data-code="Initialize buffer =
Find buffer.lastIndexOf("value") = 28 ? Update buffer = <elem attr1="value" attr2="
Find buffer.lastIndexOf("attr2") = 21 ? Update buffer = <elem attr1="value"
Find buffer.lastIndexOf("value") = 14 ? Update buffer = <elem attr1="
Find buffer.lastIndexOf("attr1") = 7 ? Update buffer =

Initialize buffer = <elem attr1="value" attr2="value">
Find buffer.lastIndexOf("value") = 28 ? Update buffer = <elem attr1="value" attr2="
Find buffer.lastIndexOf("attr2") = 21 ? Update buffer = <elem attr1="value"
Find buffer.lastIndexOf("value") = 14 ? Update buffer = <elem attr1="
Find buffer.lastIndexOf("attr1") = 7 ? Update buffer = <elem
From steps 3 and 6, we can conclude that attr1 and attr2 start at 3:7 and 3:21, respectively.

Then, we further improved the mechanism to handle other formatting variations, such as a single start tag across multiple lines and multiple start tags on a single line, by introducing the start index and leading space stack to store the buffer indices where new lines start and the number of leading spaces of each line. For example, if there is a start tag that starts from line 1 and ends at 3:20 (line 3: column 20):

<div class="codeMirror-code–wrapper" data-code="” data-lang=”application/xml”>

<elem attr1="value across 2 lines" attr2 = "value">

Our parser will perform the following steps:

Plain Text

 

<div class="codeMirror-code–wrapper" data-code="Initialize buffer =
Initialize startIndexes = [0, 19, 35] and leadingSpaces = [0, 4, 4]
Find buffer.lastIndexOf("value") = 45
Find buffer.lastIndexOf("attr2") = 36 ? lineNumber = 3, startIndexes = [0, 19, 35] and leadingSpaces = [0, 4, 4] ? columnNumber = 36 – startIndexes.peek() + leadingSpaces.peek() = 5
Find buffer.lastIndexOf("value across 2 lines") = 14
Find buffer.lastIndexOf("attr1") = 7 ? Update lineNumber = 1, startIndexes = [0], and leadingSpaces = [0] ? columnNumber = 7 – startIndexes.peek() + leadingSpaces.peek() = 7
From steps 4 and 8, we can conclude that attr1 and attr2 start at 1:7 and 3:5, respectively.” data-lang=”text/plain”>

Initialize buffer = <elem attr1="value across 2 lines" attr2 = "value">
Initialize startIndexes = [0, 19, 35] and leadingSpaces = [0, 4, 4]
Find buffer.lastIndexOf("value") = 45
Find buffer.lastIndexOf("attr2") = 36 ? lineNumber = 3, startIndexes = [0, 19, 35] and leadingSpaces = [0, 4, 4] ? columnNumber = 36 - startIndexes.peek() + leadingSpaces.peek() = 5
Find buffer.lastIndexOf("value across 2 lines") = 14
Find buffer.lastIndexOf("attr1") = 7 ? Update lineNumber = 1, startIndexes = [0], and leadingSpaces = [0] ? columnNumber = 7 - startIndexes.peek() + leadingSpaces.peek() = 7
From steps 4 and 8, we can conclude that attr1 and attr2 start at 1:7 and 3:5, respectively.