Pascal Schmid's personal website

Dump Clang ASTs of all C++ files in a compilation database as JSON

Using Clang, one can obtain a JSON representation of the Abstract Syntax Tree (AST) of a C++ file as follows:

clang++ -Xclang -ast-dump=json -fsyntax-only example.cxx > example.json

Unfortunately, this JSON representation is missing a lot of information which is available to the parser. Alternatives to this approach of obtaining an AST of a C++ file are presented at the end of this text.

The AST JSON file can be very large compared to the C++ source code it was created from. Specifically, a simple "Hello, World!" program which uses #include <iostream> results in a JSON file of approximately 120 megabytes! It is therefore almost mandatory to compress this file. Using gzip, I was able to bring the size down to approximately 5 megabytes.

Because the AST of a C++ source file very much depends on the compiler flags, one needs access to them. One way to obtain them is via a compilation database. To automatically dump the ASTs of all C++ source files in this compilation database, I wrote the following script:

#!/usr/bin/python3

# Copyright 2022 Pascal Schmid
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# You may find more information about this script at
# https://paschmid.ch/

# https://clang.llvm.org/docs/JSONCompilationDatabase.html

import argparse
import gzip
import json
import math
import subprocess
import tempfile
from pathlib import Path


def main():
    parser = argparse.ArgumentParser(
        description="Dump the AST of all C++ files given in a "
        + "compilation database as gzipped JSON files using Clang"
    )
    parser.add_argument(
        "build_path",
        type=Path,
        help="Path to the directory which contains the compilation database",
    )
    parser.add_argument(
        "ast_path",
        type=Path,
        help="Path to the directory into which the outputs should be placed",
    )
    args = parser.parse_args()

    assert args.build_path.is_dir()

    compilation_database_file = args.build_path / "compile_commands.json"
    assert compilation_database_file.is_file()
    with compilation_database_file.open(mode="rt") as fp:
        compilation_database = json.load(fp)

    assert isinstance(compilation_database, list)

    args.ast_path.mkdir(exist_ok=True)

    counter = 0
    for command_object in compilation_database:
        counter += 1

        assert "directory" in command_object
        compile_directory = Path(command_object["directory"])
        assert compile_directory.is_dir()

        assert "file" in command_object
        source_file = Path(command_object["file"])
        if not source_file.is_absolute():
            source_file = (compile_directory / source_file).resolve(strict=True)
            assert source_file.is_absolute()
        assert source_file.is_file()
        assert "\n" not in str(source_file)
        print(source_file)

        assert "command" in command_object
        compile_command = command_object["command"]
        assert compile_command.startswith("/usr/bin/clang++ ")
        compiler_args = compile_command[compile_command.index(" ") :]

        with tempfile.NamedTemporaryFile() as response_file:
            response_file.write(bytes(compiler_args, "utf-8"))
            response_file.seek(0)

            proc = subprocess.run(
                [
                    "/usr/bin/clang++",
                    "-Xclang",
                    "-ast-dump=json",
                    "-fsyntax-only",
                    "@" + str(response_file.name),
                ],
                cwd=compile_directory,
                stdout=subprocess.PIPE,
            )

        ast_file = Path(args.ast_path) / "{:0{numbers}d}.json.gz".format(
            counter, numbers=math.ceil(math.log10(len(compilation_database)))
        )
        assert not ast_file.exists()
        with gzip.open(ast_file, "wb") as fp:
            fp.write(proc.stdout)
        assert ast_file.is_file()


if __name__ == "__main__":
    main()

Here is an example which demonstrates how to use the script:

Assume a C++ file of interest is example.cxx, whose content is:

int add(int x, int y)
{
    return x + y;
}

Create a CMakeLists.txt file containing this:

cmake_minimum_required(VERSION 3.25)

project(Example LANGUAGES CXX)

add_library(Example SHARED example.cxx)

Configure the project using CMake to obtain the compilation database (build/compile_commands.json):

cmake \
    -DCMAKE_EXPORT_COMPILE_COMMANDS:BOOL=ON \
    -DCMAKE_CXX_COMPILER=/usr/bin/clang++ \
    -S . \
    -B build

Run the script to dump the ASTs:

python3 ./main.py build/ ast/

The outputs can then be found in the ast directory, where the number in the filename is the index of the corresponding source file in the compilation database.

Unfortunately, the AST files do not contain enough information for the project I have in mind. I may have to resort to either libclang or LibTooling. However, I do not like C++, so I might use the Python bindings to libclang instead.

I hope the AST dumps will be more complete in future versions of Clang.

This blog post by Pascal Schmid is licensed under CC BY-SA 4.0.