Comment développer un analyseur de code en quelques heures-javaDidacticiel-php.cn

L'analyse statique est un outil robuste qui aide les développeurs à contrôler la qualité du code. Essayons de développer un analyseur simple pour Lua en utilisant Java et voyons ce qu'il y a sous le capot de l'analyseur statique.

How to develop code analyzer in hours

Petit avant-propos

Nous allons écrire l'analyseur statique pour Lua en Java.

Pourquoi Lua ? Sa syntaxe est simple et il n'est pas nécessaire de s'embourber dans les détails. C'est aussi un bon langage, ~~surtout comparé à JavaScript~~. Pourquoi Java ? Il dispose d'une pile technologique complète pour nous, et Java est également un langage de développement convivial.

Non :) Cependant, l'article explique comment nous avons développé un véritable pilote d'analyseur lors du hackathon interne. Le titre n’est donc pas qu’un simple piège à clics : nous avons en réalité eu 48 heures. Notre équipe était composée de trois développeurs, donc si vous souhaitez essayer ce solo, préparez-vous à y consacrer un peu plus de temps.

Comme il ne s'agit que d'un pilote, nous appellerons notre analyseur « Mun » dans l'article.

Qu'est-ce qu'un analyseur ?

Avant de commencer, il est préférable de comprendre ce qu'est un analyseur et de définir l'étendue du travail. Dans l’ensemble, c’est clair : récupérez le code et râlez-vous s’il y a quelque chose qui ne va pas ici. De quoi avons-nous besoin exactement ? Nous nous intéressons aux aspects suivants de l'analyse statique :

Lexer et analyseur. Nous prenons le code source et le transformons en un arbre facile à utiliser (AST).
AST (ou arbre de syntaxe abstraite) est une manière de représenter la structure de données d'un programme sous forme d'arborescence. Il contient des données sur la syntaxe du programme.
Données sémantiques. Seules les données de syntaxe ne suffisent pas pour l’analyse. Nous avons donc besoin d’un mécanisme supplémentaire qui regroupe les données sémantiques (de signification) dans l’arborescence. Par exemple, il pourrait s'agir de la portée de la variable.
Analyse des flux de données. Si nous souhaitons effectuer une analyse approfondie, nous pouvons essayer de prédire les valeurs des variables dans les nœuds du programme. Par exemple, cela permet de détecter les erreurs liées à la division zéro.

Cela semble fastidieux, mais ce n'est que le cœur de l'analyseur. Au fait, un compilateur a la même chose dans le frontend. Cependant, la partie la plus délicate (la génération de code) se cache juste dans le backend.

Quel est le plan pour le noyau ? Le périmètre est le suivant :

écrire des règles de diagnostic pour détecter les erreurs dans le code ;
collecter les erreurs détectées et émettre un rapport ;
créez notre propre plugin pour afficher les avertissements (facultatif).

En effet, trop pendant 48 heures :) Certaines choses doivent être abandonnées, certaines rationalisées et certaines réutilisées. Au fur et à mesure que nous parcourons l'article, nous examinerons de tels cas.

Pour mieux comprendre, schématisons-le :

How to develop code analyzer in hours

Cela suffit pour commencer. Vous pouvez en savoir plus sur l'analyse statique sur notre site Web.

Cœur

Lexer et analyseur

Théorie

Donc, lexer et parser sont la base de l'analyseur, nous ne pouvons pas aller plus loin sans eux. Si nous voulons quelque chose au-delà des expressions régulières.

C'est assez simple avec un lexer : c'est fastidieux de manipuler le code source sous forme de texte. Nous devons donc le traduire en une représentation intermédiaire, c'est-à-dire le diviser en jetons. Cela dit, le lexer ne doit pas être intelligent. C'est un incontournable car toutes les choses les plus difficiles et les plus compliquées vont à un analyseur.

Puis passons à un analyseur. Il prend les jetons entrants, les calcule et crée AST. Voici un bref aperçu.

Les langues ont des grammaires et il existe différents types d'analyseurs pour travailler avec elles. Si nous l'écrivions nous-mêmes, il serait préférable d'écrire le simple analyseur LL(1) pour une grammaire sans contexte. Comme nous l'avons déjà dit, Lua a une grammaire intuitive donc cela suffirait.

LL signifie que la chaîne d'entrée est lue de gauche à droite et que la sortie de gauche à droite est construite pour elle. En général, il ne suffit pas de regarder le jeton actuel, (LL(0)) , nous devrons donc peut-être examiner k jetons à l'avance. Un tel analyseur est appelé LL(k).

Cependant, nous n'en avons pas besoin, car nous n'allons rien écrire. Pourquoi? Nous n'avons que 48 heures et pas le temps de créer l'analyseur, surtout si vous ne l'avez jamais développé.

Approche choisie

Quelle alternative avons-nous pour écrire notre lexer et notre analyseur ? Générateurs ! Il s'agit de toute une classe d'utilitaires qui nécessitent uniquement que nous donnions un fichier de grammaire spécialement décrit en entrée pour générer l'analyseur.

Nous avons choisi ANTLR v4. L'outil est également écrit en Java, ce qui le rend très simple à utiliser. Au fil de nombreuses années de développement, il a commencé à très bien se porter.

Le problème se cache ici aussi : nous devons savoir comment écrire le fichier de grammaire pour l'analyseur. Heureusement, il n'est pas difficile de trouver des options toutes faites sur GitHub, nous allons donc partir d'ici.

Une fois que nous avons configuré le projet avec ANTLR, passons à la construction de l'arbre de syntaxe abstraite.

Arbre de syntaxe abstraite

En parlant d'arbres : ANTLR est bien conçu pour visualiser l'analyse de code. Par exemple, voici le calcul factoriel :

function fact (n)
  if n == 0 then
    return 1
  else
    return n * fact(n-1)
  end
end

print("enter a number:")
a = io.read("*number")
print(fact(a))

Copier après la connexion

On peut obtenir l'arbre suivant :

How to develop code analyzer in hours

Nous pensons que c'est probablement plus proche de l'arbre d'analyse en ce qui concerne la classification. Nous pouvons nous arrêter ici et commencer à travailler avec.

Mais nous ne ferons pas cela et ne le convertirons pas en AST. Pourquoi? Il serait plus facile pour nous, en tant qu'équipe Java, de travailler avec une arborescence similaire à celle de notre analyseur. Vous pouvez en savoir plus sur le développement Java ici. L'exemple de hiérarchie de classes de Spoon ressemble à ceci :

How to develop code analyzer in hours

Eh bien, assez pour reporter le travail manuel, il est temps d'écrire du code. Nous ne montrons pas entièrement le code dans son intégralité ; cela n'a aucun sens car il est volumineux et inesthétique. Pour vous donner un petit aperçu de notre façon de penser, je vais laisser quelques extraits de code sous un spoiler pour les personnes intéressées.

On commence la manipulation depuis la cime de l'arbre :

public void enterStart_(LuaParser.Start_Context ctx) {
    _file = new CtFileImpl();
    _file.setBlocks(new ArrayList<>());
    for (var chunk : ctx.children) {
        if (chunk instanceof LuaParser.ChunkContext) {
            CtBlock block = getChunk((LuaParser.ChunkContext)chunk);
            if (block == null)
                continue;

            block.setParent(_file);
            _file.getBlocks().add(block);
        }
    }
}

private CtBlock getChunk(LuaParser.ChunkContext ctx) {
    for (var block : ctx.children) {
        if (block instanceof LuaParser.BlockContext) {
            return getBlock((LuaParser.BlockContext)block);
        }
    }
    return  null;
}

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    block.setLine(ctx.start.getLine());
    block.setColumn(ctx.start.getCharPositionInLine());
    block.setStatements(new ArrayList<>());

    for (var statement : ctx.children) {
        if (statement instanceof LuaParser.StatContext) {
            var statements = getStatement((LuaParser.StatContext)statement);
            for (var ctStatement : statements) {
                ctStatement.setParent(block);
                block.getStatements().add(ctStatement);
            }
        }
    }
    return block;
}

Copier après la connexion

Nous parcourons simplement l'arbre de haut en bas, construisant notre arbre au fur et à mesure. Tôt ou tard, nous atteindrons les nœuds terminaux. Voici les paramètres de la fonction :

private List<CtParameter> parseParameters(
    LuaParser.NamelistContext ctx,
    CtElement parent
) {
    var parameters = new ArrayList<CtParameter>();
    for (var child : ctx.children) {
        if (Objects.equals(child.toString(), ","))
            continue;

        var parameter = new CtParameterImpl();
        parameter.setParameterName(child.toString());
        parameter.setParent(parent);
        parameters.add(parameter);
    }
    return parameters;
}

Copier après la connexion

Cela ne semble pas non plus être très difficile : nous transformons simplement un objet en un autre. Terminons également la liste des codes ici. Nous espérons que son fonctionnement est clair.

Franchement, cette approche ne conduit pas à de gros gains à long terme. Au lieu de convertir le texte en arbre, nous convertissons un arbre en un autre. Les deux tâches sont plutôt fastidieuses. Quelles options avons-nous ?

Nous pouvons écrire notre lexer et notre analyseur à partir de zéro. C'est bien, mais pas quand on est limité par les délais et les compétences.
Nous pouvons configurer l'ANTLR pour qu'il génère immédiatement l'AST souhaité. Cela semble trop beau pour être vrai, mais nous devons encore étudier ANTLR, ce qui serait également une perte de temps importante.
Solution de mise sur le marché rapide. Nous devons travailler avec ce que nous avons et convertir l’arbre résultant en arbre cible. Pas bon, mais c'est supportable.
Nous ne nous convertirons pas du tout. S'il n'y avait pas de développeurs d'analyseurs Java dans l'équipe, nous l'aurions fait.

La section serait clairement incomplète si nous n'apportions pas l'exemple d'analyse de code pour notre AST. Vous pouvez examiner la jolie empreinte de l'arbre pour le même calcul factoriel sous le spoiler.

CtGlobal:
  CtFile:
    CtFunction: func
      Parameters:
        CtParameter: n
      CtBlock:
        CtIf:
          Condition:
            CtBinaryOperator: Equals
              Left:
                CtVariableRead: n
              Right:
                CtLiteral: 0
          Then block
            CtBlock:
              CtReturn:
                CtLiteral: 1    
          Else block
            CtBlock:
              CtReturn:
                CtBinaryOperator: Multiplication
                  Left:
                    CtVariableRead: n
                  Right:
                    CtInvocation: fact
                      Arguments:
                        CtBinaryOperator: Minus
                          Left:
                            CtVariableRead: n
                          Right:
                            CtLiteral: 1
    CtInvocation: print
      Arguments:
        CtLiteral: "enter a number:"
    CtAssignment:
      Left:
        CtVariableWrite: a
      Right:
        CtInvocation:
          CtFieldRead: 
            Target: io
            Field: read
          Arguments:
            CtParameter: "*number"
    CtInvocation: print
      Arguments:
        CtInvocation: fact
          Arguments:
            CtVariableRead: a

Copier après la connexion

Terminons ici par un petit commentaire sur la terminologie. Bien qu'il soit préférable d'appeler notre entité un traducteur d'arbre, appelons-la simplement un analyseur pour l'instant.

Visiteur

Ensuite, nous développerons la fonctionnalité qui effectue l'analyse et l'utilise pour créer des règles de diagnostic. C'est une traversée d'arbre. Dans l’ensemble, il est facile de comprendre comment implémenter l’itération arborescente. Cependant, nous devons faire des choses utiles avec les nœuds de l'arborescence. Le modèle Visiteur est sur scène.

Vous souvenez-vous de ce modèle captivant du livre classique de GoF qui a une implémentation sympa mais un scénario d'utilisation plutôt vague ? Nous disposons donc d’un cas de référence sur la manière dont il est utilisé dans des circonstances réelles. Pour garder l'article concis, je ne montrerai pas comment cela est implémenté dans le livre, mais je montrerai comment nous le faisons dans l'analyseur.

Commençons simple, la traversée de l'arbre. Nous définissons la classe CtScanner et ajoutons deux méthodes scan pour un seul élément et leur collection.

public <T extends CtElement> void scan(T element) {
    if (element != null) {
        element.accept(this);
    }
}

public <T extends CtElement> void scan(List<T> elements) {
    if (elements != null) {
        for (var element : elements) {
            scan(element);
        }
    }
}

Copier après la connexion

Voyez-vous ceci accepter de CtElement ? Dans notre cas, toute classe qui hérite de l'interface CtVisitable doit implémenter la méthode void accept(CtAbstractVisitor visiteur). Nous parlerons de CtAbstractVisitor un peu plus tard. Maintenant, il suffit de savoir que CtScanner en est hérité.

Voici à quoi ressemble accepter dans CtAssignment :

@Override
public void accept(CtAbstractVisitor visitor){
    visitor.visitCtAssignment(this);
}

Copier après la connexion

Oui, c'est un jeu d'enfant. Les nœuds appellent leur méthode dans le visiteur. Dans notre CtScanner il devrait y avoir la méthode pour chaque classe du nœud de l'arbre :

@Override
public void visitCtIf(CtIf ctIf) {
    scan(ctIf.getCondition());
    scan((CtStatement) ctIf.getThenStatement());
    scan((CtStatement) ctIf.getElseStatement());
}

@Override
public <T> void visitCtLiteral(CtLiteral<T> literal) {
}
@Override
public void visitCtStatement(CtStatement statement) {
}

// ....

Copier après la connexion

Revenons maintenant à CtAbstractVisitor, une interface extraite de notre CtScanner. Il inclut des méthodes pour visiter les nœuds de l'arbre, mais uniquement ces méthodes, pas de méthodes scan. Lors de l'implémentation des méthodes de visiteur, soit nous laissons un espace réservé pour les surcharges futures (s'il s'agit du nœud terminal), soit nous continuons à déplier les nœuds de l'arbre en effectuant une descente récursive le long de celui-ci. C'est tout ce que nous devons savoir pour continuer.

Semantic parsing

Introduction

It'd seem that's all with the core. For example, now our analyzer can catch simple errors like the variable self-assignment. We call it the signature analysis. To provide more advanced analysis, we'd like to obtain more data about what's going on in the code. The parser has done its job, it's time to create new entities.

So far, we've completely followed the way of compilers regarding of the analyzer structure, but now our path diverges. If Lua had a compiler, it'd start analysis to ensure that the code is syntactically correct. Although our tools will continue to overlap in features.

For our needs, let's create SemanticResolver for two purposes:

define the variable scope;
evaluate the variable type based on the duck typing.

However, we'll call it from the parser along with its traversal. No decorator this time.

To keep the app structure seamless, we'll contain all the semantic data in the same AST nodes. First, let's define the necessary properties in the CtVariableAccess interface:

CtBlock getScope();
void setScope(CtBlock block);

TypeKind getTypeKind();
void setTypeKind(TypeKind type);

Copier après la connexion

Variable scope

Let's start with scope; this will be our key tool for defining the variable along with its name. First, we define the variable entity inside SemanticResolver. To keep it short, we'll show you only the interface, but the gist should be clear:

public static class Variable {
    public Variable(String identifier);
    public String getIdentifier();
    public CtBlock getScope();
    public void setScope(CtBlock block);
    public void setType(TypeKind type);
    public TypeKind getType();
    // Methods use only identifier
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Copier après la connexion

Let's also further define the stack for variable scopes:

private final Stack<Pair<CtBlock, HashSet<Variable>>> stack = new Stack<>();
private final CtGlobal global;
public SemanticResolver(CtGlobal global) {
    pushStack(global);
    this.global = stack.peek();
}

public void pushStack(CtBlock block) {
    stack.push(Pair.of(block, new HashSet<>()));
}
public void popStack() {
    stack.pop();
}

Copier après la connexion

The stack entry consists of a tuple of scope and variables set registered there. The stack operation is mundane. Here's how it works in the parser:

private CtBlock getBlock(LuaParser.BlockContext ctx) {
    CtBlock block = new CtBlockImpl();
    resolver.pushStack(block);

    // ....

    resolver.popStack();
    return block;
}

Copier après la connexion

We have to register the variables somehow. If a variable is local, it's simple—let's take the current scope and pass the variable there:

public CtBlock registerLocal(Variable variable) {
    var scope = stack.pop();
    variable.setScope(scope.getLeft());
    scope.getRight().add(variable);
    stack.push(scope);
    return scope.getLeft();
}

Copier après la connexion

If the local keyword isn't used, the variable will be either global or be declared somewhere above. So, we first go through the stack and double-check if it exists:

public CtBlock registerUndefined(Variable variable) {
    var pair = lookupPair(variable);
    pair.getRight().add(variable);
    return pair.getLeft();
}

public Pair<CtBlock, HashSet<Variable>> lookupPair(Variable variable) {
    var buf = new Stack<Pair<CtBlock, HashSet<Variable>>>();
    Pair<CtBlock, HashSet<Variable>> res = null;
    while (!stack.isEmpty()) {
        var scope = stack.pop();
        buf.push(scope);
        if (scope.getRight().contains(variable)) {
            res = scope;
            break;
        }
    }

    while (!buf.isEmpty()) {
        stack.push(buf.pop());
    }

    if (res == null) {
        return global;
    }

    return res;
}

Copier après la connexion

We can set the scope to variables when we have the entry:

private CtVariableWrite getVariableWriteInternal(
        ParseTree ctx,
        boolean isLocal
) {
    var node = new CtVariableWriteImpl();
    node.setVariableName(ctx.getChild(0).toString());
    CtBlock scope;
    if (isLocal) {
        scope = resolver.registerLocal(
             new SemanticResolver.Variable(node.getVariableName()));
    } else {
        scope = resolver.registerUndefined(
            new SemanticResolver.Variable(node.getVariableName()));
    }
    node.setScope(scope);
    return node;
}

Copier après la connexion

And defining it for readings:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName())
        );
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

Copier après la connexion

We won't show the lookupScope code. It's a single-line wrapper over lookupPair, which we can see above. Here, we can end with the variable scope. We'll check the mechanism in the diagnostic rule on a separate section. Now we continue working on the semantic parsing. Now, let's move to the variable type determination.

Duck typing

How to determine the variable type? Indeed, we obtain them from the literals. Let's further define the type and the enumeration for them:

public interface CtLiteral<T> extends CtExpression, CtVisitable {
  // ....
  void setTypeKind(TypeKind kind);
  TypeKind getTypeKind();
}

Copier après la connexion

public enum TypeKind {
    Undefined,
    Number,
    String,
    Boolean,
    Nil
}

Copier après la connexion

Thus, the data type can be numeric, string, logical, and nil. However, it'll be undefined by default. The split of undefined and nil may seem far-fetched, but it's okay for the pilot.

We store the literal type only in the tree node, doing it from the parser:

private <T> CtLiteralImpl<T> createLiteral(
    // ....
    TypeKind type,
) {
    // ....

    literal.setTypeKind(type);
    return literal;
}

Copier après la connexion

However, the variable type will be both in the tree and in SemanticResolver. So, we can request it during further traversal and the AST building:

private ArrayList<CtAssignment> parseAssignments(LuaParser.StatContext ctx) {
    // ....
    for (int i = 0; i < variables.size(); i++) {
        var assignment = new CtAssignmentImpl();
        var variable = variables.get(i);

        // ....


        variable.setTypeKind(resolver.lookupType(variable.getVariableName()));
        resolver.setType(
            variable.getVariableName(),
            variable.getScope(),
            SemanticResolver.evaluateExpressionType(expression)
        );
    }
    return assignments;
}

Copier après la connexion

There is no error in the operator precedence. Let the stored variable have the type from its past assignment. It'll facilitate our work in the future. As for the methods used here, there's nothing incredible about the lookupType implementation—it's basically the same as lookupPair. There's nothing complex about setType:

public void setType(String variable, CtBlock scope, TypeKind type) {
    var opt = stack.stream()
             .filter(x -> Objects.equals(x.getLeft(), scope))
             .findFirst();
    if (opt.isPresent()) {
        var pair = opt.get();
        var newVar = new Variable(variable);
        var meta = pair.getRight()
                       .stream()
                       .filter(x -> x.equals(newVar))
                       .findFirst();
        meta.ifPresent(value -> value.setType(type));
    }
}

Copier après la connexion

However, evaluateExpressionType is trickier. Computing variable types in dynamic languages can be a bit of a hassle. Just look at the jokes about JavaScript and string concatenation. However, firstly, Lua has a separate operator '..', and secondly, we're trying to keep the process simple. So, we'll only determine whether all the operands are the same type. We'll use the familiar CtScanner.

public static TypeKind evaluateExpressionType(CtExpression expression) {
    Mutable<TypeKind> type = new MutableObject<>(null);
    var typeEvaluator = new CtScanner() {
        private boolean stop = false;
        @Override
        public void scan(CtElement el) {
            if (stop) { return; }

            if (el instanceof CtVariableRead || el instanceof CtLiteral<?>) {
                var newType = el instanceof CtVariableRead
                              ? ((CtVariableRead) el).getTypeKind()
                              : ((CtLiteral<?>) el).getTypeKind();

                if (newType.equals(TypeKind.Undefined)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                } else if (type.getValue() == null) {
                    type.setValue(newType);
                } else if (!type.getValue().equals(newType)) {
                    type.setValue(TypeKind.Undefined);
                    stop = true;
                    return;
                }
            }
            super.scan(el);
        }
    };
    typeEvaluator.scan(expression);
    return type.getValue();
}

Copier après la connexion

In parseAssignments, we've set the assignment type to the (CtVariableWrite) variable but forgot about reading (CtVariableRead). Let's fix it:

private CtExpression getExpression(LuaParser.ExpContext ctx) {
    // ....
    if (child instanceof LuaParser.PrefixexpContext) {
        // ....
        variableRead.setTypeKind(
            resolver.lookupType(variableRead.getVariableName())
        );
        var scope = resolver.lookupScope(
            new SemanticResolver.Variable(variableRead.getVariableName()));
        variableRead.setScope(scope);

        return variableRead;
    }
    // ....
}

Copier après la connexion

We've completed the semantic analysis and almost ready to start searching for bugs.

Data-flow analysis

Inside structure

First, we make two quick stops. While the topic of the data-flow analysis deserves a series of articles, it'd be wrong to skip over it. Here, we won't dive deep into theory but just try to memorize the values set by literals.

First, let's fall into the sin of self-copying and define the variable entity for DataFlow again but in a simpler way. Again, we'll only show you the interface:

private static class Variable {
    private Variable(String identifier, CtBlock scope);
    // Methods use identifier and scope
    @Override
    public boolean equals(Object o);
    @Override
    public int hashCode();
}

Copier après la connexion

Here's the rest of the class content:

public class DataFlow {
    private static class Variable {
        // ....
    }
    Map<Variable, Object> variableCache = new HashMap<>();

    public void scanDataFlow(CtElement element) {
        if (element instanceof CtAssignment) {
            CtAssignment variableWrite = (CtAssignment) element;
            if (variableWrite.getAssignment() instanceof CtLiteral<?>) {
                var assigned = variableWrite.getAssigned();
                var variable = new Variable(
                    assigned.getVariableName(),
                    assigned.getScope()
                );
                variableCache.put(
                    variable, 
                    getValue(variableWrite.getAssignment())
                );
            }
        }
    }

    public Object getValue(CtExpression expression) {
        if (expression instanceof CtVariableRead) {
            CtVariableRead variableRead = (CtVariableRead) expression;
            var variable = new Variable(
                variableRead.getVariableName(),
                variableRead.getScope()
            );
            return variableCache.getOrDefault(variable, null);
        } else if (expression instanceof CtLiteral<?>) {
            return ((CtLiteral<?>) expression).getValue();
        }
        return null;
    }
}

Copier après la connexion

It's quite simple: in scanDataFlow, we associate the value with the variable, and in getValue, we extract that value for the set node. Everything is simple because we don't factor in branching, loops, or even expressions. Why don't we factor them in? Branching is the very topic that deserves its own series of articles. What about expressions? Well, we didn't make it in two days. Given what we've achieved in the last two days, we think this task is feasible. We just postpone it as a homework.

That's all. It's clear that such a solution is far from a real product, but we have laid some foundation. Then we can either try to enhance the code—and we'll introduce data-flow analysis via AST. Or we can redo everything properly and build the control-flow graph.

We've done the class implementation, but we haven't discussed how to use it. We've written the class, but what shall we do with it? Let's talk about it in the following section. Now we just say that DataFlow works right before calling the diagnostic rules. Those are called when traversing the finished AST. Thus, the rules will have access to the current variable values. This is called Environment; you can see it in your debugger.

Walker

Welcome to the last section regarding the core. We've already had the AST that is full of semantic data, as well as the data-flow analysis that is just waiting to be run. It's a good time to put it all together and set the stage for our diagnostic rules.

How is the analysis performed? It's simple: we get on the topmost tree node, and then we start recursive traversal. That is, we need something that will traverse the tree. We have CtScanner for that. Based on it, we define MunInvoker:

public class MunInvoker extends CtScanner {
    private final List<MunRule> rules = new ArrayList<>();
    private final Analyzer analyzer;

    public MunInvoker(Analyzer analyzer) {
        this.analyzer = analyzer;
        rules.add(new M6005(analyzer));
        rules.add(new M6020(analyzer));
        rules.add(new M7000(analyzer));
        rules.add(new M7001(analyzer));
    }

    @Override
    public <T extends CtElement> void scan(T element) {
        if (element != null) {
            analyzer.getDataFlow().scanDataFlow(element);
            rules.forEach(element::accept);
            super.scan(element);
        }
    }
}

Copier après la connexion

You can notice a few unknown things in the code:

the Analyzer class. It encapsulates the whole analysis process and contains shared resources that need to be accessed within the rules. In our case, this is the DataFlow instance. We'll get back to Analyzer later;
four confusing classes that are added to rules. We'll talk about them in the next section, so don't panic. A little spoiler for you :)

Otherwise, the class operation shouldn't raise any questions. Each time we enter any tree node, the analyzer rule is called, and the variable values are evaluated right before it. Next, the traversal continues in line with to the CtScanner algorithm.

Analysis

Preparing for writing diagnostic rules

Rule class

So, we have the analyzer core prototype—it's a good time to start analyzing something.

The base for our rules is ready—it's the CtAbstractVisitor class. The analysis goes as follows: the rule overloads a few visitors and analyzes the data contained in the AST nodes. Let's extend CtAbstractVisitor with the abstract MunRule class, which we use to create rules. In this class, we also define the addRule method that generates warnings.

Speaking of warnings: what data do they need? First is a warning message to demonstrate users what they may have been mistaken about. Second, the user needs to know where and what the analyzer warns. So, let's add data about the file where the analyzer has detected the troublesome code block and the location of this code fragment.

Here's what the MunRule class looks like:

public abstract class MunRule extends CtAbstractVisitor {
    private Analyzer analyzer;
    public void MunRule(Analyzer analyzer) {
        this.analyzer = analyzer;
    }
    protected Analyzer getAnalyzer() {
        return analyzer;
    }

    protected void addRule(String message, CtElement element) {
        var warning = new Warning();
        warning.message = message;

        WarningPosition pos = new WarningPosition(
            Analyzer.getFile(), 
            element.getLine(), 
            element.getColumn() + 1
        );
        warning.positions.add(pos);
        analyzer.addWarning(warning);
    }

    public DataFlow getDataFlow() {
        return analyzer.getDataFlow();
    }
}

Copier après la connexion

The WarningPosition and Warning classes are just data stores, so we won't list them. We'll talk about addWarning now.

Merging it together

The last thing to prepare is how we're going to view the diagnostic rules. To do this, we combine all our features together, using the already mentioned Analyzer class. Actually, here it is:

public class Analyzer {
    private DataFlow dataFlow = new DataFlow();

    public DataFlow getDataFlow() {
        return dataFlow;
    }

    public CtElement getAst(String pathToFile) throws IOException {
        InputStream inputStream = new FileInputStream(pathToFile);
        Lexer lexer = new LuaLexer(CharStreams.fromStream(inputStream));

        ParseTreeWalker walker = new ParseTreeWalker();
        var listener = new LuaAstParser();
        walker.walk(listener, new LuaParser(
            new CommonTokenStream(lexer)
        ).start_());
        return listener.getFile();
    }

    protected void addWarning(Warning warning) {
        Main.logger.info(
            "WARNING: " + warning.code + " "
            + warning.message + " ("
            + warning.positions.get(0).line + ", "
            + warning.positions.get(0).column + ")");
      }      

    public MunInvoker getMunInvoker() {
        return new MunInvoker(this);
    }

    public void analyze(String pathToFile) {
        try {
            var top = getAst(pathToFile);
            var invoker = getMunInvoker();
            invoker.scan(top);
        }
        catch (IOException ex) {
            Main.logger.error("IO error: " + ex.getMessage());
        }
    }
}

Copier après la connexion

To explain how it works, we'll give you an example of the whole analysis process:

In the getAst method, we build our AST using the scheme lexer —> parser —> tree translator;
Then we call MunIvoker, which traverses the tree and calls our diagnostic rules along with the data-flow analysis;
If necessary, the rules call the Analyzer class to get the DataFlow instance;
They call addWarning when the analyzer spots suspicious code fragment. That one just outputs the message to the log.

We've got the prep work—it's time to start writing diagnostic rules.

Writing diagnostic rules

Assigning a variable to itself

We've decided to start writing the rules with a simple one: PVS-Studio has the Java diagnostic rule, V6005, where the analyzer check if a variable is assigned to itself. It can simply be copied and slightly adapted to our tree. Since our analyzer is called Mun, we start the numbers of our diagnostic rules with M. Let's create the M6005 class, extending MunRule, and override the visitCtAssignment method in it. The following check will be located in the method:

public class M6005 extends MunRule {
    private void addRule(CtVariableAccess variable) {
        addRule("The variable is assigned to itself.", variable);
    }
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        if (RulesUtils.equals(assignment.getAssigned(), 
            assignment.getAssignment())) {
            addRule(assignment.getAssigned());
        }
    }
}

Copier après la connexion

The used RulesUtils.equals method is a wrapper and overload for another equals method that checks the name and scope:

public static boolean equals(CtVariableAccess left, CtVariableAccess right) {
    return left.getVariableName().equals(right.getVariableName())
           && left.getScope().equals(right.getScope());
}

Copier après la connexion

We need to check the scope, because the next code fragment isn't an assignment of the variable to itself:

local a = 5;
begin
    local a = a;
end

Copier après la connexion

Now, we can test the diagnostic rule on some simple code and see if it works. The following example will issue the warning on the marked line: "M6005 The variable is assigned to itself":

local a = 5;
local b = 3;

if (b > a) then
    a = a;      <=
end

Copier après la connexion

Zero division

Well, we've warmed up, so let's move on. The analyzer has already had the primitive data-flow analysis (DataFlow) that we can and should use. Again, let's look at one of our existing diagnostic rules, V6020, where the analyzer checks for the zero division. We try to adapt it for our analyzer. The divisor can be either the variable as zero or a literal as zero. We need to access the cache of variables to check their value.

Here's the simple implementation of such a diagnostic rule:

public class M6020 extends MunRule {
  private void addWarning(CtElement expression, String opText) {
    addRule(String.format(
            "%s by zero. Denominator '%s' == 0.",
            opText, expression instanceof CtLiteral
                ? ((CtLiteral) expression).getValue()
                : ((CtVariableRead) expression).getVariableName()
        ),
        expression
    );
  }

  @Override
  public void visitCtBinaryOperator(CtBinaryOperator operator) {
    BinaryOperatorKind opKind = operator.getKind();
    if (opKind != BinaryOperatorKind.DIV && opKind != BinaryOperatorKind.MOD) {
      return;
    }

    apply(operator.getRightHandOperand(), opKind == BinaryOperatorKind.MOD);
  }


  private void apply(CtExpression expr, boolean isMod) {
    Object variable = getDataFlow().getValue(expr);
    if (variable instanceof Integer) {
      if ((Integer) variable == 0) {
        String opText = isMod ? "Mod" : "Divide";
        addWarning(expr, opText);
      }
    }
  }
}

Copier après la connexion

We can see that the diagnostic rule works on simple examples and issues the following warning: "M6020 Divide by zero. Denominator 'b' == 0" on these lines:

local a = 5;
local b = 0;
local c = a / b;   <=
local d = a / 0;   <=

Copier après la connexion

If you have made the expression evaluator as your homework, you may try the diagnostic rule on this code:

local b = 7;
local b = b - 7;
local c = a / b;

Copier après la connexion

Overwritten types

Since we're writing the Lua analyzer, we need to write diagnostic rules for this language. Let's start simple.

Lua is a dynamic scripting language. Let's use this feature and write the diagnostic rule that allows us to catch overwritten types.

We'll also need to pick a new number for the diagnostic rule. Earlier we just copied them from the Java analyzer. Now, it seems like it's time to start a new thousand—the seventh. Who knows what diagnostic rule it'll be in PVS-Studio, but at the time of writing this article, it will be Lua.

Knowing about types, we facilitate our case: we need to check that the left and right assignments are of different types. Note that we ignore Undefined for the left part and Nil for the right part. The code looks like this:

public class M7000 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var assigned = assignment.getAssigned();
        var exprType = SemanticResolver.evaluateExpressionType(
            assignment.getAssignment());
        if (assigned.getTypeKind().equals(TypeKind.Undefined)
            || exprType.equals(TypeKind.Nil)
        ) {
            return;
        }

        if (!assigned.getTypeKind().equals(exprType)) {
            addRule(
                String.format(
                    "Type of the variable %s is overridden from %s to %s.",
                    assigned.getTypeKind().toString(),
                    exprType.toString()
                assigned,
            );
        }
    }
}

Copier après la connexion

It's time to check the diagnostic rule in real cases. In the following example, our analyzer issues the warning only on the last line:

local a = "string";

if (true) then
  local a = 5;
end

a = 5;     <=

Copier après la connexion

The analyzer warning: "M7000 Type of the variable a is overridden from Integer to String".

Lost local

Let's finish smoothly. The Lua plugin for VS Code has a diagnostic rule that detects global lowercase variables. This check can help detect the forgotten local identifiers. Let's implement the same diagnostic rule in our analyzer.

Here, just as before, we'll need to use the variable scope data that is obtained via the semantic analysis. We just find where the variable was declared. That's also where the variable is assigned a value. Then check its scope and name. If the variable is global and starts with a lowercase, the analyzer will warn. Easy-peasy.

Let's create a new class and override the visitCtAssignment method in it again. That way, we can look for the problematic global variables:

public class M7001 extends MunRule {
    @Override
    public void visitCtAssignment(CtAssignment assignment) {
        var variable = assignment.getAssigned();
        var firstLetter = variable.getVariableName().substring(0, 1);

        if (variable.getScope() instanceof CtGlobal && 
            !firstLetter.equals(firstLetter.toUpperCase())) {
            addRule("Global variable in lowercase initial.", variable);
        }
    }
}

Copier après la connexion

Let's check the diagnostic rule work. It issues the warning: "M7001 Global variable in lowercase initial." It's issued on the second line in the code snippet below:

function sum_numbers(b, c)
    a = b + c;  <=
    return a;
end

local a = sum_numbers(10, 5);

Copier après la connexion

Alright, we've written the diagnostic rules, we've done with the analyzer (and it even works). Breathe out. Now we can enjoy the fruits of our labors.

Viewing warnings

Above we've already shown the code that allows us to view warnings in a console or file. Here is how the process of working with the analyzer looks like in the console. We run the analyzer using the command:

java -jar mun-analyzer.jar -input "C:\munproject\test.lua"

And we get:

How to develop code analyzer in hours

However, first, the analyzer is a tool, and it should be user-friendly. Instead of working with the console, it'd be much more convenient to work with a plugin.

Il est temps d'emprunter à nouveau. PVS-Studio dispose déjà de plugins pour de nombreux IDE différents, par exemple pour Visual Studio Code et IntelliJ IDEA. Vous pouvez afficher les avertissements et les parcourir. Étant donné que les rapports de l'analyseur ont des formats standardisés, nous pouvons simplement emprunter l'algorithme de génération de rapports JSON à notre analyseur Java. L'algorithme est vaste et ennuyeux, nous ne le montrerons donc pas. Il faut quand même l'exécuter depuis la ligne de commande mais avec l'argument* -output "D:git/MunProject/report.json"*.

Ensuite, nous pouvons ouvrir le rapport dans IntelliJ IDEA ou VS Code et consulter les avertissements de l'analyseur Lua :

How to develop code analyzer in hours

Doux ! Nous pouvons désormais utiliser l'analyseur aux fins prévues sans sacrifier la convivialité.

Annonce Astra

Alors, avons-nous écrit l'analyseur à part entière ? Euh, pas exactement. À la fin de ce long voyage, nous avons le plus vrai pilote, qui traverse toutes les étapes. Cependant, les possibilités de croissance sont énormes :

Améliorations principales :
- améliorer l'analyse des flux de données ;
- considérez les flux de contrôle ;
- ajouter une analyse interprocédurale et intermodulaire. Vous pouvez lire comment nous l'avons fait en C++ ici ;
- ajouter le mécanisme d'annotation pour aider à améliorer l'analyse du flux de données et le typage du canard ;
- fournir plus de données sémantiques ;
- affiner les mécanismes existants ;
- et n'oubliez pas un meilleur analyseur.
Améliorations des règles de diagnostic :
- valoriser ceux existants;
- écrivez-en plus de nouveaux ;
- écrivez des règles de diagnostic plus complexes au-delà des quelques dizaines de lignes.
Améliorations de la convivialité de l'analyseur :
- créer un support de plugin approprié ;
- intégrer dans CI/CD.
Tests unitaires et tests de régression pour vérifier les performances des règles de diagnostic dans leur développement et leur modification.

Et bien plus encore. En d’autres termes, le chemin qui mène du pilote à l’outil à part entière est assez épineux. Ainsi, PVS-Studio se concentre sur les directions existantes : C#, C, C++ et Java au lieu de nouvelles. Si vous écrivez dans l'une de ces langues, vous pouvez essayer notre analyseur.

Épilogue

L'article s'est avéré beaucoup plus long que prévu, alors n'hésitez pas à laisser un commentaire si vous êtes arrivé à la fin :) Nous serions ravis de recevoir vos commentaires.

Si le sujet vous intéresse, vous pouvez lire sur le développement d'analyseurs pour nos langages :