Improved Source Differencing

One typical application of works on source-code evolution is the differencing between two programs. A typical application is the user interface of version control systems, in which two versions of the same file can be compared . The evolved parts are visualized in a way that makes it easy for developers to understand the changes. For us, having a working differencing solution in C RET is also a step towards providing a snippet completion recommender, as proposals also need to be integrated into existing source code, which is a special case of differencing.

Originally, source-code differencing was text-based, but more recent differencing algorithms are based on tree structures (e.g., general data, AST). We were interested in an analysis of how well SSTs are suited as the basis for works on source-code differencing. To this end, Markus Zimmermann has answered two research questions in his Bachelor thesis.

    • Can source-code be differenced on the basis of SSTs?
    • Is it possible to use domain-knowledge to improve the visualize of semantic changes (e.g., wrapping statement in conditional; moving a statement)?

In addition, we were also interested in getting first hand experience in applying SSTs to real problems to learn about possible limitations of the representation.

Improved Edit Script We extended the existing ChangeDistiller algorithm, that extracts an edit script of two programs. The script contains all required changes to get from the first program to the second, using only insert, delete, change, or move operations. The algorithm to create this script consists of two steps. First, all nodes of the two programs are being matched to find unchanged parts, then the required operations are identified that reflect the change. We have introduced several heuristics to improve the matching of nodes through applying domain-specific heuristics, e.g., to recognize renaming of methods through detecting equal method bodies. In addition, we introduce some transformations to the resulting edit script, e.g., better recognition of moves instead of multiple insert/delete combinations, to make it easier for developers to understand the semantics of the difference.

Part of the work was the implementation of an actual diff viewer that visualizes the edit script, a screenshot of the resulting tool is shown in the following. The main goal was to create a proof-of-concept system available for an empirical study, but we also tested several visualization alternatives for move and change operations.

Results The work has shown that SSTs are indeed a solid foundation for source-code differencing. It is possible to create an edit script directly on the representation without an additional intermediate state. Integrating the technique into assistance tools seems to be possible. Through our own experience when working with SSTs, we learned that there are some properties that could be improved in future work.

Big Nodes SSTs simplify the AST by favoring big nodes. For example, in contrast to regular ASTs, in which a method declaration would typically have separate nodes for return type, method name, and its signature, SSTs aggregates this information in a single node and stores the information using our naming scheme. While storing this information in a single string make it convenient to store and to work with when traversing the AST, however, it makes it harder to match nodes of two trees meaning- fully. Traditional differencing approaches work better with smaller and simpler nodes and breaking down several big SST nodes would simplify the matching.

Missing Blocks Another simplification that is introduced by SSTs is that nested blocks are not modeled as SST nodes, but as list properties of the containing node. It turned out that this makes it harder for the algorithm to differentiate child nodes, if multiple nested blocks exist (e.g. an if statement has then and else blocks).

Normalization The SST transformation normalize the source code by removing in- valid or incomplete parts and by factoring out nested expression statements to new artificial variables. Our results show that the normalization makes it harder to match nodes. Both the existence of several unknown expressions and the stable names for the artificial variables affect the matching and, as a result, make the creation of a comprehensible visualization harder.

Overall, the results are very promising. Using SSTs for source-code differencing works well, and being able to directly use SSTs in algorithms for source-code differencing is a first step towards code-snippet completion. Future work should try to use edit scripts for integrating code-snippet proposals into existing source code.

However, we also identified chances for further improvement in future work. The matching would get easier if the differencing step either reverts the normalization or if an unnormalized SST would be used in the first place. Future work could try to separate transformation and normalization of SSTs to preserve both the benefits for the static analysis gained through the normalization and the benefits for source-code differencing gained from a representation that is closer to the actual sources.