Today I drove to Cambridge (it’s just 1h each way) to attend a talk given by Daniel Ratiu, who is visiting Microsoft Research this week. He presented a summary of his PhD work, which he will defend in precisely one month. Daniel’s main message was that program analyses can benefit from the use of some domain knowledge, instead of being mostly based on the syntax and structure of the code (e.g. call graphs).
His approach consists of first extracting a graph of the program’s identifiers and how they are related (e.g. X is-a Y if class X extends class Y). Given a similar graph representing the domain’s ontology, he then uses the identifiers and the graph’s structure to try to match the program’s concepts to the domain’s concepts. This then allows one to see the conceptual coverage of a program, its conceptual redundancy, the logical cohesion of an architecture, etc.
However, there are usually no domain ontologies available. So, he and other colleagues used the approach to extract the concept graph from several APIs for the same domain (e.g. GUIs) and then do pairwise intersections: any concept belonging to at least 2 APIs would be added to the ontology. In this way they obtained e.g. an ontology of GUI artefacts (menus, buttons, etc.) and could see which implementations don’t cover important artefacts (e.g. AWT doesn’t provide tables). The approach of course relies on the program using good identifiers (not just cryptic abbreviations), and that the identifiers reflect the domain’s vocabulary.
Another example he showed was a visualisation of packages or classes of a program that refer a certain concept. This can expose non-optimal modularizations of code w.r.t. the domain’s concepts.
Daniel also gave me a demo of his tool, available as an Eclipse plugin. I especially liked the fact that the extraction of the concept graph and the matching of graphs can be parametrized in a simple configuration file. For example, one can choose Wordnet to make identifiers uniform (e.g. children is mapped to child), although Wordnet has several shortcomings for technical domains as used in most software e.g. concepts related to GUIs. I also liked that Daniel made the concept graphs he extracted publicly available, in two easy to process formats, namely OWL and also as a text file with one triple (node-relation-node) per arc.
Overall, a quite nice talk and body of work; a pity it was very poorly attended.
As an added bonus of taking the time to go to Cambridge, the Microsoft Research reception hall has some goodies for visitors: copies of the DVD of the Royal Institution’s Christmas 2008 lecture and of the 2020 Science report. I’m already looking forward to watch and read them.