Four Ways of Creating Domain Specific Languages

Language is a process of free creation; its laws and principles are fixed, but the manner in which the principles of generation are used is free and infinitely varied. Even the interpretation and use of words involves a process of free creation.~Noam Chomsky

The creation of a new domain specific modeling (DSM) language has never been an easy task. A primary reason is that there is no single “fit-for-all” recipe that can tell and guide us how to define DSM languages in a generic way. For the same reason, the definition of the language can also be the most interesting and exciting part of the language “creation” process. One of the most difficult and critical aspects of language design is to capture and define the various constituting elements of a DSM language. The goal here is the identification of:
  • the language concepts (otherwise known as language construct),
  • the relationships and constraints among the concepts and
  • the dynamics (otherwise known as execution behavior) of the language
There are various possible ways to define the concepts of a DSM language and usually, there are also various means that can help (and/or drive) the definition of a language. Considering the means, you can use, for example:
  • knowledge and expertise of domain experts and designers,
  • existing libraries and APIs capturing the domain already,
  • documentation,
  • technology roadmap,
  • and if it exits, the product family engineering process of an organization.
Naturally, having more input source in the language definition process may not necessarily simplify the definition of the language - it can actually create more difficulty in establishing the concepts of the domain due to inconsistencies, for example, between the various existing artifacts (e.g. source code vs. documentation).

Considering the possible ways, Juha-Pekka Tolvanen and Steven Kelly in [1] has identified four general types of approaches (based more than 20 cases of DSM definition):

  1. Domain Expert’s Concept
  2. Generation Output
  3. Look and Feel of System Built
  4. Variability Space
We applied some of these approaches in the various projects that I have been involved with. In the following sections I will revisit these approaches and share some experience on their application.

1. Domain Expert’s or Developer’s Concepts (Top-Down Approach)

This style of language definition is based on identifying directly the concepts applied by domain experts and the developers who are supposed to create the models in the designed DSM language. The definition of the language is typically an interactive and iterative process together with the domain experts. In this process, the use of existing notations such as UML or BPMN can help in establishing the language; however, not all domain experts might be familiar with these notations and discussions can easily lose focus. This problem can often be overcome by the use of simple graph notations (nodes, arrows, labels, colors, etc.). Alternatively, sketching the languages in a simple textual format often works out relatively well too. Using this style of language definition, the resulting languages tend to be vertical (more narrow by nature and pertain to a certain type of industry, such as IP telephony, home automation, insurance products, etc.) rather than horizontal DSLs (technical and broad in nature).

The most important aspect of this approach is that it takes the concepts directly from the domain experts instead of, for example, implementation artifacts such as source code that implements a domain model. In this way, the DSM language is established first and it is mapped to one or more target execution platform (using generic programming languages) later. Hence, there is a better chance to achieve a higher-level of abstraction than using bottom-up approaches. Since it is a top-down approach, there is also a better chance to achieve 100% code generation.

When one can apply this definition technique it means that the domain is usually discovered and relatively well established already, especially if there are more domain experts that use a common domain vocabulary. In others word, a language designer who can apply this style of language definition have a jump ahead of the ones who still needs to establish the vocabulary of the domain first.

Depending on the project context, it can also happen that the domain experts may not agree entirely on definition of each domain concept. Alternatively, it can also happen that they may not agree on the necessity of certain concepts proposed by other experts. One way to clarify these types of ambiguities is to figure out what modeling problems can actually be solved by the proposed concept. It might also be good to check how much modeling effort will be required by using the various concept alternatives. Both of these techniques should basically help in determining the value of the modelling concepts in question.

2. Generation Output (Bottom-Up Approach)

Another broadly applied style of language definition is based on identifying the concepts of the language indirectly by “extracting” them from existing source code (mostly written in a generic purpose programming language such Java, C or C++). This style is typically applied when there is a large amount legacy code written that consists of reoccurring idioms, patterns that express certain domain concepts or concepts from higher level of abstraction and automation of these idioms and patterns required in future products. Another typical application of this approach is when the DSL is quickly “prototyped” by coding in a general purpose programming language first.

The advantage of this language definition style is that the applied patterns and idioms in the source code show already how the domain concepts are actually used through concrete examples. In other words, the source code already contains model instances that “only” need to be extracted by the language engineers. If the idioms and patterns are well modularized in the given GPL, it is relatively easy to extract the domain concepts.

It can also happen, especially in large-scale legacy systems, that the idioms and patterns are not so consistently applied, or the patterns expressing a certain domain concept are intertwined with other technological concerns (e.g. inter-process communication, profiling, error-handling, etc.). In other words, less structured and/or more complex code can also give more difficulty in isolating the domain concepts of the language. For the same reason, 100% code generation is not always possible when this style of language definition is applied, which will also impose difficulties in the verification of the models later.

The bottom-up nature of the process may pose a serious risk on reaching an abstraction level for the language that can provide a sufficient return on investment. Furthermore, dedicated domain libraries & APIs can help to isolate further the domain concepts, however, the question arises: is it worth to create an explicit (external) DSL over the existing API? Alternatively, realizing the domain as internal DSL may be sufficient, especially if the target language can support the development of internal DSLs (e.g. Python); however, making the choice on the implementation technique fortunately comes in the language realization phase.

All in all, I believe that this style of language definition can lead to DSLs that can offer relatively low-intermediate level of abstractions in the design process and help in the separation and automated construction of technological concerns during the implementation process. However, 100% is code generation not easy to achieve (often not worth to achieve) and the final languages tend to be horizontal DSLs rather than vertical.

3. Look and Feel of the System Built

The third style of language definition is applicable to “products whose design can be understood by seeing, touching or by hearing”. In this category of language definition, the end-user product concepts act as modeling constructs / abstractions of the DSM language under construction. The authors of [1] give an example of a language that can be used to develop UI application of Series 60 and Symbian-based smartphones (based on the type of widgets available in these platforms).

Telling the truth, this is type of language definition that I have not encountered in my work since I am involved with completely different type of systems. So if you are looking for more experience with this style of language definition, I recommend to look into [1].

4. Variability Space

The last style of language definition is based on expressing variability (and commonality) of a product line. By using this approach, the concepts of the resulting DSL can capture the complete variability space of the product line - that is, the product assets and their possible composition. Since the DSM language focuses on expressing variations, the resulting models strongly resemble configurations of products. DSLs defined by this style are strongly declarative (DSLs are declarative in general): models typically describe what problem they solve instead of showing how they solve it. The language concepts are typically identified by using a systematic commonality/variability analysis - for example, see SCV analysis presented in [2]. Finally, the uniqueness of this language definition is that the DSL is typically shaped not only by an existing product line (i.e. existing artifacts) but also by the vision and anticipation of product experts and developers predicting future, potential product variations.

This style of definition typically gives very high return on investment (ROI) for two reasons: 1) the resulting DSL is defined at a very high abstraction level - concepts of the language are directly expressing product assets and 2) the anticipation of future variations can leads to a combination of pre-aligned meta-models and code generators which are easy to maintain and extend. In addition, the declarative nature and the close coupling with product assets in the DSL make the models easy to communicate and share with different stakeholders.

High ROI may sound very attractive, however, this style of language definition is the most difficult one to carry out. Realization of the language can be especially challenging if the platform architecture is not expressive enough for supporting product variations, or not flexible enough to accommodate future product variants. For this reason, incremental development strategy is often applied with a reactive product line engineering approach: the most common concepts of the product family are implemented first to maximize the ROI already on the first versions of the language.

Another challenge with this approach is to come up with the dynamics (i.e. runtime behavior) of the language. This is due to the fact that the language definition first focuses on modeling the product variations and the execution of the product is often addressed only in a later / separate stage (consider, for example, the execution architecture, code generation, etc). One way to handle this problem is to simply start “coding” how the model is supposed to run (without considering the identified product variations) and do code / architecture refactoring later based on the previously identified product variation points.

Naturally, I can imagine that there are other approaches out there besides the four ones described here. If you know or are aware of another language definition style, please let me know, I am really interested in hearing about it.

[1] Tolvanen, J.-P.; Kelly, S.: Defining Domain-Specific Modeling Languages to Automate Product Derivation: Collected Experiences. Proceedings of the 9th International Software Product Line Conference, H. Obbink and K. Pohl (Eds.) Springer-Verlag, LNCS 3714, pp. 198 – 209, 2005.

[2] Coplien, J., Hoffman, D.; Weiss, D.: Commonality and Variability in Software Engineering, IEEE Software 15, 6 (November/December 1998):37-45.


Niels said...

Hi Istvan,

really interesting article. Provides a nice overview of the techniques used to create DSL's. Bookmarked it!

Small comment: often these techniques are used to create the various DSL's at different abstraction levels. For example, bottom-up for the DSL's at the lower level of abstraction, top-down for the DSL's that reach into the product domain.

So I guess it's not always a matter of simply choosing between the different techniques, but more about picking the right technique based on the DSL that needs to be created.

What is your opinion on this?


István Nagy said...

Hi Niels,

regarding your question on the choice, you are right.

Picking the proper (or applicable) technique depends on the goals to be achieved. In addition, the choice can be further restricted by constraints imposed by the project. So sometimes, we may not have a choice...

Thanks for your feedback!