Converting CIM MOF files to Java files

Brief description of translatecim

The Common Information Model (CIM) is a data representation defined by the Distributed Management Task Force (DMTF). The model is defined by a set of MOF files distributed by the DMTF. MOF is a language also defined by the DMTF. TranslateCIM is a program that translates the MOF files into Java source files. TranslateCIM is written using ANTLR, a parser generator system. TranslateCIM was written to allow the author to write a network management application that is compatible with the CIM standard.

CIM defines about 1200 standard objects for use in applications. Trouble is, it's "language agnostic", meaning the objects aren't defined in a real language. CIM is specified in MOF (Managed Object Format), a pseudo-language. There are other programs that translate CIM into Java, but they are old, incomplete, proprietary, nonportable, or otherwise sucky. TranslateCIM is free and portable. It parses the latest version of CIM as of December 2008. It produces legal Java output, though work remains to be done to make the output files reflect all the details of CIM. TranslateCIM uses ANTLR 3, a modern parser generator system, so it would be possible to write additional back ends to produce languages other than Java.

Introduction

The Distributed Management Task Force (DMTF) describes the Common Information Model (CIM) in the CIM Infrastructure Specification. The specification describes CIM concepts and a language called Managed Object Format (MOF), which is *not* the Meta Object Facility defined by the OMG. MOF is used to define the classes in the CIM. There is one MOF file for each class in the CIM. There about 1200 MOF files organized in a directory tree. This web page describes a program named TranslateCIM that translates CIM MOF files into corresponding Java files suitable for use in applications.

In the CIM MOF distribution directory, a special master MOF file contains "include" statements that include all the other files. The master file starts by including 2 files that define qualifiers used in all the other files. The remaining include statements include the rest of the MOF files, each of which defines a single CIM class. The include statements are in order such that superclasses are defined before subclasses, as specified in the CIM Infrastructure Specification, section "4.5.2 Subclasses", lines 956-957.

Once the output files have been created, they must be compiled in the same order that they appear in the include file, to insure that classes are defined before they are referenced. To make this easy, the translator program outputs a shell script that contains "javac" commands that compile the files in the proper order.

The translator program is generated by ANTLR, a translator generator system designed for jobs like this. The stuff that I used to learn ANTLR is in ~/antrlplay, including Pedro Assis's grammar for ANTLR 2. My grammar for ANTLR 3 is in ~/TranslateCIM.

The translator opens the master MOF file and processes all the include statements. Whenever it encounters the end of a class definition, it writes a Java file. The translator creates an output directory tree that follows the layout of the input directory tree. In general, the output directory contains a Java file for every MOF file in the input directory.

The translator reads the entire set of CIM files and writes the entire set of Java files. It does this instead of processing one file at a time because

ANTLR generates translators that have 2 phases of translation: lexical analysis phase followed by a parsing phase. The translator reads all ~1200 CIM MOF files and lexically analyzes them to produce a single stream of tokens in memory. When that's done, it starts the parsing phase, which reads the token stream and writes one output file whenever it encounters the end of a class definition. The translator uses ANTLR stringTemplates to create the actual output, which should make it relatively easy to make the translator produce some other output language than Java if desired.

ANTLR uses recursive-descent parsing, which allows the process to be broken into pieces naturally. The "output" of each rule is passed up to the enclosing rule so that the processing at each level can be modular. Each parser rule may return value(s) that represent the rule. The returned values are often in the form of a stringTemplate or "st", but can be any Java type. For example, the list of qualifiers for a class is best represented as a Java HashMap, with the name of the qualifier as the key.

Translation issues

NULL

The CIM Infrastructure Specification says, in section 4.11.6,

All types can be initialized to the predefined constant NULL, which indicates that no value is provided. The details of the internal implementation of the NULL value are not mandated by this document.

This almost doesn't matter, since the CIM MOF files don't actually use the keyword "NULL" except in comments. There is one kinda/sorta exception: in Device/CIM_SpareConfigurationCapabilities.mof, a uint32 is initialized to "null". This seems to be a typo - the word should probably be capitalized. ANyway, the obvious way to support "NULL" in Java is via Java's "null" keyword. That works fine for Java objects but not for primitive types. Furthermore, the comment next to the usage indicates a clear difference between "NULL" and a zero value. To handle it, I put a special case in the parser that treats "null" as "NULL". The proper "fix" is to change TranslateCIM so that it doesn't generate code that uses Java primitives. Until I do that, the Java produced by TranslateCIM won't be compatible with the CIM specification. Sigh.

Unimplemented MOF constructs

The CIM Infrastructure Specification defines MOF, a language used to define CIM classes. MOF is defined very specifically in the specification. There's even a Backus-Naur grammar in Appendix A. Some parts of the MOF language are not actually used in any of the MOF files that define CIM classes. They are omitted from the translator. The omitted parts include:

The ANTLR grammar that defines the translator was adapted from the Backus-Naur grammar in the CIM Specification. Pedro Assis made the initial changes to turn it into a grammar that could be accepted by ANTLR version 2. I made further changes to turn it into an ANTLR v3 grammar, and then I refined it further to make it parse the actual CIM files and produce output files.

CIM field names (property names) violate Java naming conventions

Java naming conventions specify that field names (variables) should start with a lowercase letter. In CIM, these are called "properties", and they start with a capital letter. The translator converts CIM properties into Java fields. I could either force the first character to lowercase, or just let them violate the Java convention. Dunno what to do yet. For now, I use the CIM names, violating Java convention.

Deprecated things: to omit or not to omit?

Should the translator emit classes, fields and methods marked "Deprecated"? It's nice not to have them cluttering up libraries, but omitting them may not be right in some situations. By default, the translator emits deprecated things, but a command-line command-line named "nodeprecated" (name inspired by javadoc options) causes them to be skipped.

Unfortunately, there are some mistake in the CIM MOF files. Some deprecated classes are extended by non-deprecated classes. When you run translateCIM with the "-nodeprecated" option, these classes will cause "class xxx not defined" errors and/or the resulting Java won't compile. There are some mealy-mouthed comments about this in the CIM MOF files, saying it'll be fixed when there's a "major Schema release". If you really want to produce Java that doesn't have deprecated stuff, you'll have to manually remove the "Deprecated" qualifier from the class definition in the following files in CIM 2.20.1:

Of course, the above list is likely to change with every release of CIM.

Experimental things: to omit or not to omit?

Should the translator emit classes marked "Experimental"? By default, the translator emits them, but a command-line command-line named "omitExperimental" will cause them to be skipped.

Enums: to emit or not?

CIM defines many enumerations, many of which can be represented as enums in Java. Java enums provide type safety at compile-time. By default, the translator will convert CIM data properties into Java enums when it can. A command-line option named "noenums" (name inspired by javadoc options) turns this behavior off and causes the translator to produce older-style, less type-safe "typesafe enums".

Dangerous backslashes in cimv215.mof

In CIM version 2.15, the master MOF file contained include statements with file names that had Microsoft-Windows-style backslashes to delimit directories. MOF syntax uses backslashes to introduce escape sequences, so this wasn't legal MOF. It happens that the "backslash sequences" in the file didn't happen to match any of the escape sequences used in MOF string literals. A parser would be justified if it were to emit "unrecognized escape sequence" errors. Instead, just to make it work, I put a "NonEscapeSequence" token into the grammar to explicitly allow "\C" and "\P" as special escape sequences. All the file names in the file happen to start with "C" or "P", so this fixes the problem. What a hack.

Names of Values in properties

Duplicates

Some CIM properties have Values that have names that are duplicates. For example, in Device/CIM_AssociatedCacheMemory.mof, the "ReplacementPolicy" property has a Values qualifier with the word "Unknown" defined twice. Device/CIM_DiskPartition.java has many duplicate occurrences of "Microsoft". Values strings must be unique, whether enums are generated or not, so TranslateCIM appends a digit to each name to make it unique, producing "Microsoft", "Microsoft1", "Microsoft2", etc.

Reserved words

In cimv218Experimental-MOFs/Core/CIM_BaseMetricDefinition.mof, the "DataType" property has Values named "boolean" and "datetime". If we're generating enums, we can't use those names. TranslateCIM does <TBD> to fix the problem.

Building and running the translator

Quick reminder of how to do it

jrm
mvn clean
mkdir -p /tmp/org/dmtf/cim
mp
rp
jp
jd

More details

To use maven to convert the grammar into a translator:

cd ~/TranslateCIM
mvn package

To run the translator, see the "rp" function in my .bashrc file. This is the hack way to run the translator. The right way is via maven, as described in the next section.

Maven philosophy says that each maven project should produce a single artifact: a jar file or something. We now produce an executable jar file containing the translate-cim program. For the Kyben project, we need another jar file containing the CIM objects. That's a different jar file, so I'll need another Maven project to execute the translator to convert CIM files into Java files, then compile the Java files into class files, then package them into a jar file. Oh, yeah, and produce javadoc. If I ever get around to writing Kyben, I'll need yet a third maven project for the Kyben code.

I think I need at least the following separate Maven projects:

1. translate-cim
Runs ANTLR to convert the grammar into Java, compiles the Java, runs tests, and packages the program into an executable jar.
2. cim-java
Depends on translate-cim. Runs translate-cim to convert CIM mof files into Java, compiles the Java, runs tests, and packages the program into an executable jar. For this project, I have to write a Maven plugin for translate-cim.
3. kyben
Depends on cim-java.

The cim-java project needs to run translate-cim, which in the Maven world means I have to modify translate-cim to exist within the Maven ecosystem. I'll have to modify translate-cim to turn it into a Maven mojo, and then use maven-plugin-plugin to create a Maven plugin that runs the mojo. Then, in the pom.xml file for cim-java, I can bind the translate-cim mojo to the lifecycle phase named generate-sources.

I wrote translate-cim to get its parameters from the command line. In the Maven ecosystem, a mojo gets its parameters from pom.xml files or from the Maven command line (like when I say "mvn -Dtest=TestOneFile"). To get parameter values from the Maven ecosystem, I just need to use javadoc annotations in the translate-cim program to tell the Maven plugin builder which variables are parameters. Cool. I'm sure that the same application can then be run from a command line or as a Maven plugin. Now it's just a matter of filling in the details :-).

Errors and Testing the translator

How should I test that TranslateCIM is "correct"? Perhaps it is correct if

The above tests aren't really "unit" tests. To implement the above I don't make use of ANTLR's unit testing framework named "gunit", because gunit is designed to test individual grammar rules. I have implemented jUnit tests that test each of the errors in my Java code can produce.

The recommended way to report an error in ANTLR is to throw an ANTLR RecognitionException and let ANTLR report the problem and try to recover. This approach makes a translator that "keeps on truckin'" - it doesn't die on the first error it encounters. That's fine for compilers, but I'm parsing a relatively fixed set of input files, so TranslateCIM either works or it doesn't. I don't need to keep on truckin. The ANTLR book describes how to change a translator to make it quit on the first error, but I had trouble making it work. I couldn't figure out how to make the lexer phase die on with a "can't find include file" error.

Instead of throwing a RecognitionException and letting ANTLR try to recover, I throw my own unchecked exceptions, defined in file TranslateCIMExceptions. I couldn't throw checked exceptions because there isn't any way to make ANTLR generate the "throws" clauses that would be needed in the methods NATLR generates.

Also, I don't want the user to see a big ugly stack dump whenever there's an error, so I catch my exceptions in the main function and output only the one-line error message in the exception.

Use "mvn test" to run tests. This compiles and executes the test programs stored in ~/TranslateCIM/src/test/java/com/kyben/translatecim. Following maven conventions, the programs read test data files stored in ~/TranslateCIM/src/test/resources. Maven automatically adds that directory to the classpath, so that test programs can access the files by calling getResourceByStream. This convention allows maven to avoid referring to files by explicit file paths. To use this convention yet still allow TranslateCIM to accept explicit file paths on the command line, the test programs execute TranslateCIM slightly differently than stand-alone TranslateCIM executes. Stand-alone TranslateCIM parses explicit file names from the cammand line. Test cases use getResourceAsStream, then extract the file names, and feed those to TranslateCIM. Odd, but this approach allows me to have test cases in the maven-style development tree in the maven-standard place, yet still have a real program that accepts file names on the command line.

To-dos

Map CIM types to Java types

I map CIM datatypes into Java datatypes as shown by the templates at the bottom of the TranslateCIM.stg file. It's not possible to map things perfectly. Here are some of the problems and issues:

Generate good javadoc

Just for reference, and to show what TranslateCIM should aim to beat, the DMTF's CIM Javadoc is at /nets/intro/staff/siemsen/tmp/cim_schema_2/. To install it there, I followed the direction in the README file under "Get the CIM documentation".

The javadoc created by running the "javadoc" tool on the TranslateCIM output files is at http://localhost/nets/intro/staff/siemsen/internal/projects/kyben/TranslateCIM/javadoc/

For information about the options to the "javadoc" tool, see http://127.0.0.1/java/docs/technotes/tools/solaris/javadoc.html

Other to-dos