Codes

By "Codes" we typically refer to standardized representations of real world properties or attributes. Dependencies among codes may be standardized and captured as well. As we try to consolidate or integrate data from disparate systems, we often find that the values chosen to denote the same meaning may differ from one source system to another. Code management is a process that maintains information about the precise meaning of each code, often in more than one language, and its representations in participating systems. Code management encompasses both automated processes that allow the registration of new unknown codes, and manual work-flows that enhance and refine the information maintained about such codes. To set the stage, assume multiple provisioning sources sending different literals to represent Day-of-Week. Source A uses numbers 0-6 where 0 means "Sunday," 1 "Monday etc. Source B uses numbers 1-7 starting from Sunday, Source C uses 0-6 starting from Monday and Source D uses letter codes SU, MO, TU, WE, TH, FR and SA. Value 2 means Tuesday for sources A and B, and Wednesday for source C, while it means nothing for source D. Conversely, Saturday is represented by codes "6", "7", "5" and "SA" for the four sources respectively. It is often desirable to map sourced representation to a standard code associated with the intended meaning, and use it consistently in reporting. To achieve this, we may establish an agreed representation we define as "canonical." In the case of the days of the week, for example, we may chose the range 1-7 starting from Sunday. To report Days-Of-Week we can also use short or long descriptive names in different languages. So we associate each canonical value with abbreviations, names comments and/or descriptions (possibly showing in on-line dictionaries) in various languages. We refer to the capability of expressing descriptive information in multiple languages as "National Language Support" (NLS). NLS can also be applied at the source-specific domain level for each source system. Consequently we may have English and French names for each canonical day of the week code, but also we can have slightly different ways of capturing the same information as used by each source system. For example, Source A may display all names with Capital letters, while the canonical and other sources may use Title case. In the model below, canonical codes are identified for each type by their canonical code. Similarly at the source-specific level, every type has its own codes. Universal keys (UKEY) act as handles for canonical codes. Alternatively, a sequential enumeration Key within each canonical type can be used by facts to reference canonical codes. The latter are slightly better for optimization purposes, and can be smaller is size so they are desirable for use in very large facts. Each source-specific code points to its corresponding canonical code based on common meaning. Language specific descriptive information is associated with both canonical and sourced code.

Code Representation - Facts Reference Canonical Codes

Facts are clearly shown to point to canonical codes, however they could be pointing to sourced codes. Is one approach better than the other? Let's review the choices:

  1. Facts may point to canonical domain entries that captures the meaning intended. In our example, all facts that refer to Thursday would use the key that points to the canonical entry for code 5. This is consistent with the diagram above.
  2. An alternative way is to have the fact point to the locally sourced domain value and whenever the common representation is required the translation can be made at reporting time. This may appear to be an odd choice because facts to denote "Saturday" must point to local codes "SA", 5, 6 or 7 depending on the source they came from. Actually, in order to do this we would be better off if keys were defined at the source-specific code level, leaving the UKEY as the only canonical-level key.

While in all respects pointing to the canonical entry is better, and intuitively more elegant, there is one disadvantage. This approach makes support of real-time exception handling more complex, and here is why: Say we receive a value that is not already in our code system or it is not associated with a canonical entry. Not only do we now have to create the new sourced code, but we also have to invent a phantom canonical node in order to link the fact to it. If for example we receive a record from source D with the code value "LU" the system cannot surmise that this may be a French code for Monday and point the fact to the "Monday" canonical code. All the system knows is that there is no well-defined mapping to a canonical entry for the incoming source code "LU". If the fact did not have to point to a canonical code, but simply to a source code entry for code "LU" from source D for domain DAY-OF-WEEK, things become much simpler. The system can automatically register a new Sourced Code entry for "LU" with its own new Key and "Undefined" as its name and description. This will allow the fact to be processed, and made accessible without worrying about a canonical code. Later, the correct association to the proper canonical entry can be made, descriptive information updated, and the proper reporting labels restored without having to update the fact. If a new canonical code needs to be generated, it can be dealt with at this time, as will all needed associations. In our case, the new code "LU" would simply be linked to the canonical entry for "Monday". This scenario is depicted in the following diagram. Notice the highlighted path of entities required to link the unknown source to a fact reference.

Code Representation - Facts Reference Source Codes

In order to achieve the same effect with the first approach, every time a new code is discovered, then a new canonical entry and keys have to be generated so that the fact can be linked to the new code. If after the fact the source code is pointed to "Monday" then the created canonical code will remain, also representing Monday for the few facts that were processed prior to human intervention. The highlighted entities in the first diagram depict the trace of objects that need to be created automatically in order to link the incoming unknown code to the fact key. Another complication is when some non-trivial mapping is involved. For example, we may be looking at the area code of a phone number and deriving the location code, or deriving the state from the zip code. Depending on the source country, these look-ups are slightly different and may need to be resolved at the local level. This is shown in the diagram above via the Map-To-Code entity, which would represent the area code or zip code in the example. The Sourced code captures the derived location code or the State respectively. Real time exceptions can still be processed when a new zip-code shows up, but this time the matched code and the source code will be the same. If the mapping used were to a canonical entry, then real time processing would not be able to establish the correct mapping, unless it always created an "unknown" canonical entry as described above.

AttachmentSize
Codes166.7 KB
Codes267.68 KB
VDM Access: