Metadata JSON

Interface Metadata

The interface-related metadata included for use by the App Store interface consists of the following information:

Basic image information: This information is visible in the App Store screen and includes:
- App name: Name of the application.
- Description: Short description of the application.
- Logo: Image file that displays in the App Store screen.
Hovering the mouse over an application tile expands the tile to display the following additional information:
- Long Description: Longer description of the application.
- Version: Image version and optional build number.
- Root disk size (Local): Root disk size required for running the image on-premises.
- Root disk size (EC2): Root disk size required for running the image on an EC2 instance.
- Distro ID: Unique identifier for the image.
- Category: Category of Big Data application provided by the image.
Additional metadata determines the options that will be available in the Create New Cluster screen when a new cluster is created. This includes:
- Cluster group name: Type(s) of cluster (such as Hadoop, Spark, and/or Kafka) on which the image can run.
- Node flavor limits: Role type specific node flavor(s) required to run the application, which will be based on the CPU, RAM, and storage requirements of the application.
- Node count limits: Number of role-specific and/or Edge nodes required to run the application.

Catalog JSON File

This article uses the CDH 5.4.3 with Cloudera Manager Catalog entry as an example for explaining the EPIC Catalog (App Store) entry JSON properties. The cdh54CM.json file is located in the /opt/bluedata/catalog/entries/system directory.

Note: This article describes Version 1 of the catalog JSON. This version is still supported; however, you may want to use later versions for authoring new Catalog entries. Version 2 (or later) will be required for any entry that makes use of a later version of the vAgent config API, and Version 3 (or later) will be required if you are supplying a custom logo.

Catalog entry properties can be broadly segregated into the following purposes:

Identification
Components
Services
Node Roles
Configuration

Identification

The identification blob appears as follows:

                        "distro_id": "cdh54CM",
                        "label": {
                          "name": "CDH 5.4.3 with Cloudera Manager",
                          "description": "CDH 5.4.3 with MRv1/YARN and HBase support. Includes Pig, Hive, Hue and Spark."
                          },
                        "version": "2.0.1",
                        "epic_compatible_versions": ["3.4"],
                        "categories": [ "Hadoop", "HBase" ],

In this blob:

distro_id is unique identifier for either a Catalog entry or a versioned set of Catalog entries. It represents a particular application or application-framework setup as created and maintained by a particular author or organization. The EPIC interface and API currently only allow only one Catalog entry with a given distro ID to be installed for use at any given time. Each distro ID corresponds to one "tile" in the Images tab of the App Store screen. EPIC may also reference the distro ID when determining appropriate Add-On image entries that can be added to a cluster, because an add-on may have a distro ID requirement.
The label property contains the following parameters:
- name, which is the "short name" of the Catalog entry. The Catalog API does not allow entries with different distro IDs to share the same name.
- description, which is a longer, more detailed blurb about the entry.
version is a discriminator between multiple Catalog entries that share the same distro ID. It is expected to adhere to a simple pattern of digits separated by dots in the format version a.b.c, where:
- a.b is the version number, such as the 2.0 in "version": "2.0.1". You may assign any version you want to the Catalog entry, and each Catalog entry will have its own unique distro ID . This version represents iterations of this Catalog entry; it does not necessarily represent the version of any software deployed in a cluster. For example, you may have a CDH 5.4 Catalog entry that you deploy as Version 1.0 followed by 1.1, 2.0, etc. EPIC installs the newest available version of a given distro ID when instructed to install or upgrade that distro ID.
- c is the optional build number, such as the 1 in "version": "2.0.1". App Workbench stores the first value used for c when the distro ID is created. Future versions of the same distro ID will automatically increment the build number based on the last value stored in the system, provided that you do not change the c value in the JSON file. In this example, the first-ever build of the same distro ID will be version 2.0.1, the next version will be 2.0.2, and so forth. Manually entering a new build number that is equal to or less than the stored build value will not have any effect until you change the version number by modifying the a and/or b values, such as by moving from version 2.0.1 to version 2.1.1 or 3.0.1. Manually entering a new build number that is higher than the stored build value will increment the build number to the new value. For example, if the stored build value is 5 and you enter a build number that is less than or equal to 5, then the next build number will be 6; however, if the stored build value is 5 and you enter a build number of 10, then the next build number will be 10 and will increment from there.
epic_compatible_versions lists the EPIC platform versions where this Catalog entry may be used. An asterisk (*) may be used in a version string as a wildcard.
categories is a list of strings used by the EPIC interface to group Catalog entries during cluster creation. These values appear in the Select Cluster Type pull-down menu.

Components

The components blob appears as follows:

                        "image": {
                          "checksum": "b07e8cfea8a9c1a6cdc6990b1da29b9f",
                          "import_url": "http://s3.amazonaws.com/bluedata-vmimages/Cloudera-CDH-CM-5.4.3-v2.tgz"
                        },
                        "setup_package": {
                          "checksum": "7560c8841c1400e0e4a4ba3dac1ba8d7",
                          "import_url": "http://s3.amazonaws.com/bluedata-vmimages/cdh5-cm-setup.tgz"
                        },

In this blob:

image is a property that identifies the location for the image used to launch virtual nodes for this Catalog entry. In EPIC versions 2.0 and above, this will be an image for launching a Docker container. This location can be specified in either of two ways:
- import_url, which is the http (not https) URL from which the image can be downloaded. This must be accompanied by the checksum, which is the MD5 checksum of the image. This method is used for normal Catalog entry distribution. The image will be downloaded into the images download cache directory when the entry is installed, and the downloaded image may be automatically deleted in certain garbage-collection situations when the Catalog entry is not in use and not present in any Catalog feed.
- source_file/opt/bluedata/catalog/images/). Only the file system is necessary, not the complete path. No checksum is provided in this case. This method is used for either development or site-local entries. In this case, EPIC will never automatically download the designated image file.
setup_package is similar to the image property except for the configuration scripts package that runs inside the launched virtual node. In this case, the download cache directory is /opt/bluedata/catalog/guestconfig.

Services

The services blob appears as follows:

                        "services": [
                          {
                            "id": "hbase_master",
                            "exported_service": "hbase",
                            "label": {
                              "name": "HMaster"
                              },
                            "endpoint" : {
                              "url_scheme" : "http",
                              "port" : "60010",
                              "path" : "/",
                              "is_dashboard" : true
                              }
                            },
                            {
                              "id": "hbase_worker",
                              "label": {
                                "name": "HRegionServer"
                                },
                              "endpoint" : {
                                "url_scheme" : "http",
                                "port" : "60030",
                                "path" : "/",
                                "is_dashboard" : true
                              }
                              },
                            {
                              "id": "hbase_thrift",
                              "label": {
                                "name": "HBase Thrift service."
                                }
                              },
                            ...
                        ],

In this example, services is a list of service objects. The defined services will be referenced by other elements of this JSON file to determine which services are active on which nodes within the cluster. That information will then be used to:

Present clickable Dashboard links in the EPIC interface.
Determine which dependent nodegroups (Add-On Images) can be attached to the cluster.
Trigger NAT port mapping for the service, if appropriate.
Optionally be referenced by the setup scripts that run within the virtual node.

Setup scripts also use service identifiers to register those services with vAgent, so that necessary services can be properly started and restarted along with the virtual node. Setup scripts can also choose to wait for a vAgent-registered service to be active on a node in order to coordinate multi-node setup across the cluster.

Note: The "service" terminology does not correspond to a definition of "service" that is specific to some particular application or application framework. A "service" is any entity that can be used for any of the purposes described above. For example, a YARN resource manager is a service, as is sshd.

In this blob:

id is an identifier that must be unique within the scope of this JSON file. It is used by other objects in this file to reference this service. It is also used in the setup scripts when composing a key for registering a service with vAgent, or when waiting on a registered service to start.
exported_service is an optional property that has an agreed-by-convention value for a service that is referenced from outside the cluster. This property can have an optional qualifiers list of descriptive qualifiers for that exported service, again with agreed-by-convention values. qualifiers may only be defined if exported_service is defined.

Note: The above values are currently only used when determining appropriate Add-On Image entries that can be added to a cluster, because those entries may have a requirement that the cluster provides specific exported services, or even exported services with specific qualifiers. For example, an add-on may have a dependence on the Hadoop exported service, or a more specific dependence on Hadoop with the YARN qualifier.

label uses the same format as the entry's label:
- name, which briefly describes the service. This property is currently used only when composing clickable service-dashboard links in the EPIC interface; however, it is required for all services.
- description, which is an optional description property with more details.
endpoint describes the network endpoint of the service.
- "auth_token": true|false: Whether (true) or not (false) the endpoint requires an authentication token.
- is_dashboard is a Boolean property of the endpoint that indicates whether this is a URL that can (and should) be viewed from a web browser, such as in the EPIC interface.
- The url_scheme, port, and path properties of this object are used to compose a service URL. These properties have the following constraints:
  - url_scheme must be defined if is_dashboard is true.
  - port must be defined.
  - path is optional.

Note: The presence of an endpoint object triggers the creation of a NAT port mapping for this service, if EPIC is running inside an EC2 instance.

Node Roles

The node_roles blob appears as follows:

                        "node_roles": [
                          {
                            "id": "controller",
                            "cardinality": "1",
                            "anti_affinity_group_id": "CM",
                            "min_cores": "4",
                            "min_memory": "12288"
                          },
                          {
                            "id": "standby",
                            "cardinality": "1",
                            "anti_affinity_group_id": "CM"
                          },
                          {
                            "id": "arbiter",
                            "cardinality": "1",
                            "anti_affinity_group_id": "CM"
                          },
                          {
                            "id": "worker",
                            "cardinality": "1+"
                          }
                        ],

In this example, node_roles is a list of objects describing roles that may be deployed for this Catalog entry. Each role is a particular configuration instantiated from the entry's virtual node image and configured by the setup scripts. The configuration associated with a particular role is broadly left up to the setup scripts, and thus varies widely from entry to entry; however, there are certain constraints and semantics associated with specific roles in the current EPIC release (for non-Add-On entries):

The allowed roles are controller, worker, standby, and arbiter. If applicable, these roles will be created using the Master Node Flavor specified in the EPIC interface when you create the cluster.
To support job submission to a cluster from the EPIC interface, the cluster must include a controller-role node. If the cluster also includes a standby-role node, then that standby will be tried as an alternate target for job submission if the Controller node is unresponsive.
Worker role nodes (if applicable) will be created using the Worker Node Flavor specified in the EPIC interface when you create the cluster.
Only the worker role is allowed to have scale-out cardinality (see below); the worker role MUST have scale-out cardinality.
The Worker Count in the cluster creation interface covers the total number of worker, standby, and arbiter nodes. Cluster expansion will increase the number of worker nodes.

The properties of each role object are:

id is an identifier that must be unique within the scope of this JSON file. It is used by other objects in this file to reference this role. It is also used by EPIC as described above, and may also be referenced by the setup scripts.
cardinality describes the number of nodes in this role that will be deployed, if/when this role is selected to be used in a cluster. If the cardinality string just consists of an integer, then a fixed number of nodes will be deployed for this role. If the cardinality string is an integer followed by +, then a variable number of nodes may be deployed in this role. The integer is the minimum number. This kind of value is referred to as a "scale-out" cardinality.
anti_affinity_group_id, if it has a specified value, causes nodes deployed from this role and/or from any other role with the same anti_affinity_group_id to be placed on different physical hosts. If this constraint cannot be satisfied, then the cluster creation/expansion will be rejected.

Anti-affinity is typically used to reduce the physical resources shared by a set of nodes, to make it less likely for a single physical fault to affect them all. This constraint only applies to nodes within a given cluster; anti-affinity is not enforced among nodes from different clusters.

min_cores is an optional property that specifies a minimum number of virtual cores that must be provided in the flavor used to deploy this role.
min_memory is an optional property that specifies a minimum memory size that must be met by the flavor used to deploy this role.

Configuration

The configuration blob appears as follows:

                        "config": {
                          "selected_roles": [
                             ...
                            ],
                          "node_services": [
                             ...
                            ],
                          "config_meta": [
                             ...
                            ],
                          "config_choices": [
                             ...
                            ],

The remainder of the JSON file describes which node roles will be deployed into the cluster, and which services will be present on any node with a given role. This information may depend on choices provided by the UI/API user when they are creating the cluster.

selected_roles lists IDs of roles that will be deployed.
node_services lists IDs of services that will be present on nodes of a given role, if that role is deployed.
config_meta lists of string key/value pairs that can be referenced by the setup scripts.
config_choices lists both the choices available to the UI/API user and the possible selections for each choice. This is a potentially recursive data structure in that a selection may include another config object, which in turn may contain selected_roles/node_services/config_meta/config_choices properties.

This structure means that the top-level selected_roles, node_services, and config_meta property values will apply regardless of any user-provided input about choice selections. User-provided input may then have consequences such as activating additional roles and/or services in the cluster, and/or adding more elements to the config_meta

For example, in the CDH 5.4.3 JSON:

There is a top-level mrtypemrv1 and yarn.
If yarn is selected for the mrtype choice, then:
- The controller and worker, roles are selected for deployment.
- The yarn_rm and job_history_server services are selected to be present on the controller role node.
- The yarn_nm service is selected to be present on the worker role nodes.
- The yarn_nm service is also selected to be present on the standby and arbiter
- The yarn_ha options is enabled, with valid selections true or false. If true is selected for yarn_ha, then:
  - The controller, standby, arbiter, and worker roles must be defined.
  - The zookeeper service is selected to be present on the controller, standby, and arbiter role nodes.
  - The yarn_rm and hdfs_rm services are selected to be present on the standby role node.

Selected Roles

The selected_role blob appears as follows:

                        "selected_roles": [
                          "controller",
                          "standby",
                          "arbiter",
                          "worker"
                        ],

The value of the selected_roles property is a list of role IDs. The example shown above is taken from the choice selection that activates HBase support.

Note: In this particular Catalog entry, the top-level selected_roles property is an empty list; no roles at all will be activated unless the user provides some input (choice selections). This is a valid arrangement and reflects the fact that, for this Catalog entry, some choices must be made before any usable application framework can be provided in this cluster. By contrast, some other Catalog entries have roles and services that are always selected.

Node Services

The node_services blob appears as follows:

                        "node_services": [
                          {
                            "role_id": "controller",
                            "service_ids": [ "ganglia", "ganglia_api", "ssh", "gmetad", "gmond", "httpd" ]
                          },
                          {
                            "role_id": "standby",
                            "service_ids": [ "ssh", "gmond" ]
                          },
                          {
                            "role_id": "arbiter",
                            "service_ids": [ "ssh", "gmond" ]
                          },
                          {
                            "role_id": "worker",
                            "service_ids": [ "ssh", "gmond" ]
                          }
                        ],

Each element of this list is a node_services object that describes the services available on a given role. The role may or may not be selected; this data structure simply indicates that if a certain role is selected (according to choice selections), then these are the services a node with that role will provide. The top-level node_services in this example Catalog entry are all of the ancillary services that don't depend on choices like HBase support or MR type.

The properties of each node_services object are:

role_id references the value of the id property of a node_role object defined within this same catalog entry JSON.
service_ids is a list of id values of service objects defined within this same Catalog entry JSON.

Config Metadata

The config_metadata appears as follows:

                        "config_meta": {
                          "streaming_jar": "/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar",
                          "impala_jar_version": "0.1-SNAPSHOT",
                          "cdh_major_version": "CDH5",
                          "cdh_full_version": "5.4.3",
                          "cdh_parcel_version": "5.4.3-1.cdh5.4.3.p0.6",
                          "cdh_parcel_repo": "http://archive.cloudera.com/cdh5/parcels/5.4.3"
                        },

In this example, config_meta is a key-value store. These values are only used by the scripts in the guest package and are thus completely opaque to EPIC. These values may be referenced during node setup. For example, the streaming_jar value is conventionally referenced by the script that runs Hadoop Streaming jobs.

Choice selections may cause the definition of multiple config_meta lists that together form the KV store visible to the in-guest scripts. To avoid confusion, key conflicts are not allowed. For example, it is legal for mutually exclusive choice selections to define different values for a key, but it is not legal for the same key to be defined more than once when composing the KV store that results from a particular set of choice selections.

Config Choices

This config_choices blob appears as follows:

                        "config_choices": [
                          {
                            "id": "hbase",
                            "type": "boolean",
                            "label": {
                              "name": "HBase"
                            },
                            "selections": [
                              {
                                "id": false
                              },
                              {
                                "id": true,
                                "config": {
                                  ...
                                }
                              }
                            ]
                          },
                          {
                            "id": "mrtype",
                            "type": "multi",
                            "label": {
                              "name": "MR Type"
                            },
                            "selections": [
                              {
                                "id": "mrv1",
                                "label": {
                                  "name": "MRv1"
                                },
                                "config": {
                                  ...
                                }
                              },
                              {
                                "id": "yarn",
                                "label": {
                                  "name": "YARN"
                                },
                                "preferred": true,
                                "config": {
                                  "selected_roles": [
                                    "controller",
                                    "worker"
                                  ],
                                  "node_services": [
                                    ...
                                  ],
                                  "config_choices": [
                                    {
                                      "id": "yarn_ha",
                                      "type": "boolean",
                                      "label": {
                                        "name": "YARN and HDFS High Availability"
                                      },
                        
                                      "selections": [
                                        {
                                          "id": false
                                        },
                                        {
                                          "id": true,
                                          "config": {
                                            ...
                                          }
                                        }
                                      ]
                                    }
                                  ],
                                  "config_choices":[
                                    {
                                      "label": {
                                        "name": "CLouderaManagerServer"
                                      },
                                      "type": "string",
                                      "id": "clouderamanager-server"
                                    }
                                  ]
                                }
                              }
                            ]
                          },
                        }

This blob lists the choices available to the API/UI user when creating a cluster. Each choice has some number of valid selections (either Boolean or multiple-choice) that can be provided to satisfy that choice. A given selection can then contain a nested config, as described previously.

In this example, one choice describes whether or not to activate HBase support. Another describes the choice between using MRv1 or YARN. If YARN is selected, then there is a further choice as to whether to activate cluster High A.

Each of these choices activates certain roles for deployment and selects certain services to be present on nodes of given roles.

This structure is fairly generic; however, EPIC constrains the choices to those currently defined among the various Catalog entries provided as part of the EPIC release. Please contact BlueData support if you wish to define choices in a Catalog entry that you are authoring.

The properties of each choice object are:

id is a choice identifier. It can be referenced by the setup scripts (which can see all choice selections made for cluster creation). Each selection object must contain an id property that is the selection value. The possible values for this property are limited to the set of choices present in the Catalog provided with the EPIC release.
type describes the selection value type. This property may have one of the following values:
- boolean: Selection values are either true or false. This selection type does not require a label.
- multi: Selection values are a defined set of strings. This selection type must have a label object that describes the selection. This object includes a required name and an optional description, which will be used by future EPIC versions to drive various interface behaviors.
- string: Alphanumeric characters.
selections lists the valid selections for this choice. A selection may include an optional preferred property. If this is set to true, the EPIC interface will default to this selection value when presenting the choice. A selection may contain an optional nested config object that describes the configuration activated by the selection.