Each App Store image includes metadata contained inside a JSON file that specifies various interface and configuration options. Some of this metadata is visible in the App Store screen and/or the Create Cluster and Create New Job screens, and is described in the Interface Metadata section of this article. The interface metadata, along with other application configuration metadata, is contained inside the Catalog JSON file that is described in the Catalog JSON File section of this article.
The interface-related metadata included for use by the App Store interface consists of the following information:
This article uses the CDH 5.4.3 with Cloudera Manager Catalog entry as an
example for explaining the EPIC Catalog (App Store) entry JSON properties.
The cdh54CM.json file is located in the /opt/bluedata/catalog/entries/system directory.
Catalog entry properties can be broadly segregated into the following purposes:
The identification blob appears as follows:
"distro_id": "cdh54CM",
"label": {
"name": "CDH 5.4.3 with Cloudera Manager",
"description": "CDH 5.4.3 with MRv1/YARN and HBase support. Includes Pig, Hive, Hue and Spark."
},
"version": "2.0.1",
"epic_compatible_versions": ["3.4"],
"categories": [ "Hadoop", "HBase" ],
In this blob:
distro_id is unique identifier for either a Catalog entry or
a versioned set of Catalog entries. It represents a particular application or
application-framework setup as created and maintained by a particular author or
organization. The EPIC interface and API currently only allow only one Catalog
entry with a given distro ID to be installed for use at any given time. Each
distro ID corresponds to one "tile" in the Images tab of the App
Store screen. EPIC may also reference the distro ID when determining
appropriate Add-On image entries that can be added to a cluster, because an
add-on may have a distro ID requirement.label property contains the following parameters:
name, which is the "short name" of the
Catalog entry. The Catalog API does not allow entries with different
distro IDs to share the same name.description, which is a longer, more detailed blurb about the
entry.version is a discriminator between multiple Catalog entries
that share the same distro ID. It is expected to adhere to a simple pattern of
digits separated by dots in the format version a.b.c, where: a.b is the version number, such as the 2.0 in "version": "2.0.1". You
may assign any version you want to the Catalog entry, and each Catalog
entry will have its own unique distro ID . This version represents
iterations of this Catalog entry; it does not necessarily represent the
version of any software deployed in a cluster. For example, you may have
a CDH 5.4 Catalog entry that you deploy as Version 1.0 followed by 1.1,
2.0, etc. EPIC installs the newest available version of a given distro
ID when instructed to install or upgrade that distro ID.c is the optional build number, such as the 1 in "version": "2.0.1". App
Workbench stores the first value used for c when the
distro ID is created. Future versions of the same distro ID will
automatically increment the build number based on the last value stored
in the system, provided that you do not change the c value in the JSON file. In this example,
the first-ever build of the same distro ID will be version 2.0.1, the
next version will be 2.0.2, and so forth. Manually entering a new build
number that is equal to or less than the stored build value will not
have any effect until you change the version number by modifying the a
and/or b values, such as by moving from version 2.0.1 to version 2.1.1
or 3.0.1. Manually entering a new build number that is higher than the
stored build value will increment the build number to the new value. For
example, if the stored build value is 5 and you enter a build number
that is less than or equal to 5, then the next build number will be 6;
however, if the stored build value is 5 and you enter a build number of
10, then the next build number will be 10 and will increment from
there.epic_compatible_versions lists the EPIC platform versions
where this Catalog entry may be used. An asterisk (*) may be used in a version
string as a wildcard.categories is a list of strings used by the EPIC interface to
group Catalog entries during cluster creation. These values appear in the
Select Cluster Type pull-down menu.The components blob appears as follows:
"image": {
"checksum": "b07e8cfea8a9c1a6cdc6990b1da29b9f",
"import_url": "http://s3.amazonaws.com/bluedata-vmimages/Cloudera-CDH-CM-5.4.3-v2.tgz"
},
"setup_package": {
"checksum": "7560c8841c1400e0e4a4ba3dac1ba8d7",
"import_url": "http://s3.amazonaws.com/bluedata-vmimages/cdh5-cm-setup.tgz"
},
In this blob:
image is a property that identifies the location for the
image used to launch virtual nodes for this Catalog entry. In EPIC versions 2.0
and above, this will be an image for launching a Docker container. This location
can be specified in either of two ways: import_url,
which is the http (not https) URL from which the image can be
downloaded. This must be accompanied by the checksum,
which is the MD5 checksum of the image. This method is used for normal
Catalog entry distribution. The image will be downloaded into the images
download cache directory when the entry is installed, and the downloaded
image may be automatically deleted in certain garbage-collection
situations when the Catalog entry is not in use and not present in any
Catalog feed.source_file/opt/bluedata/catalog/images/). Only the file system
is necessary, not the complete path. No checksum is provided in this
case. This method is used for either development or site-local entries.
In this case, EPIC will never automatically download the designated
image file.setup_package is similar to the image property except for the
configuration scripts package that runs inside the launched virtual node. In
this case, the download cache directory is /opt/bluedata/catalog/guestconfig.The services blob appears as follows:
"services": [
{
"id": "hbase_master",
"exported_service": "hbase",
"label": {
"name": "HMaster"
},
"endpoint" : {
"url_scheme" : "http",
"port" : "60010",
"path" : "/",
"is_dashboard" : true
}
},
{
"id": "hbase_worker",
"label": {
"name": "HRegionServer"
},
"endpoint" : {
"url_scheme" : "http",
"port" : "60030",
"path" : "/",
"is_dashboard" : true
}
},
{
"id": "hbase_thrift",
"label": {
"name": "HBase Thrift service."
}
},
...
],
In this example, services is a list of service objects. The
defined services will be referenced by other elements of this JSON file to determine
which services are active on which nodes within the cluster. That information will
then be used to:
Setup scripts also use service identifiers to register those services with vAgent, so that necessary services can be properly started and restarted along with the virtual node. Setup scripts can also choose to wait for a vAgent-registered service to be active on a node in order to coordinate multi-node setup across the cluster.
sshd.
In this blob:
id is an identifier that must be unique within the scope of
this JSON file. It is used by other objects in this file to reference this
service. It is also used in the setup scripts when composing a key for
registering a service with vAgent, or when waiting on a registered service to
start.exported_service is an optional property that has an
agreed-by-convention value for a service that is referenced from outside the
cluster. This property can have an optional qualifiers list of descriptive
qualifiers for that exported service, again with agreed-by-convention values.
qualifiers may only be defined if exported_service is
defined.label uses the same format as the entry's label: name, which briefly describes the service. This
property is currently used only when composing clickable
service-dashboard links in the EPIC interface; however, it is required
for all services.description, which is an
optional description property with more details.endpoint describes the network endpoint of the service.
auth_token": true|false: Whether
(true) or not (false) the endpoint requires an authentication
token.is_dashboard is a Boolean property of
the endpoint that indicates whether this is a URL that can (and should)
be viewed from a web browser, such as in the EPIC interface.url_scheme, port, and path properties of this object are used to compose a
service URL. These properties have the following constraints: url_scheme must be defined if is_dashboard is true.port must be defined.path is optional.endpoint object
triggers the creation of a NAT port mapping for this service, if
EPIC is running inside an EC2 instance.
The node_roles blob appears as follows:
"node_roles": [
{
"id": "controller",
"cardinality": "1",
"anti_affinity_group_id": "CM",
"min_cores": "4",
"min_memory": "12288"
},
{
"id": "standby",
"cardinality": "1",
"anti_affinity_group_id": "CM"
},
{
"id": "arbiter",
"cardinality": "1",
"anti_affinity_group_id": "CM"
},
{
"id": "worker",
"cardinality": "1+"
}
],
In this example, node_roles is a list of objects describing roles
that may be deployed for this Catalog entry. Each role is a particular configuration
instantiated from the entry's virtual node image and configured by the setup
scripts. The configuration associated with a particular role is broadly left up to
the setup scripts, and thus varies widely from entry to entry; however, there are
certain constraints and semantics associated with specific roles in the current EPIC
release (for non-Add-On entries):
controller, worker,
standby, and arbiter. If applicable,
these roles will be created using the Master Node Flavor specified
in the EPIC interface when you create the cluster.controller-role node. If the cluster also includes
a standby-role node, then that standby will be tried as an
alternate target for job submission if the Controller node is unresponsive.worker role is allowed to have scale-out cardinality
(see below); the worker role MUST have scale-out cardinality.worker, standby, and arbiter nodes. Cluster expansion will increase the number of
worker nodes.The properties of each role object are:
id is an identifier that must be unique within the scope of
this JSON file. It is used by other objects in this file to reference this role.
It is also used by EPIC as described above, and may also be referenced by the
setup scripts.cardinality describes the number of nodes in this role that
will be deployed, if/when this role is selected to be used in a cluster. If the
cardinality string just consists of an integer, then a
fixed number of nodes will be deployed for this role. If the cardinality string is an integer followed by +, then
a variable number of nodes may be deployed in this role. The integer is the
minimum number. This kind of value is referred to as a "scale-out"
cardinality.anti_affinity_group_id, if it has a specified value, causes
nodes deployed from this role and/or from any other role with the same anti_affinity_group_id to be placed on different physical
hosts. If this constraint cannot be satisfied, then the cluster
creation/expansion will be rejected.Anti-affinity is typically used to reduce the physical resources shared by a set of nodes, to make it less likely for a single physical fault to affect them all. This constraint only applies to nodes within a given cluster; anti-affinity is not enforced among nodes from different clusters.
min_cores is an optional property that specifies a minimum
number of virtual cores that must be provided in the flavor used to deploy this
role.min_memory is an optional property that specifies a minimum
memory size that must be met by the flavor used to deploy this role.The configuration blob appears as follows:
"config": {
"selected_roles": [
...
],
"node_services": [
...
],
"config_meta": [
...
],
"config_choices": [
...
],
The remainder of the JSON file describes which node roles will be deployed into the cluster, and which services will be present on any node with a given role. This information may depend on choices provided by the UI/API user when they are creating the cluster.
selected_roles lists IDs of roles that will be deployed.node_services lists IDs of services that will be present on
nodes of a given role, if that role is deployed.config_meta lists of string key/value pairs that can be
referenced by the setup scripts.config_choices lists both the choices available to the UI/API
user and the possible selections for each choice. This is a potentially
recursive data structure in that a selection may include another config object,
which in turn may contain selected_roles/node_services/config_meta/config_choices properties.This structure means that the top-level selected_roles, node_services, and config_meta property
values will apply regardless of any user-provided input about choice selections.
User-provided input may then have consequences such as activating additional roles
and/or services in the cluster, and/or adding more elements to the config_meta
For example, in the CDH 5.4.3 JSON:
mrtypemrv1 and yarn.yarn is selected for the mrtype choice,
then: controller and worker,
roles are selected for deployment.yarn_rm and job_history_server services are
selected to be present on the controller role
node.yarn_nm service is selected to be
present on the worker role nodes.yarn_nm service is also selected to be present on
the standby and arbiteryarn_ha options is enabled,
with valid selections true or false. If true is
selected for yarn_ha, then: controller, standby, arbiter, and worker roles must be defined.zookeeper service is selected to be present
on the controller, standby, and arbiter role
nodes.yarn_rm and hdfs_rm services are selected to be present on the standby role node.The selected_role blob appears as follows:
"selected_roles": [
"controller",
"standby",
"arbiter",
"worker"
],
The value of the selected_roles property is a list of role IDs.
The example shown above is taken from the choice selection that activates HBase
support.
selected_roles property is an empty list; no
roles at all will be activated unless the user provides some
input (choice selections). This is a valid arrangement and
reflects the fact that, for this Catalog entry, some choices
must be made before any usable application framework can be
provided in this cluster. By contrast, some other Catalog
entries have roles and services that are always selected.
The node_services blob appears as follows:
"node_services": [
{
"role_id": "controller",
"service_ids": [ "ganglia", "ganglia_api", "ssh", "gmetad", "gmond", "httpd" ]
},
{
"role_id": "standby",
"service_ids": [ "ssh", "gmond" ]
},
{
"role_id": "arbiter",
"service_ids": [ "ssh", "gmond" ]
},
{
"role_id": "worker",
"service_ids": [ "ssh", "gmond" ]
}
],
Each element of this list is a node_services object that describes
the services available on a given role. The role may or may not be selected; this
data structure simply indicates that if a certain role is selected (according to
choice selections), then these are the services a node with that role will provide.
The top-level node_services in this example Catalog entry are all
of the ancillary services that don't depend on choices like HBase support or MR
type.
The properties of each node_services object are:
role_id references the value of the id
property of a node_role object defined within this same
catalog entry JSON.service_ids is a list of id values of service objects defined
within this same Catalog entry JSON.The config_metadata appears as follows:
"config_meta": {
"streaming_jar": "/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar",
"impala_jar_version": "0.1-SNAPSHOT",
"cdh_major_version": "CDH5",
"cdh_full_version": "5.4.3",
"cdh_parcel_version": "5.4.3-1.cdh5.4.3.p0.6",
"cdh_parcel_repo": "http://archive.cloudera.com/cdh5/parcels/5.4.3"
},
In this example, config_meta is a key-value store. These values
are only used by the scripts in the guest package and are thus completely opaque to
EPIC. These values may be referenced during node setup. For example, the streaming_jar value is conventionally referenced by the script
that runs Hadoop Streaming jobs.
Choice selections may cause the definition of multiple config_meta
lists that together form the KV store visible to the in-guest scripts. To avoid
confusion, key conflicts are not allowed. For example, it is legal for mutually
exclusive choice selections to define different values for a key, but it is not
legal for the same key to be defined more than once when composing the KV store that
results from a particular set of choice selections.
This config_choices blob appears as follows:
"config_choices": [
{
"id": "hbase",
"type": "boolean",
"label": {
"name": "HBase"
},
"selections": [
{
"id": false
},
{
"id": true,
"config": {
...
}
}
]
},
{
"id": "mrtype",
"type": "multi",
"label": {
"name": "MR Type"
},
"selections": [
{
"id": "mrv1",
"label": {
"name": "MRv1"
},
"config": {
...
}
},
{
"id": "yarn",
"label": {
"name": "YARN"
},
"preferred": true,
"config": {
"selected_roles": [
"controller",
"worker"
],
"node_services": [
...
],
"config_choices": [
{
"id": "yarn_ha",
"type": "boolean",
"label": {
"name": "YARN and HDFS High Availability"
},
"selections": [
{
"id": false
},
{
"id": true,
"config": {
...
}
}
]
}
],
"config_choices":[
{
"label": {
"name": "CLouderaManagerServer"
},
"type": "string",
"id": "clouderamanager-server"
}
]
}
}
]
},
}
This blob lists the choices available to the API/UI user when creating a cluster.
Each choice has some number of valid selections (either Boolean or multiple-choice)
that can be provided to satisfy that choice. A given selection can then contain a
nested config, as described previously.
In this example, one choice describes whether or not to activate HBase support. Another describes the choice between using MRv1 or YARN. If YARN is selected, then there is a further choice as to whether to activate cluster High A.
Each of these choices activates certain roles for deployment and selects certain services to be present on nodes of given roles.
This structure is fairly generic; however, EPIC constrains the choices to those currently defined among the various Catalog entries provided as part of the EPIC release. Please contact BlueData support if you wish to define choices in a Catalog entry that you are authoring.
The properties of each choice object are:
id is a choice identifier. It can be referenced by the setup
scripts (which can see all choice selections made for cluster creation). Each
selection object must contain an id property that is the
selection value. The possible values for this property are limited to the set of
choices present in the Catalog provided with the EPIC release.type describes the selection value type. This property may
have one of the following values: boolean: Selection
values are either true or false.
This selection type does not require a label.multi: Selection values are a defined set of strings. This
selection type must have a label object that
describes the selection. This object includes a required name and an optional description, which will
be used by future EPIC versions to drive various interface
behaviors.selections lists the valid selections for this choice.
A selection may include an optional preferred property. If
this is set to true, the EPIC interface will default to this
selection value when presenting the choice. A selection may contain an optional
nested config object that describes the configuration
activated by the selection.