This document describes the layout of a deployment with cells v2, including
deployment considerations for security and scale and recommended practices and
tips for running and maintaining cells v2 for admins and operators. It is
focused on code present in Pike and later, and while it is geared towards
people who want to have multiple cells for whatever reason, the nature of the
cells v2 support in Nova means that it applies in some way to all deployments.
Overview
The purpose of the cells functionality in nova is to allow larger deployments
to shard their many compute nodes into cells. All nova deployments are by
definition cells deployments, even if most will only ever have a single cell.
This means a multi-cell deployment will not b radically different from a
“standard” nova deployment.
Consider such a deployment. It will consists of the following components:
The nova-api service which provides the external REST API to
users.
The nova-scheduler and placement
services which are
responsible for tracking resources and deciding which compute node instances
should be on.
An “API database” that is used primarily by nova-api and
nova-scheduler (called API-level services below) to track
location information about instances, as well as a temporary location for
instances being built but not yet scheduled.
The nova-conductor service which offloads long-running tasks for
the API-level services and insulates compute nodes from direct database access
The nova-compute service which manages the virt driver and
hypervisor host.
A “cell database” which is used by API, conductor and compute
services, and which houses the majority of the information about
instances.
A “cell0 database” which is just like the cell database, but
contains only instances that failed to be scheduled. This database mimics a
regular cell, but has no compute nodes and is used only as a place to put
instances that fail to land on a real compute node (and thus a real cell).
A message queue which allows the services to communicate with each
other via RPC.
In smaller deployments, there will typically be a single message queue that all
services share and a single database server which hosts the API database, a
single cell database, as well as the required cell0 database. Because we only
have one “real” cell, we consider this a “single-cell deployment”.
In larger deployments, we can opt to shard the deployment using multiple cells.
In this configuration there will still only be one global API database but
there will be a cell database (where the bulk of the instance information
lives) for each cell, each containing a portion of the instances for the entire
deployment within, as well as per-cell message queues and per-cell
nova-conductor instances. There will also be an additional
nova-conductor instance, known as a super conductor, to handle
API-level operations.
In these larger deployments, each of the nova services will use a cell-specific
configuration file, all of which will at a minimum specify a message queue
endpoint (i.e. transport_url
). Most of the services will
also contain database connection configuration information (i.e.
database.connection
), while API-level services that need
access to the global routing and placement information will also be configured
to reach the API database (i.e. api_database.connection
).
API-level services need to be able to contact other services in all of
the cells. Since they only have one configured
transport_url
and
database.connection
, they look up the information for the
other cells in the API database, with records called cell mappings.
Note
The API database must have cell mapping records that match
the transport_url
and
database.connection
configuration options of the
lower-level services. See the nova-manage
Cells v2 Commands
commands for more information about how to create and examine these records.
The following section goes into more detail about the difference between
single-cell and multi-cell deployments.
Service layout
The services generally have a well-defined communication pattern that
dictates their layout in a deployment. In a small/simple scenario, the
rules do not have much of an impact as all the services can
communicate with each other on a single message bus and in a single
cell database. However, as the deployment grows, scaling and security
concerns may drive separation and isolation of the services.
Single cell
This is a diagram of the basic services that a simple (single-cell) deployment
would have, as well as the relationships (i.e. communication paths) between
them:
digraph services {
graph [pad="0.35", ranksep="0.65", nodesep="0.55", concentrate=true];
node [fontsize=10 fontname="Monospace"];
edge [arrowhead="normal", arrowsize="0.8"];
labelloc=bottom;
labeljust=left;
{ rank=same
api [label="nova-api"]
apidb [label="API Database" shape="box"]
scheduler [label="nova-scheduler"]
}
{ rank=same
mq [label="MQ" shape="diamond"]
conductor [label="nova-conductor"]
}
{ rank=same
cell0db [label="Cell0 Database" shape="box"]
celldb [label="Cell Database" shape="box"]
compute [label="nova-compute"]
}
api -> mq -> compute
conductor -> mq -> scheduler
api -> apidb
api -> cell0db
api -> celldb
conductor -> apidb
conductor -> cell0db
conductor -> celldb
}All of the services are configured to talk to each other over the same
message bus, and there is only one cell database where live instance
data resides. The cell0 database is present (and required) but as no
compute nodes are connected to it, this is still a “single cell”
deployment.
Multiple cells
In order to shard the services into multiple cells, a number of things
must happen. First, the message bus must be split into pieces along
the same lines as the cell database. Second, a dedicated conductor
must be run for the API-level services, with access to the API
database and a dedicated message queue. We call this super conductor
to distinguish its place and purpose from the per-cell conductor nodes.
digraph services2 {
graph [pad="0.35", ranksep="0.65", nodesep="0.55", concentrate=true];
node [fontsize=10 fontname="Monospace"];
edge [arrowhead="normal", arrowsize="0.8"];
labelloc=bottom;
labeljust=left;
subgraph api {
api [label="nova-api"]
scheduler [label="nova-scheduler"]
conductor [label="super conductor"]
{ rank=same
apimq [label="API MQ" shape="diamond"]
apidb [label="API Database" shape="box"]
}
api -> apimq -> conductor
api -> apidb
conductor -> apimq -> scheduler
conductor -> apidb
}
subgraph clustercell0 {
label="Cell 0"
color=green
cell0db [label="Cell Database" shape="box"]
}
subgraph clustercell1 {
label="Cell 1"
color=blue
mq1 [label="Cell MQ" shape="diamond"]
cell1db [label="Cell Database" shape="box"]
conductor1 [label="nova-conductor"]
compute1 [label="nova-compute"]
conductor1 -> mq1 -> compute1
conductor1 -> cell1db
}
subgraph clustercell2 {
label="Cell 2"
color=red
mq2 [label="Cell MQ" shape="diamond"]
cell2db [label="Cell Database" shape="box"]
conductor2 [label="nova-conductor"]
compute2 [label="nova-compute"]
conductor2 -> mq2 -> compute2
conductor2 -> cell2db
}
api -> mq1 -> conductor1
api -> mq2 -> conductor2
api -> cell0db
api -> cell1db
api -> cell2db
conductor -> cell0db
conductor -> cell1db
conductor -> mq1
conductor -> cell2db
conductor -> mq2
}It is important to note that services in the lower cell boxes only
have the ability to call back to the placement API but cannot access
any other API-layer services via RPC, nor do they have access to the
API database for global visibility of resources across the cloud.
This is intentional and provides security and failure domain
isolation benefits, but also has impacts on some things that would
otherwise require this any-to-any communication style. Check Operations requiring upcalls
below for the most up-to-date information about any caveats that may be present
due to this limitation.
Usage
As noted previously, all deployments are in effect now cells v2 deployments. As
a result, setup of any nova deployment - even those that intend to only have
one cell - will involve some level of cells configuration. These changes are
configuration-related, both in the main nova configuration file as well as some
extra records in the databases.
All nova deployments must now have the following databases available
and configured:
The “API” database
One special “cell” database called “cell0”
One (or eventually more) “cell” databases
Thus, a small nova deployment will have an API database, a cell0, and
what we will call here a “cell1” database. High-level tracking
information is kept in the API database. Instances that are never
scheduled are relegated to the cell0 database, which is effectively a
graveyard of instances that failed to start. All successful/running
instances are stored in “cell1”.
Note
Since Nova services make use of both configuration file and some
databases records, starting or restarting those services with an
incomplete configuration could lead to an incorrect deployment.
Only restart the services once you are done with the described
steps below.
Note
The following examples show the full expanded command line usage of
the setup commands. This is to make it easier to visualize which of
the various URLs are used by each of the commands. However, you should be
able to put all of that in the config file and nova-manage will
use those values. If need be, you can create separate config files and pass
them as nova-manage --config-file foo.conf
to control the behavior
without specifying things on the command lines.
Configuring a new deployment
If you are installing Nova for the first time and have no compute hosts in the
database yet then it will be necessary to configure cell0 and at least one
additional “real” cell. To begin, ensure your API database schema has been
populated using the nova-manage api_db sync command. Ensure the
connection information for this database is stored in the nova.conf
file
using the api_database.connection
config option:
[api_database]
connection = mysql+pymysql://root:secretmysql@dbserver/nova_api?charset=utf8
Since there may be multiple “cell” databases (and in fact everyone
will have cell0 and cell1 at a minimum), connection info for these is
stored in the API database. Thus, the API database must exist and must provide
information on how to connect to it before continuing to the steps below, so
that nova-manage can find your other databases.
Next, we will create the necessary records for the cell0 database. To
do that we will first use nova-manage cell_v2 map_cell0 to create
and map cell0. For example:
$ nova-manage cell_v2 map_cell0 \
--database_connection mysql+pymysql://root:secretmysql@dbserver/nova_cell0?charset=utf8
Note
If you don’t specify --database_connection
then the commands will use
the database.connection
value from your config file
and mangle the database name to have a _cell0
suffix
Warning
If your databases are on separate hosts then you should specify
--database_connection
or make certain that the nova.conf
being used has the database.connection
value pointing
to the same user/password/host that will work for the cell0 database.
If the cell0 mapping was created incorrectly, it can be deleted
using the nova-manage cell_v2 delete_cell command before running
nova-manage cell_v2 map_cell0 again with the proper database
connection value.
We will then use nova-manage db sync to apply the database schema to
this new database. For example:
$ nova-manage db sync \
--database_connection mysql+pymysql://root:secretmysql@dbserver/nova_cell0?charset=utf8
Since no hosts are ever in cell0, nothing further is required for its setup.
Note that all deployments only ever have one cell0, as it is special, so once
you have done this step you never need to do it again, even if you add more
regular cells.
Now, we must create another cell which will be our first “regular”
cell, which has actual compute hosts in it, and to which instances can
actually be scheduled. First, we create the cell record using
nova-manage cell_v2 create_cell. For example:
$ nova-manage cell_v2 create_cell \
--name cell1 \
--database_connection mysql+pymysql://root:secretmysql@127.0.0.1/nova?charset=utf8 \
--transport-url rabbit://stackrabbit:secretrabbit@mqserver:5672/
Note
It is a good idea to specify a name for the new cell you create so you can
easily look up cell UUIDs with the nova-manage cell_v2 list_cells
command later if needed.
Note
The nova-manage cell_v2 create_cell command will print the UUID
of the newly-created cell if --verbose
is passed, which is useful if you
need to run commands like nova-manage cell_v2 discover_hosts
targeted at a specific cell.
At this point, the API database can now find the cell database, and further
commands will attempt to look inside. If this is a completely fresh database
(such as if you’re adding a cell, or if this is a new deployment), then you
will need to run nova-manage db sync on it to initialize the
schema.
Now we have a cell, but no hosts are in it which means the scheduler will never
actually place instances there. The next step is to scan the database for
compute node records and add them into the cell we just created. For this step,
you must have had a compute node started such that it registers itself as a
running service. You can identify this using the openstack compute
service list command:
$ openstack compute service list --service nova-compute
Once that has happened, you can scan and add it to the cell using the
nova-manage cell_v2 discover_hosts command:
$ nova-manage cell_v2 discover_hosts
This command will connect to any databases for which you have created cells (as
above), look for hosts that have registered themselves there, and map those
hosts in the API database so that they are visible to the scheduler as
available targets for instances. Any time you add more compute hosts to a cell,
you need to re-run this command to map them from the top-level so they can be
utilized. You can also configure a periodic task to have Nova discover new
hosts automatically by setting the
scheduler.discover_hosts_in_cells_interval
to a time
interval in seconds. The periodic task is run by the nova-scheduler
service, so you must be sure to configure it on all of your
nova-scheduler hosts.
Note
In the future, whenever you add new compute hosts, you will need to run the
nova-manage cell_v2 discover_hosts command after starting them to
map them to the cell if you did not configure automatic host discovery using
scheduler.discover_hosts_in_cells_interval
.
Adding a new cell to an existing deployment
You can add additional cells to your deployment using the same steps used above
to create your first cell. We can create a new cell record using
nova-manage cell_v2 create_cell. For example:
$ nova-manage cell_v2 create_cell \
--name cell2 \
--database_connection mysql+pymysql://root:secretmysql@127.0.0.1/nova?charset=utf8 \
--transport-url rabbit://stackrabbit:secretrabbit@mqserver:5672/
Note
It is a good idea to specify a name for the new cell you create so you can
easily look up cell UUIDs with the nova-manage cell_v2 list_cells
command later if needed.
Note
The nova-manage cell_v2 create_cell command will print the UUID
of the newly-created cell if --verbose
is passed, which is useful if you
need to run commands like nova-manage cell_v2 discover_hosts
targeted at a specific cell.
You can repeat this step for each cell you wish to add to your deployment. Your
existing cell database will be re-used - this simply informs the top-level API
database about your existing cell databases.
Once you’ve created your new cell, use nova-manage cell_v2
discover_hosts to map compute hosts to cells. This is only necessary if you
haven’t enabled automatic discovery using the
scheduler.discover_hosts_in_cells_interval
option. For
example:
$ nova-manage cell_v2 discover_hosts
Note
This command will search for compute hosts in each cell database and map
them to the corresponding cell. This can be slow, particularly for larger
deployments. You may wish to specify the --cell_uuid
option, which will
limit the search to a specific cell. You can use the nova-manage
cell_v2 list_cells command to look up cell UUIDs if you are going to
specify --cell_uuid
.
Finally, run the nova-manage cell_v2 map_instances command to map
existing instances to the new cell(s). For example:
$ nova-manage cell_v2 map_instances
Note
This command will search for instances in each cell database and map them to
the correct cell. This can be slow, particularly for larger deployments. You
may wish to specify the --cell_uuid
option, which will limit the search
to a specific cell. You can use the nova-manage cell_v2
list_cells command to look up cell UUIDs if you are going to specify
--cell_uuid
.
Note
The --max-count
option can be specified if you would like to limit the
number of instances to map in a single run. If --max-count
is not
specified, all instances will be mapped. Repeated runs of the command will
start from where the last run finished so it is not necessary to increase
--max-count
to finish. An exit code of 0 indicates that all instances
have been mapped. An exit code of 1 indicates that there are remaining
instances that need to be mapped.
Template URLs in Cell Mappings
Starting in the 18.0.0 (Rocky) release, the URLs provided in the cell mappings
for --database_connection
and --transport-url
can contain
variables which are evaluated each time they are loaded from the
database, and the values of which are taken from the corresponding
base options in the host’s configuration file. The base URL is parsed
and the following elements may be substituted into the cell mapping
URL (using rabbit://bob:s3kret@myhost:123/nova?sync=true#extra
):
Cell Mapping URL Variables
Variable |
Meaning |
Part of example URL |
scheme
|
The part before the :// |
rabbit
|
username
|
The username part of the credentials |
bob
|
password
|
The password part of the credentials |
s3kret
|
hostname
|
The hostname or address |
myhost
|
port
|
The port number (must be specified) |
123
|
path
|
The “path” part of the URL (without leading slash) |
nova
|
query
|
The full query string arguments (without leading question mark) |
sync=true
|
fragment
|
Everything after the first hash mark |
extra
|
Variables are provided in curly brackets, like {username}
. A simple template
of rabbit://{username}:{password}@otherhost/{path}
will generate a full URL
of rabbit://bob:s3kret@otherhost/nova
when used with the above example.
Note
The database.connection
and
transport_url
values are not reloaded from the
configuration file during a SIGHUP, which means that a full service restart
will be required to notice changes in a cell mapping record if variables are
changed.
Note
The transport_url
option can contain an
extended syntax for the “netloc” part of the URL
(i.e. userA:passwordA@hostA:portA,userB:passwordB:hostB:portB
). In this
case, substitions of the form username1
, username2
, etc will be
honored and can be used in the template URL.
The templating of these URLs may be helpful in order to provide each service host
with its own credentials for, say, the database. Without templating, all hosts
will use the same URL (and thus credentials) for accessing services like the
database and message queue. By using a URL with a template that results in the
credentials being taken from the host-local configuration file, each host will
use different values for those connections.
Assuming you have two service hosts that are normally configured with the cell0
database as their primary connection, their (abbreviated) configurations would
look like this:
[database]
connection = mysql+pymysql://service1:foo@myapidbhost/nova_cell0
and:
[database]
connection = mysql+pymysql://service2:bar@myapidbhost/nova_cell0
Without cell mapping template URLs, they would still use the same credentials
(as stored in the mapping) to connect to the cell databases. However, consider
template URLs like the following:
mysql+pymysql://{username}:{password}@mycell1dbhost/nova
and:
mysql+pymysql://{username}:{password}@mycell2dbhost/nova
Using the first service and cell1 mapping, the calculated URL that will actually
be used for connecting to that database will be:
mysql+pymysql://service1:foo@mycell1dbhost/nova
Design
Prior to the introduction of cells v2, when a request hit the Nova API for a
particular instance, the instance information was fetched from the database.
The information contained the hostname of the compute node on which the
instance was currently located. If the request needed to take action on the
instance (which it generally would), the hostname was used to calculate the
name of a queue and a message was written there which would eventually find its
way to the proper compute node.
The meat of the cells v2 feature was to split this hostname lookup into two parts
that yielded three pieces of information instead of one. Basically, instead of
merely looking up the name of the compute node on which an instance was
located, we also started obtaining database and queue connection information.
Thus, when asked to take action on instance $foo, we now:
Lookup the three-tuple of (database, queue, hostname) for that instance
Connect to that database and fetch the instance record
Connect to the queue and send the message to the proper hostname queue
The above differs from the previous organization in two ways. First, we now
need to do two database lookups before we know where the instance lives.
Second, we need to demand-connect to the appropriate database and queue. Both
of these changes had performance implications, but it was possible to mitigate
them through the use of things like a memcache of instance mapping information
and pooling of connections to database and queue systems. The number of cells
will always be much smaller than the number of instances.
There were also availability implications with the new feature since something like a
instance list which might query multiple cells could end up with a partial result
if there is a database failure in a cell. These issues can be mitigated, as
discussed in Handling cell failures. A database failure within a cell
would cause larger issues than a partial list result so the expectation is that
it would be addressed quickly and cells v2 will handle it by indicating in the
response that the data may not be complete.
Caveats
Note
Many of these caveats have been addressed since the introduction of cells v2
in the 16.0.0 (Pike) release. These are called out below.
Cross-cell move operations
Support for cross-cell cold migration and resize was introduced in the 21.0.0
(Ussuri) release. This is documented in
Cross-cell resize. Prior to this release, it was
not possible to cold migrate or resize an instance from a host in one cell to a
host in another cell.
It is not currently possible to live migrate, evacuate or unshelve an instance
from a host in one cell to a host in another cell.
Console proxies
Starting from the 18.0.0 (Rocky) release, console proxies must be run per cell
because console token authorizations are stored in cell databases. This means
that each console proxy server must have access to the
database.connection
information for the cell database
containing the instances for which it is proxying console access. This
functionality was added as part of the convert-consoles-to-objects spec.
Operations requiring upcalls
If you deploy multiple cells with a superconductor as described above,
computes and cell-based conductors will not have the ability to speak
to the scheduler as they are not connected to the same MQ. This is by
design for isolation, but currently the processes are not in place to
implement some features without such connectivity. Thus, anything that
requires a so-called “upcall” will not function. This impacts the
following:
Instance reschedules during boot and resize (part 1)
Instance affinity reporting from the compute nodes to scheduler
The late anti-affinity check during server create and evacuate
Querying host aggregates from the cell
Attaching a volume and [cinder] cross_az_attach = False
Instance reschedules during boot and resize (part 2)
The first is simple: if you boot an instance, it gets scheduled to a
compute node, fails, it would normally be re-scheduled to another
node. That requires scheduler intervention and thus it will not work
in Pike with a multi-cell layout. If you do not rely on reschedules
for covering up transient compute-node failures, then this will not
affect you. To ensure you do not make futile attempts at rescheduling,
you should set scheduler.max_attempts
to 1
in
nova.conf
.
The second two are related. The summary is that some of the facilities
that Nova has for ensuring that affinity/anti-affinity is preserved
between instances does not function in Pike with a multi-cell
layout. If you don’t use affinity operations, then this will not
affect you. To make sure you don’t make futile attempts at the
affinity check, you should set
workarounds.disable_group_policy_check_upcall
to True
and filter_scheduler.track_instance_changes
to False
in nova.conf
.
The fourth was previously only a problem when performing live migrations using
the since-removed XenAPI driver and not specifying --block-migrate
. The
driver would attempt to figure out if block migration should be performed based
on source and destination hosts being in the same aggregate. Since aggregates
data had migrated to the API database, the cell conductor would not be able to
access the aggregate information and would fail.
The fifth is a problem because when a volume is attached to an instance
in the nova-compute service, and [cinder]/cross_az_attach=False
in
nova.conf, we attempt to look up the availability zone that the instance is
in which includes getting any host aggregates that the instance.host
is in.
Since the aggregates are in the API database and the cell conductor cannot
access that information, so this will fail. In the future this check could be
moved to the nova-api service such that the availability zone between the
instance and the volume is checked before we reach the cell, except in the
case of boot from volume where the nova-compute
service itself creates the volume and must tell Cinder in which availability
zone to create the volume. Long-term, volume creation during boot from volume
should be moved to the top-level superconductor which would eliminate this AZ
up-call check problem.
The sixth is detailed in bug 1781286 and is similar to the first issue.
The issue is that servers created without a specific availability zone
will have their AZ calculated during a reschedule based on the alternate host
selected. Determining the AZ for the alternate host requires an “up call” to
the API DB.