This document describes the Inventory Service, and simultaneously attempts to provide a template from which we can build further Architecture documents.
This document is not meant to be a complete, rigorous treatment – it is only meant to be sufficient for conversion to specs
This document will refer to possible development paths or optional features, and in no way is meant to prefer or recommend them, merely to provide warnings and context
Context that is not directly modified by this proposal is sometimes provided as a means of justifying the proposal or providing marketing or requirements direction in the future
Puppet already collects more information about its clients than most organizations have about their networks, and we already cache that data as yaml. However, there’s not currently a builtin way for users to get access to that data, nor is there any way to rely on it for operations like dynamic grouping. The primary purpose of this proposal is to solve those two issues.
Also, the shared-nothing architecture that Puppet uses has resulted in compromises in how we manage and distribute this inventory information, and this central shared service removes the need for those compromises, although it does not specifically include those fixes in scope.
Additionally, this is a first step toward a CMDB-like service from Puppet Labs. With it in place, and an interface to view it, we can begin exploring an add-on application for a more complete commercial CMDB.
This document describes multiple possible directions once this service exists but should in no way be tied to those directions – the service should be treated as a stand-alone deliverable, and any downstream work should be architected and specified separately.
- Agent – included merely for context. This proposal does not recommend changing anything on the agent.
- Master – receives data from agent and will talk to Inventory Service
- Dashboard – human interface for Puppet ecosystem. Will read data from Inventory Service
- Inventory Service (new) – owns all inventory data
- Facts – inventory data collected from client (generally generated by Facter)
- Catalog – (included only for context) resource graph whose generation requires inventory data
- Node – wrapper object that includes the Fact data and potentially other data
Inventory data is currently sent by every client to a server. The client usually runs an agent process, which collects the data and sends it to a master process. The master uses the ‘yaml’ terminus for Facts to cache the data on the local filesystem by node name. This process of updating the inventory data also invalidates the cached Node data on the server, thus forcing a requery of the Node information (and thus preventing stale data being used).
If there are multiple master processes behind a load balancer, only one master gets updated, which means that is the master the client must be talking to. Rather than have strict configuration requirements, the agent both uploads the Fact data and requests a catalog, all in one ‘get’ call. Because of line length limitations, the facts are compressed, and that compressed data still sometimes exceeds line limits.
Current data flow looks like this:
Note that the single upload-and-compile call is drawn as two connections, but it is in fact a single request.
The crux of the proposal is that a separate process be started to own the inventory data, and that the master process be configured to send its facts to the IS when updated by the client and pull them from it when necessary.
This process will also be a ‘master’ process, and this proposal does not cover how this process should implement data management. It is somewhat obvious, however, that a ‘Facts’ terminus will be used to store and retrieve data – starting with the existing ‘yaml’ terminus seems a sufficient first step. This process will need authorization updates to allow any server or a set of servers to be able to upload fact data.
All other master processes need to be modified and/or configured such that they send all facts they receive to this process, and when facts are needed (such as for compilation) then they will use this service to find them.
The Dashboard application then needs to be modified to provide a view to this data, so that a viewer can see any inventory data that exists for a given node. At a minimum, this data should include information about when the data was last updated.
This proposed architecture converts the data flow to this:
Because Puppet already ships with a Facts terminus type sufficient for retrieving and storing data via RESTful calls (the Facts REST terminus), it might be as simple as reconfiguring all other master processes to use this terminus (with appropriate configuration as necessary, using the facts_terminus setting) rather than their current yaml terminus. Alternatively, a new terminus type could be created specifically for talking to this terminus.
The Dashboard has ‘Smart Groups’ in its roadmap, which would require a Query interface (e.g., ‘find all hosts whose operatingsystem is Debian’), along with a probably much richer interface overall (e.g., the ability to know which parameters can be queried). This proposal does not cover that Query interface, but it might need to be extended to do so.
Over time we will want to provide more historical information beyond when a given data set was updated. We currently just call for a timestamp that determines how up to date the data is, but over time we’ll want to know when a parameter is added, removed, or modified, and provide a feed of those changes. Most likely, some kind of event stream is the appropriate solution for this view, but that is all outside the scope of this proposal.