Memory Map Protocol¶
Motivation¶
Cantal scans the whole system at 2 second interval. This includes metrics of your application. If cantal would poll applications by some kind of remote procedure call (RPC), it would rely too much on the application responsiveness to provide fine-grained statistics. For many synchronous applications it just doesn’t work, because it may wait more than 2 seconds for a database on occassion).
Cantal is implemented with another kind of inter-process communication (IPC), the shared memory. As you will see later in the text it requires almost zero configuration and allows efficient collection of statistics for most kinds of programs. For scripting languages it’s also practially zero-cost. For fully-threaded programs it’s usually as cheap as any other way you could implement.
For scripting languages using shared memory appoach described here also allows to dive into the application that is currently slow or unresponsive.
Overview of Files and Discovery¶
The metrics are discovered by cantal by scanning environment variables of
running processes. Whenever it sees CANTAL_PATH
in environment the
metrics are gathered from there.
For CANTAL_PATH=/run/myapp
, catal will look into:
/run/myapp.meta
for metadata (names size and alignemnt) for metrics/run/myapp.values
for metrics
Here is a short example of the contents of the meta file:
counter 8: {"metric": "requests.number"}
counter 8: {"metric": "requests.duration", "unit": "ms"}
It contains two 8 byte (64bit) unsigned integers, which are growing counters.
Here is the respective .values
file (displayed in a format of
hexdump -C
):
00000000 61 00 00 00 00 00 00 00 67 62 00 00 00 00 00 00 |a.......gb......|
Here we can see that there have been 97 requests (0x61) each lasts of almost 260 milleconds on average (0x6261/0x61).
The files must reside on some in-memory file system (tmpfs
). On typical
system /run
folder is a good place. On some systems /tmp
is a tmpfs
too, but be careful not to put it into HDD or SSD. In docker containers you
need to map some tmpfs folder from host system (example command line:
docker run -v /run/containers/my1:/run/cantal -e CANTAL_PATH=/run/cantal ...
)
In container running by lithos a !Statedir
is a good place.
Metadata File Format¶
File format of the meta data is simple: every next line is the metric (or a padding). Format of the line:
TYPE NUMBER_OF_BYTES TYPE_PARAM: JSON_METADATA
For example:
level 8 signed: {"metric": "memoryusage"}
counter 8: {"metric": "requests_processed"}
The TYPE_PARAM
is optional and is currently used for level
type, which
can be one of the signed
or float
(unsigned
will be added in
future)
The JSON_METADATA
field is a subset of a JSON, and is currently limited to
(we may extend it to a larger subset of or full JSON later):
- Serialized data should contain no newlines (you can’t pretty print json)
- Only a dictionary (object) with string keys and string values is supported
The keys and the values of the dictionary might be arbitrary. But the whole set of keys must be unique for the file.
An a padding is just:
pad 123
Where 123
is the number of bytes.
The values of the respective lengths are stored consecutively in the
.values
file in the same order as entries in metadata. The pad
entries
might be used to align counters to addresses of multiples of 8, or whatever is
needed for efficient accounting.
Metadata file is imutable. To create a metadata file you must write to a temporary name then do an atomic rename operation to put it to the right path.
Values File Format¶
Values file is a binary file that contains raw values in host byte order written consecutively one after another with actually any file structure, or in other words with the structure defined in metadata file. For example, if we have a metadata of:
counter 8: {"metric": "requests.number"}
counter 8: {"metric": "requests.duration", "unit": "ms"}
pad 48
state 64: {"value": "request.sql.request"}
There are exactly:
- 8 bytes, integer in host byte order, counter of the number of requests
- 8 bytes, integer in host byte order, sum of the duration of all requests
- 48 bytes of padding, any garbage can be there, but usually just zeros
- 64 bytes state, first bytes of the sql request that is currently going on
The file size is 128 bytes. As you can see the state is aligned to 64 bytes, because this makes it another CPU cache line. This means two processors can write counters and state simultatenously without any kind of contention. This is very CPU-dependent and optional for file format, but usually some kind of padding is implemented by the implementation.
Overall, the structure of the file is implemented this way so that program can atomically adjust any counter directly in shared memory without ever duplicating metrics or delaying the statistics submission (for some very fast and heavy-threaded programs it may still be a lot of contention and traditional technics may be applied here, but please do benchmarks first).
Data Types¶
Data types that are currently supported by cantal agent:
Type Name | Allowed Sizes | Alignment (recommended) | Description |
---|---|---|---|
counter |
8 bytes (64bit) | 8 bytes | A 64bit ever-growing counter. |
level |
8 bytes (64bit) | 8 bytes | A current value of something, may grow or decrease |
state |
16-65535 bytes | 64 bytes | An arbitrary string value that is visible in cantal. No history of it is stored. |
pad |
1-65535 bytes | – | No data |
More types and sizes will be implemented later.
The counter
value is a most useful type. You should increment the
value of counter using atomic operations (unless you have a GIL so any small
write is atomic) and never write whole value to it. It’s fine to initialize
counter value to zero on application restart, you don’t need to store value
somewhere.
Good use cases for counter
are:
- Number of requests
- Total duration of requests
- Tasks processed
From the above you can derive the following values, which you should not write by the application, but they are calculated by a cantal itself:
- The number requests per second (or any other unit in time)
- The average duration of each request
- Tasks processed per second
Good use cases for level
are:
- Memory used by object pool
- Current queue size
Don’t use level
for things that are number of operations per second or
similar things. Use counter
instead. This allows correct statistics even
if collection interval changes, when something is slow and so on.