Introduction

DCTV is a data exploration toolkit designed for both interactive and batch analysis of trace files and other heterogeneous time series data. It's designed to answer complex of the sort of data that one frequently finds in records of system activity.

Important features of DCTV are:

SQL1999 querying of trace files
specialized relational algebra and SQL syntax for time series
comprehensive dimensional analysis for unit conversion and error detection
support for analyzing very large (larger than memory) trace files
powerful GUI for interactive trace exploration

Use cases include:

examining CPU time spent by a particular application
examining CPU time spent in part of an application
examining memory activity of the whole system to determine what caused a game to miss a frame deadline
finding which functions cause the most page faults during app startup
tracking down slow memory leaks
finding why a real-time thread took too long to run and poll a device
bulk analysis of traces from production to extract metrics for a dashboard

DCTV is a "power user" tool: using it effectively requires an understanding of both the system components that generate the trace events being queried and an understanding of SQL-like declarative query systems. This document aims to describe and document DCTV's functionality, walk through a few examples of trace analysis, and invite the reader to investigate further.

Quick start

Getting DCTV

Be running gLinux (we'll port eventually)
git clone sso://team/dctv/dctv
make dev
follow prompts; install dependencies
while the build is broken, complain to dancol@, goto 2
./dctv

Hello world

 SELECT COUNT(*) FROM mytrace.scheduler.timeslices_p_cpu;
COUNT()
-------
  32362
  ]]>

Background

Life is just one damned thing after another. Arnold J. Toynbee

Purpose of DCTV

A trace file by itself is of limited utility: it's gigabytes of detailed, low-level records of system activity. When we analyze a trace file, what we really want to do is pose questions to that trace file and get back meaningful answers. The information we want lies in the non-trivial relationships between trace events, the relationships between relationships, and so on, in a way that puts limits on the kind of trace analysis that it's possible to do using ad-hoc analysis of trace events themselves.

After we pose questions to a trace file and get answers, we frequently want to use these answers as the basis for further questions. In this way, we gradually increase the level of abstraction of our analysis, moving from questions posed in terms of raw trace events to ones posed in terms of the problem we've actually trying to solve.

DCTV is a question-answering machine. By incrementally constructing queries and then querying against them (for example, using the WITH construction), users extract increasingly abstract data from trace files, data not directly represented by discrete and specific low-level events in a trace. The SQL REPL and the GUI both provide information-querying capabilities.

DCTV also provides a standard library of ready-made building blocks that users can query during trace analysis.

Other trace analysis tools

DCTV is not the first such tool for trace analysis. It integrates the best parts of WPA, LISA, and Perfetto's trace analysis models.

TODO(dancol): flesh out this section

Document conventions

This document currently assumes the reader is familiar with the basics of SQL and the basics of trace processing, focusing on DCTV's specific features in this area.

Time tables

Some figures below are "time tables" (they have "Time ▶" in the upper-left). They represent timelines, where each row in the table is a separate and independent data series. Some tables represent operands and results; in this case, a thick black line separates the input rows and output rows.

Function signatures

Table-valued function signatures are given in Python syntax, with a bare * signifying that all arguments following the * are keyword-only and cannot be specified positionally. (That is, if a function signature is foo(*, bar=7), then you have to write either foo() (using bar's default value or write foo(bar=>5) (specifying an explicit value of the keyword argument), and you can't write foo(1) (because we can't specify bar positionally.)

Data model

DCTV is designed around querying one or more trace files using SQL queries. DCTV performs no hardcoded pre-processing of trace files: we model each event in a trace file as a row of the "raw events" table corresponding to that event's type. Each field in an event is a column in that event's table; users extract higher-level information from these low-level events by defining views in terms of these low-level events. By querying the views, users can extract higher-level trace events; users can also define views in terms of other views to answer more abstract questions.

Table types

DCTV's query engine provides the tables and set functions that any SQL system provides, but extends these facilities with a set of operators and functions dedicated to working with heterogeneous time series. Tables in DCTV are first-class typed objects: tables are either regular tables, span tables, or event tables. Each type of table has a set of query operations that it supports; DCTV provides functions to convert one type of table to another as needed.

It's always possible to "view" one of DCTV's special table types as a regular table by just using regular table operations (like the non-SPAN variant of SELECT) on it. The result of any of these non-special operations is itself a regular table.

This table summarizes the special operations DCTV supports. Don't worry if you don't recognize some of these terms (like "partitioned span table"): they're defined below.

Operation	Left operand	Right operand	Result
SELECT	Regular table	N/A	Regular table
SELECT	Span table	N/A	Regular table
SELECT SPAN	Span table	N/A	Span table
SPAN JOIN	Unpartitioned span table	Unpartitioned span table	Unpartitioned span table
SPAN BROADCAST INTO	Unpartitioned span table	Partitioned span table	Partitioned span table
SPAN BROADCAST FROM	Partitioned span table	Unpartitioned span table	Partitioned span table
GROUP USING PARTITION	Partitioned span table	N/A	Unpartitioned span table
GROUP USING SPANS FROM	Partitioned span table	Unpartitioned span table	Partitioned span table
GROUP USING SPANS FROM	Unpartitioned span table	Unpartitioned span table	Unpartitioned span table

A regular SQL table is essentially a list of points in high-dimensional space, with each column in the table representing one dimension along which a point can vary.

A span table represents data that vary over the time dimension. An interval of time over which the data in a span table remain the same is called a span. The collection of time-varying data described by a span table is the payload of that span table.

All span tables have two special columns: _ts and _duration. _ts is an INT64 timestamp, in nanoseconds since the start of the trace. _duration is a non-zero INT64 number of nanoseconds that the span covers. (That is, the span describes the region of time [_ts, _ts + _duration].)

_ts and _duration are always non-NULL, and a span table is always ordered by increasing values of _ts. Spans in a span table cannot "overlap": a span must end either before or at exactly the same time as the next span begins. (Spans from different partitions may overlap, however: see immediately below.) A span table need not be contiguous: that is, it's legal for gaps to exist between spans.

For example, imagine that you're looking at a Christmas tree light that changes color in time with music. We might describe the color of the light using spans. The following diagram depicts how we might use spans to describe the light's state. Each pair of numbers (one above the table, one below) indicates the time corresponding the vertical line connecting them.

Light color
Time ▶	1	2	3	45
Color	Red			Green
Time ▶	1	2	3	45

Here, the light was red from time one to time three and then green from time four to time five, inclusive. (From time three to time four, the light was off; we're choosing to represent "off" as the absence of a span, but an equally valid choice would be to make a span with a special "Off" value for the color.)

It's useful to look at the physical table representation of the above set of spans.

Light color (span table representation)
_ts	_duration	color
1	2	red
4	1	green

Note that one row in the physical table representation of a span table corresponds to one logical span.

It's because span tables are always ordered by _ts that DCTV disallows queries of the form SELECT SPAN ... ORDER BY .... Re-ordering a span table makes no sense. If you don't want to SELECT from a span table and make the result a span table, you can choose to instead view the span table as a regular table by using the non-SPAN variant of select (SELECT * FROM my_span_table), and in this mode, SELECT will let you order the result set by whatever you want.

An event table is like a span table, but without the _duration column. It represents a sequence of "points" in time. The advantage of using an event table over a regular SQL table to represent points is automatic integration of the event table into time-based operations on spans.

Partitions

A span table is either a partitioned span table or a non-partitioned span table. A non-partitioned span table is just the kind of span table described above. A partitioned span table, by contrast, has an additional special column, the partition column. A partitioned span table is basically a bundle of logical partition tables all combined into a single table under a single name. Each distinct value of the partition column, which is called a partition, defines one independent sequence of spans.

All of DCTV's operations on span tables know about partitioned span tables (the partition column is part of the span table's type) and operate on each partition within a span table independently. There are also operations that transform a partitioned span table into a non-partitioned span table through the use of SQL grouping operators.

It's useful to sequences of spans this way instead of putting each in own table: this way, using a partitioned span table, we can operate on groups of related time series uniformly without having to change our queries depending on how many different time series we have: for example, a CPU-related query should look the same on any system no matter how many CPUs it has!

DCTV currently allows a span table to have either zero or one partition column, but not more. This limit is just an implementation limit, and in the future, DCTV will allow partitioning by more than one column.

Let's look at our Christmas tree light example, but with partitions. Here, we're looking at two lights, one called "light#0" and another called "light#1". We use a sequence of spans to describe each light's state. It's critical to understand that each light has a distinct state history, but that we store all of these histories in the same physical table, using a column to describe the specific light that a specific row describes.

Colors of two lights
Time ▶	1	2	3	45
Light#0	Red			Green
Light#1	Green	Red
Time ▶	1	2	3	45

Here's the physical partitioned span table representation of the logical spans from the above diagram.

Colors of two lights (span table representation)
_ts	_duration	lightno	color
1	2	0	red
1	1	1	green
2	3	1	red
4	1	0	green

Like an unpartitioned span table, a partitioned span table is ordered strictly by increasing _ts. If spans from two different partitions begin at the same time, the ordering of those with the same _ts value is unspecified.

Span operations

While we can apply normal SQL querying operations to span tables, we can answer certain questions much more conveniently by using DCTV's special span operations, which are designed to make it easy to work with real-world time series data.

Span join

The span join family of operations merge spans together in a timewise-correct way and generates new spans divided on the common boundaries of the spans that flow as input into the span join.

It's easiest to demonstrate a span join visually.

Time ▶	1	2	34
Size	tiny		giant
Species	fish	squirrel
SPAN JOIN
Phenotype	tiny fish	tiny squirrel	giant squirrel
Time ▶	1	2	34

Here, we're joining two hypothetical time series (as represented by span tables), a time series of sizes and a time series of animal types. (Imagine we're trying to reconstruct the state of an animal given a record of the transmutation spells some novice sorcer's apprentice might have haphazardly cast.)

In this trace, the "make the animal tiny" spell was in effect from timestamp one to timestamp three (inclusive), and the "make the animal giant" spell was in effect from timestamp 3 onward. Likewise, the "make the animal a fish" spell was in effect from timestamp one to timestamp two (inclusive) and the "make the animal a squirrel" spell was in effect from timestamp two onward. The first row depicts the result of the size spells, and the second row depicts the effect of the animal-type spell. (We imagine that each spell cancels the effect of the last spell of the same type.)

The last row, "phenotype", represents a span table giving the type of animal that we observe at each moment, inferred from the effects of the previous two rows. Note that the result span table has a span division wherever any of the inputs has a span division. We ensure that all the properties of any of the input spans stay constant "within" any of the output spans, allowing for correct future computation involving these values.

It may be informative to look at the row-wise representation of the above span tables:

Size
_ts	_duration	size
1	2	tiny
2	1	giant

Species
_ts	_duration	species
1	1	fish
2	2	squirrel

Phenotype
_ts	_duration	size	species
1	1	tiny	fish
2	1	tiny	squirrel
3	1	giant	squirrel

Span join: inner and outer

What happens when spans don't line up exactly?

Span joins come in two varieties, named after the varieties of regular SQL joins: inner span join and outer span join. When all the inputs to a span join cover the same period of time, the difference doesn't matter. But when there are gaps in one sequence or another, the difference becomes important. Just as in the previous section, we'll start with a diagram.

Sample inputs
Time ▶	1	2	34
Breath	fire		ice
Color	red	green
Time ▶	1	2	34

Here, we see that there is no magic breath spell in effect from time two to time three, inclusive. What happens when we perform a span join on these span tables? It depends on the kind of span join.

Span inner join
Time ▶	1	2	34
Breath	fire		ice
Color	red	green
Span inner join
Phenotype	fire-breathing red		ice-breathing green
Time ▶	1	2	34

Span outer join
Time ▶	1	2	34
Breath	fire		ice
Color	red	green
Span outer join
Phenotype	fire-breathing red	`NULL`-breathing green	ice-breathing green
Time ▶	1	2	34

In the span inner join case, we emit an output span only when all input spans cover a time interval. In the span outer join case, we emit an output span when any input span covers a specific time region, providing NULL for the value of any payload column not provided by a span for that region.

The table representations of the two result span tables may make the result more clear.

Span inner join result (table view)
_ts	_duration	breath	color
1	1	fire	red
3	1	ice	green

Span outer join result (table view)
_ts	_duration	breath	color
1	1	fire	red
2	1	`NULL`	green
3	1	ice	green

Note that even a span outer join won't produce a result span that covers a period of time that no input span covered, as the following diagram indicates.

Holes in span outer join
Time ▶	1	2	34
Breath			ice
Color		red	green
Span outer join
Phenotype		`NULL`-breathing red	ice-breathing green
Time ▶	1	2	34

Span broadcast

A span broadcast is a special kind of span join that operates on two span tables, one partitioned and one not. Normally, DCTV treats each partition within a partitioned span table as a separate time series and operates on each independently; DCTV refuses to perform span operations on span tables partitioned by different columns or between partitioned and non-partitioned span tables, since the desired operation isn't obvious.

With a span broadcast, we can tell DCTV to perform a special kind of span join between a partitioned and non-partitioned table, "broadcasting" the non-partitioned span into every partition in the partitioned span table in such a way that the result has useful properties.

The overall result is almost as if we copied the non-partitioned span table N times, one for each N partition, into a new partitioned span table, and then joined that new partitioned span table with the other partitioned span table that we had when we started. The difference between this hypothetical operation and span broadcast is that span broadcast doesn't generate any output spans for regions not covered by any span in the partitioned span table, even if that region is covered by the non-partitioned span table.

Another way to think of it is that span broadcast "labels" each span in a partitioned span table with the payload of the non-partitioned table. The output of a span broadcast operation is partitioned in the same way as its partitioned input.

As usual, a diagram may be illustrative. Here, "Size#0" and "Size#1" indicate two spans of the same span table (let's suppose animals 0 and 1 have different size spells cast on them), "Size". "Color" is the input non-partitioned span table (let's suppose color spells affect all animal at the same time).

Sample inputs
Time ▶	1	2	3	45
Size #0	tiny	giant
Size #1	tiny
Color	red		green
Time ▶	1	2	3	45

Just like regular span joins, span broadcasts come in span inner broadcast and span outer broadcast varieties, depicted below. Note that the time period from four to five doesn't appear in the result span tables, since from time four to time five, we had a color span from the non-partitioned span, but no spans from size, the partitioned span table.

Inner broadcast of color into size
Time ▶	1	2	3	45
Size #0	tiny	giant
Size #1	tiny
Color	red		green
Inner broadcast
Result#0	tiny red		giant green
Result#1	tiny red		tiny green
Time ▶	1	2	3	45

Outer broadcast of color into size
Time ▶	1	2	3	45
Size #0	tiny	giant
Size #1	tiny
Color	red		green
Outer broadcast
Result#0	tiny red	`NULL`-colored giant	giant green
Result#1	tiny red	`NULL`-colored tiny	tiny green
Time ▶	1	2	3	45

In general, we use a span broadcast when we have a number of different things happening at the same time (each represented by one partition of a span table) and we want to "mix into" this span table knowledge of something that affects the environment as a whole.

Span group

A span group operation is the opposite of a span join, in a sense. It merges spans together and applies SQL set functions (like MAX and SUM) to the payloads of the merged spans, forming for each payload a combined value determined through the usual SQL aggregation operation..

Here's a diagram.

Time ▶	1	2	3	4	5	6	7	89
Number arms	2	5	0	7	2	4	9	0
Periods	A		B		C		D
Span group
`MAX(arms)`	5		7		4		9
`MIN(arms)`	2		0		2		0
Time ▶	1	2	3	4	5	6	7	89

Here, our hapless sorcerer repeatedly changed the numbers of arms that our poor animal had at any time. We want to determine, based on the record of arm-number changes, for each relatively broad interval A, B, C, and D, the minimum and maximum number of arms our animal had during that interval.

A span group operation involves two span tables: the grouped table and the grouper table. The grouped table ("number of arms", in our example) supplies the source data for the grouping operations; the grouper table (here, "periods") supplies spans describing the groups that form the output value. The grouped table may or may not be partitioned; if it is partitioned, DCTV applies grouping to each partition individually. The grouper table may not currently be partitioned.

A span group operation always emits one output span for each span in its grouper input span table. If no grouped span overlaps with a given grouper span, all its aggregate values end up being NULL. An example follows.

Illustration of span group behavior with missing grouped values
Time ▶	1	2	3	4	5	6	7	89
Number arms	2	5	0				9	0
Periods	A		B		C		D
Span group
`MAX(arms)`	5		0		`NULL`		9
`MIN(arms)`	2		0		`NULL`		0
Time ▶	1	2	3	4	5	6	7	89

Span group operations have two flavors: span group and intersect and span group and union. The difference matters only when multiple partitions are involved. In the former case, we include payloads from the grouped span table only when all partitions are present in a given interval; in the latter case, we include the grouped span table in the output spans when any input grouped partition is present.

Span departition

A span departition operation transforms a partitioned span table into a non-partitioned span table by grouping the partition payloads with SQL set values. This operation is useful mainly when we have a "split up" view of activity on the system and want to derive a whole-system view by matching up all the partitions.

To return to our magical forensics example, imagine our apprentice cast some very expensive add-arms-to-animals spells on a number of different animals. We're billed for arms based on the total number we're using at any one time (there's a license server and everything), so we want to reconstruct, based on a record of each animal's arm count, the number of arms we were using in total at a particular moment. In the following table, "Arms#0", "Arms#1", and so on denote the partitions of a single "Arms" span table.

Time ▶	1	2	3	4	5	6	7	89
Arms#0	2			7		4	9	0
Arms#1	2					4
Departition
`SUM(arms)`	4			9		8	13	4
Time ▶	1	2	3	4	5	6	7	89

A span departition operation resembles a span group join followed by a span group operation, but it's specified separately so that we can work with partitioned span tables without knowing in advance how many partitions we have or having to expand our queries to work with each partition separately.

Span departitions come in two varieties, the span departition and union and span departition and intersect operations, with the difference concerning the treatment of missing data. The following table gives the differences between these approaches.

Arm history with missing data
Time ▶	1	2	3	4	5	6	7	89
Arms#0				7		4	9	0
Arms#1	2
Time ▶	1	2	3	4	5	6	7	89

In intersect mode, we generate an output span for a region of time only when all partitions have a span covering that period.

Departition intersect result
Time ▶	1	2	3	4	5	6	7	89
Arms#0	2			7		4	9	0
Arms#1	2					4
Departition intersect
`SUM(arms)`				9
Time ▶	1	2	3	4	5	6	7	89

By contrast, in union mode, we generate an output span when any partition covers a unit in time. We treat any missing partitions as contributing NULL to the output aggregation for each span. Note that SQL aggregation functions just skip NULL values, so the sums below are correct.

Departition union result
Time ▶	1	2	3	4	5	6	7	89
Arms#0	2			7		4	9	0
Arms#1	2					4
Departition union
`SUM(arms)`	2			9		4	9	0
Time ▶	1	2	3	4	5	6	7	89

Trace processing intrinsic functions

DCTV aims to be a general-purpose time series analysis program, one that just happens to be especially useful for processing Android system traces. Its general approach is to avoid system- and metric-specific data processing routines and provide general-purpose operators that users can combine to analyze data in particular situations.

The previous section describes operations that DCTV provides in the form of query operators. DCTV also provides some operations, usually less common ones, in the form of table-valued functions.

Time series to span conversion

Recall that DCTV exposes events from trace files as raw data points, in event tables. We have to build span tables from these raw data somehow, and the time_series_to_spans table-valued function does exactly that.

time_series_to_spans takes as input a set of event sources and a set of output column descriptors and produces a span table as output. Logically, it consuming events from the given sources, in time order, and constructs spans by watching for "start" and "stop" events as denoted by the input sources. Payload values attached to the event sources become payload columns of the output span table according to each column specification's column specification.

Each source is either a "start-start" source or a "stop" source. The former case models a set of events that divide a timeline up into discrete chunks.

Returning for a moment to our hypothetical wizardly apprentice, we recall that an animal's size might change as our apprentice casts various "change size" spells on it. The raw, event-by-event, record of spells cast by our apprentice might look like this.

Raw size spell record
_ts	size
1	tiny
3	huge
4	large
6	huge

Processing this raw event table into spans using time_series_to_spans, we end up with a span table that looks like this. (The time scale goes to seven for easier comparison with the next example.)

Time ▶	1	2	3	4	5	67
Size	tiny		huge	large
Time ▶	1	2	3	4	5	67

The final "huge" spell isn't reflected in the output span table, because time_series_to_spans ignores spans left "open" (i.e., unclosed) at the end of processing. The intent of this feature is to work with span inner join operations to automatically ignore noisy partial-data "junk intervals" at the beginning and end of traces. If a need arises, time_series_to_spans could be extended in the future to automatically close open spans.

time_series_to_spans also supports "stop" events. These events don't start new spans, but do indicate that any open span active at the time of the stop event should be finished. In an operating system context, if sched_switch is a start-start event, a CPU hotplug off event might be a "stop" event, since it would indicate that a CPU has stopped processing traces without producing any new ones.

To return to our unfortunate apprentice example, suppose we have an additional table of "size reset" spells that we know were cast during the sequence of size change spells. A size reset spell just returns a creature to whatever size it had without any magical augmentation. The raw table might look something like this.

Raw size-reset spell record
_ts
5
7

If we feed both our original size spell record event table and our size-reset spell table into time_series_to_spans, we end up with a span table that looks like this.

Time ▶	1	2	3	4	5	67
Size	tiny		huge	large		huge
Time ▶	1	2	3	4	5	67

Note the differences: first, we now have a "hole" between times five and six, because the stop table told us that we stopped changing our poor confused creature's size at time five and didn't start changing it again until time six. Second, we have a "huge" span from time six to seven, because the span beginning at time six is no longer left open after time_series_to_spans ends.

If you want a span table that substitutes a concrete value (say, "normal") for the hole, you can combine a span outer join of the whole-trace span with COALESCE on the payload column to make one.

Each payload column that time_series_to_spans generates is described by a "source specification". The specification describes, for each output column, the source event table from which we get the column's value and the "edge" from which we draw the value. (The edge defaults to "rising".) Using the "rising" edge means that we draw the output payload column for a span from the event that started the span; using "falling" instead tells time_series_to_spans to draw the payload column value from the closing event. We typically stick with "rising" except in special cases.

time_series_to_spans supports creating partitioned span tables as well; each source specification can be associated with a partition column in that source table. All sources for a given call to time_series_to_spans must be partitioned the same way.

Stackification

Not all raw input events look like a series of start and stop on a timeline. Another common pattern in row input is the "start-stop stack", in which a series of nested and balanced start and stop events describe the erection and demolition of a stack of some kind of thing.

Stacks can be anything: examples include procedure call stacks, Android synchronous atrace regions, and nested interrupt handlers. To keep with our hapless-apprentice example theme, we'll imagine that spells are prepared by simultaneous chanting, waving, and stirring, and that we have distinct "start" and "stop" records for each activity.

Suppose we know at what time our apprentice starts a given activity and know at what time an activity ends. Suppose also that our apprentice at least paid enough attention in class to understand that one always stops the magical activity one most recently started.

(Note that at time five, a second chant begins even though a chant was already ongoing. A friend must have joined in.)

Spell starts
_ts	activity
1	stir
3	wave
4	chant
5	chant

Spell stops
_ts
2
7
7
7

What happens if we rearrange these data into spans?

Notional stackified spells
Time ▶	1	2	3	4	5	67
Effects	[stir]		[wave]	[wave, chant]	[wave, chant, chant]
Time ▶	1	2	3	4	5	67

This arrangement makes logical sense, but it isn't quite compatible with DCTV's data model. Note that the value of each cell is actually a list! Unlike some databases, DCTV does not support composite (multi-part) values as column values. But here we apparently have composite values in the cells. How do we represent these spans as tables? By normalization.

Stack contents
stack_id	depth	token
1	0	stir
2	0	wave
3	0	wave
3	1	chant
4	0	wave
4	1	chant
4	2	chant

Normalized stackified spells
Time ▶	1	2	3	4	5	67
Stack Id	1		2	3	4
Time ▶	1	2	3	4	5	67

Now, we can look up the stack corresponding to each span by looking at that span's stack id payload and joining it against the stack contents table. The stackify DCTV intrinsic processes any kind of stack into these two tables (the stack contents regular table and the "stack history" span table).

Generating span tables from thin air

There are general utility functions to generate specialized span tables useful for composing with others. The generate_sequential_spans table-valued function generates a sequence of spans according to the start time, stop time, and duration specified in the call. It's useful for generating spans to quantize the timeline into discrete intervals and for generating "whole trace" spans that act as inputs to span joins.

Each trace namespace has a few convenience functions for succinctly generating, using generate_sequential_spans, certain kinds of span tables. See the "standard library" reference below.

Dimensional analysis

DCTV provides a dimensional analysis feature to make it easy and natural to query traces using naturally-specified values and to avoid errors that can arise from accidental nonsensical combinations of incompatible units. Each quantity in a query is associated with a unit and these units propagate through the query as it is processed. Quantities with different units combine according to the rules of dimensional analysis. DCTV also knows how to convert from one compatible unit to another. DCTV will signal errors rather than produce results that are dimensional nonsense. The overall goal of the dimensional analysis feature is to make it easy and natural to query traces using naturally-specified values and to avoid errors that can arise from accidental nonsensical combinations of incompatible units.

Unit specification

Units come from two sources:

intrinsic tagging of quantities with units during trace parsing, and
explicit tagging of quantities with units in query syntax.

The syntax for specifying a unit is just adding the name of the unit after a numeric literal. For simple alphanumeric unit names, a bare word is sufficient, e.g., 4ns. For more complicated units that contain operators that SQL would otherwise interpret as part of expressions, the unit name needs to be quoted with backticks, as in 4`miles/hour`. Without the backticks, SQL would interpret 4miles/hour as an attempt to divide the quantity 4 by the column hour, which is probably not what we want.

Unit names

DCTV understands both common and abbreviated names for units. This document will eventually list all understood unit names; for the moment, see units.txt in the DCTV source code.

Unit conversion

Queries can explicitly convert units from one type to another using the IN operator.

In the DCTV REPL, column headers that denote a quantity with unit list that unit in square brackets after the column name. Above, we see [cm] at the end of the column name, indicating that the 10.16 is specified in terms of centimeters.

DCTV's unit analysis also understands rates. In the example below, DCTV gives a unit in terms of miles, because we're multiplying a rate, in miles per hour, by a unit of time. The time unit here need not be the literal unit used in the rate: DCTV will convert units as needed.

Differences from standard SQL

Nested namespaces

Standard SQL provides a two-level namespace for tables: each table is named by an optional schema (followed by a dot), and then a table name. DCTV, by contrast, allows for arbitrarily deep nesting of namespaces, with each namespace component separated by a period. (SQL's standard syntax is a special case.) We use the nested namespace syntax to talk about specific tables and views embedded in a "trace sub-namespace", which we form when we mount a trace into the global SQL namespace.

Keyword arguments

Normal SQL allows only positional arguments to function calls. DCTV allows for Python-style keyword arguments as well, with each keyword-argument pair separated by the "=>" token. See the syntax reference for details.

Extended table-valued-function-call syntax

DCTV exposes some facilities as table-valued functions. The arguments to these functions are evaluated in a context different from normal SQL expression evaluation, and in this context, DCTV supports extended syntax, including the use of list and dictionary literals. (Table-valued functions are Python functions and these list and dictionary literals become list and dict values inside calls.) See the syntax reference for details.

Miscellaneous syntax extensions

DCTV is designed to minimize users fighting with the syntax. Wherever SQL requires a list of something to be comma-separated, DCTV allows and ignores a trailing comma. Where SQL requires a list terminator (e.g., semicolons after each query statement), DCTV allows users to omit the list terminator.

DCTV recognizes <> and the C-style != operators as equivalent.

DCTV provides the "spaceship" and "anti-spaceship" operators <=> and <!=>, respectively, which act like == and !=, except that they treat NULL as being equal to itself. (MySQL calls these operators "null safe comparison operators".)

In addition to the standard SQL -- comment prefix, DCTV allows the use of # as a Python-style comment prefix and the use of /* and */ for C-style block comments.

Missing features

DCTV does not implement some features of more traditional databases. The following table summarizes the features not provided, whether we plan to provide them, and any additional relevant information.

Feature	Status	Command
INSERT/UPDATE/DELETE	Not planned	DCTV is immutable
SQL1999 window functions	Planned
SQL/PL	Planned	Will be accelerated
Recursive CTEs	Planned
Coordinated subqueries	Planned

Syntax reference

SQL Statement list

The REPL accepts statement lists as top-level input.

Operator	Description
*	Multiply
/	True division (yield float)
%	Modulus
//	Floor division (truncates toward zero)
+	Addition
-	Subtraction
<<	Left shift
>>	Right shift
&	Bitwise AND
\|	Bitwise OR
BETWEEN	Standard SQL
= ==	Equality
<=> IS	NULL-safe equality
<!=> IS NOT	NULL-safe inequality
>=	Greater than or equals
>	Greater than
<=	Less than or equal
<	Less than
!= <>	Inequality
NOT	Logical negation
AND	Logical conjunction
OR	Logical disjunction

Getting DCTV

Hello world

Background

Purpose of DCTV

Other trace analysis tools

Time tables

Function signatures

Table types

Partitions

Span operations

Span join

Span join: inner and outer

Span broadcast

Span group

Span departition

Trace processing intrinsic functions

Dimensional analysis

Unit specification

Unit names

Unit conversion

Nested namespaces

Keyword arguments

Extended table-valued-function-call syntax

Miscellaneous syntax extensions

Missing features

SQL Statement list

SQL statement

SELECT

Regular select core

Result column

Table or join specification

Conventional join

Span join

Span broadcast

Table specification

Table-valued function arglist

Table-valued function (TVF) expression

SQL expression

Function call argument list

Data type names

Literal value syntax

Bind parameters

Numeric literal

VALUES list

Common table expression

Namespace prefix

Table namespace name

Table-valued-function name

SQL function name

CREATE VIEW

DROP VIEW

DROP ALL

MOUNT TRACE

Ordering term

SQL compound operators

Comment syntax

Operators

Per-trace names

The DCTV namespace