rev-cache
=========

The revision cache API ('rev-cache') provides a method for efficiently storing
and accessing commit branch sections.  Such branch slices are defined by a
series of start/top (interesting) and end/bottom (uninteresting) commits.  Each
slice contains information on commits in topological order.  Recorded with each
commit is:

* All intra-slice topological relations, encoded into path "channels" (see
  'Mechanics' for full explanation).
* Object meta-data: type, SHA-1, size, date (for commits).
* Objects introduced by that commit, not present in the its cached parents.

In addition to the API, basic structures are exported for the possibility of
direct access.

The API
-------
You can find the function prototypes in `revision.h`.

Data Structures
~~~~~~~~~~~~~~~
The `rev_cache_info` struct holds all the options and flags for the API.

----
struct rev_cache_info {
	/* generation flags */
	unsigned objects : 1,
		legs : 1,
		make_index : 1,
		fuse_me : 1;

	/* index inclusion */
	unsigned overwrite_all : 1;

	/* traversal flags */
	unsigned add_to_pending : 1;

	/* fuse options */
	unsigned int ignore_size;

	/* reserved */
	struct rev_cache_slice_map *maps,
		*last_map;
};
----

The fields:

`objects`::
	Add non-commit objects to slice.

`legs`::
	Ensure end/bottom commits have no children.

`make_index`::
	Integrate newly-made slice into index.

`fuse_me`::
	This is specified if a fuse is occuring, and slices are to be reused.
	This option requires `maps` and `last_maps` to be initialized.

`overwrite_all`::
	When a cache slice is added to the index, sometimes overlap occures
	between it and other slices.  Normally, original index entries are kept
	unless the new entry represents a start commit (older entries are more
	likely to lead to greater in-slice traversals).  This options overrides
	that, and updates all entries of the new slice.

`add_to_pending`::
	Append unique non-commit objects to the `pending` object list in the
	passed `rev_info` instance.

`add_names`::
	Include non-commit object names in the pending object entries if
	`add_to_pending` is set.

`ignore_size`::
	If non-zero, ignore slices with size greater or equal to this during
fusion.

`maps`/`last_map`::
	An array of slice mappings, indexed by their id in the slice index
	header, to be re-used with `fuse_me`.  `last_map` points to the last
	mapping used, and should be initialized to 0.

Functions
~~~~~~~~~

init_rev_cache
^^^^^^^^^^^^^^
----
void init_rev_cache_info(
	struct rev_cache_info *rci OUT
)
----

Initialize `rci` to default options.

make_cache_slice
^^^^^^^^^^^^^^^^
----
int make_cache_slice(
	struct rev_cache_info *rci IN,
	struct rev_info *revs IN,
	struct commit_list **starts IN/OUT,
	struct commit_list **ends IN/OUT,
	unsigned char *cache_sha1 OUT
)
----

Create a cache slice based on either `revs` (if non-NULL) *or* the `starts` and
`ends` lists.  The actual list of start and end commits of the slice may be
different from the parameters, based on what defines the branch segment, and
this actual list is passed back through `starts` and `ends`.

The cache slice is identified via a SHA-1 generated from the actual start/end
commit lists.  `cache_sha1`, if non-NULL, can recieve the cache slice name.
`rci` is used to specify generation options, but can be NULL if you want
`make_cache_slice` to fall back on defaults.  Returns 0 on success, non-zero on
failure.

make_cache_index
^^^^^^^^^^^^^^^^
----
int make_cache_index(
	struct rev_cache_info *rci IN,
	unsigned char *cache_sha1 IN,
	int fd IN,
	unsigned int size IN
)
----

Add a slice to the rev-cache index.  `cache_sha1` is the identity hash of the
cache slice; `fd` is a file descriptor of the cache slice opened with
read/write privileges (the slice is not actually modified); `size` is the size
of the cache slice.  Although there are currently no options for index
updating, `rci` is a placeholder in case of future options.  Note that this
function is normally called by `make_cache_slice`.  Returns 0 on success,
non-zero on failure.

open_cache_slice
^^^^^^^^^^^^^^^^
----
int open_cache_slice(
	unsigned char *sha1 IN,
	int flags IN
)
----

Returns a file descriptor to a cache slice described by `sha1` hash, using
`flags` as the access mode.  This will follow cache slice pointers to one level
of indirection.

get_cache_slice
^^^^^^^^^^^^^^^
----
unsigned char *get_cache_slice(
	struct commit *commit IN
)
----

Given a commit object `get_cache_slice` will search the revision cache index
and return, if found, the cache slice SHA-1.

traverse_cache_slice
^^^^^^^^^^^^^^^^^^^^
----
int traverse_cache_slice(
	struct rev_info *revs IN/OUT,
	unsigned char *cache_sha1 IN,
	struct commit *commit IN,
	unsigned long *date_so_far IN/OUT,
	int *slop_so_far IN/OUT,
	struct commit_list ***queue OUT,
	struct commit_list **work IN/OUT
)
----

Traverse a specified cache slice.  An explanation of the each field:

`revs`::
	The revision walk instance.  `traverse_cache_slice` uses this for
	general options (e.g. which objects are included) and slice traversal
	options (in the `rev_cache_info` field).  If the `add_to_pending`
	option is specified, non-commit objects are appended to the `pending`
	object list field.

`cache_sha1`::
	SHA-1 identifying the cache slice to use.  This can be taken directly
	from `get_cache_slice`.

`commit`::
	The current commit object in the revision walk, i.e. the commit which
	inspired this slice traversal.  Although theoretically redundant in
	view of the `work` list, this simplifies interaction with normal
	revision walks, which pop commits from `work` before analyzing them.

`date_so_far`::
	The date of the oldest encountered interesting commit.  Passing NULL
	will let `traverse_cache_slice` use defaults.

`slop_so_far`::
	The `slop` value, a la revision.c.  This is a counter used to determine
	when to stop traversing, based on how many extra uninteresting commits
	should be encountered.  NULL will enable defaults, as above.

`queue`::
	Refers to a pointer to the head of a FIFO commit list, recieving the
	commits we've seen and added.

`work`::
	A date-ordered list of commits that have yet to be processed (i.e. seen
	but not added).  Commits from here present in the slice are removed
	(and, obviously, used as starting places for traversal), and any end
	commits encountered are inserted.

starts_from_slices
^^^^^^^^^^^^^^^^^^
----
void starts_from_slices(
	struct rev_info *revs OUT,
	unsigned int flags IN,
	unsigned char *which IN,
	int n IN
)
----

Will mark start-commits in certain rev-cache slices with `flag`, and added them
to the pending list of `revs`.  If `n` is zero, `starts_from_slices` will use
all slices.  Otherwise `which` will specify an *unseperated* list of cache
SHA-1s to use (20 bytes each), and `n` will contain the number of slices (i.e.
20 * `n` = size of `which`).

fuse_cache_slices
^^^^^^^^^^^^^^^^^
----
int fuse_cache_slices(
	struct rev_cache_info *rci IN,
	struct rev_info *revs IN
)
----

Generate a slice based on `revs`, replacing all encountered slices with one
(larger) slice.  The `ignore_size` field in `rci`, if non-zero, will dictate
which cache slice sizes to ignore in both traversal and replacement.

regenerate_cache_index
^^^^^^^^^^^^^^^^^^^^^^
----
int regenerate_cache_index(
	struct rev_cache_info *rci IN
)
----

Remake the revision cache index, including all the slices.  Currently no
options in `rci` exist for index (re)generation, but some may develop in the
future.

to/from_disked_rc_object/index_entry
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----
struct rc_object/index_entry *from_disked_rc_object/index_entry(
	struct rc_object/index_entry_ondisk *src IN,
	struct rc_object/index_entry *dst OUT
)

struct rc_object/index_entry_ondisk *to_disked_rc_object/index_entry(
	struct rc_object/index_entry *src IN,
	struct rc_object/index_entry_ondisk *dst OUT
)
----

Functions to convert between the internal and storage (`_ondisk`) versions of
object and index entry structures.  These are necessary for direct access to
the cache slices.  If NULL is provided for `dst` a statically allocated
structure is used, and a pointer to the struct is returned.  Otherwise the
functions return `dst`.

Example Usage
-------------

A few examples to demonstrate usage:

.Creating a slice
----
/* pretend you're a porcelain for rev-cache reading from the command line */
struct rev_info revs;
struct rev_cache_info rci;

init_revisions(&revs, 0);
init_rci(&rci);

flags = 0;
for (i = 1; i < argc; i++) {
        if (!strcmp(argv[i], "--not"))
                flags ^= UNINTERESTING;
        else if(!strcmp(argv[i], "--fresh"))
                starts_from_slices(&revs, UNINTERESTING, 0, 0);
        else
                handle_revision_arg(argv[i], &revs, flags, 1);
}

/* we want to explicitly set certain options */
rci.objects = 0;

if (!make_cache_slice(&rci, &revs, 0, 0, cache_sha1))
        printf("made slice!  it's called %s\n", sha1_to_hex(cache_sha1));
----

.Traversing a slice
----
/* let's say you're walking the tree with a 'work' list of current heads and a
 * FILO output list 'out' */
out = 0;
outp = &out;

while (work) {
        struct commit *commit = pop_commit(&work);
        struct object *object = &commit->object;
        unsigned char *cache_sha1;

        if (cache_sha1 = get_cache_slice(object->sha1)) {
                /* note that this will instatiate any topo-relations
                 * as it goes */
                if (traverse_cache_slice(&revs, cache_sha1,
                        commit, 0, 0, /* use defaults */
                        &outp, &work) < 0)
                        die("I'm overreacting to a non-fatal cache error");
        } else {
                struct commit_list *parents = commit->parents;

                while (parents) {
                        struct commit *p = parents->item;
                        struct object *po = &p->object;

                        parents = parents->next;
                        if (po->flags & UNINTERESTING)
                                continue;

                        if (object->flags & UNINTERESTING)
                                po->flags |= UNINTERESTING;
                        else if (po->flags & SEEN)
                                continue;

                        if (!po->parsed)
                                parse_commit(p);
                        insert_by_date(p, &work);
                }

                if (object->flags & (SEEN | UNINTERESTING) == 0)
                        outp = &commit_list_insert(commit, outp)->next;
                object->flags |= SEEN;
        }
}
----

Some Internals
--------------
For more advanced usage, the slice and index file(s) may be accessed directly.
Relavant structures are availabe in `rev-cache.h`.

File Formats
~~~~~~~~~~~~

Cache Slices
^^^^^^^^^^^^
A slice has a basic fixed-size header, followed by a certain number of object
entries, then a NULL-seperated list of object names.  Commits are sorted in
topo-order, and each commit entry is followed by the objects added in that
commit.

----
         -- +--------------------------------+
header      | object number, etc...          |
         -- +--------------------------------+
commit      | commit info                    |
entry       | path data                      |
            +--------------------------------+
            | tree/blob info                 |
            +--------------------------------+
            | tree/blob info                 |
            +--------------------------------+
            | ...                            |
         -- +--------------------------------+
commit      | commit info                    |
entry       | path data                      |
            +--------------------------------+
            | tree/blob info                 |
            +--------------------------------+
            | ...                            |
         -- +--------------------------------+
...         ...
         -- +--------------------------------+
name list   | \0some_file_name\0             |
(note       +--------------------------------+
preceeding  | another_file\0                 |
null)       ...                              |
            +--------------------------------+
----

Here is the header:

----
struct rc_cache_slice_header {
	char signature[8]; /* REVCACHE */
	unsigned char version;
	uint32_t ofs_objects;

	uint32_t object_nr;
	uint16_t path_nr;
	uint32_t size;

	unsigned char sha1[20];

	uint32_t names_size;
};
----

Explanations:

`signature`::
	The identifying signature of cache slice file.  Always "REVCACHE".
`version`::
	The version number, currently 1.
`ofs_objects`::
	The byte offset at which the commit/object listing starts.  Always
	present at the 10th byte, regardless of file version.
`object_nr`::
	The total number of objects (commit + non-commit objects) present in
	the slice.
`path_nr`::
	The total number of paths/channels used in encoding the topological
	data.  Note that paths are reused (see 'Mechanics'), so there will
	never be more than a few hundred paths (if that) used.
`size`::
	The size of the slice *excluding* the name list.  In other words, the
	size of the portion mapped to memory.
`sha1`::
	The cache slice SHA-1.
`names_size`::
	The size of the name list.  `size` + `names_size` = size of slice

Revision Cache Index
^^^^^^^^^^^^^^^^^^^^
The index is a single file that associates SHA-1s with cache slices and file
positions.  It is somewhat similar to pack-file indexes, containing a fanout
table and a list of index entries sorted by hash.

----
         -- +--------------------------------+
header      | object #, cache #, etc.        |
         -- +--------------------------------+
sha1s of    | SHA-1                          |
slices      | ...                            |
         -- +--------------------------------+
fanout      | fanout[0x00]                   |
table       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
            | fanout[0xff]                   |
         -- +--------------------------------+
index       | SHA-1 of object                |
entries     | index of cache slice SHA-1     |
            | position in cache slice        |
            +--------------------------------+
            |                                |
            ...
            +--------------------------------+
----

The header:

----
struct rc_index_header {
	char signature[8]; /* REVINDEX */
	unsigned char version;
	uint32_t ofs_objects;

	uint32_t object_nr;
	unsigned char cache_nr;

	uint32_t max_date;
};
----

Explanations:

`signature`::
	Always "REVINDEX".
`version`::
	Version number, currently 1.
`ofs_objects`::
	Offset at which the entry objects begin.  This is more obviously useful
	in the index because the list of slice SHA-1s is variably-sized.
`object_nr`::
	Number of index entry objects present.
`cache_nr`::
	Number of cache slices to which the index maps, and hence the number of
slice SHA-1s listed.
`max_date`::
	The oldest commit represented in the index.  This is used to help speed
up lookup times by knowing what range of commits we definitely don't have
cached.  Normal usage of 'rev-cache' would leave no "holes" in its coverage of
commit history -- once a commit is cached, everything reachable from it should
be cached as well.  Most of the time refs are added to rev-cache simultaneous
as well.  This means that in most situations almost everything <= `max_date`
will be cached.

Mechanics
~~~~~~~~~

The most important part of rev-cache is its method of encoding topological
relations.  To ensure fluid traversal and reconstruction, commits are related
through high-level "streams"/"channels" rather than individual
interconnections.  Intuitively, rev-cache stores history the way gitk shows it:
commits strung up on lines, which interconnect at merges and branches.

Each commit is associated to a given channel/path via a 'path id', and
variable-length fields govern which paths (if any) are closed or opened at that
object.  This means that topo-data can be preserved in only a few bytes extra
per object entry.  Other information stored per entry is the sha-1 hash, type,
date, size, name, and status in cache slice.  Here is format of an object
entry, both on-disk and in-memory:

----
struct object_entry {
        unsigned type : 3;
        unsigned is_end : 1;
        unsigned is_start : 1;
        unsigned uninteresting : 1;
        unsigned include : 1;
        unsigned flags : 1;
        unsigned char sha1[20];

        unsigned char merge_nr;
        unsigned char split_nr;
        unsigned size_size : 3;
        unsigned name_size : 3;

        uint32_t date;
        uint16_t path;

        /* merge paths */
        /* split paths */
        /* size */
        /* name index */
};
----

An explanation of each field:

`type`::
	Object type
`is_end`::
	The commit has some parents outside the cache slice (all if slice has
	legs)
`is_start`::
	The commit has no children in cache slice
`uninteresting`::
	Run-time flag, used in traversal
`include`::
	Run-time flag, used in traversal (initialization)
`flags`::
	Currently unused, extra bit
`sha1`::
	Object SHA-1 hash

`merge_nr`::
	The number of paths the current channel diverges into; the current path
	ends upon any merge.
`split_nr`::
	The number of paths this commit ends; used on both merging and
	branching.
`size_size`::
	Number of bytes the object size takes up.
`name_size`::
	Number of bytes the name index takes up.

`date`::
	The date of the commit.
`path`::
	The path ID of the channel with which this commit is associated.

merge paths::
	The path IDs (16-bit) that are to be created.  Overflow is not a
	problem as path IDs are reused, leaving even complicated projects to
	consume no more than a few hundred IDs.
split paths::
	The path IDs (16-bit) that are to be ended.
size::
	The size split into the minimum number of bytes.  That is, 1-8 bytes
	representing the size, least-significant byte first.
name index::
	An offset for the null-seperated, object name list at the end of the
	cache slice.  Also split into the minimum number of bytes.

Each path ID refers to an index in a 'path array', which stores the current
status (eg. active, interestingness) of each channel.

Due to topo-relations and boundary tracking, all of a commit's parents must be
encountered before the path is reallocated.  This is achieved by using a
counter system per merge: starting at the parent number, the counter is
decremented as each parent is encountered (dictated by 'split paths'); at 0 the
path is cleared.

Boundary tracking is necessary because non-commits are stored relative to the
commit in which they were introduced.  If a series of commits is not included
in the output, the last interesting commit must be parsed manually to ensure
all objects are accounted for.

To prevent list-objects from recursing into trees that we've already taken care
of, the flag `FACE_VALUE` is introduced.  An object with this flag is not
explored (= "taken at face value"), significantly reducing I/O and processing
time.

Notes
~~~~~

Due to rev-cache's internal storage format, walking may lead to some
discrepencies between cached and uncached repositories.  Although noticeable to
users directly calling rev-list, these are unused or corner cases and
internally a non-issue.

First note that rev-cache records commits in topological order.  Large portions
of commit history will already be sorted topologically in the revision walk,
yielding a different output for unsorted calls to rev-list.  A more obscure
consquence occurs when two objects of the same SHA-1, but different name, are
introduced seperately in parallel branches: different names might be shown for
that object depending on which object entry was encountered first.

A similar disparity arises when two objects of same SHA-1/different name are
present in the same tree structure.  rev-cache, walking objects as they were
introduced, lists the youngest file's name; rev-list, walking the full trees
each commit, shows the first file encountered.