git.oblomov.eu Git - git/blob - Documentation/technical/index-format.txt

   1 Git index format
   2 ================
   3
   4 == The Git index file has the following format
   5
   6   All binary numbers are in network byte order.
   7   In a repository using the traditional SHA-1, checksums and object IDs
   8   (object names) mentioned below are all computed using SHA-1.  Similarly,
   9   in SHA-256 repositories, these values are computed using SHA-256.
  10   Version 2 is described here unless stated otherwise.
  11
  12    - A 12-byte header consisting of
  13
  14      4-byte signature:
  15        The signature is { 'D', 'I', 'R', 'C' } (stands for "dircache")
  16
  17      4-byte version number:
  18        The current supported versions are 2, 3 and 4.
  19
  20      32-bit number of index entries.
  21
  22    - A number of sorted index entries (see below).
  23
  24    - Extensions
  25
  26      Extensions are identified by signature. Optional extensions can
  27      be ignored if Git does not understand them.
  28
  29      Git currently supports cache tree and resolve undo extensions.
  30
  31      4-byte extension signature. If the first byte is 'A'..'Z' the
  32      extension is optional and can be ignored.
  33
  34      32-bit size of the extension
  35
  36      Extension data
  37
  38    - Hash checksum over the content of the index file before this checksum.
  39
  40 == Index entry
  41
  42   Index entries are sorted in ascending order on the name field,
  43   interpreted as a string of unsigned bytes (i.e. memcmp() order, no
  44   localization, no special casing of directory separator '/'). Entries
  45   with the same name are sorted by their stage field.
  46
  47   An index entry typically represents a file. However, if sparse-checkout
  48   is enabled in cone mode (`core.sparseCheckoutCone` is enabled) and the
  49   `extensions.sparseIndex` extension is enabled, then the index may
  50   contain entries for directories outside of the sparse-checkout definition.
  51   These entries have mode `040000`, include the `SKIP_WORKTREE` bit, and
  52   the path ends in a directory separator.
  53
  54   32-bit ctime seconds, the last time a file's metadata changed
  55     this is stat(2) data
  56
  57   32-bit ctime nanosecond fractions
  58     this is stat(2) data
  59
  60   32-bit mtime seconds, the last time a file's data changed
  61     this is stat(2) data
  62
  63   32-bit mtime nanosecond fractions
  64     this is stat(2) data
  65
  66   32-bit dev
  67     this is stat(2) data
  68
  69   32-bit ino
  70     this is stat(2) data
  71
  72   32-bit mode, split into (high to low bits)
  73
  74     4-bit object type
  75       valid values in binary are 1000 (regular file), 1010 (symbolic link)
  76       and 1110 (gitlink)
  77
  78     3-bit unused
  79
  80     9-bit unix permission. Only 0755 and 0644 are valid for regular files.
  81     Symbolic links and gitlinks have value 0 in this field.
  82
  83   32-bit uid
  84     this is stat(2) data
  85
  86   32-bit gid
  87     this is stat(2) data
  88
  89   32-bit file size
  90     This is the on-disk size from stat(2), truncated to 32-bit.
  91
  92   Object name for the represented object
  93
  94   A 16-bit 'flags' field split into (high to low bits)
  95
  96     1-bit assume-valid flag
  97
  98     1-bit extended flag (must be zero in version 2)
  99
 100     2-bit stage (during merge)
 101
 102     12-bit name length if the length is less than 0xFFF; otherwise 0xFFF
 103     is stored in this field.
 104
 105   (Version 3 or later) A 16-bit field, only applicable if the
 106   "extended flag" above is 1, split into (high to low bits).
 107
 108     1-bit reserved for future
 109
 110     1-bit skip-worktree flag (used by sparse checkout)
 111
 112     1-bit intent-to-add flag (used by "git add -N")
 113
 114     13-bit unused, must be zero
 115
 116   Entry path name (variable length) relative to top level directory
 117     (without leading slash). '/' is used as path separator. The special
 118     path components ".", ".." and ".git" (without quotes) are disallowed.
 119     Trailing slash is also disallowed.
 120
 121     The exact encoding is undefined, but the '.' and '/' characters
 122     are encoded in 7-bit ASCII and the encoding cannot contain a NUL
 123     byte (iow, this is a UNIX pathname).
 124
 125   (Version 4) In version 4, the entry path name is prefix-compressed
 126     relative to the path name for the previous entry (the very first
 127     entry is encoded as if the path name for the previous entry is an
 128     empty string).  At the beginning of an entry, an integer N in the
 129     variable width encoding (the same encoding as the offset is encoded
 130     for OFS_DELTA pack entries; see pack-format.txt) is stored, followed
 131     by a NUL-terminated string S.  Removing N bytes from the end of the
 132     path name for the previous entry, and replacing it with the string S
 133     yields the path name for this entry.
 134
 135   1-8 nul bytes as necessary to pad the entry to a multiple of eight bytes
 136   while keeping the name NUL-terminated.
 137
 138   (Version 4) In version 4, the padding after the pathname does not
 139   exist.
 140
 141   Interpretation of index entries in split index mode is completely
 142   different. See below for details.
 143
 144 == Extensions
 145
 146 === Cache tree
 147
 148   Since the index does not record entries for directories, the cache
 149   entries cannot describe tree objects that already exist in the object
 150   database for regions of the index that are unchanged from an existing
 151   commit. The cache tree extension stores a recursive tree structure that
 152   describes the trees that already exist and completely match sections of
 153   the cache entries. This speeds up tree object generation from the index
 154   for a new commit by only computing the trees that are "new" to that
 155   commit. It also assists when comparing the index to another tree, such
 156   as `HEAD^{tree}`, since sections of the index can be skipped when a tree
 157   comparison demonstrates equality.
 158
 159   The recursive tree structure uses nodes that store a number of cache
 160   entries, a list of subnodes, and an object ID (OID). The OID references
 161   the existing tree for that node, if it is known to exist. The subnodes
 162   correspond to subdirectories that themselves have cache tree nodes. The
 163   number of cache entries corresponds to the number of cache entries in
 164   the index that describe paths within that tree's directory.
 165
 166   The extension tracks the full directory structure in the cache tree
 167   extension, but this is generally smaller than the full cache entry list.
 168
 169   When a path is updated in index, Git invalidates all nodes of the
 170   recursive cache tree corresponding to the parent directories of that
 171   path. We store these tree nodes as being "invalid" by using "-1" as the
 172   number of cache entries. Invalid nodes still store a span of index
 173   entries, allowing Git to focus its efforts when reconstructing a full
 174   cache tree.
 175
 176   The signature for this extension is { 'T', 'R', 'E', 'E' }.
 177
 178   A series of entries fill the entire extension; each of which
 179   consists of:
 180
 181   - NUL-terminated path component (relative to its parent directory);
 182
 183   - ASCII decimal number of entries in the index that is covered by the
 184     tree this entry represents (entry_count);
 185
 186   - A space (ASCII 32);
 187
 188   - ASCII decimal number that represents the number of subtrees this
 189     tree has;
 190
 191   - A newline (ASCII 10); and
 192
 193   - Object name for the object that would result from writing this span
 194     of index as a tree.
 195
 196   An entry can be in an invalidated state and is represented by having
 197   a negative number in the entry_count field. In this case, there is no
 198   object name and the next entry starts immediately after the newline.
 199   When writing an invalid entry, -1 should always be used as entry_count.
 200
 201   The entries are written out in the top-down, depth-first order.  The
 202   first entry represents the root level of the repository, followed by the
 203   first subtree--let's call this A--of the root level (with its name
 204   relative to the root level), followed by the first subtree of A (with
 205   its name relative to A), and so on. The specified number of subtrees
 206   indicates when the current level of the recursive stack is complete.
 207
 208 === Resolve undo
 209
 210   A conflict is represented in the index as a set of higher stage entries.
 211   When a conflict is resolved (e.g. with "git add path"), these higher
 212   stage entries will be removed and a stage-0 entry with proper resolution
 213   is added.
 214
 215   When these higher stage entries are removed, they are saved in the
 216   resolve undo extension, so that conflicts can be recreated (e.g. with
 217   "git checkout -m"), in case users want to redo a conflict resolution
 218   from scratch.
 219
 220   The signature for this extension is { 'R', 'E', 'U', 'C' }.
 221
 222   A series of entries fill the entire extension; each of which
 223   consists of:
 224
 225   - NUL-terminated pathname the entry describes (relative to the root of
 226     the repository, i.e. full pathname);
 227
 228   - Three NUL-terminated ASCII octal numbers, entry mode of entries in
 229     stage 1 to 3 (a missing stage is represented by "0" in this field);
 230     and
 231
 232   - At most three object names of the entry in stages from 1 to 3
 233     (nothing is written for a missing stage).
 234
 235 === Split index
 236
 237   In split index mode, the majority of index entries could be stored
 238   in a separate file. This extension records the changes to be made on
 239   top of that to produce the final index.
 240
 241   The signature for this extension is { 'l', 'i', 'n', 'k' }.
 242
 243   The extension consists of:
 244
 245   - Hash of the shared index file. The shared index file path
 246     is $GIT_DIR/sharedindex.<hash>. If all bits are zero, the
 247     index does not require a shared index file.
 248
 249   - An ewah-encoded delete bitmap, each bit represents an entry in the
 250     shared index. If a bit is set, its corresponding entry in the
 251     shared index will be removed from the final index.  Note, because
 252     a delete operation changes index entry positions, but we do need
 253     original positions in replace phase, it's best to just mark
 254     entries for removal, then do a mass deletion after replacement.
 255
 256   - An ewah-encoded replace bitmap, each bit represents an entry in
 257     the shared index. If a bit is set, its corresponding entry in the
 258     shared index will be replaced with an entry in this index
 259     file. All replaced entries are stored in sorted order in this
 260     index. The first "1" bit in the replace bitmap corresponds to the
 261     first index entry, the second "1" bit to the second entry and so
 262     on. Replaced entries may have empty path names to save space.
 263
 264   The remaining index entries after replaced ones will be added to the
 265   final index. These added entries are also sorted by entry name then
 266   stage.
 267
 268 == Untracked cache
 269
 270   Untracked cache saves the untracked file list and necessary data to
 271   verify the cache. The signature for this extension is { 'U', 'N',
 272   'T', 'R' }.
 273
 274   The extension starts with
 275
 276   - A sequence of NUL-terminated strings, preceded by the size of the
 277     sequence in variable width encoding. Each string describes the
 278     environment where the cache can be used.
 279
 280   - Stat data of $GIT_DIR/info/exclude. See "Index entry" section from
 281     ctime field until "file size".
 282
 283   - Stat data of core.excludesFile
 284
 285   - 32-bit dir_flags (see struct dir_struct)
 286
 287   - Hash of $GIT_DIR/info/exclude. A null hash means the file
 288     does not exist.
 289
 290   - Hash of core.excludesFile. A null hash means the file does
 291     not exist.
 292
 293   - NUL-terminated string of per-dir exclude file name. This usually
 294     is ".gitignore".
 295
 296   - The number of following directory blocks, variable width
 297     encoding. If this number is zero, the extension ends here with a
 298     following NUL.
 299
 300   - A number of directory blocks in depth-first-search order, each
 301     consists of
 302
 303     - The number of untracked entries, variable width encoding.
 304
 305     - The number of sub-directory blocks, variable width encoding.
 306
 307     - The directory name terminated by NUL.
 308
 309     - A number of untracked file/dir names terminated by NUL.
 310
 311 The remaining data of each directory block is grouped by type:
 312
 313   - An ewah bitmap, the n-th bit marks whether the n-th directory has
 314     valid untracked cache entries.
 315
 316   - An ewah bitmap, the n-th bit records "check-only" bit of
 317     read_directory_recursive() for the n-th directory.
 318
 319   - An ewah bitmap, the n-th bit indicates whether hash and stat data
 320     is valid for the n-th directory and exists in the next data.
 321
 322   - An array of stat data. The n-th data corresponds with the n-th
 323     "one" bit in the previous ewah bitmap.
 324
 325   - An array of hashes. The n-th hash corresponds with the n-th "one" bit
 326     in the previous ewah bitmap.
 327
 328   - One NUL.
 329
 330 == File System Monitor cache
 331
 332   The file system monitor cache tracks files for which the core.fsmonitor
 333   hook has told us about changes.  The signature for this extension is
 334   { 'F', 'S', 'M', 'N' }.
 335
 336   The extension starts with
 337
 338   - 32-bit version number: the current supported versions are 1 and 2.
 339
 340   - (Version 1)
 341     64-bit time: the extension data reflects all changes through the given
 342         time which is stored as the nanoseconds elapsed since midnight,
 343         January 1, 1970.
 344
 345   - (Version 2)
 346     A null terminated string: an opaque token defined by the file system
 347     monitor application.  The extension data reflects all changes relative
 348     to that token.
 349
 350   - 32-bit bitmap size: the size of the CE_FSMONITOR_VALID bitmap.
 351
 352   - An ewah bitmap, the n-th bit indicates whether the n-th index entry
 353     is not CE_FSMONITOR_VALID.
 354
 355 == End of Index Entry
 356
 357   The End of Index Entry (EOIE) is used to locate the end of the variable
 358   length index entries and the beginning of the extensions. Code can take
 359   advantage of this to quickly locate the index extensions without having
 360   to parse through all of the index entries.
 361
 362   Because it must be able to be loaded before the variable length cache
 363   entries and other index extensions, this extension must be written last.
 364   The signature for this extension is { 'E', 'O', 'I', 'E' }.
 365
 366   The extension consists of:
 367
 368   - 32-bit offset to the end of the index entries
 369
 370   - Hash over the extension types and their sizes (but not
 371         their contents).  E.g. if we have "TREE" extension that is N-bytes
 372         long, "REUC" extension that is M-bytes long, followed by "EOIE",
 373         then the hash would be:
 374
 375         Hash("TREE" + <binary representation of N> +
 376                 "REUC" + <binary representation of M>)
 377
 378 == Index Entry Offset Table
 379
 380   The Index Entry Offset Table (IEOT) is used to help address the CPU
 381   cost of loading the index by enabling multi-threading the process of
 382   converting cache entries from the on-disk format to the in-memory format.
 383   The signature for this extension is { 'I', 'E', 'O', 'T' }.
 384
 385   The extension consists of:
 386
 387   - 32-bit version (currently 1)
 388
 389   - A number of index offset entries each consisting of:
 390
 391     - 32-bit offset from the beginning of the file to the first cache entry
 392         in this block of entries.
 393
 394     - 32-bit count of cache entries in this block
 395
 396 == Sparse Directory Entries
 397
 398   When using sparse-checkout in cone mode, some entire directories within
 399   the index can be summarized by pointing to a tree object instead of the
 400   entire expanded list of paths within that tree. An index containing such
 401   entries is a "sparse index". Index format versions 4 and less were not
 402   implemented with such entries in mind. Thus, for these versions, an
 403   index containing sparse directory entries will include this extension
 404   with signature { 's', 'd', 'i', 'r' }. Like the split-index extension,
 405   tools should avoid interacting with a sparse index unless they understand
 406   this extension.