Commit | Line | Data |
---|---|---|
e83c5163 | 1 | |
6ad6d3d3 LT |
2 | |
3 | ||
e83c5163 LT |
4 | GIT - the stupid content tracker |
5 | ||
6ad6d3d3 | 6 | |
e83c5163 LT |
7 | "git" can mean anything, depending on your mood. |
8 | ||
9 | - random three-letter combination that is pronounceable, and not | |
10 | actually used by any common UNIX command. The fact that it is a | |
90c4851b | 11 | mispronunciation of "get" may or may not be relevant. |
e83c5163 LT |
12 | - stupid. contemptible and despicable. simple. Take your pick from the |
13 | dictionary of slang. | |
14 | - "global information tracker": you're in a good mood, and it actually | |
15 | works for you. Angels sing, and a light suddenly fills the room. | |
16 | - "goddamn idiotic truckload of sh*t": when it breaks | |
17 | ||
18 | This is a stupid (but extremely fast) directory content manager. It | |
19 | doesn't do a whole lot, but what it _does_ do is track directory | |
20 | contents efficiently. | |
21 | ||
22 | There are two object abstractions: the "object database", and the | |
6ad6d3d3 LT |
23 | "current directory cache" aka "index". |
24 | ||
25 | ||
e83c5163 | 26 | |
d19938ab | 27 | The Object Database (GIT_OBJECT_DIRECTORY) |
e83c5163 | 28 | |
6ad6d3d3 | 29 | |
e83c5163 LT |
30 | The object database is literally just a content-addressable collection |
31 | of objects. All objects are named by their content, which is | |
32 | approximated by the SHA1 hash of the object itself. Objects may refer | |
33 | to other objects (by referencing their SHA1 hash), and so you can build | |
34 | up a hierarchy of objects. | |
35 | ||
6ad6d3d3 LT |
36 | All objects have a statically determined "type" aka "tag", which is |
37 | determined at object creation time, and which identifies the format of | |
90c4851b | 38 | the object (i.e. how it is used, and how it can refer to other objects). |
6ad6d3d3 LT |
39 | There are currently three different object types: "blob", "tree" and |
40 | "commit". | |
41 | ||
42 | A "blob" object cannot refer to any other object, and is, like the tag | |
43 | implies, a pure storage object containing some user data. It is used to | |
90c4851b | 44 | actually store the file data, i.e. a blob object is associated with some |
6ad6d3d3 LT |
45 | particular version of some file. |
46 | ||
47 | A "tree" object is an object that ties one or more "blob" objects into a | |
48 | directory structure. In addition, a tree object can refer to other tree | |
49 | objects, thus creating a directory hierarchy. | |
50 | ||
51 | Finally, a "commit" object ties such directory hierarchies together into | |
52 | a DAG of revisions - each "commit" is associated with exactly one tree | |
53 | (the directory hierarchy at the time of the commit). In addition, a | |
54 | "commit" refers to one or more "parent" commit objects that describe the | |
55 | history of how we arrived at that directory hierarchy. | |
56 | ||
57 | As a special case, a commit object with no parents is called the "root" | |
58 | object, and is the point of an initial project commit. Each project | |
59 | must have at least one root, and while you can tie several different | |
60 | root objects together into one project by creating a commit object which | |
61 | has two or more separate roots as its ultimate parents, that's probably | |
62 | just going to confuse people. So aim for the notion of "one root object | |
63 | per project", even if git itself does not enforce that. | |
64 | ||
65 | Regardless of object type, all objects are share the following | |
66 | characteristics: they are all in deflated with zlib, and have a header | |
67 | that not only specifies their tag, but also size information about the | |
68 | data in the object. It's worth noting that the SHA1 hash that is used | |
69 | to name the object is always the hash of this _compressed_ object, not | |
70 | the original data. | |
71 | ||
72 | As a result, the general consistency of an object can always be tested | |
e83c5163 LT |
73 | independently of the contents or the type of the object: all objects can |
74 | be validated by verifying that (a) their hashes match the content of the | |
75 | file and (b) the object successfully inflates to a stream of bytes that | |
76 | forms a sequence of <ascii tag without space> + <space> + <ascii decimal | |
77 | size> + <byte\0> + <binary object data>. | |
78 | ||
6ad6d3d3 LT |
79 | The structured objects can further have their structure and connectivity |
80 | to other objects verified. This is generally done with the "fsck-cache" | |
81 | program, which generates a full dependency graph of all objects, and | |
82 | verifies their internal consistency (in addition to just verifying their | |
83 | superficial consistency through the hash). | |
84 | ||
85 | The object types in some more detail: | |
86 | ||
87 | BLOB: A "blob" object is nothing but a binary blob of data, and | |
88 | doesn't refer to anything else. There is no signature or any | |
89 | other verification of the data, so while the object is | |
90 | consistent (it _is_ indexed by its sha1 hash, so the data itself | |
91 | is certainly correct), it has absolutely no other attributes. | |
92 | No name associations, no permissions. It is purely a blob of | |
90c4851b | 93 | data (i.e. normally "file contents"). |
6ad6d3d3 LT |
94 | |
95 | In particular, since the blob is entirely defined by its data, | |
96 | if two files in a directory tree (or in multiple different | |
97 | versions of the repository) have the same contents, they will | |
bebc5c61 | 98 | share the same blob object. The object is totally independent |
6ad6d3d3 LT |
99 | of it's location in the directory tree, and renaming a file does |
100 | not change the object that file is associated with in any way. | |
101 | ||
102 | TREE: The next hierarchical object type is the "tree" object. A tree | |
103 | object is a list of mode/name/blob data, sorted by name. | |
104 | Alternatively, the mode data may specify a directory mode, in | |
105 | which case instead of naming a blob, that name is associated | |
106 | with another TREE object. | |
107 | ||
108 | Like the "blob" object, a tree object is uniquely determined by | |
109 | the set contents, and so two separate but identical trees will | |
110 | always share the exact same object. This is true at all levels, | |
90c4851b | 111 | i.e. it's true for a "leaf" tree (which does not refer to any |
6ad6d3d3 LT |
112 | other trees, only blobs) as well as for a whole subdirectory. |
113 | ||
114 | For that reason a "tree" object is just a pure data abstraction: | |
115 | it has no history, no signatures, no verification of validity, | |
116 | except that since the contents are again protected by the hash | |
117 | itself, we can trust that the tree is immutable and its contents | |
118 | never change. | |
119 | ||
120 | So you can trust the contents of a tree to be valid, the same | |
121 | way you can trust the contents of a blob, but you don't know | |
122 | where those contents _came_ from. | |
123 | ||
124 | Side note on trees: since a "tree" object is a sorted list of | |
125 | "filename+content", you can create a diff between two trees | |
126 | without actually having to unpack two trees. Just ignore all | |
127 | common parts, and your diff will look right. In other words, | |
128 | you can effectively (and efficiently) tell the difference | |
129 | between any two random trees by O(n) where "n" is the size of | |
130 | the difference, rather than the size of the tree. | |
131 | ||
132 | Side note 2 on trees: since the name of a "blob" depends | |
90c4851b | 133 | entirely and exclusively on its contents (i.e. there are no names |
6ad6d3d3 LT |
134 | or permissions involved), you can see trivial renames or |
135 | permission changes by noticing that the blob stayed the same. | |
136 | However, renames with data changes need a smarter "diff" implementation. | |
e83c5163 LT |
137 | |
138 | CHANGESET: The "changeset" object is an object that introduces the | |
6ad6d3d3 LT |
139 | notion of history into the picture. In contrast to the other |
140 | objects, it doesn't just describe the physical state of a tree, | |
141 | it describes how we got there, and why. | |
142 | ||
143 | A "changeset" is defined by the tree-object that it results in, | |
144 | the parent changesets (zero, one or more) that led up to that | |
145 | point, and a comment on what happened. Again, a changeset is | |
146 | not trusted per se: the contents are well-defined and "safe" due | |
147 | to the cryptographically strong signatures at all levels, but | |
148 | there is no reason to believe that the tree is "good" or that | |
149 | the merge information makes sense. The parents do not have to | |
150 | actually have any relationship with the result, for example. | |
151 | ||
152 | Note on changesets: unlike real SCM's, changesets do not contain | |
90c4851b | 153 | rename information or file mode change information. All of that |
6ad6d3d3 LT |
154 | is implicit in the trees involved (the result tree, and the |
155 | result trees of the parents), and describing that makes no sense | |
156 | in this idiotic file manager. | |
e83c5163 LT |
157 | |
158 | TRUST: The notion of "trust" is really outside the scope of "git", but | |
6ad6d3d3 LT |
159 | it's worth noting a few things. First off, since everything is |
160 | hashed with SHA1, you _can_ trust that an object is intact and | |
161 | has not been messed with by external sources. So the name of an | |
162 | object uniquely identifies a known state - just not a state that | |
163 | you may want to trust. | |
164 | ||
165 | Furthermore, since the SHA1 signature of a changeset refers to | |
166 | the SHA1 signatures of the tree it is associated with and the | |
167 | signatures of the parent, a single named changeset specifies | |
168 | uniquely a whole set of history, with full contents. You can't | |
169 | later fake any step of the way once you have the name of a | |
170 | changeset. | |
171 | ||
172 | So to introduce some real trust in the system, the only thing | |
173 | you need to do is to digitally sign just _one_ special note, | |
174 | which includes the name of a top-level changeset. Your digital | |
175 | signature shows others that you trust that changeset, and the | |
176 | immutability of the history of changesets tells others that they | |
177 | can trust the whole history. | |
178 | ||
179 | In other words, you can easily validate a whole archive by just | |
180 | sending out a single email that tells the people the name (SHA1 | |
181 | hash) of the top changeset, and digitally sign that email using | |
182 | something like GPG/PGP. | |
183 | ||
184 | In particular, you can also have a separate archive of "trust | |
185 | points" or tags, which document your (and other peoples) trust. | |
186 | You may, of course, archive these "certificates of trust" using | |
187 | "git" itself, but it's not something "git" does for you. | |
188 | ||
189 | Another way of saying the last point: "git" itself only handles content | |
e83c5163 LT |
190 | integrity, the trust has to come from outside. |
191 | ||
e83c5163 | 192 | |
e83c5163 | 193 | |
6ad6d3d3 LT |
194 | The "index" aka "Current Directory Cache" (".git/index") |
195 | ||
196 | ||
197 | The index is a simple binary file, which contains an efficient | |
198 | representation of a virtual directory content at some random time. It | |
199 | does so by a simple array that associates a set of names, dates, | |
200 | permissions and content (aka "blob") objects together. The cache is | |
201 | always kept ordered by name, and names are unique (with a few very | |
202 | specific rules) at any point in time, but the cache has no long-term | |
203 | meaning, and can be partially updated at any time. | |
204 | ||
205 | In particular, the index certainly does not need to be consistent with | |
206 | the current directory contents (in fact, most operations will depend on | |
207 | different ways to make the index _not_ be consistent with the directory | |
208 | hierarchy), but it has three very important attributes: | |
e83c5163 LT |
209 | |
210 | (a) it can re-generate the full state it caches (not just the directory | |
6ad6d3d3 LT |
211 | structure: it contains pointers to the "blob" objects so that it |
212 | can regenerate the data too) | |
e83c5163 LT |
213 | |
214 | As a special case, there is a clear and unambiguous one-way mapping | |
215 | from a current directory cache to a "tree object", which can be | |
216 | efficiently created from just the current directory cache without | |
217 | actually looking at any other data. So a directory cache at any | |
218 | one time uniquely specifies one and only one "tree" object (but | |
219 | has additional data to make it easy to match up that tree object | |
220 | with what has happened in the directory) | |
e83c5163 LT |
221 | |
222 | (b) it has efficient methods for finding inconsistencies between that | |
223 | cached state ("tree object waiting to be instantiated") and the | |
224 | current state. | |
225 | ||
6ad6d3d3 LT |
226 | (c) it can additionally efficiently represent information about merge |
227 | conflicts between different tree objects, allowing each pathname to | |
228 | be associated with sufficient information about the trees involved | |
229 | that you can create a three-way merge between them. | |
230 | ||
231 | Those are the three ONLY things that the directory cache does. It's a | |
e83c5163 LT |
232 | cache, and the normal operation is to re-generate it completely from a |
233 | known tree object, or update/compare it with a live tree that is being | |
6ad6d3d3 LT |
234 | developed. If you blow the directory cache away entirely, you generally |
235 | haven't lost any information as long as you have the name of the tree | |
236 | that it described. | |
237 | ||
238 | At the same time, the directory index is at the same time also the | |
239 | staging area for creating new trees, and creating a new tree always | |
240 | involves a controlled modification of the index file. In particular, | |
241 | the index file can have the representation of an intermediate tree that | |
242 | has not yet been instantiated. So the index can be thought of as a | |
243 | write-back cache, which can contain dirty information that has not yet | |
244 | been written back to the backing store. | |
245 | ||
246 | ||
247 | ||
248 | The Workflow | |
249 | ||
250 | ||
251 | Generally, all "git" operations work on the index file. Some operations | |
252 | work _purely_ on the index file (showing the current state of the | |
253 | index), but most operations move data to and from the index file. Either | |
254 | from the database or from the working directory. Thus there are four | |
255 | main combinations: | |
256 | ||
257 | 1) working directory -> index | |
258 | ||
259 | You update the index with information from the working directory | |
260 | with the "update-cache" command. You generally update the index | |
261 | information by just specifying the filename you want to update, | |
262 | like so: | |
263 | ||
264 | update-cache filename | |
265 | ||
266 | but to avoid common mistakes with filename globbing etc, the | |
267 | command will not normally add totally new entries or remove old | |
f7ec43ae | 268 | entries, i.e. it will normally just update existing cache entries. |
6ad6d3d3 LT |
269 | |
270 | To tell git that yes, you really do realize that certain files | |
271 | no longer exist in the archive, or that new files should be | |
272 | added, you should use the "--remove" and "--add" flags | |
273 | respectively. | |
274 | ||
275 | NOTE! A "--remove" flag does _not_ mean that subsequent | |
276 | filenames will necessarily be removed: if the files still exist | |
277 | in your directory structure, the index will be updated with | |
278 | their new status, not removed. The only thing "--remove" means | |
279 | is that update-cache will be considering a removed file to be a | |
280 | valid thing, and if the file really does not exist any more, it | |
281 | will update the index accordingly. | |
282 | ||
283 | As a special case, you can also do "update-cache --refresh", | |
284 | which will refresh the "stat" information of each index to match | |
285 | the current stat information. It will _not_ update the object | |
f7ec43ae | 286 | status itself, and it will only update the fields that are used |
6ad6d3d3 LT |
287 | to quickly test whether an object still matches its old backing |
288 | store object. | |
289 | ||
290 | 2) index -> object database | |
291 | ||
292 | You write your current index file to a "tree" object with the | |
293 | program | |
294 | ||
295 | write-tree | |
296 | ||
297 | that doesn't come with any options - it will just write out the | |
298 | current index into the set of tree objects that describe that | |
299 | state, and it will return the name of the resulting top-level | |
300 | tree. You can use that tree to re-generate the index at any time | |
301 | by going in the other direction: | |
302 | ||
303 | 3) object database -> index | |
304 | ||
305 | You read a "tree" file from the object database, and use that to | |
306 | populate (and overwrite - don't do this if your index contains | |
307 | any unsaved state that you might want to restore later!) your | |
308 | current index. Normal operation is just | |
309 | ||
310 | read-tree <sha1 of tree> | |
311 | ||
312 | and your index file will now be equivalent to the tree that you | |
313 | saved earlier. However, that is only your _index_ file: your | |
314 | working directory contents have not been modified. | |
315 | ||
316 | 4) index -> working directory | |
317 | ||
318 | You update your working directory from the index by "checking | |
319 | out" files. This is not a very common operation, since normally | |
320 | you'd just keep your files updated, and rather than write to | |
321 | your working directory, you'd tell the index files about the | |
90c4851b | 322 | changes in your working directory (i.e. "update-cache"). |
6ad6d3d3 LT |
323 | |
324 | However, if you decide to jump to a new version, or check out | |
bebc5c61 | 325 | somebody else's version, or just restore a previous tree, you'd |
6ad6d3d3 LT |
326 | populate your index file with read-tree, and then you need to |
327 | check out the result with | |
328 | ||
329 | checkout-cache filename | |
330 | ||
331 | or, if you want to check out all of the index, use "-a". | |
332 | ||
333 | NOTE! checkout-cache normally refuses to overwrite old files, so | |
334 | if you have an old version of the tree already checked out, you | |
335 | will need to use the "-f" flag (_before_ the "-a" flag or the | |
336 | filename) to _force_ the checkout. | |
337 | ||
338 | ||
339 | Finally, there are a few odds and ends which are not purely moving from | |
340 | one representation to the other: | |
341 | ||
342 | 5) Tying it all together | |
343 | ||
344 | To commit a tree you have instantiated with "write-tree", you'd | |
345 | create a "commit" object that refers to that tree and the | |
346 | history behind it - most notably the "parent" commits that | |
347 | preceded it in history. | |
348 | ||
349 | Normally a "commit" has one parent: the previous state of the | |
350 | tree before a certain change was made. However, sometimes it can | |
351 | have two or more parent commits, in which case we call it a | |
352 | "merge", due to the fact that such a commit brings together | |
353 | ("merges") two or more previous states represented by other | |
354 | commits. | |
355 | ||
356 | In other words, while a "tree" represents a particular directory | |
357 | state of a working directory, a "commit" represents that state | |
358 | in "time", and explains how we got there. | |
359 | ||
360 | You create a commit object by giving it the tree that describes | |
361 | the state at the time of the commit, and a list of parents: | |
362 | ||
363 | commit-tree <tree> -p <parent> [-p <parent2> ..] | |
364 | ||
365 | and then giving the reason for the commit on stdin (either | |
366 | through redirection from a pipe or file, or by just typing it at | |
367 | the tty). | |
368 | ||
369 | commit-tree will return the name of the object that represents | |
370 | that commit, and you should save it away for later use. | |
371 | Normally, you'd commit a new "HEAD" state, and while git doesn't | |
372 | care where you save the note about that state, in practice we | |
373 | tend to just write the result to the file ".git/HEAD", so that | |
374 | we can always see what the last committed state was. | |
375 | ||
376 | 6) Examining the data | |
377 | ||
378 | You can examine the data represented in the object database and | |
379 | the index with various helper tools. For every object, you can | |
380 | use "cat-file" to examine details about the object: | |
381 | ||
382 | cat-file -t <objectname> | |
383 | ||
384 | shows the type of the object, and once you have the type (which | |
385 | is usually implicit in where you find the object), you can use | |
386 | ||
387 | cat-file blob|tree|commit <objectname> | |
388 | ||
389 | to show its contents. NOTE! Trees have binary content, and as a | |
390 | result there is a special helper for showing that content, | |
391 | called "ls-tree", which turns the binary content into a more | |
392 | easily readable form. | |
393 | ||
394 | It's especially instructive to look at "commit" objects, since | |
395 | those tend to be small and fairly self-explanatory. In | |
396 | particular, if you follow the convention of having the top | |
397 | commit name in ".git/HEAD", you can do | |
398 | ||
399 | cat-file commit $(cat .git/HEAD) | |
400 | ||
401 | to see what the top commit was. | |
402 | ||
403 | 7) Merging multiple trees | |
404 | ||
405 | Git helps you do a three-way merge, which you can expand to | |
406 | n-way by repeating the merge procedure arbitrary times until you | |
407 | finally "commit" the state. The normal situation is that you'd | |
408 | only do one three-way merge (two parents), and commit it, but if | |
409 | you like to, you can do multiple parents in one go. | |
410 | ||
411 | To do a three-way merge, you need the two sets of "commit" | |
412 | objects that you want to merge, use those to find the closest | |
413 | common parent (a third "commit" object), and then use those | |
414 | commit objects to find the state of the directory ("tree" | |
415 | object) at these points. | |
416 | ||
417 | To get the "base" for the merge, you first look up the common | |
418 | parent of two commits with | |
419 | ||
420 | merge-base <commit1> <commit2> | |
421 | ||
422 | which will return you the commit they are both based on. You | |
423 | should now look up the "tree" objects of those commits, which | |
424 | you can easily do with (for example) | |
425 | ||
426 | cat-file commit <commitname> | head -1 | |
427 | ||
428 | since the tree object information is always the first line in a | |
429 | commit object. | |
430 | ||
431 | Once you know the three trees you are going to merge (the one | |
432 | "original" tree, aka the common case, and the two "result" trees, | |
433 | aka the branches you want to merge), you do a "merge" read into | |
434 | the index. This will throw away your old index contents, so you | |
435 | should make sure that you've committed those - in fact you would | |
436 | normally always do a merge against your last commit (which | |
437 | should thus match what you have in your current index anyway). | |
438 | To do the merge, do | |
439 | ||
440 | read-tree -m <origtree> <target1tree> <target2tree> | |
441 | ||
442 | which will do all trivial merge operations for you directly in | |
443 | the index file, and you can just write the result out with | |
444 | "write-tree". | |
445 | ||
446 | NOTE! Because the merge is done in the index file, and not in | |
447 | your working directory, your working directory will no longer | |
448 | match your index. You can use "checkout-cache -f -a" to make the | |
449 | effect of the merge be seen in your working directory. | |
450 | ||
451 | NOTE2! Sadly, many merges aren't trivial. If there are files | |
452 | that have been added.moved or removed, or if both branches have | |
453 | modified the same file, you will be left with an index tree that | |
454 | contains "merge entries" in it. Such an index tree can _NOT_ be | |
455 | written out to a tree object, and you will have to resolve any | |
456 | such merge clashes using other tools before you can write out | |
457 | the result. | |
458 | ||
459 | [ fixme: talk about resolving merges here ] | |
460 |