Git for Computer Scientists

submited by

Style Pass

2021-06-13 10:30:06

In simplified form, git object storage is "just" a DAG of objects, with a handful of different types of objects. They are all stored compressed and identified by an SHA-1 hash (that, incidentally, isn't the SHA-1 of the contents of the file they represent, but of their representation in git).

blob: The simplest object, just a bunch of bytes. This is often a file, but can be a symlink or pretty much anything else. The object that points to the blob determines the semantics.

tree: Directories are represented by tree object. They refer to blobs that have the contents of files (filename, access mode, etc is all stored in the tree), and to other trees for subdirectories.

When a node points to another node in the DAG, it depends on the other node: it cannot exist without it. Nodes that nothing points to can be garbage collected with git gc, or rescued much like filesystem inodes with no filenames pointing to them with git fsck --lost-found.

commit: A commit refers to a tree that represents the state of the files at the time of the commit. It also refers to 0..n other commits that are its parents. More than one parent means the commit is a merge, no parents means it is an initial commit, and interestingly there can be more than one initial commit; this usually means two separate projects merged. The body of the commit object is the commit message.