Learn how Git stores files internally using snapshots, blobs, trees, and hashing to avoid duplication and save repository space efficiently.
Git is the most widely used version control system in the world, and one of the key reasons for its popularity is its highly efficient storage model. At first glance, Git appears to store a complete copy of your project every time you commit. Surprisingly, repositories remain compact even after thousands of commits.
So how does Git duplicate files while still saving disk space?
In this article, we will explore how Git stores files internally, how it avoids unnecessary duplication, and why its storage mechanism is both fast and space-efficient. By the end, you will clearly understand how Git manages file data under the hood and why it scales so well for large projects.
Unlike traditional version control systems such as Subversion (SVN), which store file differences between versions, Git takes a fundamentally different approach.
Git stores snapshots of the entire project state at every commit.
However, Git is smart enough not to duplicate unchanged data. If a file has not changed between commits, Git simply reuses the previously stored version instead of saving a new copy. This design enables Git to deliver:
Most version control systems track line-by-line changes over time. Git does not.
Every time you create a commit, Git records a snapshot of the entire file structure at that moment.
If a file remains unchanged between commits:
This means Git behaves like a content-addressable filesystem, where identical content is stored once and referenced many times.
This snapshot model allows Git to:
Git stores all repository data as objects inside the .git/objects directory. Each object is identified by a cryptographic hash based on its content.

There are four primary object types in Git:
A blob (Binary Large Object) represents the raw content of a file.
Key characteristics of blobs:
If two files — or the same file across commits — have identical content:
This is the foundation of Git’s space-saving mechanism.
You can inspect blobs using:
text1git ls-tree <commit-hash>
A tree object represents a directory in your project.
It contains:
Each directory in your project maps to a tree object, allowing Git to recreate the complete filesystem structure for any commit.
A commit object ties everything together.
It contains:
Commit Structure Example
text1Commit2└── Tree (Root Directory)3 ├── Blob (File 1)4 ├── Blob (File 2)5 └── Tree (Subdirectory)6 ├── Blob (File 3)7 └── Blob (File 4)
Each commit represents a complete snapshot, but most data is reused from earlier commits.
.git Directory: Git’s Internal Storage and Control SystemThe .git directory is the core of every Git repository. It stores all metadata, objects, and references.
.git/objects/This directory stores all Git objects (blobs, trees, commits) in compressed form. Objects are named using their hash values.
.git/refs/References to branches and tags live here. Each branch is simply a pointer to a commit.
.git/index (Staging Area)The index tracks what will be included in the next commit. It bridges the gap between your working directory and the repository.
.git/HEADThe HEAD file points to the currently checked-out branch or commit.
Git’s efficiency comes from three core techniques.
Git computes a hash (SHA-1 by default, SHA-256 supported) for every object based on its content.
This guarantees data integrity and prevents duplication.

Git compresses objects using zlib, reducing disk usage while maintaining fast access.
Git never stores the same content twice. If a file hasn’t changed:
This is how Git duplicates files logically without duplicating data physically.
To fully understand how Git duplicates files while saving space, it is essential to understand the three logical areas through which every change flows: the working directory, the staging area, and the commit history. These are not just conceptual layers — they directly influence how Git creates objects and reuses existing data.

The working directory is the actual project folder on your local machine. It contains real files that you edit using your editor or IDE.
Key characteristics:
When you modify a file in the working directory:
.git/objects is usedThis design allows Git to remain fast and lightweight while you experiment with changes.
The staging area, also called the index, is where Git begins its internal storage optimization.
When you run:
text1git add <file>
Git performs the following actions:
.git/indexImportant details:
This is where Git’s de-duplication logic begins to take effect.\
When you run:
text1git commit
Git creates a commit object, which includes:
Crucially:
Each commit represents a complete snapshot, but internally, most data is shared across commits. This allows Git to maintain a full project history without ballooning repository size.
One of Git’s strengths is transparency. Git provides low-level commands that allow you to inspect its internal object database, making it easier to understand how files are stored and reused.
These commands are especially valuable for developers who want to understand Git beyond everyday workflows.
git cat-file: Viewing Raw Git ObjectsThe git cat-file command allows you to inspect any Git object directly.
To view a commit object:
text1git cat-file -p <object-hash>
This displays:
You can also inspect blob objects to see file content exactly as Git stores it, confirming that identical content is reused across commits.
git ls-tree: Exploring Tree StructuresThe git ls-tree command shows how a commit or tree maps to files and directories.
text1git ls-tree <commit-hash>
Output includes:
This command clearly demonstrates how Git builds directory snapshots using tree objects that reference blob objects, without duplicating data.
git rev-parse: Resolving References to HashesThe git rev-parse command helps resolve symbolic references into their actual object hashes.
text1git rev-parse HEAD
Use cases include:
This reinforces the idea that branches and tags are lightweight pointers, not copies of data.
Git’s ability to duplicate files logically without duplicating data physically is the cornerstone of its performance and scalability. By storing content as immutable, hashed objects and reusing them across commits, Git ensures that repositories remain fast and space-efficient — even with extensive histories.
.git directory contains all internal dataUnderstanding Git’s internal storage model gives you deeper confidence when working with branches, rebases, merges, and large repositories. It also explains why Git continues to outperform traditional version control systems in both speed and reliability.