Browse Source
The common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust. * ds/chunked-file-api: commit-graph.c: display correct number of chunks when writing chunk-format: add technical docs chunk-format: restore duplicate chunk checks midx: use 64-bit multiplication for chunk sizes midx: use chunk-format read API commit-graph: use chunk-format read API chunk-format: create read chunk API midx: use chunk-format API in write_midx_internal() midx: drop chunk progress during write midx: return success/failure in chunk write methods midx: add num_large_offsets to write_midx_context midx: add pack_perm to write_midx_context midx: add entries to write_midx_context midx: use context in write_midx_pack_names() midx: rename pack_info to write_midx_context commit-graph: use chunk-format write API chunk-format: create chunk format write API commit-graph: anonymize data in chunk_write_fnmaint
data:image/s3,"s3://crabby-images/a8656/a86569103aa29db44a783f016e2b8703656c4d27" alt="gitster@pobox.com"
10 changed files with 655 additions and 468 deletions
@ -0,0 +1,116 @@
@@ -0,0 +1,116 @@
|
||||
Chunk-based file formats |
||||
======================== |
||||
|
||||
Some file formats in Git use a common concept of "chunks" to describe |
||||
sections of the file. This allows structured access to a large file by |
||||
scanning a small "table of contents" for the remaining data. This common |
||||
format is used by the `commit-graph` and `multi-pack-index` files. See |
||||
link:technical/pack-format.html[the `multi-pack-index` format] and |
||||
link:technical/commit-graph-format.html[the `commit-graph` format] for |
||||
how they use the chunks to describe structured data. |
||||
|
||||
A chunk-based file format begins with some header information custom to |
||||
that format. That header should include enough information to identify |
||||
the file type, format version, and number of chunks in the file. From this |
||||
information, that file can determine the start of the chunk-based region. |
||||
|
||||
The chunk-based region starts with a table of contents describing where |
||||
each chunk starts and ends. This consists of (C+1) rows of 12 bytes each, |
||||
where C is the number of chunks. Consider the following table: |
||||
|
||||
| Chunk ID (4 bytes) | Chunk Offset (8 bytes) | |
||||
|--------------------|------------------------| |
||||
| ID[0] | OFFSET[0] | |
||||
| ... | ... | |
||||
| ID[C] | OFFSET[C] | |
||||
| 0x0000 | OFFSET[C+1] | |
||||
|
||||
Each row consists of a 4-byte chunk identifier (ID) and an 8-byte offset. |
||||
Each integer is stored in network-byte order. |
||||
|
||||
The chunk identifier `ID[i]` is a label for the data stored within this |
||||
fill from `OFFSET[i]` (inclusive) to `OFFSET[i+1]` (exclusive). Thus, the |
||||
size of the `i`th chunk is equal to the difference between `OFFSET[i+1]` |
||||
and `OFFSET[i]`. This requires that the chunk data appears contiguously |
||||
in the same order as the table of contents. |
||||
|
||||
The final entry in the table of contents must be four zero bytes. This |
||||
confirms that the table of contents is ending and provides the offset for |
||||
the end of the chunk-based data. |
||||
|
||||
Note: The chunk-based format expects that the file contains _at least_ a |
||||
trailing hash after `OFFSET[C+1]`. |
||||
|
||||
Functions for working with chunk-based file formats are declared in |
||||
`chunk-format.h`. Using these methods provide extra checks that assist |
||||
developers when creating new file formats. |
||||
|
||||
Writing chunk-based file formats |
||||
-------------------------------- |
||||
|
||||
To write a chunk-based file format, create a `struct chunkfile` by |
||||
calling `init_chunkfile()` and pass a `struct hashfile` pointer. The |
||||
caller is responsible for opening the `hashfile` and writing header |
||||
information so the file format is identifiable before the chunk-based |
||||
format begins. |
||||
|
||||
Then, call `add_chunk()` for each chunk that is intended for write. This |
||||
populates the `chunkfile` with information about the order and size of |
||||
each chunk to write. Provide a `chunk_write_fn` function pointer to |
||||
perform the write of the chunk data upon request. |
||||
|
||||
Call `write_chunkfile()` to write the table of contents to the `hashfile` |
||||
followed by each of the chunks. This will verify that each chunk wrote |
||||
the expected amount of data so the table of contents is correct. |
||||
|
||||
Finally, call `free_chunkfile()` to clear the `struct chunkfile` data. The |
||||
caller is responsible for finalizing the `hashfile` by writing the trailing |
||||
hash and closing the file. |
||||
|
||||
Reading chunk-based file formats |
||||
-------------------------------- |
||||
|
||||
To read a chunk-based file format, the file must be opened as a |
||||
memory-mapped region. The chunk-format API expects that the entire file |
||||
is mapped as a contiguous memory region. |
||||
|
||||
Initialize a `struct chunkfile` pointer with `init_chunkfile(NULL)`. |
||||
|
||||
After reading the header information from the beginning of the file, |
||||
including the chunk count, call `read_table_of_contents()` to populate |
||||
the `struct chunkfile` with the list of chunks, their offsets, and their |
||||
sizes. |
||||
|
||||
Extract the data information for each chunk using `pair_chunk()` or |
||||
`read_chunk()`: |
||||
|
||||
* `pair_chunk()` assigns a given pointer with the location inside the |
||||
memory-mapped file corresponding to that chunk's offset. If the chunk |
||||
does not exist, then the pointer is not modified. |
||||
|
||||
* `read_chunk()` takes a `chunk_read_fn` function pointer and calls it |
||||
with the appropriate initial pointer and size information. The function |
||||
is not called if the chunk does not exist. Use this method to read chunks |
||||
if you need to perform immediate parsing or if you need to execute logic |
||||
based on the size of the chunk. |
||||
|
||||
After calling these methods, call `free_chunkfile()` to clear the |
||||
`struct chunkfile` data. This will not close the memory-mapped region. |
||||
Callers are expected to own that data for the timeframe the pointers into |
||||
the region are needed. |
||||
|
||||
Examples |
||||
-------- |
||||
|
||||
These file formats use the chunk-format API, and can be used as examples |
||||
for future formats: |
||||
|
||||
* *commit-graph:* see `write_commit_graph_file()` and `parse_commit_graph()` |
||||
in `commit-graph.c` for how the chunk-format API is used to write and |
||||
parse the commit-graph file format documented in |
||||
link:technical/commit-graph-format.html[the commit-graph file format]. |
||||
|
||||
* *multi-pack-index:* see `write_midx_internal()` and `load_multi_pack_index()` |
||||
in `midx.c` for how the chunk-format API is used to write and |
||||
parse the multi-pack-index file format documented in |
||||
link:technical/pack-format.html[the multi-pack-index file format]. |
@ -0,0 +1,179 @@
@@ -0,0 +1,179 @@
|
||||
#include "cache.h" |
||||
#include "chunk-format.h" |
||||
#include "csum-file.h" |
||||
|
||||
/* |
||||
* When writing a chunk-based file format, collect the chunks in |
||||
* an array of chunk_info structs. The size stores the _expected_ |
||||
* amount of data that will be written by write_fn. |
||||
*/ |
||||
struct chunk_info { |
||||
uint32_t id; |
||||
uint64_t size; |
||||
chunk_write_fn write_fn; |
||||
|
||||
const void *start; |
||||
}; |
||||
|
||||
struct chunkfile { |
||||
struct hashfile *f; |
||||
|
||||
struct chunk_info *chunks; |
||||
size_t chunks_nr; |
||||
size_t chunks_alloc; |
||||
}; |
||||
|
||||
struct chunkfile *init_chunkfile(struct hashfile *f) |
||||
{ |
||||
struct chunkfile *cf = xcalloc(1, sizeof(*cf)); |
||||
cf->f = f; |
||||
return cf; |
||||
} |
||||
|
||||
void free_chunkfile(struct chunkfile *cf) |
||||
{ |
||||
if (!cf) |
||||
return; |
||||
free(cf->chunks); |
||||
free(cf); |
||||
} |
||||
|
||||
int get_num_chunks(struct chunkfile *cf) |
||||
{ |
||||
return cf->chunks_nr; |
||||
} |
||||
|
||||
void add_chunk(struct chunkfile *cf, |
||||
uint32_t id, |
||||
size_t size, |
||||
chunk_write_fn fn) |
||||
{ |
||||
ALLOC_GROW(cf->chunks, cf->chunks_nr + 1, cf->chunks_alloc); |
||||
|
||||
cf->chunks[cf->chunks_nr].id = id; |
||||
cf->chunks[cf->chunks_nr].write_fn = fn; |
||||
cf->chunks[cf->chunks_nr].size = size; |
||||
cf->chunks_nr++; |
||||
} |
||||
|
||||
int write_chunkfile(struct chunkfile *cf, void *data) |
||||
{ |
||||
int i; |
||||
uint64_t cur_offset = hashfile_total(cf->f); |
||||
|
||||
/* Add the table of contents to the current offset */ |
||||
cur_offset += (cf->chunks_nr + 1) * CHUNK_TOC_ENTRY_SIZE; |
||||
|
||||
for (i = 0; i < cf->chunks_nr; i++) { |
||||
hashwrite_be32(cf->f, cf->chunks[i].id); |
||||
hashwrite_be64(cf->f, cur_offset); |
||||
|
||||
cur_offset += cf->chunks[i].size; |
||||
} |
||||
|
||||
/* Trailing entry marks the end of the chunks */ |
||||
hashwrite_be32(cf->f, 0); |
||||
hashwrite_be64(cf->f, cur_offset); |
||||
|
||||
for (i = 0; i < cf->chunks_nr; i++) { |
||||
off_t start_offset = hashfile_total(cf->f); |
||||
int result = cf->chunks[i].write_fn(cf->f, data); |
||||
|
||||
if (result) |
||||
return result; |
||||
|
||||
if (hashfile_total(cf->f) - start_offset != cf->chunks[i].size) |
||||
BUG("expected to write %"PRId64" bytes to chunk %"PRIx32", but wrote %"PRId64" instead", |
||||
cf->chunks[i].size, cf->chunks[i].id, |
||||
hashfile_total(cf->f) - start_offset); |
||||
} |
||||
|
||||
return 0; |
||||
} |
||||
|
||||
int read_table_of_contents(struct chunkfile *cf, |
||||
const unsigned char *mfile, |
||||
size_t mfile_size, |
||||
uint64_t toc_offset, |
||||
int toc_length) |
||||
{ |
||||
int i; |
||||
uint32_t chunk_id; |
||||
const unsigned char *table_of_contents = mfile + toc_offset; |
||||
|
||||
ALLOC_GROW(cf->chunks, toc_length, cf->chunks_alloc); |
||||
|
||||
while (toc_length--) { |
||||
uint64_t chunk_offset, next_chunk_offset; |
||||
|
||||
chunk_id = get_be32(table_of_contents); |
||||
chunk_offset = get_be64(table_of_contents + 4); |
||||
|
||||
if (!chunk_id) { |
||||
error(_("terminating chunk id appears earlier than expected")); |
||||
return 1; |
||||
} |
||||
|
||||
table_of_contents += CHUNK_TOC_ENTRY_SIZE; |
||||
next_chunk_offset = get_be64(table_of_contents + 4); |
||||
|
||||
if (next_chunk_offset < chunk_offset || |
||||
next_chunk_offset > mfile_size - the_hash_algo->rawsz) { |
||||
error(_("improper chunk offset(s) %"PRIx64" and %"PRIx64""), |
||||
chunk_offset, next_chunk_offset); |
||||
return -1; |
||||
} |
||||
|
||||
for (i = 0; i < cf->chunks_nr; i++) { |
||||
if (cf->chunks[i].id == chunk_id) { |
||||
error(_("duplicate chunk ID %"PRIx32" found"), |
||||
chunk_id); |
||||
return -1; |
||||
} |
||||
} |
||||
|
||||
cf->chunks[cf->chunks_nr].id = chunk_id; |
||||
cf->chunks[cf->chunks_nr].start = mfile + chunk_offset; |
||||
cf->chunks[cf->chunks_nr].size = next_chunk_offset - chunk_offset; |
||||
cf->chunks_nr++; |
||||
} |
||||
|
||||
chunk_id = get_be32(table_of_contents); |
||||
if (chunk_id) { |
||||
error(_("final chunk has non-zero id %"PRIx32""), chunk_id); |
||||
return -1; |
||||
} |
||||
|
||||
return 0; |
||||
} |
||||
|
||||
static int pair_chunk_fn(const unsigned char *chunk_start, |
||||
size_t chunk_size, |
||||
void *data) |
||||
{ |
||||
const unsigned char **p = data; |
||||
*p = chunk_start; |
||||
return 0; |
||||
} |
||||
|
||||
int pair_chunk(struct chunkfile *cf, |
||||
uint32_t chunk_id, |
||||
const unsigned char **p) |
||||
{ |
||||
return read_chunk(cf, chunk_id, pair_chunk_fn, p); |
||||
} |
||||
|
||||
int read_chunk(struct chunkfile *cf, |
||||
uint32_t chunk_id, |
||||
chunk_read_fn fn, |
||||
void *data) |
||||
{ |
||||
int i; |
||||
|
||||
for (i = 0; i < cf->chunks_nr; i++) { |
||||
if (cf->chunks[i].id == chunk_id) |
||||
return fn(cf->chunks[i].start, cf->chunks[i].size, data); |
||||
} |
||||
|
||||
return CHUNK_NOT_FOUND; |
||||
} |
@ -0,0 +1,68 @@
@@ -0,0 +1,68 @@
|
||||
#ifndef CHUNK_FORMAT_H |
||||
#define CHUNK_FORMAT_H |
||||
|
||||
#include "git-compat-util.h" |
||||
|
||||
struct hashfile; |
||||
struct chunkfile; |
||||
|
||||
#define CHUNK_TOC_ENTRY_SIZE (sizeof(uint32_t) + sizeof(uint64_t)) |
||||
|
||||
/* |
||||
* Initialize a 'struct chunkfile' for writing _or_ reading a file |
||||
* with the chunk format. |
||||
* |
||||
* If writing a file, supply a non-NULL 'struct hashfile *' that will |
||||
* be used to write. |
||||
* |
||||
* If reading a file, use a NULL 'struct hashfile *' and then call |
||||
* read_table_of_contents(). Supply the memory-mapped data to the |
||||
* pair_chunk() or read_chunk() methods, as appropriate. |
||||
* |
||||
* DO NOT MIX THESE MODES. Use different 'struct chunkfile' instances |
||||
* for reading and writing. |
||||
*/ |
||||
struct chunkfile *init_chunkfile(struct hashfile *f); |
||||
void free_chunkfile(struct chunkfile *cf); |
||||
int get_num_chunks(struct chunkfile *cf); |
||||
typedef int (*chunk_write_fn)(struct hashfile *f, void *data); |
||||
void add_chunk(struct chunkfile *cf, |
||||
uint32_t id, |
||||
size_t size, |
||||
chunk_write_fn fn); |
||||
int write_chunkfile(struct chunkfile *cf, void *data); |
||||
|
||||
int read_table_of_contents(struct chunkfile *cf, |
||||
const unsigned char *mfile, |
||||
size_t mfile_size, |
||||
uint64_t toc_offset, |
||||
int toc_length); |
||||
|
||||
#define CHUNK_NOT_FOUND (-2) |
||||
|
||||
/* |
||||
* Find 'chunk_id' in the given chunkfile and assign the |
||||
* given pointer to the position in the mmap'd file where |
||||
* that chunk begins. |
||||
* |
||||
* Returns CHUNK_NOT_FOUND if the chunk does not exist. |
||||
*/ |
||||
int pair_chunk(struct chunkfile *cf, |
||||
uint32_t chunk_id, |
||||
const unsigned char **p); |
||||
|
||||
typedef int (*chunk_read_fn)(const unsigned char *chunk_start, |
||||
size_t chunk_size, void *data); |
||||
/* |
||||
* Find 'chunk_id' in the given chunkfile and call the |
||||
* given chunk_read_fn method with the information for |
||||
* that chunk. |
||||
* |
||||
* Returns CHUNK_NOT_FOUND if the chunk does not exist. |
||||
*/ |
||||
int read_chunk(struct chunkfile *cf, |
||||
uint32_t chunk_id, |
||||
chunk_read_fn fn, |
||||
void *data); |
||||
|
||||
#endif |
Loading…
Reference in new issue