Commit Graph

28 Commits (3cf773e4264ecc6e9f603a24aeb72cc68b372f96)

Author SHA1 Message Date
Torsten Bögershausen 76759c7dff git on Mac OS and precomposed unicode
Mac OS X mangles file names containing unicode on file systems HFS+,
VFAT or SAMBA.  When a file using unicode code points outside ASCII
is created on a HFS+ drive, the file name is converted into
decomposed unicode and written to disk. No conversion is done if
the file name is already decomposed unicode.

Calling open("\xc3\x84", ...) with a precomposed "Ä" yields the same
result as open("\x41\xcc\x88",...) with a decomposed "Ä".

As a consequence, readdir() returns the file names in decomposed
unicode, even if the user expects precomposed unicode.  Unlike on
HFS+, Mac OS X stores files on a VFAT drive (e.g. an USB drive) in
precomposed unicode, but readdir() still returns file names in
decomposed unicode.  When a git repository is stored on a network
share using SAMBA, file names are send over the wire and written to
disk on the remote system in precomposed unicode, but Mac OS X
readdir() returns decomposed unicode to be compatible with its
behaviour on HFS+ and VFAT.

The unicode decomposition causes many problems:

- The names "git add" and other commands get from the end user may
  often be precomposed form (the decomposed form is not easily input
  from the keyboard), but when the commands read from the filesystem
  to see what it is going to update the index with already is on the
  filesystem, readdir() will give decomposed form, which is different.

- Similarly "git log", "git mv" and all other commands that need to
  compare pathnames found on the command line (often but not always
  precomposed form; a command line input resulting from globbing may
  be in decomposed) with pathnames found in the tree objects (should
  be precomposed form to be compatible with other systems and for
  consistency in general).

- The same for names stored in the index, which should be
  precomposed, that may need to be compared with the names read from
  readdir().

NFS mounted from Linux is fully transparent and does not suffer from
the above.

As Mac OS X treats precomposed and decomposed file names as equal,
we can

 - wrap readdir() on Mac OS X to return the precomposed form, and

 - normalize decomposed form given from the command line also to the
   precomposed form,

to ensure that all pathnames used in Git are always in the
precomposed form.  This behaviour can be requested by setting
"core.precomposedunicode" configuration variable to true.

The code in compat/precomposed_utf8.c implements basically 4 new
functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(),
precomposed_utf8_closedir() and precompose_argv().  The first three
are to wrap opendir(3), readdir(3), and closedir(3) functions.

The argv[] conversion allows to use the TAB filename completion done
by the shell on command line.  It tolerates other tools which use
readdir() to feed decomposed file names into git.

When creating a new git repository with "git init" or "git clone",
"core.precomposedunicode" will be set "false".

The user needs to activate this feature manually.  She typically
sets core.precomposedunicode to "true" on HFS and VFAT, or file
systems mounted via SAMBA.

Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Torsten Bögershausen <tboegi@web.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2012-07-08 22:03:46 -07:00
Jeff King 98acc837a1 strbuf: add fixed-length version of add_wrapped_text
The function strbuf_add_wrapped_text takes a NUL-terminated
string. This makes it annoying to wrap strings we have as a
pointer and a length.

Refactoring strbuf_add_wrapped_text and all of its
sub-functions to handle fixed-length strings turned out to
be really ugly. So this implementation is lame; it just
strdups the text and operates on the NUL-terminated version.
This should be fine as the strings we are wrapping are
generally pretty short.  If it becomes a problem, we can
optimize later.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2011-02-23 13:44:36 -08:00
Junio C Hamano 32ae5b3425 Merge branch 'rs/optim-text-wrap'
* rs/optim-text-wrap:
  utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text()
  utf8.c: remove strbuf_write()
  utf8.c: remove print_spaces()
  utf8.c: remove print_wrapped_text()
2010-03-02 12:44:10 -08:00
René Scharfe 462749b728 utf8.c: speculatively assume utf-8 in strbuf_add_wrapped_text()
is_utf8() works by calling utf8_width() for each character at the
supplied location.  In strbuf_add_wrapped_text(), we do that anyway
while wrapping the lines.  So instead of checking the encoding
beforehand, optimistically assume that it's utf-8 and wrap along
until an invalid character is hit, and when that happens start over.

This pays off if the text consists only of valid utf-8 characters.
The following command was run against the Linux kernel repo with
git 1.7.0:

	$ time git log --format='%b' v2.6.32 >/dev/null

	real	0m2.679s
	user	0m2.580s
	sys	0m0.100s

	$ time git log --format='%w(60,4,8)%b' >/dev/null

	real	0m4.342s
	user	0m4.230s
	sys	0m0.110s

And with this patch series:

	$ time git log --format='%w(60,4,8)%b' >/dev/null

	real	0m3.741s
	user	0m3.630s
	sys	0m0.110s

So the cost of wrapping is reduced to 70% in this case.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20 09:22:44 -08:00
René Scharfe 68ad5e1e9c utf8.c: remove strbuf_write()
The patch before the previous one made sure that all callers of
strbuf_add_wrapped_text() supply a strbuf.  Replace all calls of
strbuf_write() with regular strbuf functions and remove it.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20 09:19:35 -08:00
René Scharfe 3c0ff44a1e utf8.c: remove print_spaces()
The previous patch made sure that strbuf_add_wrapped_text() (and thus
strbuf_add_indented_text(), too) always get a strbuf.  Make use of
this fact by adding strbuf_addchars(), a small helper that adds a
char the specified number of times to a strbuf, and use it to replace
print_spaces().

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20 09:19:06 -08:00
René Scharfe bb96a2c900 utf8.c: remove print_wrapped_text()
strbuf_add_wrapped_text() is called only from print_wrapped_text()
without a strbuf (in which case it writes its results to stdout).

At its only callsite, supply a strbuf, call strbuf_add_wrapped_text()
directly and remove the wrapper function.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-02-20 09:18:04 -08:00
Junio C Hamano 5e133b8cf9 utf8.c: mark file-local function static
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2010-01-12 01:06:09 -08:00
René Scharfe 8a3c63e01d strbuf_add_wrapped_text(): skip over colour codes
Ignore display mode escape sequences (colour codes) for the purpose of
text wrapping because they don't have a visible width.

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-23 15:36:07 -08:00
René Scharfe 37bb5d7443 strbuf_add_wrapped_text(): factor out strbuf_add_indented_text()
Add a new helper function, strbuf_add_indented_text(), to indent text
without a width limit, and call it from strbuf_add_wrapped_text().  It
respects both indent (applied to the first line) and indent2 (applied to
the rest of the lines); indent2 was ignored by the indent-only path of
strbuf_add_wrapped_text() before the patch.

Two simple test cases are added, one exercising strbuf_add_wrapped_text()
and the other strbuf_add_indented_text().

Signed-off-by: Rene Scharfe <rene.scharfe@lsrfire.ath.cx>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-11-22 16:22:02 -08:00
Junio C Hamano 00d3947366 Teach --wrap to only indent without wrapping
When a zero or negative width is given to "shortlog -w<width>,<in1>,<in2>"
and --format=%[wrap(w,in1,in2)...%], just indent the text by in1 without
wrapping.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-10-22 23:20:16 -07:00
Johannes Schindelin a94410c813 Add strbuf_add_wrapped_text() to utf8.[ch]
The newly added function can rewrap text according to a given first-line
indent, other-indent and text width.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-10-19 00:57:29 -07:00
Johannes Schindelin ae0b270230 print_wrapped_text(): allow hard newlines
print_wrapped_text() will insert its own newlines. Up until now, if the
text passed to it contained newlines, they would not be handled properly
(the wrapping got confused after that).

The strategy is to replace a single new-line with a space, but keep double
new-lines so that already-wrapped text with empty lines between paragraphs
will be handled properly.

However, single new-line characters are only handled this way if the
character after it is an alphanumeric character, as per Linus' suggestion.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
2009-10-19 00:57:29 -07:00
Brandon Casey 309dbc82e3 On Solaris choose the OLD_ICONV iconv() declaration based on the UNIX spec
OLD_ICONV is only necessary on Solaris until UNIX03.  This is indicated
by the private macro _XPG6 which is set in /usr/include/sys/feature_tests.h.

Signed-off-by: Brandon Casey <drafnel@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-06-06 13:21:05 -07:00
Geoffrey Thomas 8a9391e944 utf8: add utf8_strwidth()
I'm about to use this pattern more than once, so make it a common function.

Signed-off-by: Geoffrey Thomas <geofft@mit.edu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2009-02-04 16:30:43 -08:00
Junio C Hamano 44b25b872f utf8_width(): allow non NUL-terminated input
The original interface assumed that the input string is
always terminated with a NUL, but that wasn't too useful.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06 20:53:46 -08:00
Junio C Hamano 396ccf1fcb utf8: pick_one_utf8_char()
utf8_width() function was doing two different things.  To pick a
valid character from UTF-8 stream, and compute the display width of
that character.  This splits the former to a separate function
pick_one_utf8_char().

Signed-off-by: Junio C Hamano <gitster@pobox.com>
2008-01-06 20:27:35 -08:00
Guido Ostkamp a777e9ca54 Remove unreachable statements
Solaris Workshop Compiler found a few unreachable statements.

Signed-off-by: Guido Ostkamp <git@ostkamp.fastmail.fm>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-15 21:23:47 -08:00
Junio C Hamano f3fa183802 Style: place opening brace of a function definition at column 1
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2007-11-08 15:35:32 -08:00
Amos Waterland b51be13c9c wcwidth redeclaration
Build fails for git 1.5.1.3 on AIX, with the message:

utf8.c:66: error: conflicting types for 'wcwidth'
/.../lib/gcc/powerpc-ibm-aix5.3.0.0/4.0.3/include/string.h:266: error: previous declaration of 'wcwidth' was here

Fix this by renaming our static variant to our own name.

Signed-off-by: Amos Waterland <apw@us.ibm.com>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-05-07 22:02:40 -07:00
Junio C Hamano 253e772ede Merge branch 'maint'
* maint:
  Unset NO_C99_FORMAT on Cygwin.
  Fix a "pointer type missmatch" warning.
  Fix some "comparison is always true/false" warnings.
  Fix an "implicit function definition" warning.
  Fix a "label defined but unreferenced" warning.
  Document the config variable format.suffix
  git-merge: fail correctly when we cannot fast forward.
  builtin-archive: use RUN_SETUP
  Fix git-gc usage note
2007-03-03 19:47:46 -08:00
Ramsay Jones fd547a972a Fix a "pointer type missmatch" warning.
In particular, the second parameter in the call to iconv() will
cause this warning if your library declares iconv() with the
second (input buffer pointer) parameter of type const char **.
This is the old prototype, which is none-the-less used by the
current version of newlib on Cygwin. (It appears in old versions
of glibc too).

Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03 18:55:17 -08:00
Ramsay Jones 2832114532 Fix some "comparison is always true/false" warnings.
On Cygwin the wchar_t type is an unsigned short (16-bit) int.
This results in the above warnings from the return statement in
the wcwidth() function (in particular, the expressions involving
constants with values larger than 0xffff). Simply replace the
use of wchar_t with an unsigned int, typedef-ed as ucs_char_t.

Signed-off-by: Ramsay Jones <ramsay@ramsay1.demon.co.uk>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-03 18:55:10 -08:00
Johannes Schindelin 62273826fe print_wrapped_text: fix output for negative indent
When providing a negative indent, it means that -indent columns were
already printed. Fix a bug where the function ate the first character
if already the first word did not fit into the first line.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-03-02 15:14:48 -08:00
Johannes Schindelin 094e03b039 Actually make print_wrapped_text() useful
Now, it returns the current column, does not add a newline, and you can
pass a negative indent, to indicate that the indent was already printed.

With this, you can actually continue in the middle of a paragraph, not
having to print everything into a buffer first.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <junkio@cox.net>
2007-02-27 17:29:02 -08:00
Junio C Hamano 677cfed56a commit-tree: cope with different ways "utf-8" can be spelled.
People can spell config.commitencoding differently from what we
internally have ("utf-8") to mean UTF-8.  Try to accept them and
treat them equally.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-30 15:58:43 -08:00
Junio C Hamano b45974a655 Move encoding conversion routine out of mailinfo to utf8.c
This moves the body of convert_to_utf8() routine used in mailinfo
to the utf8.c i18n library.

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-26 00:22:39 -08:00
Johannes Schindelin 9e83266525 commit-tree: encourage UTF-8 commit messages.
Introduce is_utf() to check if a text looks like it is encoded
in UTF-8, utf8_width() to count display width, and implements
print_wrapped_text() using them.

git-commit-tree warns if the commit message does not minimally
conform to the UTF-8 encoding when i18n.commitencoding is either
unset, or set to "utf-8".

Signed-off-by: Junio C Hamano <junkio@cox.net>
2006-12-24 00:32:49 -08:00