I'm not looking for a shell script here, just
suggestions on an algorithm and/or tool(s) ....
I have two directories, OLD and NEW. each contains
c.6000 files(!). in each directory, >4000 files
are "interesting". the interesting files can be
readily identified by name. for each interesting
file X, it is either in both directories (with the
same name), or else in only one directory (which
can be _either_ OLD or NEW).
the contents of these interesting files is text,
nominally US-ASCII, albeit there is some UTF-8
(mostly CJK (Han) characters encoded as UTF-8).
in general, the files are relatively small, less
than one hundred lines, but some of have a few
thousand lines and there are four outliers of
about 1 million (10^6) lines each.
just to make life more confusing, not every line
in the interesting files is relevant. however,
the not-relevant lines are trivial to identify.
but the positioning and number of not-relevant
lines varies and cannot be predicated.
I need to compare the differences in gory detail:
(1) if X is only in either OLD or NEW, I need to
see all the relevant lines (this is a must!).
(2) if X is in both OLD and NEW, I want to see the
differences (in relevant lines) in as clear as
fashion as possible, down to the approximately
the character level. for example, if the only
difference between the "same" line in the two
versions of X is the 5th character on the line,
I want to see that 5th character highlighted
somehow (i.e., it made very obvious that that
is what the difference is). this also means
that lines which are the same can be folded
(not shown), and that it should be obvious
when an entire line is added/deleted.
and I need to be able to do this multiple times,
as newer NEWs are generated. (oh, and relevant
lines cannot be re-ordered if displayed.) most
of the time, there is a relatively small number
of differences, but in the current situation,
there are thousands of differences --- however,
almost all the differences can be evaluated with
only a cursory glance _provided_ the difference
is immediately obvious (see case 2).
thanks to this list, I have found the vimdiff(1)
command does case 2 very nicely --- a quick glance
at the screen of (in essence):
vimdiff <(munch OLD/X) <(munch NEW/X)
is near-ideal. the `munch' used above is a trivial
script that filters out the not-relevant lines.
hence, `vimdiff' only compares the relevant lines.
with some tweaking to handle the only in OLD or
NEW situation, the above also works for case 1.
however, doing the above for >4000 files is a
non-starter! so what I am currently doing is,
generate a list of all files in both dirs
create empty files "old.lines" and "new.lines"
for each file X in the generated list:
append a separator line identifying X to both .lines files
if X is in both dirs,
munch OLD/X >>old.lines
munch NEW/X >>new.lines
munch (whichever dir)/X >>(the appropriate .lines)
append that many filler lines to the other .lines
vimdiff old.lines new.lines
in other words, generate two _huge_ files, one
for OLD and one for NEW, each containing _all_
relevant lines for a canonical ordering of the
files (plus separator lines to mark where each
file X begins). if a file X does not exist,
filler lines are present so that corresponding
lines in the two huge files do not get too far
out of sync. finally, `vimdiff' those two
resultant huge files.
this works rather well, but goes badly wrong in
at least one case: when X is in both directories,
but one version has vastly many more lines than the
other version. then diff(1) does not "correctly"
identify the added/deleted/changed line sets, and
so the `vimdiff' display is utterly confusing.
it winds up showing file OLD/X compared with the
different file NEW/Y (at least until the `diff'
output gets itself back into sync).
any ideas how to solve this? or, other ideas on
how to do the visual comparison (1 and 2) that
I'm thinking of modifying the algorithm to also
append filler lines in the case when OLD/X and
NEW/X have a different number of relevant lines,
so that the start of each `munch'ed X (i.e., the
separator lines) will be on the same line in both
of the two resultant huge files. but I'm not too
convinced that will work, so thought I'd ask if
anyone has any suggestions?
Experienced (20+ yrs) kernel/software Eng: | Brian Foster Montpellier,
• Unix, embedded, &tc; • Linux; • doc; | blf at utvinternet.ie FRANCE
• IDL, automated testing, process, &tc. | Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie | http://www.stopesso.com
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!