LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] vimdiff(1)ing (subsets of) two directories

[ILUG] vimdiff(1)ing (subsets of) two directories

Brian Foster blf at blf.utvinternet.ie
Thu Mar 9 20:04:15 GMT 2006


 I'm not looking for a shell script here, just
 suggestions on an algorithm and/or tool(s) ....

 I have two directories, OLD and NEW.  each contains
 c.6000 files(!).  in each directory, >4000 files
 are "interesting".  the interesting files can be
 readily identified by name.  for each interesting
 file X, it is either in both directories (with the
 same name), or else in only one directory (which
 can be _either_ OLD or NEW).

 the contents of these interesting files is text,
 nominally US-ASCII, albeit there is some UTF-8
 (mostly CJK (Han) characters encoded as UTF-8).
 in general, the files are relatively small, less
 than one hundred lines, but some of have a few
 thousand lines and there are four outliers of
 about 1 million (10^6) lines each.

 just to make life more confusing, not every line
 in the interesting files is relevant.  however,
 the not-relevant lines are trivial to identify.
 but the positioning and number of not-relevant
 lines varies and cannot be predicated.

 I need to compare the differences in gory detail:

 (1) if X is only in either OLD or NEW, I need to
     see all the relevant lines (this is a must!).

 (2) if X is in both OLD and NEW, I want to see the
     differences (in relevant lines) in as clear as
     fashion as possible, down to the approximately
     the character level.  for example, if the only
     difference between the "same" line in the two
     versions of X is the 5th character on the line,
     I want to see that 5th character highlighted
     somehow (i.e., it made very obvious that that
     is what the difference is).  this also means
     that lines which are the same can be folded
     (not shown), and that it should be obvious
     when an entire line is added/deleted.

 and I need to be able to do this multiple times,
 as newer NEWs are generated.  (oh, and relevant
 lines cannot be re-ordered if displayed.)  most
 of the time, there is a relatively small number
 of differences, but in the current situation,
 there are thousands of differences --- however,
 almost all the differences can be evaluated with
 only a cursory glance _provided_ the difference
 is immediately obvious (see case 2).

 thanks to this list, I have found the vimdiff(1)
 command does case 2 very nicely --- a quick glance
 at the screen of (in essence):

   vimdiff <(munch OLD/X) <(munch NEW/X)

 is near-ideal.  the `munch' used above is a trivial
 script that filters out the not-relevant lines.
 hence, `vimdiff' only compares the relevant lines.
 with some tweaking to handle the only in OLD or
 NEW situation, the above also works for case 1.

 however, doing the above for >4000 files is a
 non-starter!  so what I am currently doing is,
 in effect:

    generate a list of all files in both dirs
    create empty files "old.lines" and "new.lines"
    for each file X in the generated list:
       append a separator line identifying X to both .lines files
       if X is in both dirs,
          munch OLD/X  >>old.lines
          munch NEW/X  >>new.lines
       otherwise
          munch (whichever dir)/X  >>(the appropriate .lines)
          append that many filler lines to the other .lines
    vimdiff old.lines new.lines

 in other words, generate two _huge_ files, one
 for OLD and one for NEW, each containing _all_
 relevant lines for a canonical ordering of the
 files (plus separator lines to mark where each
 file X begins).  if a file X does not exist,
 filler lines are present so that corresponding
 lines in the two huge files do not get too far
 out of sync.  finally, `vimdiff' those two
 resultant huge files.

 this works rather well, but goes badly wrong in
 at least one case:  when X is in both directories,
 but one version has vastly many more lines than the
 other version.  then diff(1) does not "correctly"
 identify the added/deleted/changed line sets, and
 so the `vimdiff' display is utterly confusing.
 it winds up showing file OLD/X compared with the
 different file NEW/Y (at least until the `diff'
 output gets itself back into sync).

 any ideas how to solve this?  or, other ideas on
 how to do the visual comparison (1 and 2) that
 I want?

 I'm thinking of modifying the algorithm to also
 append filler lines in the case when OLD/X and
 NEW/X have a different number of relevant lines,
 so that the start of each `munch'ed X (i.e., the
 separator lines) will be on the same line in both
 of the two resultant huge files.  but I'm not too
 convinced that will work, so thought I'd ask if
 anyone has any suggestions?

cheers!
	-blf-
-- 
Experienced (20+ yrs) kernel/software Eng: | Brian Foster   Montpellier,
 • Unix, embedded, &tc;  • Linux;  • doc;  | blf at utvinternet.ie   FRANCE
 • IDL, automated testing, process, &tc.   |  Stop E$$o (ExxonMobile)!
Résumé (CV) http://www.blf.utvinternet.ie  |     http://www.stopesso.com



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell