LINUX.IE, website of the Irish Linux Users' Group
Tux rules!

   
Home
New Users
Articles
Download
Projects
Community
Vendors

  Print Version
Email to...
 
Archives:


planetILUG

Recent News

News Archive


Join the
ILUG
on FaceBook


Join the
ILUG
on LinkedIn


Join the
ILUG SETI
Group



















 
 :: Mailing Lists

[ILUG] Gawk query

[ILUG] Gawk query

Brian Foster blf at utvinternet.ie
Thu Aug 23 20:07:45 IST 2007


  | From: "Paschal Nee" <pnee at toombeola.com>
  | Date: Thu, 23 Aug 2007 17:03:34 +0100
  |[ ... ]
  | Seems like a standard enough query but is there a way to get gawk to
  | recognise fields that are enclosed by a delimiter as single fields.
  | 
  | An example:
  | 
  | 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hh HTTP/1.1" 200 1763
  | 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hy HTTP/1.1" 200 1763
  | 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hz HTTP/1.1" 200 1763
  | 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hf HTTP/1.1" 200 1763
  | 
  | Running gawk '{for (i=0;i<NF;i++) print $i}' on the above would get you

 no, it wouldn't — that prints $0, $1, ..., $(NF-1);
 i.e, (on separate lines) the entire line,
 followed by all but the last field.
 to print each field (on separate lines),
 you probably meant:

            gawk '{for (i=1;i<=NF;i++) print $i}'

  |[ ... ] what I'm looking for would be
  | 
  | 10.129.141.1
  | -
  | -
  | 23/Aug/2007:16:06:14 +0100
  | GET /hh HTTP/1.1
  | 200
  | 1763
  | 
  | i.e. recognising that [] and "" enclose single fields.

 right.  recognising [, ], and " as additional field
 separators (which is not exactly the same thing) is
 easy.  the problem is grokking that whilst spaces
 outside [...] and "..." are field separators, spaces
 inside either are not field separators.

 after a bit of head-scratching, the easiest approach
 seems to be a bit of pre-processing; that is, make
 the two types of spaces unique.

 one way to do this is to turn all field separators
 (space, [, ], and ") into something else (I use ¶)
 whilst leaving spaces not separating fields alone.
 with several caveats (below) this can be done by:

     sed -e 's/ /¶/g
             :again
             s/\(¶\[[^]]*\)¶/\1 /g
             s/\(¶"[^"]*\)¶/\1 /g
             tagain
             s/[]"[]//g'  |
         gawk -F¶ '{ for (i = 1; i <= NF; i++) print $i }'

 another way is to turn all spaces which are not field
 separators into something else (I use §), and just to
 make it easier, all field separators into spaces (and
 with similar caveats):

     sed -e ':again
             s/\( \[[^]]*\) /\1§/g
             s/\( "[^"]*\) /\1§/g
             tagain
             s/[]"[]/ /g'  |
         gawk '{ for (i = 1; i <= NF; i++) print gensub(/§/, " ", "g", $i) }'

 both approaches, as written, assume each opening [
 or " is preceded by a space (and may also assume no
 field is empty?).  (all true for the example input.)

 both approaches possibly only work with GNU awk(1)?

 I suspect goofy things will happen if an input [...]
 field contains one or more _" or _[, or if an input
 "..." field contains one or more _[ (where _ means
 space).  and if the input does happen to contain ¶
 or § (as appropriate), things won't be quite right.
 there might be other fsckups as well?
 yer kiloage will vary!

 yer input looks sufficiently regular you ought to
 be able to play games with a (large) ERE and sed(1),
 something alone the lines of:

     sed -e 's/^\([^ ]*\) \([^ ]*\) [\([^]]*\)] "[^"]*" \(.*\)$/\1\n\2\n\3\n\4\n\5/'

 the above is a TRUNCATED and UNTESTED illustration
 (of a possible alternative approach)!  there's an
 important variant to avoid the huge ERE, something
 like:

     sed -n -e 's/ /\n/
                P
                s/^.*\n\([^ ]*\) /\1\n/
                P
                s/^.*\n[\([^]]*\)] /\1\n/
                P
                s/^.*\n"[^"]*"/\1\n/
                p'

 (pay careful attention to P vs. p!)
 again, a TRUNCATED and UNTESTED illustration!

  | There are some kludgey workarounds on the web - if could be done
  | "right" I'd rather do it that way.  [ ... ]

 IMHO, all of the above are kludgey (albeit I can
 imagine rewriting the last sed (with P and p) as
 a more general loop avoiding specific knowledge
 of the order of the field formats?).
 are they any better or worse than what you've found?

cheers!
	-blf-
-- 
▶ ▶  I AM CURRENTLY LOOKING FOR A JOB!  ◀ ◀ | Brian Foster
Experienced (>25 yrs) software engineer:    |        Montpellier, FRANCE
 • Unix, Linux, embedded, design-for-test;  | Stop E$$o (ExxonMobile)!
 • Software/hardware co-design, debugging;  |     http:/www.stopesso.com
 • Kernels, drivers, filesystems, &tc;    Résumé (CV) & contact details:
 • IDL, automated testing, process, &tc.   http://www.blf.utvinternet.ie



More information about the ILUG mailing list
Read this without the formatting.
                                                                                                    

 

Hosted by HEAnet


Maintained by the ILUG website team. The aim of Linux.ie is to support and help commercial and private users of Linux in Ireland. You can display ILUG news in your own webpages, read backend information to find out how. Networking services kindly provided by HEAnet, server kindly donated by Dell. Linux is a trademark of Linus Torvalds, used with permission. No penguins were harmed in the production or maintenance of this highly praised website. Looking for the Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!
RSS Version
Powered by Dell