| From: "Paschal Nee" <pnee at toombeola.com>
| Date: Thu, 23 Aug 2007 17:03:34 +0100
|[ ... ]
| Seems like a standard enough query but is there a way to get gawk to
| recognise fields that are enclosed by a delimiter as single fields.
|
| An example:
|
| 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hh HTTP/1.1" 200 1763
| 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hy HTTP/1.1" 200 1763
| 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hz HTTP/1.1" 200 1763
| 10.129.141.1 - - [23/Aug/2007:16:06:14 +0100] "GET /hf HTTP/1.1" 200 1763
|
| Running gawk '{for (i=0;i<NF;i++) print $i}' on the above would get you
no, it wouldn't — that prints $0, $1, ..., $(NF-1);
i.e, (on separate lines) the entire line,
followed by all but the last field.
to print each field (on separate lines),
you probably meant:
gawk '{for (i=1;i<=NF;i++) print $i}'
|[ ... ] what I'm looking for would be
|
| 10.129.141.1
| -
| -
| 23/Aug/2007:16:06:14 +0100
| GET /hh HTTP/1.1
| 200
| 1763
|
| i.e. recognising that [] and "" enclose single fields.
right. recognising [, ], and " as additional field
separators (which is not exactly the same thing) is
easy. the problem is grokking that whilst spaces
outside [...] and "..." are field separators, spaces
inside either are not field separators.
after a bit of head-scratching, the easiest approach
seems to be a bit of pre-processing; that is, make
the two types of spaces unique.
one way to do this is to turn all field separators
(space, [, ], and ") into something else (I use ¶)
whilst leaving spaces not separating fields alone.
with several caveats (below) this can be done by:
sed -e 's/ /¶/g
:again
s/\(¶\[[^]]*\)¶/\1 /g
s/\(¶"[^"]*\)¶/\1 /g
tagain
s/[]"[]//g' |
gawk -F¶ '{ for (i = 1; i <= NF; i++) print $i }'
another way is to turn all spaces which are not field
separators into something else (I use §), and just to
make it easier, all field separators into spaces (and
with similar caveats):
sed -e ':again
s/\( \[[^]]*\) /\1§/g
s/\( "[^"]*\) /\1§/g
tagain
s/[]"[]/ /g' |
gawk '{ for (i = 1; i <= NF; i++) print gensub(/§/, " ", "g", $i) }'
both approaches, as written, assume each opening [
or " is preceded by a space (and may also assume no
field is empty?). (all true for the example input.)
both approaches possibly only work with GNU awk(1)?
I suspect goofy things will happen if an input [...]
field contains one or more _" or _[, or if an input
"..." field contains one or more _[ (where _ means
space). and if the input does happen to contain ¶
or § (as appropriate), things won't be quite right.
there might be other fsckups as well?
yer kiloage will vary!
yer input looks sufficiently regular you ought to
be able to play games with a (large) ERE and sed(1),
something alone the lines of:
sed -e 's/^\([^ ]*\) \([^ ]*\) [\([^]]*\)] "[^"]*" \(.*\)$/\1\n\2\n\3\n\4\n\5/'
the above is a TRUNCATED and UNTESTED illustration
(of a possible alternative approach)! there's an
important variant to avoid the huge ERE, something
like:
sed -n -e 's/ /\n/
P
s/^.*\n\([^ ]*\) /\1\n/
P
s/^.*\n[\([^]]*\)] /\1\n/
P
s/^.*\n"[^"]*"/\1\n/
p'
(pay careful attention to P vs. p!)
again, a TRUNCATED and UNTESTED illustration!
| There are some kludgey workarounds on the web - if could be done
| "right" I'd rather do it that way. [ ... ]
IMHO, all of the above are kludgey (albeit I can
imagine rewriting the last sed (with P and p) as
a more general loop avoiding specific knowledge
of the order of the field formats?).
are they any better or worse than what you've found?
cheers!
-blf-
--
▶ ▶ I AM CURRENTLY LOOKING FOR A JOB! ◀ ◀ | Brian Foster
Experienced (>25 yrs) software engineer: | Montpellier, FRANCE
• Unix, Linux, embedded, design-for-test; | Stop E$$o (ExxonMobile)!
• Software/hardware co-design, debugging; | http:/www.stopesso.com
• Kernels, drivers, filesystems, &tc; Résumé (CV) & contact details:
• IDL, automated testing, process, &tc. http://www.blf.utvinternet.ie
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!