On 22/10/2008, Kae Verens <kae at verens.com> wrote:
> Niall O Broin wrote:
>> I'd like to be able to extract 'Joe Bloggs' from this:
>>>> <span class="label"><span class="highlight"><span
>> class="highlight-inner"></span></span>
>> <span title="offline" class="name presence-offline">Joe
>> Bloggs</span><ul class="profile-links">
>> <li class="view-profile first"><a href="/users/joebloggs" title="View
>> user profile">
>> View profile</a></li>
>> <li class="view-blog"><a href="/blog/88" title="View blog">View
>> blog</a></li>
>> <li class="add-contact"><a
>> href="/relationship/88/request?destination=dashboard%2Flatest-activity"
>> title="Add to your contacts"> Add associate</a></li>
>> <li class="send-message last"><a href="/" title="Initiate a chat
>> conversation with Joe Bloggs"
>> onclick="javascript:Drupal.xmppclient.message_chat('joe.blogs at whatever.com');;return
>>>> false;">Initiate chat</a></li>
>> </ul></span></span>
>>>> i.e. I need to extract the value from the first tag which DOESN'T have
>> any enclosed tags.
>>>> Now, I could write some code to do this in Perl or Ruby, but I'd like
>> to be able to do it with a pure RE if it can be done.
> s/.*class="name presence-offline">\([^<]*\)<.*/\1/m
>> or am I missing something here?
That'll work in this case, but may notbe as general as wanted.
Using the "i.e." explanation, it looks like what is wanted is:
a >
one or more not-<
a <
(and that last bit isn't really needed).
Note that you can't parse html with a regexp, but if you're willing to
accept some limitations in the input, you might be able to get close
enough to what you want.
(Limitations include "the only angles delimit tags", and "matching
over newlines is up to you to do right", and "whitespace is also your
business", for example.)
How is "grep -o '>[^<]\+'" as a starting point?
Strip the leading '>' and you're done.
Your regexp engine might include a way to match only the first time
(perl's "^.*?", for example) or might include a way to match across
newlines (in which case, you might want something like '>[^<\n][^<]*'
instead), or might include a way to group the bit you want and make
only that bit available.
Good luck,
f
Maintained by the ILUG website team. The aim of Linux.ie is to
support and help commercial and private users of Linux in Ireland. You can
display ILUG news in your own webpages, read backend
information to find out how. Networking services kindly provided by HEAnet, server kindly donated by
Dell. Linux is a trademark of Linus Torvalds,
used with permission. No penguins were harmed in the production or maintenance
of this highly praised website. Looking for the
Indian Linux Users' Group? Try here. If you've read all this and aren't a lawyer: you should be!