Learning regexpr

ppslim · Post by **ppslim** » Tue May 06, 2003 5:57 am

While again, not directly Tcl related, this subject is something worthwhile talking about. It will help in a hell of a lot fo ways, and make no less than 3 commands, far more fun, with 2 requiring them full stop.

This entry, actually comes from our very own "stdragon". Posted back in Sep 02 (available here), it is more worthwhile in a Tcl FAQ, than at the bottom of a failed search, or long and boring trail through the forums history.

Originally inspired by the very question "Tutorials please", "How do I use them", "What can I use them for", this was the reply, of which, is what makes this forum as a whole, such a powerful learning and help tool.

Regexps?

1. I learned to use them by experimentation in tclsh. If you're in a hurry and don't want to 'learn as you go' then sit down for an hour and play with every special character you see in re_sytnax until you know what it does.

2. You can use them for complex string matching and substitution. That's about it, but that encompasses a lot, since most things in life can be represented as a string. Usually people use it for syntax checking (e.g. "Is the input a valid email address?" or "Is this sentence 'bad' as defined by this list of user rules?"). Other common uses are getting rid of color/control codes, extracting parts of a line into variables, and performing substitutions into kick messages,etc (like the person's nick replaces %nick in the kick message).

3. Regexp returns 0 for no match and 1 for a match. Regsub returns the number of substitutions, e.g. 0 for no match and non-zero for a match.

Looking through re_syntax is good, but it's better to find some example scripts. Depending on what you want to do, most regular expressions are very simple. There are only a few special chars you have to escape (like (), |, ., *, {}, +, erm, maybe some more..).

Here's a quick mini tutorial:

( ) is used for match reporting.. regexp lets you specify 'match variables' that get filled in with what exactly matched. The matches within parenthesis are what get reported. Also it's just like math, they allow you to group other operations. An example of match reporting would be: regexp {(.*)!(.*)@(.*)} $from match nick user host

| means "or". It lets your regexp match more than one thing. For instance, "hello there|hi there" would match either "hello there" or "hi there". Using ( ) for grouping, you could say "(hello|hi) there" and it would be the same thing.

. means "any single character". So the regex "..." would match any 3 letters (including spaces). To match an actual period, escape it with a backslash \.

* means "any number of the previous thing, including 0." So if you have "baa*", what is the previous thing? "a". So that would match "ba" (0), "baa", "baaaaaaaaa", etc. However, using grouping ( ) you can match multiple things: "(baa)*" would match "baabaabaa". If you notice, .* will match anything, because . means "any char" and * means "any number". So any easy way to translate between dos-style wildcards and regexps is to replace ?'s with single dots, and * with ".*"

+ is exactly the same as *, but it requires at least 1. So "baa+" would not match "ba" anymore.

{ } is a range operator. It's just like * and + but it lets you specify how many repeats are acceptable. For instance, "ba{2,10}" would match from "baa" (2 a's) to "baaaaaaaaaa" (10 a's).

There are a few more ones that are either less useful or way more complicated. For instance the section on negative lookaheads meant very little to me until the other day. It's a silly name, what it really is is a "and not" operator. For instance, if you want to match something that "contains an a and no b" you could use a negative lookahead. (Yes for that example you can use the [ ] operator to match for ^b, but [ ] can't contain a full regular expression, whereas negative lookaheads can.)

So those are the basics. Anything more advanced would require exponentially more amounts of text I think. If you have specific questions feel free to ask.

As the man said, post any further questions if you feel fit.

Post by **user** » Tue May 06, 2003 3:10 pm

People often stick with regexp (once they learn how to use it) even when there's better (faster) methods avaliable to deal with a certain problem. 'scan' and 'string map' is the most common commands ignored by regexp fanatics.

Example using scan to chop up eggdrop's $botname:

Learning regexpr

Learning regexpr

A note to future regexp lunatics :)