egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Learning regexpr

 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Tcl FAQ
View previous topic :: View next topic  
Author Message
ppslim
Revered One


Joined: 23 Sep 2001
Posts: 3914
Location: Liverpool, England

PostPosted: Tue May 06, 2003 5:57 am    Post subject: Learning regexpr Reply with quote

While again, not directly Tcl related, this subject is something worthwhile talking about. It will help in a hell of a lot fo ways, and make no less than 3 commands, far more fun, with 2 requiring them full stop.

This entry, actually comes from our very own "stdragon". Posted back in Sep 02 (available here), it is more worthwhile in a Tcl FAQ, than at the bottom of a failed search, or long and boring trail through the forums history.

Originally inspired by the very question "Tutorials please", "How do I use them", "What can I use them for", this was the reply, of which, is what makes this forum as a whole, such a powerful learning and help tool.

Regexps?

1. I learned to use them by experimentation in tclsh. If you're in a hurry and don't want to 'learn as you go' then sit down for an hour and play with every special character you see in re_sytnax until you know what it does.

2. You can use them for complex string matching and substitution. That's about it, but that encompasses a lot, since most things in life can be represented as a string. Usually people use it for syntax checking (e.g. "Is the input a valid email address?" or "Is this sentence 'bad' as defined by this list of user rules?"). Other common uses are getting rid of color/control codes, extracting parts of a line into variables, and performing substitutions into kick messages,etc (like the person's nick replaces %nick in the kick message).

3. Regexp returns 0 for no match and 1 for a match. Regsub returns the number of substitutions, e.g. 0 for no match and non-zero for a match.

Looking through re_syntax is good, but it's better to find some example scripts. Depending on what you want to do, most regular expressions are very simple. There are only a few special chars you have to escape (like (), |, ., *, {}, +, erm, maybe some more..).

Here's a quick mini tutorial:

( ) is used for match reporting.. regexp lets you specify 'match variables' that get filled in with what exactly matched. The matches within parenthesis are what get reported. Also it's just like math, they allow you to group other operations. An example of match reporting would be: regexp {(.*)!(.*)@(.*)} $from match nick user host

| means "or". It lets your regexp match more than one thing. For instance, "hello there|hi there" would match either "hello there" or "hi there". Using ( ) for grouping, you could say "(hello|hi) there" and it would be the same thing.

. means "any single character". So the regex "..." would match any 3 letters (including spaces). To match an actual period, escape it with a backslash \.

* means "any number of the previous thing, including 0." So if you have "baa*", what is the previous thing? "a". So that would match "ba" (0), "baa", "baaaaaaaaa", etc. However, using grouping ( ) you can match multiple things: "(baa)*" would match "baabaabaa". If you notice, .* will match anything, because . means "any char" and * means "any number". So any easy way to translate between dos-style wildcards and regexps is to replace ?'s with single dots, and * with ".*"

+ is exactly the same as *, but it requires at least 1. So "baa+" would not match "ba" anymore.

{ } is a range operator. It's just like * and + but it lets you specify how many repeats are acceptable. For instance, "ba{2,10}" would match from "baa" (2 a's) to "baaaaaaaaaa" (10 a's).

There are a few more ones that are either less useful or way more complicated. For instance the section on negative lookaheads meant very little to me until the other day. It's a silly name, what it really is is a "and not" operator. For instance, if you want to match something that "contains an a and no b" you could use a negative lookahead. (Yes for that example you can use the [ ] operator to match for ^b, but [ ] can't contain a full regular expression, whereas negative lookaheads can.)

So those are the basics. Anything more advanced would require exponentially more amounts of text I think. If you have specific questions feel free to ask.

As the man said, post any further questions if you feel fit.
_________________
PlusNet Supported Customer - Low cost UK ISP services
Back to top
View user's profile Send private message Yahoo Messenger MSN Messenger
user
 


Joined: 18 Mar 2003
Posts: 1452
Location: Norway

PostPosted: Tue May 06, 2003 3:10 pm    Post subject: A note to future regexp lunatics :) Reply with quote

People often stick with regexp (once they learn how to use it) even when there's better (faster) methods avaliable to deal with a certain problem. 'scan' and 'string map' is the most common commands ignored by regexp fanatics.

Example using scan to chop up eggdrop's $botname:

Code:
scan $botname %\[^!\]!%\[^@\]@%s nick user host


which is ~13.5 times faster than

Code:
regexp {(.*)!(.*)@(.*)} $botname match nick user host


in my tclsh (8.3)

/rant Razz
Back to top
View user's profile Send private message
ppslim
Revered One


Joined: 23 Sep 2001
Posts: 3914
Location: Liverpool, England

PostPosted: Sun May 11, 2003 7:59 pm    Post subject: Reply with quote

Thanks to a kind sugestion from pgpkeys (#egghelp@efnet), there is also a PDF document for you to download.

I ahvn't had a look at it myself yet, but I sure it may be of more use to sombody here.

The document is called Mastering Regular Expressions
_________________
PlusNet Supported Customer - Low cost UK ISP services
Back to top
View user's profile Send private message Yahoo Messenger MSN Messenger
Sir_Fz
Revered One


Joined: 27 Apr 2003
Posts: 3793
Location: Lebanon

PostPosted: Fri Jul 11, 2003 1:56 pm    Post subject: Reply with quote

well guys Smile, I was about to ask how to learn regexp, but I found this post and its realy handy Smile

nice idea Ppslim.
_________________
Follow me on GitHub

- Opposing

Public Tcl scripts
Back to top
View user's profile Send private message Visit poster's website
caesar
Ass Kicker


Joined: 14 Oct 2001
Posts: 3401
Location: Area 51

PostPosted: Sat Aug 28, 2004 1:01 pm    Post subject: Reply with quote

The Mastering Regular Expressions no longer seems to be working, dose anyone have a good link to it?
_________________
You may say anything about me, but don't misspell my name.
Back to top
View user's profile Send private message
awyeah
Revered One


Joined: 26 Apr 2004
Posts: 1580
Location: Switzerland

PostPosted: Sat Aug 28, 2004 3:53 pm    Post subject: Reply with quote

Here is a good website to learn regular expressions (regexp) from, it includes tutorials and examples:
http://www.regular-expressions.info/

Here are some softwares only made for the purpose of testing/using regular expressions with their respective examples:
http://www.regular-expressions.info/tools.html
_________________
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger MSN Messenger
awyeah
Revered One


Joined: 26 Apr 2004
Posts: 1580
Location: Switzerland

PostPosted: Thu Jul 07, 2005 8:19 am    Post subject: Reply with quote

I have been dealing with regexp's alot these days. Here are a few common examples which can help you in eggdrop scripts.

Suppose you want to count the number of A's (alphabet) in a string:
Code:

regexp -all {A} $string


Suppose you want to count the number of the's (word) in a string:
Code:

regexp -all {the} $string


Suppose you want to count more than one character:
Code:

regexp -all {[abcd]} $string
#This code will count and add all the number of a's, b's, c's and d's found


Suppose you want the script to exeucte if any of these characters are not present:
Code:

regexp -all {[^abcd]} $string
#This code will check and add all the number of a's, b's, c's and d's found. #The total number should be 0, for this statement to be true. (negative logic)


Sometimes while matching with regexp's you can use:

Code:

regexp "string" $string
regexp \[string\] $string
regexp {string} $string


I would you to use the curly brackets or the square brackets.

Counting special characters:
Code:

#To count the number of ['s or use:
regexp -all \[\\\[\] $string

#To count the number of {'s or use:
regexp -all \[\\\\\] $string

#To count the number of {'s or use:
regexp -all \[{\] $string

NOTE: Generally you will only need to add 3 escape's infront of each [, ] or \ special characters. For others mostly you need not.


Note: regexp has a -nocase switch, which can be used for ignoring cases while doing matching.

Code:

regexp -nocase -all {abc} $string
#and
regexp -nocase -all {ABC} $string
#will be considered the same then


Matching range of characters:

Code:

#To match a character in between the range of a, b, c, d, ..........z:
regexp {[a-z]} $string > will give 1 for MATCH, 0 for NO-MATCH (lower case match)
regexp -all {[a-z]} $string > will return total number of MATCHES (lower case match)
regexp -nocase -all {[a-z]} $string > will return total number of MATCHES (case ignored)

#To match a character in between the range of 0, 1, 2, 3, ..........9:
regexp {[0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[0-9]} $string > will return total number of MATCHES

#Note: The nocase switch for the [0-9] would be redundant.

#To match a character in between the range of a, b, c,....z and 0, 1, 2...9:
regexp {[a-z0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[a-z0-9]} $string > will return total number of MATCHES


Note also, it cis also necessary to use the proper matching format:

Code:

regexp {^string_here$} $string
^ = Assert position at the start of the string
$ = Assert position at the end of the string (or before the line break)
$+ = Assert position at the end of the string (or before the line break)


The | operator is used as a LOGICAL "OR".

Code:

regexp {abc|efg|hij} $string
#This will try to match "abc" or "efg" or "hij" if none is found, returns 0, if anyone is found returns 1.


The ^ operator used in a [list] before the first element is used as a LOGICAL "NOT".

Code:

regexp {^[^abcd]$} $string
#If a, b, c and d are not present return 1, if anyone of them is, return 0.


Note if you want to find the matched patterns in the string of regexp you can use the -inline switch. But you should use it with -all in most cases.

Other examples:
If you want to match certain patterns:
Code:

regexp {^[a-z]{3,}[0-9]{2,}$} $string
#This will match only if 3 or more characters are present in the range [a-z] and 2 or more characters in the range [0-9] of the string.

#Example:
abgfg452 > will match
as456342 > will not match
abc12 > will match


Other examples:
Code:

regexp {^[a-z]{3,5}[0-9]{2,8}$} $string
#This will match only if 3, 4 or 5 characters are present in the range [a-z] and 2, 3, 4, 5, 6, 7 or 8 characters in the range [0-9] of the string.

#Examples:
adfsd3463 > will match
adsfsdgfs325 > will not match
wer436234 > will match
gdtweer436322512 > will not match


Other examples:
Code:

regexp {^[a-z]{4}[0-9]{3}$} $string
#This will match only if 4 characters are present in the range [a-z] and 3 characters in the range [0-9] of the string.

#Examples:
wrew364 > will match
we436 > will not match
whg6743 > wil not match
wga63 > will not match
se65 > will not match


Other matchings:
Code:

regexp {https?} $string
This will return 1 if "http" or "https" is present in the string, else return 0.


Here are some advanced examples:

Code:

regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} $string
#IP Address -- Matches 0.0.0.0 through 999.999.999.999

regexp {(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)} $string
#IP Address -- Matches 0.0.0.0 through 255.255.255.255

regexp {(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]}
#Matching a url

regexp {[0-9]{5}(?:-[0-9]{4})?}
#US Zipcode

regexp {[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}}
#Email address

regexp {(0[1-9]|[12][0-9]|3[01])[-/.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}}
#Date in formats: dd-mm-yy, dd.mm.yy, dd/mm/yy

regexp {^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{14}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$}
#Matching all major credit cards


More of these examples can be found by DOWNLOADING and installing
the SOFTWARE "REGEXYBUDDY".

Download link: http://www.regexbuddy.com/download.html

1) After downloading, install the trial version of the software.
2) After installation, run the software and click on the Library tabs.
3) In the long search list on the right panel, highlight any matching pattern
of your choice and in the left of the software, the window you would be able to see the regular expression match pattern.
4) This is a good software to learn regexp from.
_________________
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================


Last edited by awyeah on Fri Jul 08, 2005 4:40 am; edited 2 times in total
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger MSN Messenger
awyeah
Revered One


Joined: 26 Apr 2004
Posts: 1580
Location: Switzerland

PostPosted: Fri Jul 08, 2005 4:15 am    Post subject: Reply with quote

Here are some quick and easy examples of substitutions. We can use 'regsub' (regular substitution) or 'string map'.

Nevertheless regular substitutions are slower, yet more advanced, complicated and effective than string map. However they both can be utilized to accomlish the same thing.

If you want to remove a character from a string:
Code:

#regsub
regsub -all {a} $data "" data
#Will remove all occurences of character "a" in the string $data.

regsub -all {a} $data "b" data
#Will replace all occurences of character "a" in the string by "b".

Similarly,

#string map
string map {"a" ""} $data
#Will remove all occurences of character "a" in the string $data.

#string map
string map {a b} $data
string map {"a" "b"} $data
#Will replace all occurences of character "a" in the string by "b".


Mostly, regsub and string map are used in filters, to filter out certain parts, characters or words in texts.

Similar to regexp, regsub expressions can be used as [list] for matching each character individually.

Code:

#This will remove all occurences of a, b, c and d in the string $data.
regsub -all {[abcd]} "sgfdszasbdgds" "" data
#Note: We are using the -all switch here so it will return '5' as per the matches.

You can also strip control codes (colors, bolds, underlines etc) from strings using regsub, string map filters as you might have seen in most posts on the forum. Here are some I found on the forum:

#For removing colors
regsub -all {\003([0-9]{1,2}(,[0-9]{1,2})?)?} $str "" str

#For removing control codes
regsub -all {\017|\037|\002|\026|\006|\007} $str "" str

#For removing control codes
set str [string map {"\017" "" "\037" "" "\002" "" "\026" "" "\006" "" "\007" ""} $str]


You might have noticed, removing colors takes advanced regsub logics, which string map can't accomplish as above.

Note: It is best to indicate control codes in their ascii codes.

Then normally, string map and regsub can be used as filters to strip out certain special characters or to escape them with extra \'s.

Here are common examples I found to escape special characters by creating small filters.

Code:

#regsub
proc filter {data} {
regsub -all -- \\\\ $data \\\\\\\\ data
regsub -all -- \\\[ $data \\\\\[ data
regsub -all -- \\\] $data \\\\\] data
regsub -all -- \\\} $data \\\\\} data
regsub -all -- \\\{ $data \\\\\{ data
regsub -all -- \\\$ $data \\\\\$ data
regsub -all -- \\\" $data \\\\\" data
return $data
}
#Taken from: http://www.peterre.com/characters.html

#string map
proc filter {data} {
 set data [string map {\\ \\\\ [ \\\[ ] \\\] \{ \\\{ \} \\\} $ \\\$ \" \\\"} $data]
}
#Taken from: spambuster.tcl


A list of all special characters that can choke scripts if not used properly:
Code:

\, [, ], {, }, $, "


Note: regsub can be used in similar format as regexp:
Code:

regsub -all "\002|\003|\017|\026|\037" $text "" text
regsub -all {\002|\003|\017|\026|\037} $text "" text


For example:
Code:

#To remove the total number of capital letters in a string:
regsub -all {[A-Z]} $text "" counted

#The total number of capital letters in $text will be placed in $counted and $text would have been stripped of the capital letters.

Same goes similarly with numbers, [0-9] or both [a-z0-9].
Also the -nocase switch is available in regsub for case sensitive matching or if you want to ignore cases while matching -- only for alphabets.


String map does not have an all switch hence it is difficult to count the total number of characters, so string map does have limitations.

For example:
Code:

regsub -all {a} $text "" counted

#is similar as:

set counted 0
for {set count 0} {$count < [string length $text]} {incr text} {
 if {[string equal "a" [string index $text $count]]} {
  incr counted
  }
}

(Adv: As you can see regsub is more simpler, easier and has a smaller code)
(Disadv: regsub is slower than string map)

_________________
·­awyeah·

==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger MSN Messenger
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Wed Jul 13, 2005 11:50 pm    Post subject: Reply with quote

attention should be paid to some subtle aspects of regexps, for example so-called "greedy matching"

by default, regexp characters '+' and '*' will match as much as possible ("greedy matching"), which might mean rather unexpected results, most likely in HTML parsing constructs like this:
Code:

% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*)</tag>} $str ""
%

here, we need to strip the tags and their contents, leaving the text in-between; however, because of the greedy matching, we end up with all of the characters between the first opening tag and the last closing tag stripped, effectively leaving us with an empty string - definitely not what we wanted!

the solution is to add '?' specifier after the asterisk, to avert the greedy matching and force '+'/'*' to match as little as possible:
Code:

% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*?)</tag>} $str ""
some text
%
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Tcl FAQ All times are GMT - 4 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber