View previous topic :: View next topic |
Author |
Message |
ppslim Revered One
Joined: 23 Sep 2001 Posts: 3914 Location: Liverpool, England
|
Posted: Tue May 06, 2003 5:57 am Post subject: Learning regexpr |
|
|
While again, not directly Tcl related, this subject is something worthwhile talking about. It will help in a hell of a lot fo ways, and make no less than 3 commands, far more fun, with 2 requiring them full stop.
This entry, actually comes from our very own "stdragon". Posted back in Sep 02 (available here), it is more worthwhile in a Tcl FAQ, than at the bottom of a failed search, or long and boring trail through the forums history.
Originally inspired by the very question "Tutorials please", "How do I use them", "What can I use them for", this was the reply, of which, is what makes this forum as a whole, such a powerful learning and help tool.
Regexps?
1. I learned to use them by experimentation in tclsh. If you're in a hurry and don't want to 'learn as you go' then sit down for an hour and play with every special character you see in re_sytnax until you know what it does.
2. You can use them for complex string matching and substitution. That's about it, but that encompasses a lot, since most things in life can be represented as a string. Usually people use it for syntax checking (e.g. "Is the input a valid email address?" or "Is this sentence 'bad' as defined by this list of user rules?"). Other common uses are getting rid of color/control codes, extracting parts of a line into variables, and performing substitutions into kick messages,etc (like the person's nick replaces %nick in the kick message).
3. Regexp returns 0 for no match and 1 for a match. Regsub returns the number of substitutions, e.g. 0 for no match and non-zero for a match.
Looking through re_syntax is good, but it's better to find some example scripts. Depending on what you want to do, most regular expressions are very simple. There are only a few special chars you have to escape (like (), |, ., *, {}, +, erm, maybe some more..).
Here's a quick mini tutorial:
( ) is used for match reporting.. regexp lets you specify 'match variables' that get filled in with what exactly matched. The matches within parenthesis are what get reported. Also it's just like math, they allow you to group other operations. An example of match reporting would be: regexp {(.*)!(.*)@(.*)} $from match nick user host
| means "or". It lets your regexp match more than one thing. For instance, "hello there|hi there" would match either "hello there" or "hi there". Using ( ) for grouping, you could say "(hello|hi) there" and it would be the same thing.
. means "any single character". So the regex "..." would match any 3 letters (including spaces). To match an actual period, escape it with a backslash \.
* means "any number of the previous thing, including 0." So if you have "baa*", what is the previous thing? "a". So that would match "ba" (0), "baa", "baaaaaaaaa", etc. However, using grouping ( ) you can match multiple things: "(baa)*" would match "baabaabaa". If you notice, .* will match anything, because . means "any char" and * means "any number". So any easy way to translate between dos-style wildcards and regexps is to replace ?'s with single dots, and * with ".*"
+ is exactly the same as *, but it requires at least 1. So "baa+" would not match "ba" anymore.
{ } is a range operator. It's just like * and + but it lets you specify how many repeats are acceptable. For instance, "ba{2,10}" would match from "baa" (2 a's) to "baaaaaaaaaa" (10 a's).
There are a few more ones that are either less useful or way more complicated. For instance the section on negative lookaheads meant very little to me until the other day. It's a silly name, what it really is is a "and not" operator. For instance, if you want to match something that "contains an a and no b" you could use a negative lookahead. (Yes for that example you can use the [ ] operator to match for ^b, but [ ] can't contain a full regular expression, whereas negative lookaheads can.)
So those are the basics. Anything more advanced would require exponentially more amounts of text I think. If you have specific questions feel free to ask.
As the man said, post any further questions if you feel fit. _________________ PlusNet Supported Customer - Low cost UK ISP services |
|
Back to top |
|
 |
user

Joined: 18 Mar 2003 Posts: 1452 Location: Norway
|
Posted: Tue May 06, 2003 3:10 pm Post subject: A note to future regexp lunatics :) |
|
|
People often stick with regexp (once they learn how to use it) even when there's better (faster) methods avaliable to deal with a certain problem. 'scan' and 'string map' is the most common commands ignored by regexp fanatics.
Example using scan to chop up eggdrop's $botname:
Code: | scan $botname %\[^!\]!%\[^@\]@%s nick user host |
which is ~13.5 times faster than
Code: | regexp {(.*)!(.*)@(.*)} $botname match nick user host |
in my tclsh (8.3)
/rant  |
|
Back to top |
|
 |
ppslim Revered One
Joined: 23 Sep 2001 Posts: 3914 Location: Liverpool, England
|
|
Back to top |
|
 |
Sir_Fz Revered One

Joined: 27 Apr 2003 Posts: 3793 Location: Lebanon
|
Posted: Fri Jul 11, 2003 1:56 pm Post subject: |
|
|
well guys , I was about to ask how to learn regexp, but I found this post and its realy handy
nice idea Ppslim. _________________ Follow me on GitHub
- Opposing
Public Tcl scripts |
|
Back to top |
|
 |
caesar Mint Rubber

Joined: 14 Oct 2001 Posts: 3767 Location: Mint Factory
|
Posted: Sat Aug 28, 2004 1:01 pm Post subject: |
|
|
The Mastering Regular Expressions no longer seems to be working, dose anyone have a good link to it? _________________ Once the game is over, the king and the pawn go back in the same box. |
|
Back to top |
|
 |
awyeah Revered One

Joined: 26 Apr 2004 Posts: 1580 Location: Switzerland
|
Posted: Sat Aug 28, 2004 3:53 pm Post subject: |
|
|
Here is a good website to learn regular expressions (regexp) from, it includes tutorials and examples:
http://www.regular-expressions.info/
Here are some softwares only made for the purpose of testing/using regular expressions with their respective examples:
http://www.regular-expressions.info/tools.html _________________ ·awyeah·
==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
================================== |
|
Back to top |
|
 |
awyeah Revered One

Joined: 26 Apr 2004 Posts: 1580 Location: Switzerland
|
Posted: Thu Jul 07, 2005 8:19 am Post subject: |
|
|
I have been dealing with regexp's alot these days. Here are a few common examples which can help you in eggdrop scripts.
Suppose you want to count the number of A's (alphabet) in a string:
Code: |
regexp -all {A} $string
|
Suppose you want to count the number of the's (word) in a string:
Code: |
regexp -all {the} $string
|
Suppose you want to count more than one character:
Code: |
regexp -all {[abcd]} $string
#This code will count and add all the number of a's, b's, c's and d's found
|
Suppose you want the script to exeucte if any of these characters are not present:
Code: |
regexp -all {[^abcd]} $string
#This code will check and add all the number of a's, b's, c's and d's found. #The total number should be 0, for this statement to be true. (negative logic)
|
Sometimes while matching with regexp's you can use:
Code: |
regexp "string" $string
regexp \[string\] $string
regexp {string} $string
|
I would you to use the curly brackets or the square brackets.
Counting special characters:
Code: |
#To count the number of ['s or use:
regexp -all \[\\\[\] $string
#To count the number of {'s or use:
regexp -all \[\\\\\] $string
#To count the number of {'s or use:
regexp -all \[{\] $string
NOTE: Generally you will only need to add 3 escape's infront of each [, ] or \ special characters. For others mostly you need not.
|
Note: regexp has a -nocase switch, which can be used for ignoring cases while doing matching.
Code: |
regexp -nocase -all {abc} $string
#and
regexp -nocase -all {ABC} $string
#will be considered the same then
|
Matching range of characters:
Code: |
#To match a character in between the range of a, b, c, d, ..........z:
regexp {[a-z]} $string > will give 1 for MATCH, 0 for NO-MATCH (lower case match)
regexp -all {[a-z]} $string > will return total number of MATCHES (lower case match)
regexp -nocase -all {[a-z]} $string > will return total number of MATCHES (case ignored)
#To match a character in between the range of 0, 1, 2, 3, ..........9:
regexp {[0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[0-9]} $string > will return total number of MATCHES
#Note: The nocase switch for the [0-9] would be redundant.
#To match a character in between the range of a, b, c,....z and 0, 1, 2...9:
regexp {[a-z0-9]} $string > will give 1 for MATCH, 0 for NO-MATCH
regexp -all {[a-z0-9]} $string > will return total number of MATCHES
|
Note also, it cis also necessary to use the proper matching format:
Code: |
regexp {^string_here$} $string
^ = Assert position at the start of the string
$ = Assert position at the end of the string (or before the line break)
$+ = Assert position at the end of the string (or before the line break)
|
The | operator is used as a LOGICAL "OR".
Code: |
regexp {abc|efg|hij} $string
#This will try to match "abc" or "efg" or "hij" if none is found, returns 0, if anyone is found returns 1.
|
The ^ operator used in a [list] before the first element is used as a LOGICAL "NOT".
Code: |
regexp {^[^abcd]$} $string
#If a, b, c and d are not present return 1, if anyone of them is, return 0.
|
Note if you want to find the matched patterns in the string of regexp you can use the -inline switch. But you should use it with -all in most cases.
Other examples:
If you want to match certain patterns:
Code: |
regexp {^[a-z]{3,}[0-9]{2,}$} $string
#This will match only if 3 or more characters are present in the range [a-z] and 2 or more characters in the range [0-9] of the string.
#Example:
abgfg452 > will match
as456342 > will not match
abc12 > will match
|
Other examples:
Code: |
regexp {^[a-z]{3,5}[0-9]{2,8}$} $string
#This will match only if 3, 4 or 5 characters are present in the range [a-z] and 2, 3, 4, 5, 6, 7 or 8 characters in the range [0-9] of the string.
#Examples:
adfsd3463 > will match
adsfsdgfs325 > will not match
wer436234 > will match
gdtweer436322512 > will not match
|
Other examples:
Code: |
regexp {^[a-z]{4}[0-9]{3}$} $string
#This will match only if 4 characters are present in the range [a-z] and 3 characters in the range [0-9] of the string.
#Examples:
wrew364 > will match
we436 > will not match
whg6743 > wil not match
wga63 > will not match
se65 > will not match
|
Other matchings:
Code: |
regexp {https?} $string
This will return 1 if "http" or "https" is present in the string, else return 0.
|
Here are some advanced examples:
Code: |
regexp {([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})} $string
#IP Address -- Matches 0.0.0.0 through 999.999.999.999
regexp {(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)} $string
#IP Address -- Matches 0.0.0.0 through 255.255.255.255
regexp {(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]}
#Matching a url
regexp {[0-9]{5}(?:-[0-9]{4})?}
#US Zipcode
regexp {[A-Z0-9._%-]+@[A-Z0-9._%-]+\.[A-Z]{2,4}}
#Email address
regexp {(0[1-9]|[12][0-9]|3[01])[-/.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}}
#Date in formats: dd-mm-yy, dd.mm.yy, dd/mm/yy
regexp {^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{14}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$}
#Matching all major credit cards
|
More of these examples can be found by DOWNLOADING and installing
the SOFTWARE "REGEXYBUDDY".
Download link: http://www.regexbuddy.com/download.html
1) After downloading, install the trial version of the software.
2) After installation, run the software and click on the Library tabs.
3) In the long search list on the right panel, highlight any matching pattern
of your choice and in the left of the software, the window you would be able to see the regular expression match pattern.
4) This is a good software to learn regexp from. _________________ ·awyeah·
==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
==================================
Last edited by awyeah on Fri Jul 08, 2005 4:40 am; edited 2 times in total |
|
Back to top |
|
 |
awyeah Revered One

Joined: 26 Apr 2004 Posts: 1580 Location: Switzerland
|
Posted: Fri Jul 08, 2005 4:15 am Post subject: |
|
|
Here are some quick and easy examples of substitutions. We can use 'regsub' (regular substitution) or 'string map'.
Nevertheless regular substitutions are slower, yet more advanced, complicated and effective than string map. However they both can be utilized to accomlish the same thing.
If you want to remove a character from a string:
Code: |
#regsub
regsub -all {a} $data "" data
#Will remove all occurences of character "a" in the string $data.
regsub -all {a} $data "b" data
#Will replace all occurences of character "a" in the string by "b".
Similarly,
#string map
string map {"a" ""} $data
#Will remove all occurences of character "a" in the string $data.
#string map
string map {a b} $data
string map {"a" "b"} $data
#Will replace all occurences of character "a" in the string by "b".
|
Mostly, regsub and string map are used in filters, to filter out certain parts, characters or words in texts.
Similar to regexp, regsub expressions can be used as [list] for matching each character individually.
Code: |
#This will remove all occurences of a, b, c and d in the string $data.
regsub -all {[abcd]} "sgfdszasbdgds" "" data
#Note: We are using the -all switch here so it will return '5' as per the matches.
You can also strip control codes (colors, bolds, underlines etc) from strings using regsub, string map filters as you might have seen in most posts on the forum. Here are some I found on the forum:
#For removing colors
regsub -all {\003([0-9]{1,2}(,[0-9]{1,2})?)?} $str "" str
#For removing control codes
regsub -all {\017|\037|\002|\026|\006|\007} $str "" str
#For removing control codes
set str [string map {"\017" "" "\037" "" "\002" "" "\026" "" "\006" "" "\007" ""} $str]
|
You might have noticed, removing colors takes advanced regsub logics, which string map can't accomplish as above.
Note: It is best to indicate control codes in their ascii codes.
Then normally, string map and regsub can be used as filters to strip out certain special characters or to escape them with extra \'s.
Here are common examples I found to escape special characters by creating small filters.
Code: |
#regsub
proc filter {data} {
regsub -all -- \\\\ $data \\\\\\\\ data
regsub -all -- \\\[ $data \\\\\[ data
regsub -all -- \\\] $data \\\\\] data
regsub -all -- \\\} $data \\\\\} data
regsub -all -- \\\{ $data \\\\\{ data
regsub -all -- \\\$ $data \\\\\$ data
regsub -all -- \\\" $data \\\\\" data
return $data
}
#Taken from: http://www.peterre.com/characters.html
#string map
proc filter {data} {
set data [string map {\\ \\\\ [ \\\[ ] \\\] \{ \\\{ \} \\\} $ \\\$ \" \\\"} $data]
}
#Taken from: spambuster.tcl
|
A list of all special characters that can choke scripts if not used properly:
Code: |
\, [, ], {, }, $, "
|
Note: regsub can be used in similar format as regexp:
Code: |
regsub -all "\002|\003|\017|\026|\037" $text "" text
regsub -all {\002|\003|\017|\026|\037} $text "" text
|
For example:
Code: |
#To remove the total number of capital letters in a string:
regsub -all {[A-Z]} $text "" counted
#The total number of capital letters in $text will be placed in $counted and $text would have been stripped of the capital letters.
Same goes similarly with numbers, [0-9] or both [a-z0-9].
Also the -nocase switch is available in regsub for case sensitive matching or if you want to ignore cases while matching -- only for alphabets.
|
String map does not have an all switch hence it is difficult to count the total number of characters, so string map does have limitations.
For example:
Code: |
regsub -all {a} $text "" counted
#is similar as:
set counted 0
for {set count 0} {$count < [string length $text]} {incr text} {
if {[string equal "a" [string index $text $count]]} {
incr counted
}
}
(Adv: As you can see regsub is more simpler, easier and has a smaller code)
(Disadv: regsub is slower than string map)
|
_________________ ·awyeah·
==================================
Facebook: jawad@idsia.ch (Jay Dee)
PS: Guys, I don't accept script helps or requests personally anymore.
================================== |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Wed Jul 13, 2005 11:50 pm Post subject: |
|
|
attention should be paid to some subtle aspects of regexps, for example so-called "greedy matching"
by default, regexp characters '+' and '*' will match as much as possible ("greedy matching"), which might mean rather unexpected results, most likely in HTML parsing constructs like this:
Code: |
% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*)</tag>} $str ""
%
|
here, we need to strip the tags and their contents, leaving the text in-between; however, because of the greedy matching, we end up with all of the characters between the first opening tag and the last closing tag stripped, effectively leaving us with an empty string - definitely not what we wanted!
the solution is to add '?' specifier after the asterisk, to avert the greedy matching and force '+'/'*' to match as little as possible:
Code: |
% set str "<tag>foo</tag>some text<tag>bar</tag>"
% regsub -all {<tag>(.*?)</tag>} $str ""
some text
%
|
|
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|