| View previous topic :: View next topic |
| Author |
Message |
madpinger Voice
Joined: 03 Oct 2010 Posts: 12
|
Posted: Sat Oct 30, 2010 12:01 pm Post subject: Release: SA_urltitle.tcl |
|
|
The intended purpose of this eggdrop script is to relay the title information
of a url sent to a irc channel by irc users while attempting to identify the
correct character encoding to preserve the information and replace
HTML Entities with their desired unicode counterparts.
http://github.com/madpinger/Eggdrop-URL-title-script
Bash me, use it, abuse it, what ever works. ^.^
Just felt like doing it.
First url is utf-8, second url is euc-jp.
example of iso8859-1 compiled bot:
example of utf-8 compiled bot:
As you can see, it handles different encoding tho, with limits depending on the system's and the bots compiled encoding.
Updates:
Added Speechles's new proc with some notes, as you will need to make changed in order for it to work depending on your system and how your bot is compiled. Eventually, I'll get around to it or more simply put figure out how to account for all the different configurations.
Fixed white space issues as pointed out by spithash. Just never occurred to me as an issue, lol.
Changed how I use http clean up, so it should not loose any tokens.
Last edited by madpinger on Mon Nov 01, 2010 1:58 pm; edited 5 times in total |
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sat Oct 30, 2010 12:22 pm Post subject: |
|
|
MOAR scripts are a good thing
This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google. | Code: | proc decode_entities {text {char "utf-8"} } {
# code below is neccessary to prevent numerous html markups
# from appearing in the output (ie, ", ᘧ, etc)
# stolen (borrowed is a better term) from tcllib's htmlparse ;)
# works unpatched utf-8 or not, unlike htmlparse::mapEscapes
# which will only work properly patched....
set escapes {
\xa0 ¡ \xa1 ¢ \xa2 £ \xa3 ¤ \xa4
¥ \xa5 ¦ \xa6 § \xa7 ¨ \xa8 © \xa9
ª \xaa « \xab ¬ \xac ­ \xad ® \xae
¯ \xaf ° \xb0 ± \xb1 ² \xb2 ³ \xb3
´ \xb4 µ \xb5 ¶ \xb6 · \xb7 ¸ \xb8
¹ \xb9 º \xba » \xbb ¼ \xbc ½ \xbd
¾ \xbe ¿ \xbf À \xc0 Á \xc1 Â \xc2
à \xc3 Ä \xc4 Å \xc5 Æ \xc6 Ç \xc7
È \xc8 É \xc9 Ê \xca Ë \xcb Ì \xcc
Í \xcd Î \xce Ï \xcf Ð \xd0 Ñ \xd1
Ò \xd2 Ó \xd3 Ô \xd4 Õ \xd5 Ö \xd6
× \xd7 Ø \xd8 Ù \xd9 Ú \xda Û \xdb
Ü \xdc Ý \xdd Þ \xde ß \xdf à \xe0
á \xe1 â \xe2 ã \xe3 ä \xe4 å \xe5
æ \xe6 ç \xe7 è \xe8 é \xe9 ê \xea
ë \xeb ì \xec í \xed î \xee ï \xef
ð \xf0 ñ \xf1 ò \xf2 ó \xf3 ô \xf4
õ \xf5 ö \xf6 ÷ \xf7 ø \xf8 ù \xf9
ú \xfa û \xfb ü \xfc ý \xfd þ \xfe
ÿ \xff ƒ \u192 Α \u391 Β \u392 Γ \u393 Δ \u394
Ε \u395 Ζ \u396 Η \u397 Θ \u398 Ι \u399
Κ \u39A Λ \u39B Μ \u39C Ν \u39D Ξ \u39E
Ο \u39F Π \u3A0 Ρ \u3A1 Σ \u3A3 Τ \u3A4
Υ \u3A5 Φ \u3A6 Χ \u3A7 Ψ \u3A8 Ω \u3A9
α \u3B1 β \u3B2 γ \u3B3 δ \u3B4 ε \u3B5
ζ \u3B6 η \u3B7 θ \u3B8 ι \u3B9 κ \u3BA
λ \u3BB μ \u3BC ν \u3BD ξ \u3BE ο \u3BF
π \u3C0 ρ \u3C1 ς \u3C2 σ \u3C3 τ \u3C4
υ \u3C5 φ \u3C6 χ \u3C7 ψ \u3C8 ω \u3C9
ϑ \u3D1 ϒ \u3D2 ϖ \u3D6 • \u2022
… \u2026 ′ \u2032 ″ \u2033 ‾ \u203E
⁄ \u2044 ℘ \u2118 ℑ \u2111 ℜ \u211C
™ \u2122 ℵ \u2135 ← \u2190 ↑ \u2191
→ \u2192 ↓ \u2193 ↔ \u2194 ↵ \u21B5
⇐ \u21D0 ⇑ \u21D1 ⇒ \u21D2 ⇓ \u21D3 ⇔ \u21D4
∀ \u2200 ∂ \u2202 ∃ \u2203 ∅ \u2205
∇ \u2207 ∈ \u2208 ∉ \u2209 ∋ \u220B ∏ \u220F
∑ \u2211 − \u2212 ∗ \u2217 √ \u221A
∝ \u221D ∞ \u221E ∠ \u2220 ∧ \u2227 ∨ \u2228
∩ \u2229 ∪ \u222A ∫ \u222B ∴ \u2234 ∼ \u223C
≅ \u2245 ≈ \u2248 ≠ \u2260 ≡ \u2261 ≤ \u2264
≥ \u2265 ⊂ \u2282 ⊃ \u2283 ⊄ \u2284 ⊆ \u2286
⊇ \u2287 ⊕ \u2295 ⊗ \u2297 ⊥ \u22A5
⋅ \u22C5 ⌈ \u2308 ⌉ \u2309 ⌊ \u230A
⌋ \u230B ⟨ \u2329 ⟩ \u232A ◊ \u25CA
♠ \u2660 ♣ \u2663 ♥ \u2665 ♦ \u2666
" \x22 & \x26 < \x3C > \x3E O&Elig; \u152 œ \u153
Š \u160 š \u161 Ÿ \u178 ˆ \u2C6
˜ \u2DC   \u2002   \u2003   \u2009
‌ \u200C ‍ \u200D ‎ \u200E ‏ \u200F – \u2013
— \u2014 ‘ \u2018 ’ \u2019 ‚ \u201A
“ \u201C ” \u201D „ \u201E † \u2020
‡ \u2021 ‰ \u2030 ‹ \u2039 › \u203A
€ \u20AC ' \u0027 ‎ "" ‏ ""
};
if {![string equal $char [encoding system]]} { set text [encoding convertfrom $char $text] }
set text [string map [list "\]" "\\\]" "\[" "\\\[" "\$" "\\\$" "\"" "\\\"" "\\" "\\\\"] [string map $escapes $text]]
regsub -all -- {&#([[:digit:]]{1,5});} $text {[format %c [string trimleft "\1" "0"]]} text
regsub -all -- {&#x([[:xdigit:]]{1,4});} $text {[format %c [scan "\1" %x]]} text
catch { set text "[subst "$text"]" }
if {![string equal $char [encoding system]]} { set text [encoding convertto $char $text] }
return "$text"
} |
Feel free to steal (borrow) this..  _________________ speechles' eggdrop tcl archive
Last edited by speechles on Sat May 28, 2011 8:44 pm; edited 2 times in total |
|
| Back to top |
|
 |
madpinger Voice
Joined: 03 Oct 2010 Posts: 12
|
Posted: Sat Oct 30, 2010 12:36 pm Post subject: |
|
|
| speechles wrote: | MOAR scripts are a good thing
This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
....
Feel free to steal (borrow) this..  |
Thanks, I'll review it's changes for inclusion. Tho, I think that I have the encoding covered with the converfrom which changes the encoding to the system default ?
I'm developing on 1.8 cvs patched to be utf-8, tho I did a quick test on 1.6.20 with out any mod.
*EDIT*
Oh, IC what you did there.  |
|
| Back to top |
|
 |
spithash Master

Joined: 12 Jul 2007 Posts: 248 Location: Libera
|
Posted: Sun Oct 31, 2010 3:05 pm Post subject: |
|
|
| Code: | [20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube - spithash's Channel
|
can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell _________________ Libera ##rtlsdr & ##re - Nick: spithash
Click here for troll.tcl |
|
| Back to top |
|
 |
madpinger Voice
Joined: 03 Oct 2010 Posts: 12
|
Posted: Mon Nov 01, 2010 12:17 pm Post subject: |
|
|
| spithash wrote: | | Code: | [20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube - spithash's Channel
|
can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell |
basically, it's cause the title is on more than one line in the HTML that is parsed.
| Code: |
<title>
YouTube
- spithash's Channel
</title>
|
I merge multiple line titles to deal with this in the regexp. If it's a real bother, it would be simple enough to add white space stripping to it.
Tho, that's the reason in a nut shell.
*EDIT*
Ok, fixed that for you. This is the change to make
| Code: |
[12:31] <madpinger> http://www.youtube.com/user/spithash
[12:31] <Belkar> [Url title:] YouTube - spithash's Channel
|
find:
| Code: |
foreach line [split $data \n] {
if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
set charenc $charset
}
append newdata $line
}
|
Change append newdata $line to append newdata [string trim $line]
| Code: |
foreach line [split $data \n] {
if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
set charenc $charset
}
append newdata " [string trim $line]"
} |
This keeps at least one space between the two lines, so words don't get joined. Updated github's copy with a token cleanup fix. Forgive me for some of the silly stuff I've messed up, I do this half asleep or drunk most times.  |
|
| Back to top |
|
 |
SVD Voice
Joined: 13 Mar 2006 Posts: 9
|
Posted: Tue Jan 11, 2011 5:28 pm Post subject: |
|
|
| Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance. |
|
| Back to top |
|
 |
madpinger Voice
Joined: 03 Oct 2010 Posts: 12
|
Posted: Fri Jan 14, 2011 6:44 am Post subject: |
|
|
| Stan wrote: | | Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance. |
Hmm, sure. I'd tell you what to change here, but you have to prefix it with http:// before using the uri, or it has issues. I'll add that in with an other fix/feature a user requested on github in a few days ^.^ |
|
| Back to top |
|
 |
cubemon Voice
Joined: 20 May 2011 Posts: 1
|
Posted: Fri May 20, 2011 12:50 pm Post subject: |
|
|
| speechles wrote: | MOAR scripts are a good thing
This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
| Code: |
[string map [b]-nocase[/b] $escapes $text]
|
Feel free to steal (borrow) this..  |
I admit nicking your script and using successfully with my bot!
However, if you want Ä to correspond to "Ä" and ä to "ä" (and make other capital and lowercase umlauts work), you need to remove the -nocase option from the string map clause.
Thanks for a great conversion script! |
|
| Back to top |
|
 |
kenh83 Halfop
Joined: 08 Sep 2010 Posts: 61
|
Posted: Sat May 28, 2011 1:22 am Post subject: |
|
|
This script is no longer on GitHub.. lame.  |
|
| Back to top |
|
 |
SVD Voice
Joined: 13 Mar 2006 Posts: 9
|
Posted: Tue Oct 18, 2011 11:02 am Post subject: |
|
|
| I often see the error "Tcl error [pub_url]: can't read "tok": no such variable" when URLs are posted from certain websites. Is there an update or fix to this script? It's a great script otherwise. |
|
| Back to top |
|
 |
|