egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Release: SA_urltitle.tcl

 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Script Support & Releases
View previous topic :: View next topic  
Author Message
madpinger
Voice


Joined: 03 Oct 2010
Posts: 12

PostPosted: Sat Oct 30, 2010 12:01 pm    Post subject: Release: SA_urltitle.tcl Reply with quote

The intended purpose of this eggdrop script is to relay the title information
of a url sent to a irc channel by irc users while attempting to identify the
correct character encoding to preserve the information and replace
HTML Entities with their desired unicode counterparts.


http://github.com/madpinger/Eggdrop-URL-title-script

Bash me, use it, abuse it, what ever works. ^.^

Just felt like doing it.


First url is utf-8, second url is euc-jp.
example of iso8859-1 compiled bot:


example of utf-8 compiled bot:


As you can see, it handles different encoding tho, with limits depending on the system's and the bots compiled encoding.

Updates:
Added Speechles's new proc with some notes, as you will need to make changed in order for it to work depending on your system and how your bot is compiled. Eventually, I'll get around to it or more simply put figure out how to account for all the different configurations.

Fixed white space issues as pointed out by spithash. Just never occurred to me as an issue, lol.
Changed how I use http clean up, so it should not loose any tokens.


Last edited by madpinger on Mon Nov 01, 2010 1:58 pm; edited 5 times in total
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat Oct 30, 2010 12:22 pm    Post subject: Reply with quote

MOAR scripts are a good thing Wink

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
Code:
proc decode_entities {text {char "utf-8"} } {
   # code below is neccessary to prevent numerous html markups
   # from appearing in the output (ie, ", ᘧ, etc)
   # stolen (borrowed is a better term) from tcllib's htmlparse ;)
   # works unpatched utf-8 or not, unlike htmlparse::mapEscapes
   # which will only work properly patched....
   set escapes {
        \xa0 ¡ \xa1 ¢ \xa2 £ \xa3 ¤ \xa4
      ¥ \xa5 ¦ \xa6 § \xa7 ¨ \xa8 © \xa9
      ª \xaa « \xab ¬ \xac ­ \xad ® \xae
      ¯ \xaf ° \xb0 ± \xb1 ² \xb2 ³ \xb3
      ´ \xb4 µ \xb5 ¶ \xb6 · \xb7 ¸ \xb8
      ¹ \xb9 º \xba » \xbb ¼ \xbc ½ \xbd
      ¾ \xbe ¿ \xbf À \xc0 Á \xc1 Â \xc2
      Ã \xc3 Ä \xc4 Å \xc5 Æ \xc6 Ç \xc7
      È \xc8 É \xc9 Ê \xca Ë \xcb Ì \xcc
      Í \xcd Î \xce Ï \xcf Ð \xd0 Ñ \xd1
      Ò \xd2 Ó \xd3 Ô \xd4 Õ \xd5 Ö \xd6
      × \xd7 Ø \xd8 Ù \xd9 Ú \xda Û \xdb
      Ü \xdc Ý \xdd Þ \xde ß \xdf à \xe0
      á \xe1 â \xe2 ã \xe3 ä \xe4 å \xe5
      æ \xe6 ç \xe7 è \xe8 é \xe9 ê \xea
      ë \xeb ì \xec í \xed î \xee ï \xef
      ð \xf0 ñ \xf1 ò \xf2 ó \xf3 ô \xf4
      õ \xf5 ö \xf6 ÷ \xf7 ø \xf8 ù \xf9
      ú \xfa û \xfb ü \xfc ý \xfd þ \xfe
      ÿ \xff ƒ \u192 Α \u391 Β \u392 Γ \u393 Δ \u394
      Ε \u395 Ζ \u396 Η \u397 Θ \u398 Ι \u399
      Κ \u39A Λ \u39B Μ \u39C Ν \u39D Ξ \u39E
      Ο \u39F Π \u3A0 Ρ \u3A1 Σ \u3A3 Τ \u3A4
      Υ \u3A5 Φ \u3A6 Χ \u3A7 Ψ \u3A8 Ω \u3A9
      α \u3B1 β \u3B2 γ \u3B3 δ \u3B4 ε \u3B5
      ζ \u3B6 η \u3B7 θ \u3B8 ι \u3B9 κ \u3BA
      λ \u3BB μ \u3BC ν \u3BD ξ \u3BE ο \u3BF
      π \u3C0 ρ \u3C1 ς \u3C2 σ \u3C3 τ \u3C4
      υ \u3C5 φ \u3C6 χ \u3C7 ψ \u3C8 ω \u3C9
      ϑ \u3D1 ϒ \u3D2 ϖ \u3D6 • \u2022
      … \u2026 ′ \u2032 ″ \u2033 ‾ \u203E
      ⁄ \u2044 ℘ \u2118 ℑ \u2111 ℜ \u211C
      ™ \u2122 ℵ \u2135 ← \u2190 ↑ \u2191
      → \u2192 ↓ \u2193 ↔ \u2194 ↵ \u21B5
      ⇐ \u21D0 ⇑ \u21D1 ⇒ \u21D2 ⇓ \u21D3 ⇔ \u21D4
      ∀ \u2200 ∂ \u2202 ∃ \u2203 ∅ \u2205
      ∇ \u2207 ∈ \u2208 ∉ \u2209 ∋ \u220B ∏ \u220F
      ∑ \u2211 − \u2212 ∗ \u2217 √ \u221A
      ∝ \u221D ∞ \u221E ∠ \u2220 ∧ \u2227 ∨ \u2228
      ∩ \u2229 ∪ \u222A ∫ \u222B ∴ \u2234 ∼ \u223C
      ≅ \u2245 ≈ \u2248 ≠ \u2260 ≡ \u2261 ≤ \u2264
      ≥ \u2265 ⊂ \u2282 ⊃ \u2283 ⊄ \u2284 ⊆ \u2286
      ⊇ \u2287 ⊕ \u2295 ⊗ \u2297 ⊥ \u22A5
      ⋅ \u22C5 ⌈ \u2308 ⌉ \u2309 ⌊ \u230A
      ⌋ \u230B ⟨ \u2329 ⟩ \u232A ◊ \u25CA
      ♠ \u2660 ♣ \u2663 ♥ \u2665 ♦ \u2666
      " \x22 & \x26 < \x3C > \x3E O&Elig; \u152 œ \u153
      Š \u160 š \u161 Ÿ \u178 ˆ \u2C6
      ˜ \u2DC   \u2002   \u2003   \u2009
      ‌ \u200C ‍ \u200D ‎ \u200E ‏ \u200F – \u2013
      — \u2014 ‘ \u2018 ’ \u2019 ‚ \u201A
      “ \u201C ” \u201D „ \u201E † \u2020
      ‡ \u2021 ‰ \u2030 ‹ \u2039 › \u203A
      € \u20AC ' \u0027 ‎ "" ‏ ""
   };
   if {![string equal $char [encoding system]]} { set text [encoding convertfrom $char $text] }
   set text [string map [list "\]" "\\\]" "\[" "\\\[" "\$" "\\\$" "\"" "\\\"" "\\" "\\\\"] [string map $escapes $text]]
   regsub -all -- {&#([[:digit:]]{1,5});} $text {[format %c [string trimleft "\1" "0"]]} text
   regsub -all -- {&#x([[:xdigit:]]{1,4});} $text {[format %c [scan "\1" %x]]} text
   catch { set text "[subst "$text"]" }
   if {![string equal $char [encoding system]]} { set text [encoding convertto $char $text] }
   return "$text"
}

Feel free to steal (borrow) this.. Smile
_________________
speechles' eggdrop tcl archive


Last edited by speechles on Sat May 28, 2011 8:44 pm; edited 2 times in total
Back to top
View user's profile Send private message
madpinger
Voice


Joined: 03 Oct 2010
Posts: 12

PostPosted: Sat Oct 30, 2010 12:36 pm    Post subject: Reply with quote

speechles wrote:
MOAR scripts are a good thing Wink

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.
....
Feel free to steal (borrow) this.. Smile


Thanks, I'll review it's changes for inclusion. Tho, I think that I have the encoding covered with the converfrom which changes the encoding to the system default ?

I'm developing on 1.8 cvs patched to be utf-8, tho I did a quick test on 1.6.20 with out any mod.

*EDIT*
Oh, IC what you did there. Very Happy
Back to top
View user's profile Send private message
spithash
Master


Joined: 12 Jul 2007
Posts: 248
Location: Libera

PostPosted: Sun Oct 31, 2010 3:05 pm    Post subject: Reply with quote

Code:
[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel


can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell
_________________
Libera ##rtlsdr & ##re - Nick: spithash
Click here for troll.tcl
Back to top
View user's profile Send private message Visit poster's website Yahoo Messenger MSN Messenger
madpinger
Voice


Joined: 03 Oct 2010
Posts: 12

PostPosted: Mon Nov 01, 2010 12:17 pm    Post subject: Reply with quote

spithash wrote:
Code:
[20:56:51] <@spithash> http://www.youtube.com/user/spithash
[20:56:55] <@nagger> [Url title:] YouTube        - spithash's Channel


can anyone tell me why this white space appears there? I have the same problem with another title grab tcl aswell


basically, it's cause the title is on more than one line in the HTML that is parsed.

Code:

    <title>
    YouTube
        - spithash's Channel
  </title>


I merge multiple line titles to deal with this in the regexp. If it's a real bother, it would be simple enough to add white space stripping to it.

Tho, that's the reason in a nut shell.

*EDIT*
Ok, fixed that for you. This is the change to make
Code:

[12:31] <madpinger> http://www.youtube.com/user/spithash
[12:31] <Belkar> [Url title:] YouTube - spithash's Channel

find:
Code:

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata $line
                        }

Change append newdata $line to append newdata [string trim $line]
Code:

                        foreach line [split $data \n] {
                            if {[regexp -nocase {<meta.*charset.(.*?)".*>} $line match charset]} {
                                set charenc $charset
                            }
                            append newdata " [string trim $line]"
                        }

This keeps at least one space between the two lines, so words don't get joined. Updated github's copy with a token cleanup fix. Forgive me for some of the silly stuff I've messed up, I do this half asleep or drunk most times. Wink
Back to top
View user's profile Send private message
SVD
Voice


Joined: 13 Mar 2006
Posts: 9

PostPosted: Tue Jan 11, 2011 5:28 pm    Post subject: Reply with quote

Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.
Back to top
View user's profile Send private message
madpinger
Voice


Joined: 03 Oct 2010
Posts: 12

PostPosted: Fri Jan 14, 2011 6:44 am    Post subject: Reply with quote

Stan wrote:
Great script! However, it doesn't pick up when someone omits the http://. For example, if I type in www.youtube.com, I would like it to catch that and display the title. Any chance you could add that feature? Thanks in advance.


Hmm, sure. I'd tell you what to change here, but you have to prefix it with http:// before using the uri, or it has issues. I'll add that in with an other fix/feature a user requested on github in a few days ^.^
Back to top
View user's profile Send private message
cubemon
Voice


Joined: 20 May 2011
Posts: 1

PostPosted: Fri May 20, 2011 12:50 pm    Post subject: Reply with quote

speechles wrote:
MOAR scripts are a good thing Wink

This might help you script as for completeness and compatibility (patched utf-8 vs not). This procedure is what I presently use within my twitter script. It is a more evolved version of the same procedure within incith-google.

Code:

[string map [b]-nocase[/b] $escapes $text]


Feel free to steal (borrow) this.. Smile


I admit nicking your script and using successfully with my bot! Smile

However, if you want &Auml; to correspond to "Ä" and &auml; to "ä" (and make other capital and lowercase umlauts work), you need to remove the -nocase option from the string map clause.

Thanks for a great conversion script!
Back to top
View user's profile Send private message
kenh83
Halfop


Joined: 08 Sep 2010
Posts: 61

PostPosted: Sat May 28, 2011 1:22 am    Post subject: Reply with quote

This script is no longer on GitHub.. lame. Sad
Back to top
View user's profile Send private message
SVD
Voice


Joined: 13 Mar 2006
Posts: 9

PostPosted: Tue Oct 18, 2011 11:02 am    Post subject: Reply with quote

I often see the error "Tcl error [pub_url]: can't read "tok": no such variable" when URLs are posted from certain websites. Is there an update or fix to this script? It's a great script otherwise.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Script Support & Releases All times are GMT - 4 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber