This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

URL Title grabber

Support & discussion of released scripts, and announcements of new releases.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

URL Title grabber

Post by rosc2112 »

Just a little script to grab url titles when a url is posted in channel:

http://members.dandy.net/~fbn/urltitle.tcl.txt

Uploaded to archive as well.
d
danzigrules
Voice
Posts: 17
Joined: Thu Aug 02, 2007 6:06 am

Post by danzigrules »

It is working great. Haven't had one problem with it at all.


Thanks again, rosc


danzigrules
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

cool, I was worried it would choke on tcl special chars in the urls.. I'm still a little concerned this script might possibly be breakable, making it produce an error, if someone makes a really screwed up webpage title then makes the script grab it..

I tried on my own test page with some tcl special chars and it didn't give an error but it did spit out {} chars which is how tcl protects/splits chars..

But I guess there's sufficient protection between the delay time config and the user flag permissions, so you don't get clobbered by idiots spamming url's in a chan.
d
danzigrules
Voice
Posts: 17
Joined: Thu Aug 02, 2007 6:06 am

Post by danzigrules »

ok, may have found the first problem, well not really a problem, but is this the type of special characters you are talking about?

[danzigrules]http://www.break.com/rushhour3/man-vs-k ... e-war.html
[GodBot] Man Vs Kids Karate War Video

Not a big deal to me, but thought I would post it here.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

Those are just html codes, which you can add to the [string map] line near the bottom. You'll prolly end up adding a hundred or more eventually (my dictionary script has about as many). Just put

" " " "

at the end of the line with "return [string map { so for and so on, that will replace &nbsp with a space. Read the manpage for 'string' to know the proper syntax if you run into probs adding to the string map.
f
flashy
Voice
Posts: 24
Joined: Mon May 01, 2006 3:38 am

Post by flashy »

can u make it ignore images posted on a channel ie .jpg's please.
..jpg - No title found.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

Change the pubm bind's mask to be more specific, eg:

bind pubm $urltitle(pubmflags) {*://*.htm?} pubm:urltitle

should work.
f
flashy
Voice
Posts: 24
Joined: Mon May 01, 2006 3:38 am

Post by flashy »

thank you will try.
c
cruxing
Voice
Posts: 9
Joined: Wed Sep 05, 2007 1:56 am

Post by cruxing »

I actually edited this by simply changing the error message output for "No title found" to "". This puts it under the character limit and simply causes it to not be reported since the string is too short. Seemed like the easiest/best way, even if a bit hacky, since it allows for html/html/xml/xhtml/php/asp etc all to report back without having to get real complex with more regex or whatever.

I do have another question though -- if a webpage is lagged, there seems to be a long delay, which is natural, but the title ends up getting pasted 2-3 times. Usually 3, simply because the lag is rarely in that special in between point for 2.

I'm very novice, and I can't for the life of me figure out why this is occurring. As a temporary workaround I've just shortened the timeout to about 2000, which causes it to just error out on the slow sites instead of reporting it 3 times a few seconds later, but I'd really appreciate it if someone could take a look and see if they catch something simple.

Thanks!
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

Are people triggering it multiple times in their impatience?
c
cruxing
Voice
Posts: 9
Joined: Wed Sep 05, 2007 1:56 am

Post by cruxing »

Ah, no, def. not. It's occurring just fine with myself as the only person triggering it. I added some echos to watch the process and try and figure out where it was looping, and to the best of my observation the entire cycle is getting rerun.

It only happens on slow pages, the slower the site, the more it echos. So of course if I try to test it now nothing seems wrong since it's the middle of the night... :)

Essentially, the proc pubm:urltitle occurs, followed right away by the proc urltitle, then a brief pause (perhaps .5-1 second), then the catch and cleanup, where it then rolls right back into the proc pubm:urltitle instantly and the cycle repeats.

The cycle seems to 'end' whenever the results from the very first http get is successful, and I've had it report 2 and 3 times, depending on the lag of the site. I've yet to pull off 4, but I'm not sure if I've found a site slow enough to accomplish that yet.

I'm not that coding inclined, unfortunately, but a friend and I have dug through it and we can't for the life of us figure out what would cause it to repeat like that.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

Give me a url to duplicate the problem.
c
cruxing
Voice
Posts: 9
Joined: Wed Sep 05, 2007 1:56 am

Post by cruxing »

http://www.webware.com/8301-1_109-9848317-2.html was the first one I noticed it with and was using it last night to test, although it seems faster now. Doesn't appear to be duplicating it as of this moment.

4chan links during the day usually work as well. As would anything recently slashdotted/wanged/farked, likely.

[10:30] <@a> http://www.webware.com/8301-1_109-9848317-2.html
[10:30] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone
[10:30] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone

[10:40] <@c> http://www.webware.com/8301-1_109-9848317-2.html
[10:40] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone
[10:40] <@a> weird

[13:58] <@c> hm hm hm
[13:59] <@c> http://www.dieselsweeties.com/archive.php?s=1924
[13:59] <@bot> URL: diesel sweeties: pixelated robot romance web comic
[13:59] <@c> called only once that time?
[13:59] <@a> yup
[13:59] <@bot> http://www.webware.com/8301-1_109-9848317-2.html
[13:59] <@c> so what is the order of the logs
[13:59] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone
[13:59] <@a> heh this one called 3 times
[13:59] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone
[14:00] <@bot> URL: Bloggers behaving badly: Gizmodo messes with CES flat screens | Webware : Cool Web apps for everyone

Any link that has done this has been notably slow for all of us when loading it, so I'm certain it's not limited to the shell the bot resides on.
User avatar
rosc2112
Revered One
Posts: 1454
Joined: Sun Feb 19, 2006 8:36 pm
Location: Northeast Pennsylvania

Post by rosc2112 »

I am not able to reproduce the problem with any of those urls.

Have you modified the script?
c
cruxing
Voice
Posts: 9
Joined: Wed Sep 05, 2007 1:56 am

Post by cruxing »

The only modifications I've made have been output ones. The shortened URL: and squelching the error messages by making them "". While I can't imagine they'd impact it, hey, who knows. I certainly don't!

Presently, as I mentioned, none of those links are currently running slow enough to duplicate. They're all only returning 1 hit for me as well.

It is somewhat difficult to test until there's a website sufficiently slow, which makes it a little trickier, heh.

Modifications made:

Code: Select all

puthelp "PRIVMSG $chan :URL: $urtitle"

if {[string match -nocase "*couldn't open socket*" $error]} {
	return "wtf, srsly."
}

if { [::http::status $http] == "timeout" } {
	return ""
}

and...

if {[regexp -nocase {<title>(.*?)</title>} $data match title]} {
	return [string map { {href=} "" \" "" } $title]
} else {
	return ""
}

Post Reply