UNOFFICIAL incith-google 2.1x (Nov30,2o12)

speechles · Post by **speechles** » Thu Apr 23, 2009 10:54 pm

bosto wrote:no one got an idea about how to modify the torrent section as ask above?

just to bad, might be usefull for a lot of people I think

Of course it's possible to do this. But, the url you gave above isn't valid. You used a placeholder. So in short, not much can be done at all since your asking for modifications with no way to tell what is to be scraped. Also, the mininova portion of this script is a grey area, meaning it can very easily be used to pirate copyrighted material just as easily as it can be used to find legitimate uncopyrighted material. I'm sure a lot of people would find it useful, moreso for the illegal aspect then anything else. If you have some html examples of what is sent using that query it might be possible to give you some hints on what you need to change, but without it there is no way to tell what to do at this point...

There are several areas you would need to modify in order to do this. You would need to change the template markers within the regexp's in the main mininova proc (as well as the corresponding regsub's which remove the scraped templates before the next interation). Those template markers are used to designate what is to be scraped from the webpage. You would also need to change the url from mininova within the fetch_html procedure to your own. This is rather trivial and the easiest part as no other investment of time is needed, just change the url. The html data is then sent back to the mininova procedure and those template markers set within each regexp go to work. There are usually four template markers set for each procedure. Total results, no_search results found message, data for output, and lastly errors. So it's not exactly as simple as one would imagine to just type some new things in and get it working without seeing any of the html..

You can use Webby to grab your sites html and see exactly how the bot is going to see it. You can also craft regexp's within your query and see the results of these in your channel. Webby is simply a tool meant to help you modify regexp's (template/context markers) on your own (with your own time investment) and solve these issues without outside intervention.

In any effect, hopefully everything works out.. If not, post some html and I can like I said, give some hints (overly verbose instructions) on what you need to need to change.

bosto · Post by **bosto** » Sat Apr 25, 2009 5:10 pm

Thanks for your answer speechles!

Yes I used placeholder especially cause this site is private, so need registration to view it. If you're willing I can pm it to you, but I won't tell it here for the moment at least.

I know it's not an easy job to modify this I've looked around the code and it's kind a bit of chinees to me, but I'll give a try, looking especialy at the area you told me and look at your weeby tool.

Thanks

speechles · Post by **speechles** » Sat Apr 25, 2009 9:35 pm

bosto wrote:I know it's not an easy job to modify this I've looked around the code and it's kind a bit of chinees to me, but I'll give a try, looking especialy at the area you told me and look at your weeby tool.

Thanks

Webby can show you what those regexp's do. Take the url from the debug query the google script gives you when you have set debugnick your-nick-here

!webby http://www.mininova.org/search/<your_search_terms_here>/seeds

This is what you would do for webby to pull up that same url. But now, you can see how the google script is actually scraping html too. This will make it easier to modify the mininova procedure to scrape your site instead. So the first regexp within the mininova procedure is this:

Code: Select all

      # give results an output header with result tally.
      regexp -- {<h1>(?!No).*?\((.+?)\s(.+?)\)} $html - match match2

So it seems to appear there are 2 captures because match and match2 seem to indidcate that. You can put a wrong number here, but you cannot go over 9 (maximum you will ever get is the first 9 captures in case you ever do) and you need to use at least 1 obviously. The query to use to see how this regexp works would be:

<speechles> !webby http://www.mininova.org/search/futurama/seeds --regexp <h1>(?!No).*?$(.+?)\s(.+?)$--2--
<sp33chy> regexp capture1 ( 500 )
<sp33chy> regexp capture2 ( torrents )

As shown by the comment above it, this scrapes the total number of results, as well as the word used to describe it as the total. Now onto the second regexp.

regexp -nocase {<td>(.+?)</td><td><a href="/cat.+?>(.+?)</a>.+?<a href="/get/(.*?)".+?">.+?<a href="/tor.+?">(.+?)</a>} $htm - ebcU ebcI ebcBid ebcPR

Now the names used for the variables are confusing, mostly because this code is copied from the.. ebay procedure because it was easier at that time suffice it to say... But anyways, to let webby illustrate this regexp just use:

<speechles> !webby http://www.mininova.org/search/futurama/seeds --regexp <td>(.+?)</td><td><a href="/cat.+?>(.+?)</a>.+?<a href="/get/(.*?)".+?">.+?<a href="/tor.+?">(.+?)</a>--4--
<sp33chy> regexp capture1 ( 11 Feb 09 )
<sp33chy> regexp capture2 ( Movies )
<sp33chy> regexp capture3 ( 2271590 )
<sp33chy> regexp capture4 ( Futurama .Into.The.Wild.Green.Yonder.DVDRip.XviD-KareemAmir-WBB-NoRar? )

As you can see with results on your own trying this, the script is taking date, type, url-snippet and name. Hence there are four variables. You can now customize both of these regexp's shown above to fit similar things within the html of your url thereby hacking the google script with said changes. For the regsub's that look similar to a regexp above it, you simply copy the body of your regexp into the regsub:

regexp -nocase -- {the part in here} $html - blah blah
...some other code may or may not be here...
regsub -nocase {copy to the part here as well} $html "" html

If there is no regsub for that particular regexp, then it probably isn't going to be recursed/reiterated nor become a problem for any regexp's lying beyond it. If there happens to be a regsub quite similar or exactly as the regexp, then likely this is to avoid an infinite loop scenario where the html is scraped knowing the regsub will remove what is scraped afterwards, so the next loop through it will grab new html and continue until there is no more to match. If you forget to change the regsub as well for this, your bot will continue forever grabbing the same html over and over unable to stop itself, because the regsub isn't removing html it has scraped and this causes your bot to become unstable and unusable, the only thing to do when this happens unfortunately is kill the process. This will never happen using webby and it's simple regexp parser. But will happen if you forget things I've said when hacking your regexp/regsub changes into procedures.

In this way, you can use webby to modify practically any script which relies on regexp/regsub and tailor them to work for your site of choice, simply by learning regexp and using webby as a friend. If you need to literally use double-hyphens (--) in your regexp you can but they need to be escaped (\--) at least one of them like shown or problems will arise. When using these in a real-script of course you wouldn't need to escape that first hyphen. This is only done within webby because of how it parses the regular expression itself...wink:

Dont forget as well, if you don't use the regular expression engine (--regexp) within your query webby acts like a simple web information script..

If you have any problems with the webby script please post about them Here..
</webby>

tmyoungjr · Post by **tmyoungjr** » Wed Apr 29, 2009 9:40 am

hey there,

ive an odd problem. !wiki just stops functioning after a day or so. I can get it to fix if i kill the bot and restart (rehash doesn't cut it).

any thoughts - or anyone see this happen at all?

speechles · Post by **speechles** » Wed Apr 29, 2009 5:18 pm

tmyoungjr wrote:hey there,

ive an odd problem. !wiki just stops functioning after a day or so. I can get it to fix if i kill the bot and restart (rehash doesn't cut it).

any thoughts - or anyone see this happen at all?

I know what is happening. But, for me to confirm I would need to see what your !w or !wm query is.

You can pretty much skip this part, only read it for understanding
What I believe is the html from the site is massive it even takes a regular browser on an 100mbit fiber line 45 seconds just to get all the html (assuming you mean, wikimedia !wm fails)... Or the site (assuming your meaning wikimedia !wm again) is either run from a low bandwidth connection, or your bot is run on low bandwidth (this assumes you meant both wikipedia or wikimedia). The html is not being retrieved entirely by your bot in this case. When parsing specific types of actions, specifically redirectToFragment's or a user using an #anchor (subtag) in their query. Both of these use dynamic regular expressions to allow the same while loop to work for all copies of each <a name>. What is happening is the script has a dynamic regexp containing the anchor/redirectToFragment which is then used to get the value of the <a name>. If there is a match between this value and what we're after all is well. If not we remove the <a name>, and loop to retrieve the next one. In this way, we can go through all the <a name>'s in a row and all should be fine either way found or not. But.... if the regular substitution (regsub) which does the removing cannot complete this removal you've got problems. The bot will loop forever as the while loop will become "stuck". It will continue to grab the same <a name> over and over and generate a humongous output eating both your bot's cpu and memory. You can solve this and see if indeed this is causing it by raising the -timeout to a higher number. If you search for "proc fetch_html" within the script, then below it search for "-timeout". Within the fetch_html procedure raise the timeout 1000 for every second you want it higher. Keep in mind this also is the delay used by the bot to wait for a connection, so timeout errors will take this long to appear as well. If this solves it, then there is really nothing I can do as this error would require checking if certain dynamic templates match and so much overhead to check that it would bloat this already bloated script even more. If this doesn't solve it, post the query your using and I can easily tell what is causing it.

This is the important part, waay down here. Hope you don't miss it

And since you mentioned it and I wasn't following you 100%. You being able to .rehash or .restart during this event would indicate your bot is obedient in that regard. So the above is entirely now irrelevant. If the bot was stuck in the while loop it would be quite unresponsive to everyone as it would be madly looping it's brains out. Since it isn't I can only wonder if your channel becomes +m and your bot isn't voiced or some similar ircd phenomenon happening which causes the odd behavior. Say someone has banned your bots host without enforcing the ban, on most ircd's this silences that user.

cleaner · Post by **cleaner** » Mon May 04, 2009 3:31 pm

Code: Select all

21:30:38 <+X> !g weather ohio
21:30:43 <@Y> Weather for Ohio: 20������C, Current: Clear, Wind: NE at 11 km/h, Humidity: 29%; Forecast: Mon, Mostly sunny (19������C|6������C); Tue, Mostly sunny (19��                          Thunderstorm (19������C|12������C)

What is wrong?

shadrach · Post by **shadrach** » Fri May 08, 2009 2:44 pm

local seems not to work (.

speechles · Post by **speechles** » Sun May 10, 2009 1:39 pm

shadrach wrote:local seems not to work (.

New version has better !local support than it had before, still includes both special and actual locations. Will be easier to fix in the future and shouldn't generate false matches which it may have done before with certain queries.

z0rc · Post by **z0rc** » Wed May 27, 2009 11:42 am

I have question about codepages. How to make this script to use utf-8 codepage for every ins and outs? ".tcl encoding system" returns "utf-8" on my bot and everyone on chan also use this codepage. Google search (and images) works well when encoding conversion input and output set to 0. But google translate always fail with any possible configuration:

Code: Select all

<z0rc> !tr @ja anime
(privmsg) <Kuro-sama> url: http://www.google.com/translate_t?text=anime&sl=auto&tl=ja charset: shiftjis
<Kuro-sama> Google says: (auto->ja) Translation: Italian (automatically detected) » Japanese >> ƒAƒjƒ�

<z0rc> !tr ja@en アニメ
(privmsg) <Kuro-sama> url: http://www.google.com/translate_t?text=%e3%82%a2%e3%83%8b%e3%83%a1&sl=ja&tl=en charset: shiftjis
<Kuro-sama> Google says: (ja->en) Translation: Japanese » English >> ã‚"ãƒ‹ãƒ¡

Also there is problems with wiki and wikimedia. It works well only with english, but fails in many ways when trying to use for example russian or japanese:

Code: Select all

<z0rc> !wiki アニメ
(privmsg) <Kuro-sama> url: http://en.wikipedia.org/wiki/%C3%8B charset: utf-8 encode_string: 
<Kuro-sama> Ë | Ë, ë (e-umlaut or diaeresis) is a letter in the Albanian and Kashubian languages. This letter also appears in Afrikaans, Dutch, French, Abruzzese dialect, and Luxembourgish language as a variant of letter "e". The letter also appears in Turoyo and Taiwanese Minnan when written in Latin script. @ http://en.wikipedia.org/wiki/%C3%8B

<z0rc> !wiki .ja アニメ
(privmsg) <Kuro-sama> url: http://ja.wikipedia.org/wiki/%C3%8B charset: utf-8 encode_string: 
(Bot doesn't return anything on channel)

<z0rc[w0rk]> !wiki .ru atom
(privmsg) <Kuro-sama> url: http://ru.wikipedia.org/wiki/Atom charset: utf-8 encode_string: 
<Kuro-sama> учитывал многие недостатки упомянутого формата. Сейчас активно поддерживается компанией Google во многих их проектах. @ http://ru.wikipedia.org/wiki/Atom
(the answer is cutted wrong)

<z0rc[w0rk]> !wiki .ru курчатов
(privmsg) <Kuro-sama> url: http://ru.wikipedia.org#column-one charset: utf-8 encode_string: 
<Kuro-sama> Wikipedia Error: Sorry, no search results found.

So I'm asking for full utf-8 support for everything.

tmyoungjr · Post by **tmyoungjr** » Thu May 28, 2009 8:47 am

speechles wrote:
tmyoungjr wrote:hey there,

ive an odd problem. !wiki just stops functioning after a day or so. I can get it to fix if i kill the bot and restart (rehash doesn't cut it).

any thoughts - or anyone see this happen at all?
I know what is happening. But, for me to confirm I would need to see what your !w or !wm query is.

You can pretty much skip this part, only read it for understanding
>snip<

This is the important part, waay down here. Hope you don't miss it
And since you mentioned it and I wasn't following you 100%. You being able to .rehash or .restart during this event would indicate your bot is obedient in that regard. So the above is entirely now irrelevant. If the bot was stuck in the while loop it would be quite unresponsive to everyone as it would be madly looping it's brains out. Since it isn't I can only wonder if your channel becomes +m and your bot isn't voiced or some similar ircd phenomenon happening which causes the odd behavior. Say someone has banned your bots host without enforcing the ban, on most ircd's this silences that user.

yeah ive gone over that - this particular server is run just for one channel. nothing more. nothing else is going on either. we do have 2 bots running in the channel and the other bot is the only user in the channel with ops. i have seen no evidence of it silencing my bot - as my bot responds to just about any other query thats part of incith-google after !wiki stops working. its an odd phenomenon for sure. i've got debugging going also and nothing silly jumps out - but ill hafta dig a bit deeper.

speechles · Post by **speechles** » Sat May 30, 2009 10:44 am

z0rc wrote:I have question about codepages. How to make this script to use utf-8 codepage for every ins and outs? ".tcl encoding system" returns "utf-8" on my bot and everyone on chan also use this codepage. Google search (and images) works well when encoding conversion input and output set to 0. But google translate always fail with any possible configuration:

Code: Select all

<z0rc> !tr @ja anime
(privmsg) <Kuro-sama> url: http://www.google.com/translate_t?text=anime&sl=auto&tl=ja charset: shiftjis
<Kuro-sama> Google says: (auto->ja) Translation: Italian (automatically detected) » Japanese >> ƒAƒjƒ�

<z0rc> !tr ja@en アニメ
(privmsg) <Kuro-sama> url: http://www.google.com/translate_t?text=%e3%82%a2%e3%83%8b%e3%83%a1&sl=ja&tl=en charset: shiftjis
<Kuro-sama> Google says: (ja->en) Translation: Japanese » English >> ã‚"ãƒ‹ãƒ¡

Also there is problems with wiki and wikimedia. It works well only with english, but fails in many ways when trying to use for example russian or japanese:

Code: Select all

<z0rc> !wiki アニメ
(privmsg) <Kuro-sama> url: http://en.wikipedia.org/wiki/%C3%8B charset: utf-8 encode_string: 
<Kuro-sama> Ë | Ë, ë (e-umlaut or diaeresis) is a letter in the Albanian and Kashubian languages. This letter also appears in Afrikaans, Dutch, French, Abruzzese dialect, and Luxembourgish language as a variant of letter "e". The letter also appears in Turoyo and Taiwanese Minnan when written in Latin script. @ http://en.wikipedia.org/wiki/%C3%8B

<z0rc> !wiki .ja アニメ
(privmsg) <Kuro-sama> url: http://ja.wikipedia.org/wiki/%C3%8B charset: utf-8 encode_string: 
(Bot doesn't return anything on channel)

<z0rc[w0rk]> !wiki .ru atom
(privmsg) <Kuro-sama> url: http://ru.wikipedia.org/wiki/Atom charset: utf-8 encode_string: 
<Kuro-sama> учитывал многие недостатки упомянутого формата. Сейчас активно поддерживается компанией Google во многих их проектах. @ http://ru.wikipedia.org/wiki/Atom
(the answer is cutted wrong)

<z0rc[w0rk]> !wiki .ru курчатов
(privmsg) <Kuro-sama> url: http://ru.wikipedia.org#column-one charset: utf-8 encode_string: 
<Kuro-sama> Wikipedia Error: Sorry, no search results found.

So I'm asking for full utf-8 support for everything.

Same here. But until eggdrop itself correctly supports utf-8 the best this script can do is attempt to juggle other encodings in and out in an attempt to "work around" this problem. Russian uses the symbols < and > within its regular text, my parsers unfortunately hit the text before its encoded. It parses the page, then after the pieces are parsed out it then encodes these to the correct encoding. This is why your russian line is cut incorrectly. I can have this fixed in the next update (once something else breaks). I've been hearing talk that eggdrop 1.6.20 will fully support utf-8 and all the normal bells/whistles associated with it. Meaning, if I am to continue making "work-around" hacks this can never be perfect. You need to accept this fact and just live with it for awhile. Read back in this thread, others have asked for utf-8 support as well and this isn't my problem to fix I can only help you achieve a "quasi utf-8" look with my work-arounds. If you want "full utf-8" support for everything, your going to need to wait for eggdrop 1.6.20. The eggdrop utf-8 patch is a hack and destroys iso8859-1 behavior and as such, I will not support it in the script. The script needs both iso8859-1 and quasi/real utf-8 for its query/parsing to work correctly.

Google translations has been problematic since day1. It is unpredicatable in the charset returned for certain queries. Google translations doesn't return any utf-8 queries. To do that would require sending the proper utf-8 query which eggdrop can't do for reasons we've already discussed. There is no way to work around this. So.. the only thing to do at the moment, is a big fat question mark. Once eggdrop 1.6.20 comes out of course this script will work perfectly as all my "work-arounds" will no longer be necessary. It won't be tomorrow, next week, or next month. But i hear in 2010 eggdrop 1.6.20 is due, 1st quarter release.

speechles · Post by **speechles** » Sat May 30, 2009 10:57 am

tmyoungjr wrote:yeah ive gone over that - this particular server is run just for one channel. nothing more. nothing else is going on either. we do have 2 bots running in the channel and the other bot is the only user in the channel with ops. i have seen no evidence of it silencing my bot - as my bot responds to just about any other query thats part of incith-google after !wiki stops working. its an odd phenomenon for sure. i've got debugging going also and nothing silly jumps out - but ill hafta dig a bit deeper.

If you have rebound tcl/set in your eggdrop.conf by commenting out the unbinds and restarted your bot:

eggdrop.conf wrote:# Comment these two lines if you wish to enable the .tcl and .set commands.
# If you select your owners wisely, you should be okay enabling these.
#unbind dcc n tcl *dcc:tcl
#unbind dcc n set *dcc:set

Log into your bot via dcc chat, supply the password, gain partyline rights. Now issue a !wiki command and wait to verify no reply was indeed given. Now immediately on the partyline, type .set errorInfo (the I must be uppercase the rest lower) and simply paste the results of your reply here. This is the only other thing I can think is happening. The tcl event engine is encountering an error and stops execution of the script which errored. That would indeed prevent a proper reply from showing up on !wiki.

tmyoungjr · Post by **tmyoungjr** » Tue Jun 02, 2009 11:59 am

speechles wrote:
tmyoungjr wrote:yeah ive gone over that - this particular server is run just for one channel. nothing more. nothing else is going on either. we do have 2 bots running in the channel and the other bot is the only user in the channel with ops. i have seen no evidence of it silencing my bot - as my bot responds to just about any other query thats part of incith-google after !wiki stops working. its an odd phenomenon for sure. i've got debugging going also and nothing silly jumps out - but ill hafta dig a bit deeper.
If you have rebound tcl/set in your eggdrop.conf by commenting out the unbinds and restarted your bot:
eggdrop.conf wrote:# Comment these two lines if you wish to enable the .tcl and .set commands.
# If you select your owners wisely, you should be okay enabling these.
#unbind dcc n tcl *dcc:tcl
#unbind dcc n set *dcc:set
Log into your bot via dcc chat, supply the password, gain partyline rights. Now issue a !wiki command and wait to verify no reply was indeed given. Now immediately on the partyline, type .set errorInfo (the I must be uppercase the rest lower) and simply paste the results of your reply here. This is the only other thing I can think is happening. The tcl event engine is encountering an error and stops execution of the script which errored. That would indeed prevent a proper reply from showing up on !wiki.

there it is.

when i set googlefight to disable using : variable google_fight 0

it doesn't cause the bot to ignore !fight requests (the bot just replies w/ proper syntax for the !fight command).

we have a 2nd bot in our channel that has a script for !fight - and it gets rather annoying seeing my bot respond and the other bot respond to the same command.

so i'd hack out the code for gfight and apparently doing so was making !wiki fail. so far so good - i'll just change the triggers for fight to something obscure.

Joori · Post by **Joori** » Tue Jun 30, 2009 6:04 am

I dont know if it's me or if it's the actual script in question but i've noticed that sometimes when quering things like google weather for any location, it doesn't print the results to the channel, although i do get the debug output which loads fine in my browser. I've adjusted "variable split_length" and "variable description_length" down from 400 to 200 and even 100 each, but it seems that the problem is still there. I've also discovered than if i issue a rehash to the bot, then the results display fine in channel... I have no clue what could be causing this but it just means i have to dcc chat my eggdrop more often to issue the rehash in order for some weather results to start displaying.

Just thought i'd let you know speechless

inz · Post by **inz** » Wed Jul 01, 2009 3:33 pm

Have the same problem as Joori. Query shows up in debug output file, but not in channel

Doesn't seem to depend on the query, sometimes it works, sometimes it doesnt.

Hope you can help out, thanks.