egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

UNOFFICIAL incith-google 2.1x (Nov30,2o12)
Goto page Previous  1, 2, 3 ... 19, 20, 21 ... 56, 57, 58  Next
 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Script Support & Releases
View previous topic :: View next topic  
Author Message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sun Aug 17, 2008 12:10 am    Post subject: Reply with quote

pwner wrote:
hmm the script is great, but I have a little problem; out of all the features, only a few work for me (google search is gone, wiki, ebay and basically all the good ones Sad ).

Could this be the fault of my shell provider, or my the tcl version I'm currently using?

I'm using incith-google-v1.98s, someone please help...

Let me explain why with a slight ethics lesson. Websites wish their content to be viewed on their medium. They sometimes take countermeasures to discourage scraping, which is the method of data retrieval this script uses.

For ebay this happens:
Quote:
<bot> redirected: http://search.ebay.com/dog_W0QQpqryZdog -> http://shop.ebay.com/items/_W0QQ_nkwZdogQQ_armrsZ1QQ_fromZQQ_mdoZ
<bot> url: http://shop.ebay.com/items/_W0QQ_nkwZdogQQ_armrsZ1QQ_fromZQQ_mdoZ charset: iso8859-1 encode_string: iso8859-1

I haven't written parsers for the template this new server gives. Notice the search.ebay.com becomes shop.ebay.com, this server uses a new template design not supported at the moment. Only the search.ebay.com template is supported presently.

For google, you may see this:
Quote:
<bot> redirected: http://www.google.com/search?hl=&q=anything&safe=off&btnG=Search&lr=lang_all&num=1 -> http://sorry.google.com/sorry/?continue=http://www.google.com/search%3Fhl%3D%26q%3Danything%26safe%3Doff%26btnG%3DSearch%26lr%3Dlang_all%26num%3D1
<bot> url: http://sorry.google.com/sorry/?continue=http://www.google.com/search%3Fhl%3D%26q%3Danything%26safe%3Doff%26btnG%3DSearch%26lr%3Dlang_all%26num%3D1 charset: utf-8 encode_string:

This means google will only allow you to use its services if you can complete their captcha requirement given on the sorry.google.com page. This is some problem between you and google. The other google based sites may not work either for you because something identifies you as malicious possibly. This is beyond my control, contact google.

For you to even begin to see this debug output you MUST change the debugnick in the config section from 'speechles' to the nickname of your debug admin, your nickname perhaps?

As for the other functions of the script, they should all work except for gamespot. Ganespot uses a new server as well, that requires cookie/referrer fields to disuade scraping. Expect a new version soon with better redirect support for wiki(pedia/media) and a few other things too...
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
Phyxion
Voice


Joined: 30 Jul 2008
Posts: 7

PostPosted: Sun Aug 17, 2008 9:23 am    Post subject: Reply with quote

speechles wrote:
Phyxion wrote:
GameSpot ain't working anymore speechles. They updated their code once again.

They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages Sad. I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...

If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script.

The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help.
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sun Aug 17, 2008 12:52 pm    Post subject: Reply with quote

Phyxion wrote:
speechles wrote:
Phyxion wrote:
GameSpot ain't working anymore speechles. They updated their code once again.

They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages Sad. I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...

If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script.

The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help.

I can't believe you just said that...You fail to understand how eggdrop works. Sure, it works on firefox because firefox can supply the cookie and referrer fields. It DOES NOT work on eggdrop until I supply those requirements. There IS NO search data to search for. There IS ONLY a static "searching..." message. Don't believe me? Check this out! Now where are the results to parse? There aren't any. Do you see what I've been saying all along now?
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
testebr
Halfop


Joined: 01 Dec 2005
Posts: 86

PostPosted: Sun Aug 17, 2008 1:56 pm    Post subject: Reply with quote

Test -> Max Payne

The problem is not with referrer, but with javascript ajax result :]

Try disable javascript in your browser and test it.


Last edited by testebr on Sun Aug 17, 2008 2:01 pm; edited 1 time in total
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sun Aug 17, 2008 2:01 pm    Post subject: Reply with quote

testebr wrote:
Test -> Max Payne


Quote:
{"search_results":"<div class=\"sort_results\">\n <select class=\"{'term':'max payne','type':'game'

,'offset':false,'track':true}\">\n <option selected=\"selected\" value=\"rank\">Sort By Rank<

\/option>\n <option value=\"date\">Sort By Date<\/option>\n \n <option value

=\"score\">Sort By Score<\/option>\n <\/se.....


The above comes from:
http://www.gamespot.com/pages/search/search_ajax.php?q=max%20payne&type=game&offset=0&tags_only=false&sort=rank

When communicating with gamespot, it will send you html data along with a cookie session ID. That html data will be incomplete because it is actually waiting on a php backend server to fill the html request using that ajax get above. Notice the 'search results' appearing at the front?

This means that it is silly to assume that since you can visit the website with a normal web browser and see all the html your bot will be able to do the same. Websites go out of their way to discourage bots, so these cookie sessions and other such nonsense and hurdles put up in our way that we must jump over in order to continue scraping them. Hopefully you understand what I mean.
_________________
speechles' eggdrop tcl archive


Last edited by speechles on Sun Aug 17, 2008 2:15 pm; edited 1 time in total
Back to top
View user's profile Send private message
testebr
Halfop


Joined: 01 Dec 2005
Posts: 86

PostPosted: Sun Aug 17, 2008 2:11 pm    Post subject: Reply with quote

Read my reply above (I edited).
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sun Aug 17, 2008 2:16 pm    Post subject: Reply with quote

testebr wrote:
Read my reply above (I edited).

Read my reply. I already know this...
gamespot wrote:
Response Headers
Date Sun, 17 Aug 2008 18:08:12 GMT
Server Apache
Accept-Ranges bytes
X-Powered-By PHP/5.2.5
Set-Cookie gspot_side_081708=4; expires=Wed, 20-Aug-2008 18:08:12 GMT; path=/; domain=.gamespot.com
Keep-Alive timeout=300, max=990
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
Request Headers
Host www.gamespot.com
User-Agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16
Accept text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 300
Connection keep-alive
Referer http://forum.egghelp.org/viewtopic.php?p=84640
Cookie gspot_side_081408=100; geolocn=NzAuMTMyLjAuOTE6ODQw; XCLGFbrowser=Cg8ILkh0Qr9HAAAAXg8; mbox=PC#1216060154750-11875#1280814507|session#1217742433671-451299#1217744367|check#true#1217742567; __qca=4869b91b-5b1c2-cf30b-ab8d7; MADCAPP=083B3d:1; __utmz=14953632.1217742436.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=14953632.3523376941471426000.1217742436.1217742436.1217742436.1; gspot_promo_081408=1; gspot_promo_081608=1; gspot_side_081608=2; u_srv_0_0=-1; __qcb=1709914989; gspot_side_081708=3
Cache-Control max-age=0

See the problem? The script merely does a single page load. Which can get the http headers. The script will need to do a second request to the search_ajax.php url filling in the request headers correctly to retrieve any search results. The cookie session is all that matters notice the referring site is egghelp and I still got successful search data in the browser.

Firefox + firebug will allow you to see http headers as shown above (firebug is buggy though so disable it afterwards or it may crash firefox eventually).
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
Phyxion
Voice


Joined: 30 Jul 2008
Posts: 7

PostPosted: Sun Aug 17, 2008 2:54 pm    Post subject: Reply with quote

testebr wrote:
Test -> Max Payne

The problem is not with referrer, but with javascript ajax result :]

Try disable javascript in your browser and test it.
That's what I meant too speechles.

But after I checked again I see you are right.

My bad Wink
Back to top
View user's profile Send private message
Phyxion
Voice


Joined: 30 Jul 2008
Posts: 7

PostPosted: Thu Aug 21, 2008 7:20 am    Post subject: Reply with quote

Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again Sad
Back to top
View user's profile Send private message
madwoota
Halfop


Joined: 09 Aug 2005
Posts: 53

PostPosted: Fri Aug 22, 2008 10:08 pm    Post subject: Reply with quote

Phyxion wrote:
Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again Sad


Yeh, they changed <div class=g> to <li class=g>, so it's a 3 character regex fix from "div class=g>" to " class=g>" Smile
Back to top
View user's profile Send private message
Phyxion
Voice


Joined: 30 Jul 2008
Posts: 7

PostPosted: Sat Aug 23, 2008 2:15 am    Post subject: Reply with quote

madwoota wrote:
Phyxion wrote:
Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again Sad


Yeh, they changed <div class=g> to <li class=g>, so it's a 3 character regex fix from "div class=g>" to " class=g>" Smile
I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too Razz) and it works now. Thanks.
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat Aug 23, 2008 7:45 am    Post subject: Reply with quote

Phyxion wrote:
I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too Razz) and it works now. Thanks.


Wow, you've broken onebox results if you touched the class=e sections. Just change what madwoota said and you would be fine, once in the regexp and once in the regsub below it, both found under the #normal search comment... If you go nuts changing things that don't need changing, expect those things not to work any longer. The rule is, if it isn't broken, DON'T FIX IT... LMAO

http://ereader.kiczek.com/incith-google-v1.98t.tcl

Public once again, yeah, I fixed my own version as soon as the problem appeared, sorry it took so long for the public version to get the fix too.. The fix madwoota mentions is exactly all you need to do. Google changed normal search results from a <div class=g into <li class=g. They like line items instead of page divisions now I guess...
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
Phyxion
Voice


Joined: 30 Jul 2008
Posts: 7

PostPosted: Sun Aug 24, 2008 10:24 am    Post subject: Reply with quote

speechles wrote:
Phyxion wrote:
I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too Razz) and it works now. Thanks.


Wow, you've broken onebox results if you touched the class=e sections. Just change what madwoota said and you would be fine, once in the regexp and once in the regsub below it, both found under the #normal search comment... If you go nuts changing things that don't need changing, expect those things not to work any longer. The rule is, if it isn't broken, DON'T FIX IT... LMAO

http://ereader.kiczek.com/incith-google-v1.98t.tcl

Public once again, yeah, I fixed my own version as soon as the problem appeared, sorry it took so long for the public version to get the fix too.. The fix madwoota mentions is exactly all you need to do. Google changed normal search results from a <div class=g into <li class=g. They like line items instead of page divisions now I guess...
I see, didnt now. Changed it back Shocked
Back to top
View user's profile Send private message
superjet
Voice


Joined: 03 Aug 2008
Posts: 8

PostPosted: Thu Aug 28, 2008 10:14 am    Post subject: Reply with quote

incith-google-v1.98t.tcl send wrong encode to google , eggdrop-1.6.19 with utf-8 patch
Code:

!g 时间
38,300  Results | Acrylic Jewelry Displayers: Earrings @
http://www.displayit-info.com/acrylic/jewelry/acrylic6_ear4pair.html | Hb Toulon
                 [alpha77(EF6)Pro-->His]: a n @
                 http://www.ncbi.nlm.nih.gov/pubmed/10569726 | [PDF] ¢¡¤£¦¥¨§ © £ §
                 ¢!"£# $ % £ £& '©(§ @
                 http://eprints.biblio.unitn.it/archive/00000779/01/PhDTS38.pdf |
                 [PDF] The Informant at ChessCafe.com @


while incith-google-v1.98s.tcl works with correct encode

which maybe due to many similar parts(as in utf-8 chatroom, charset error converted from utf-8 into ???, so it just works without encode convertion) :
Code:

-- incith-google-v1.98s.tcl
+++ incith-google-v1.98t.tcl
...
@@ -1015,7 +1020,7 @@
         if {$incith::google::bold_descriptions == 0} {

           regsub -all -- "\002" $no_search {} no_search

         }

-        set no_search [string trim $no_search]

+        set no_search [incithencode [string trim $no_search]]

       }

 

       # give results an output header with result tally.
...
Back to top
View user's profile Send private message
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Thu Aug 28, 2008 7:56 pm    Post subject: Reply with quote

superjet wrote:
incith-google-v1.98t.tcl send wrong encode to google , eggdrop-1.6.19 with utf-8 patch
Code:

!g 时间
38,300  Results | Acrylic Jewelry Displayers: Earrings @
http://www.displayit-info.com/acrylic/jewelry/acrylic6_ear4pair.html | Hb Toulon
                 [alpha77(EF6)Pro-->His]: a n @
                 http://www.ncbi.nlm.nih.gov/pubmed/10569726 | [PDF] ¢¡¤£¦¥¨§ © £ §
                 ¢!"£# $ % £ £& '©(§ @
                 http://eprints.biblio.unitn.it/archive/00000779/01/PhDTS38.pdf |
                 [PDF] The Informant at ChessCafe.com @


while incith-google-v1.98s.tcl works with correct encode

which maybe due to many similar parts(as in utf-8 chatroom, charset error converted from utf-8 into ???, so it just works without encode convertion) :
Code:

-- incith-google-v1.98s.tcl
+++ incith-google-v1.98t.tcl
...
@@ -1015,7 +1020,7 @@
         if {$incith::google::bold_descriptions == 0} {

           regsub -all -- "\002" $no_search {} no_search

         }

-        set no_search [string trim $no_search]

+        set no_search [incithencode [string trim $no_search]]

       }

 

       # give results an output header with result tally.
...


stop posting code and guessing stuff. You have no idea how this script works so why post code you have no idea about....
You have no idea what changed? Well let me tell you, I changed the query to iso8859-1 instead of utf-8. That is why. If you want to hack eggdrop to utf-8 and use this script, well... yeah... you cannot, because that hack destroys iso8859-1 support... I want to make it standardized, not supporting hacks. So until then... stop posting things about the utf-8 patch in this thread. I couldn't care less, it is a hack.
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Script Support & Releases All times are GMT - 4 Hours
Goto page Previous  1, 2, 3 ... 19, 20, 21 ... 56, 57, 58  Next
Page 20 of 58

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber