| View previous topic :: View next topic |
| Author |
Message |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sun Aug 17, 2008 12:10 am Post subject: |
|
|
| pwner wrote: | hmm the script is great, but I have a little problem; out of all the features, only a few work for me (google search is gone, wiki, ebay and basically all the good ones ).
Could this be the fault of my shell provider, or my the tcl version I'm currently using?
I'm using incith-google-v1.98s, someone please help... |
Let me explain why with a slight ethics lesson. Websites wish their content to be viewed on their medium. They sometimes take countermeasures to discourage scraping, which is the method of data retrieval this script uses.
For ebay this happens:
I haven't written parsers for the template this new server gives. Notice the search.ebay.com becomes shop.ebay.com, this server uses a new template design not supported at the moment. Only the search.ebay.com template is supported presently.
For google, you may see this:
This means google will only allow you to use its services if you can complete their captcha requirement given on the sorry.google.com page. This is some problem between you and google. The other google based sites may not work either for you because something identifies you as malicious possibly. This is beyond my control, contact google.
For you to even begin to see this debug output you MUST change the debugnick in the config section from 'speechles' to the nickname of your debug admin, your nickname perhaps?
As for the other functions of the script, they should all work except for gamespot. Ganespot uses a new server as well, that requires cookie/referrer fields to disuade scraping. Expect a new version soon with better redirect support for wiki(pedia/media) and a few other things too... _________________ speechles' eggdrop tcl archive |
|
| Back to top |
|
 |
Phyxion Voice
Joined: 30 Jul 2008 Posts: 7
|
Posted: Sun Aug 17, 2008 9:23 am Post subject: |
|
|
| speechles wrote: | | Phyxion wrote: | | GameSpot ain't working anymore speechles. They updated their code once again. |
They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages . I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...
If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script. |
The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help. |
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sun Aug 17, 2008 12:52 pm Post subject: |
|
|
| Phyxion wrote: | | speechles wrote: | | Phyxion wrote: | | GameSpot ain't working anymore speechles. They updated their code once again. |
They did quite more than update their html templates. They changed the entire query. What it does now is use a php backend to retrieve the search results using cookie and referrer fields, which presently i'd need to investigate how those even work (although i do remember reading a post by user concerning this exact issue) before I could add something to fix it. If you leave any of these details out, your returned html merely contains a "searching..." where normally the results appeared (you can test this yourself, do a !game anything. Now check your eggdrop root for a file named ig-debug.txt, contained within is the html with 'searching...' instead of usable results). I would need to question why gamespot would do something to prevent potential free advertising from any and all index/scrape bots? Gamespot must not be getting enough click-through impressions from people scraping their pages . I've always had direct links to gamespot and every other site scraped appearing within the given results so it isn't blatant theft, it's helping advertise for them imo...
If you can tell me what you think, it would help. Is it immoral and wrong to scrape a website, when it is obvious that website is trying to eradicate scraping? If so, then it wouldn't be just of me to turn this script into something illicit (like heroin) where it's traded more for what it does wrong, then what it does right... If we all are damned and going to hell anyways, then we can soullessly and callously scrape them to death and update to a cookie/referrer approach rather than a simple query. Depends on what the object of this script is which I leave solely up to each and every one of you. The people using the script. |
The search url still works and the info is also in the page but just build up different (You can check using Firefox -> View page source). But since I don't understand a lot from TCL (regexp etc dont understand anything of it unfortunatly) I can't help. |
I can't believe you just said that...You fail to understand how eggdrop works. Sure, it works on firefox because firefox can supply the cookie and referrer fields. It DOES NOT work on eggdrop until I supply those requirements. There IS NO search data to search for. There IS ONLY a static "searching..." message. Don't believe me? Check this out! Now where are the results to parse? There aren't any. Do you see what I've been saying all along now? _________________ speechles' eggdrop tcl archive |
|
| Back to top |
|
 |
testebr Halfop
Joined: 01 Dec 2005 Posts: 86
|
Posted: Sun Aug 17, 2008 1:56 pm Post subject: |
|
|
Test -> Max Payne
The problem is not with referrer, but with javascript ajax result :]
Try disable javascript in your browser and test it.
Last edited by testebr on Sun Aug 17, 2008 2:01 pm; edited 1 time in total |
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sun Aug 17, 2008 2:01 pm Post subject: |
|
|
| Quote: | {"search_results":"<div class=\"sort_results\">\n <select class=\"{'term':'max payne','type':'game'
,'offset':false,'track':true}\">\n <option selected=\"selected\" value=\"rank\">Sort By Rank<
\/option>\n <option value=\"date\">Sort By Date<\/option>\n \n <option value
=\"score\">Sort By Score<\/option>\n <\/se..... |
The above comes from:
http://www.gamespot.com/pages/search/search_ajax.php?q=max%20payne&type=game&offset=0&tags_only=false&sort=rank
When communicating with gamespot, it will send you html data along with a cookie session ID. That html data will be incomplete because it is actually waiting on a php backend server to fill the html request using that ajax get above. Notice the 'search results' appearing at the front?
This means that it is silly to assume that since you can visit the website with a normal web browser and see all the html your bot will be able to do the same. Websites go out of their way to discourage bots, so these cookie sessions and other such nonsense and hurdles put up in our way that we must jump over in order to continue scraping them. Hopefully you understand what I mean. _________________ speechles' eggdrop tcl archive
Last edited by speechles on Sun Aug 17, 2008 2:15 pm; edited 1 time in total |
|
| Back to top |
|
 |
testebr Halfop
Joined: 01 Dec 2005 Posts: 86
|
Posted: Sun Aug 17, 2008 2:11 pm Post subject: |
|
|
| Read my reply above (I edited). |
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sun Aug 17, 2008 2:16 pm Post subject: |
|
|
| testebr wrote: | | Read my reply above (I edited). |
Read my reply. I already know this... | gamespot wrote: | Response Headers
Date Sun, 17 Aug 2008 18:08:12 GMT
Server Apache
Accept-Ranges bytes
X-Powered-By PHP/5.2.5
Set-Cookie gspot_side_081708=4; expires=Wed, 20-Aug-2008 18:08:12 GMT; path=/; domain=.gamespot.com
Keep-Alive timeout=300, max=990
Connection Keep-Alive
Transfer-Encoding chunked
Content-Type text/html; charset=ISO-8859-1
Request Headers
Host www.gamespot.com
User-Agent Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.16) Gecko/20080702 Firefox/2.0.0.16
Accept text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language en-us,en;q=0.5
Accept-Encoding gzip,deflate
Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive 300
Connection keep-alive
Referer http://forum.egghelp.org/viewtopic.php?p=84640
Cookie gspot_side_081408=100; geolocn=NzAuMTMyLjAuOTE6ODQw; XCLGFbrowser=Cg8ILkh0Qr9HAAAAXg8; mbox=PC#1216060154750-11875#1280814507|session#1217742433671-451299#1217744367|check#true#1217742567; __qca=4869b91b-5b1c2-cf30b-ab8d7; MADCAPP=083B3d:1; __utmz=14953632.1217742436.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utma=14953632.3523376941471426000.1217742436.1217742436.1217742436.1; gspot_promo_081408=1; gspot_promo_081608=1; gspot_side_081608=2; u_srv_0_0=-1; __qcb=1709914989; gspot_side_081708=3
Cache-Control max-age=0 |
See the problem? The script merely does a single page load. Which can get the http headers. The script will need to do a second request to the search_ajax.php url filling in the request headers correctly to retrieve any search results. The cookie session is all that matters notice the referring site is egghelp and I still got successful search data in the browser.
Firefox + firebug will allow you to see http headers as shown above (firebug is buggy though so disable it afterwards or it may crash firefox eventually). _________________ speechles' eggdrop tcl archive |
|
| Back to top |
|
 |
Phyxion Voice
Joined: 30 Jul 2008 Posts: 7
|
Posted: Sun Aug 17, 2008 2:54 pm Post subject: |
|
|
| testebr wrote: | Test -> Max Payne
The problem is not with referrer, but with javascript ajax result :]
Try disable javascript in your browser and test it. | That's what I meant too speechles.
But after I checked again I see you are right.
My bad  |
|
| Back to top |
|
 |
Phyxion Voice
Joined: 30 Jul 2008 Posts: 7
|
Posted: Thu Aug 21, 2008 7:20 am Post subject: |
|
|
Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again  |
|
| Back to top |
|
 |
madwoota Halfop
Joined: 09 Aug 2005 Posts: 53
|
Posted: Fri Aug 22, 2008 10:08 pm Post subject: |
|
|
| Phyxion wrote: | Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again  |
Yeh, they changed <div class=g> to <li class=g>, so it's a 3 character regex fix from "div class=g>" to " class=g>"  |
|
| Back to top |
|
 |
Phyxion Voice
Joined: 30 Jul 2008 Posts: 7
|
Posted: Sat Aug 23, 2008 2:15 am Post subject: |
|
|
| madwoota wrote: | | Phyxion wrote: | Google stopped working. I tried all versions posted here and non of them are working. Google must have changed it's code once again  |
Yeh, they changed <div class=g> to <li class=g>, so it's a 3 character regex fix from "div class=g>" to " class=g>"  | I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too ) and it works now. Thanks. |
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Sat Aug 23, 2008 7:45 am Post subject: |
|
|
| Phyxion wrote: | I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too ) and it works now. Thanks. |
Wow, you've broken onebox results if you touched the class=e sections. Just change what madwoota said and you would be fine, once in the regexp and once in the regsub below it, both found under the #normal search comment... If you go nuts changing things that don't need changing, expect those things not to work any longer. The rule is, if it isn't broken, DON'T FIX IT... LMAO
http://ereader.kiczek.com/incith-google-v1.98t.tcl
Public once again, yeah, I fixed my own version as soon as the problem appeared, sorry it took so long for the public version to get the fix too.. The fix madwoota mentions is exactly all you need to do. Google changed normal search results from a <div class=g into <li class=g. They like line items instead of page divisions now I guess... _________________ speechles' eggdrop tcl archive |
|
| Back to top |
|
 |
Phyxion Voice
Joined: 30 Jul 2008 Posts: 7
|
Posted: Sun Aug 24, 2008 10:24 am Post subject: |
|
|
| speechles wrote: | | Phyxion wrote: | I didn't know exactly what to change, so I also changed the class=e things (Maybe they changed that too ) and it works now. Thanks. |
Wow, you've broken onebox results if you touched the class=e sections. Just change what madwoota said and you would be fine, once in the regexp and once in the regsub below it, both found under the #normal search comment... If you go nuts changing things that don't need changing, expect those things not to work any longer. The rule is, if it isn't broken, DON'T FIX IT... LMAO
http://ereader.kiczek.com/incith-google-v1.98t.tcl
Public once again, yeah, I fixed my own version as soon as the problem appeared, sorry it took so long for the public version to get the fix too.. The fix madwoota mentions is exactly all you need to do. Google changed normal search results from a <div class=g into <li class=g. They like line items instead of page divisions now I guess... | I see, didnt now. Changed it back  |
|
| Back to top |
|
 |
superjet Voice
Joined: 03 Aug 2008 Posts: 8
|
Posted: Thu Aug 28, 2008 10:14 am Post subject: |
|
|
incith-google-v1.98t.tcl send wrong encode to google , eggdrop-1.6.19 with utf-8 patch
| Code: |
!g 时间
38,300 Results | Acrylic Jewelry Displayers: Earrings @
http://www.displayit-info.com/acrylic/jewelry/acrylic6_ear4pair.html | Hb Toulon
[alpha77(EF6)Pro-->His]: a n @
http://www.ncbi.nlm.nih.gov/pubmed/10569726 | [PDF] ¢¡¤£¦¥¨§ © £ §
¢!"£# $ % £ £& '©(§ @
http://eprints.biblio.unitn.it/archive/00000779/01/PhDTS38.pdf |
[PDF] The Informant at ChessCafe.com @
|
while incith-google-v1.98s.tcl works with correct encode
which maybe due to many similar parts(as in utf-8 chatroom, charset error converted from utf-8 into ???, so it just works without encode convertion) :
| Code: |
-- incith-google-v1.98s.tcl
+++ incith-google-v1.98t.tcl
...
@@ -1015,7 +1020,7 @@
if {$incith::google::bold_descriptions == 0} {
regsub -all -- "\002" $no_search {} no_search
}
- set no_search [string trim $no_search]
+ set no_search [incithencode [string trim $no_search]]
}
# give results an output header with result tally.
...
|
|
|
| Back to top |
|
 |
speechles Revered One

Joined: 26 Aug 2006 Posts: 1398 Location: emerald triangle, california (coastal redwoods)
|
Posted: Thu Aug 28, 2008 7:56 pm Post subject: |
|
|
| superjet wrote: | incith-google-v1.98t.tcl send wrong encode to google , eggdrop-1.6.19 with utf-8 patch
| Code: |
!g 时间
38,300 Results | Acrylic Jewelry Displayers: Earrings @
http://www.displayit-info.com/acrylic/jewelry/acrylic6_ear4pair.html | Hb Toulon
[alpha77(EF6)Pro-->His]: a n @
http://www.ncbi.nlm.nih.gov/pubmed/10569726 | [PDF] ¢¡¤£¦¥¨§ © £ §
¢!"£# $ % £ £& '©(§ @
http://eprints.biblio.unitn.it/archive/00000779/01/PhDTS38.pdf |
[PDF] The Informant at ChessCafe.com @
|
while incith-google-v1.98s.tcl works with correct encode
which maybe due to many similar parts(as in utf-8 chatroom, charset error converted from utf-8 into ???, so it just works without encode convertion) :
| Code: |
-- incith-google-v1.98s.tcl
+++ incith-google-v1.98t.tcl
...
@@ -1015,7 +1020,7 @@
if {$incith::google::bold_descriptions == 0} {
regsub -all -- "\002" $no_search {} no_search
}
- set no_search [string trim $no_search]
+ set no_search [incithencode [string trim $no_search]]
}
# give results an output header with result tally.
...
|
|
stop posting code and guessing stuff. You have no idea how this script works so why post code you have no idea about....
You have no idea what changed? Well let me tell you, I changed the query to iso8859-1 instead of utf-8. That is why. If you want to hack eggdrop to utf-8 and use this script, well... yeah... you cannot, because that hack destroys iso8859-1 support... I want to make it standardized, not supporting hacks. So until then... stop posting things about the utf-8 patch in this thread. I couldn't care less, it is a hack. _________________ speechles' eggdrop tcl archive |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|