egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Parsing a entire html source page

 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Scripting Help
View previous topic :: View next topic  
Author Message
ComputerTech
Master


Joined: 22 Feb 2020
Posts: 374

PostPosted: Tue Mar 30, 2021 12:57 am    Post subject: Parsing a entire html source page Reply with quote

So i am trying to retrieve the entire code from this https:://google.com/search?q=lego

Code:

bind PUB - "!test" the:test

package require http
package require tls

proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego"
set data [::http::data [::http::geturl "$url" -timeout 10000]]
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}


And i am getting this
Code:

<Tech> <HTML><HEAD><meta
<Tech> http-equiv="content-type"
<Tech> content="text/html;charset=utf-8">
<Tech> <TITLE>302
<Tech> Moved</TITLE></HEAD><BODY>
<Tech> <H1>302
<Tech> Moved</H1>
<Tech> The
<Tech> document
<Tech> has
<Tech> moved
<Tech> <A
<Tech> HREF="https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dlego&amp;q=EhAmB1MAAGEA2QAMAAAAAAAAGIDuioMGIhkA8aeDS7Cl4MTYJvxJOGvj5SyvlN0tmGEIMgFy">here</A>.
<Tech> </BODY></HTML>

_________________
ComputerTech
Back to top
View user's profile Send private message Send e-mail Visit poster's website
CrazyCat
Owner


Joined: 13 Jan 2002
Posts: 848
Location: France

PostPosted: Tue Mar 30, 2021 1:55 am    Post subject: Reply with quote

This is because you didn't think about potential redirections (as 301 or 302), and don't analyse the status.
Your line:
Code:
set data [::http::data [::http::geturl "$url" -timeout 10000]]


The better way (not the best):
Code:
set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
   // this is a redirection
} else {
   set data [::http::data $tok]
}


You can also use ::http::status and other infos to know if you are on the good page.

Have a look on https://www.tcl.tk/man/tcl8.4/TclCmd/http.htm
_________________
https://www.eggdrop.fr
Offer me a coffee - Do not ask me help in PM, we are a community.
Back to top
View user's profile Send private message Visit poster's website
ComputerTech
Master


Joined: 22 Feb 2020
Posts: 374

PostPosted: Tue Mar 30, 2021 2:27 am    Post subject: Reply with quote

Thanks CrazyCat will try that Wink
_________________
ComputerTech
Back to top
View user's profile Send private message Send e-mail Visit poster's website
ComputerTech
Master


Joined: 22 Feb 2020
Posts: 374

PostPosted: Tue Mar 30, 2021 4:16 pm    Post subject: Reply with quote

Tried your suggestion CrazyCat,
Code:

bind PUB - "!test" the:test

package require http
package require tls

proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego+ninjago"
set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
  putserv "PRIVMSG $chan :FAIL"
} else {
   set data [::http::data $tok]
}
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}

Results
Code:

20<ComputerTech>30 !test
18<Tech18> FAIL


Google still thinks i am a bot, any ideas to bypass this?
_________________
ComputerTech
Back to top
View user's profile Send private message Send e-mail Visit poster's website
CrazyCat
Owner


Joined: 13 Jan 2002
Posts: 848
Location: France

PostPosted: Tue Mar 30, 2021 5:46 pm    Post subject: Reply with quote

Google don't think you're a bot, google redirects you to a version you can read (without javascript).
Code:
set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
   set meta $tok(meta)
   set data [::http::data [::http::geturl $meta(Location)]]
} else {
   set data [::http::data $tok]
}

Note that this system works only if there is just one redirection.

And I don't understand why you do ::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0" after having used ::http ? The ::http::config must be at the initialisation of ::http
_________________
https://www.eggdrop.fr
Offer me a coffee - Do not ask me help in PM, we are a community.
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Scripting Help All times are GMT - 4 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber