| View previous topic :: View next topic |
| Author |
Message |
ComputerTech Master

Joined: 22 Feb 2020 Posts: 393
|
Posted: Tue Mar 30, 2021 12:57 am Post subject: Parsing a entire html source page |
|
|
So i am trying to retrieve the entire code from this https:://google.com/search?q=lego
| Code: |
bind PUB - "!test" the:test
package require http
package require tls
proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego"
set data [::http::data [::http::geturl "$url" -timeout 10000]]
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}
|
And i am getting this
| Code: |
<Tech> <HTML><HEAD><meta
<Tech> http-equiv="content-type"
<Tech> content="text/html;charset=utf-8">
<Tech> <TITLE>302
<Tech> Moved</TITLE></HEAD><BODY>
<Tech> <H1>302
<Tech> Moved</H1>
<Tech> The
<Tech> document
<Tech> has
<Tech> moved
<Tech> <A
<Tech> HREF="https://www.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dlego&q=EhAmB1MAAGEA2QAMAAAAAAAAGIDuioMGIhkA8aeDS7Cl4MTYJvxJOGvj5SyvlN0tmGEIMgFy">here</A>.
<Tech> </BODY></HTML>
|
_________________ ComputerTech |
|
| Back to top |
|
 |
CrazyCat Revered One

Joined: 13 Jan 2002 Posts: 1032 Location: France
|
Posted: Tue Mar 30, 2021 1:55 am Post subject: |
|
|
This is because you didn't think about potential redirections (as 301 or 302), and don't analyse the status.
Your line:
| Code: | | set data [::http::data [::http::geturl "$url" -timeout 10000]] |
The better way (not the best):
| Code: | set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
// this is a redirection
} else {
set data [::http::data $tok]
} |
You can also use ::http::status and other infos to know if you are on the good page.
Have a look on https://www.tcl.tk/man/tcl8.4/TclCmd/http.htm _________________ https://www.eggdrop.fr - French IRC network
Offer me a coffee - Do not ask me help in PM, we are a community. |
|
| Back to top |
|
 |
ComputerTech Master

Joined: 22 Feb 2020 Posts: 393
|
Posted: Tue Mar 30, 2021 2:27 am Post subject: |
|
|
Thanks CrazyCat will try that  _________________ ComputerTech |
|
| Back to top |
|
 |
ComputerTech Master

Joined: 22 Feb 2020 Posts: 393
|
Posted: Tue Mar 30, 2021 4:16 pm Post subject: |
|
|
Tried your suggestion CrazyCat,
| Code: |
bind PUB - "!test" the:test
package require http
package require tls
proc the:test {nick host hand chan text} {
http::register https 443 [list ::tls::socket]
set url "https://www.google.com/search?q=lego+ninjago"
set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
putserv "PRIVMSG $chan :FAIL"
} else {
set data [::http::data $tok]
}
::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0"
foreach lines2 $data {putserv "PRIVMSG $chan :$lines2"}
http::unregister https
}
|
Results
| Code: |
20<ComputerTech>30 !test
18<Tech18> FAIL
|
Google still thinks i am a bot, any ideas to bypass this? _________________ ComputerTech |
|
| Back to top |
|
 |
CrazyCat Revered One

Joined: 13 Jan 2002 Posts: 1032 Location: France
|
Posted: Tue Mar 30, 2021 5:46 pm Post subject: |
|
|
Google don't think you're a bot, google redirects you to a version you can read (without javascript).
| Code: | set tok [::http::geturl $url]
if {[::http::ncode $tok]==301 || [::http::ncode $tok]==302} {
set meta $tok(meta)
set data [::http::data [::http::geturl $meta(Location)]]
} else {
set data [::http::data $tok]
} |
Note that this system works only if there is just one redirection.
And I don't understand why you do ::http::config -useragent "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0" after having used ::http ? The ::http::config must be at the initialisation of ::http _________________ https://www.eggdrop.fr - French IRC network
Offer me a coffee - Do not ask me help in PM, we are a community. |
|
| Back to top |
|
 |
|