egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Parsing webpages made easy
Goto page 1, 2, 3  Next
 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Tcl FAQ
View previous topic :: View next topic  
Author Message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Thu Jul 07, 2005 1:14 am    Post subject: Parsing webpages made easy Reply with quote

lots of folks have been asking for this lately, namely for scripts that parse webpages and filter out particular information; here is an elegant, easy and flexible (can be easily adapted for any webpage) way of doing just that

instead of digging up the HTML page structure in a messy, ugly hard-coded style with [regsub] (like virtually all eggdrop scripts do), we will use the tDOM package, which implements several important XML technologies that make out lives much easier:
Code:

#!/bin/sh
# This line continues for Tcl, but is a single line for 'sh' \
exec tclsh8.4 "$0" ${1+"$@"}
package require tdom
package require http
set url "http://news.bbc.co.uk/sport"
set page [::http::data [::http::geturl $url]]
set doc [dom parse -html $page]
set root [$doc documentElement]
set node [$root selectNodes {//table[@width=416]/tr[1]/td[3]/div[2]}]
set text [[[lindex $node 0] childNodes] nodeValue]
puts "Latest sport news: [string trim $text]"


here, we fetch the latest sport headlines from BBC's news site, after having determined that the text we are interested in is located within the TABLE element with width=416, on the first TD row, at the third TD column, in the second DIV element (selectNodes is a standard XPath query); a handy visual tool for making such calculations is the DOM Inspector of Mozilla/Firefox (on Windows, you must have installed Firefox with Development Tools), which displays the HTML document structure as a tree

this little script's output is:
Quote:

[demond@whitepine demond]$ ./tdom.sh
Latest sport news: Lord Coe flies to London on Thursday determined to make an immediate start to preparations for the 2012 Olympics.
Back to top
View user's profile Send private message Visit poster's website
kanibus
Halfop


Joined: 03 May 2005
Posts: 44

PostPosted: Thu Jul 14, 2005 3:14 am    Post subject: Reply with quote

this looks to be a much simpler method that how i used the http package, i would like to rewrite some of my codes but i cannot get the tDOM package installed as i am getting c compiler errors Crying or Very sad
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Thu Jul 14, 2005 4:09 am    Post subject: Reply with quote

what errors? I've been getting some too, but managed to fix the source
Back to top
View user's profile Send private message Visit poster's website
kanibus
Halfop


Joined: 03 May 2005
Posts: 44

PostPosted: Thu Jul 14, 2005 1:03 pm    Post subject: Reply with quote

when i ./configure i get
Quote:

C preprocessor "/lib/cpp" fails sanity check
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Thu Jul 14, 2005 1:14 pm    Post subject: Reply with quote

that's not tDOM's problem, it's a problem with your compiler installation - most likely you are unable to compile any software from source, not only tDOM
Back to top
View user's profile Send private message Visit poster's website
kanibus
Halfop


Joined: 03 May 2005
Posts: 44

PostPosted: Thu Jul 14, 2005 3:15 pm    Post subject: Reply with quote

well its actually installed in /usr/bin/cpp but i cant find in the makefile where to change the dir
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Thu Jul 14, 2005 3:37 pm    Post subject: Reply with quote

you don't have to change anything in the Makefile, it should have found the C preprocessor (cpp) in the correct location, and /usr/bin is correct, so it's not that causing the problem you experience

anyway, if you need more help on that, address your questions on the subject to the main forum; here we should discuss Tcl FAQ matters only
Back to top
View user's profile Send private message Visit poster's website
tonyrayo
Voice


Joined: 31 Jul 2003
Posts: 20
Location: Waldorf, MD

PostPosted: Sun Aug 21, 2005 4:50 pm    Post subject: Reply with quote

Thanks demond. I have 0 experience with tcl but I believe with general knowledge and the info you have provided I'll be able to come up with the script I need (just simple parsing of a webpage then outputting data... once that works adding a function to store var so it know when the webpage has been updated instead of constant flooding).
Back to top
View user's profile Send private message Visit poster's website
phab
Voice


Joined: 22 Aug 2005
Posts: 12

PostPosted: Sun Sep 04, 2005 11:01 am    Post subject: Reply with quote

Doesnt work here... I copy and pasted the example:

[phab@debian ~]$ ./tdom.sh
Latest sport news:
[phab@debian ~]$

What's wrong? Wink
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Sun Sep 04, 2005 1:26 pm    Post subject: Reply with quote

most likely, BBC page has its structure changed and you need to correct your xpath
Back to top
View user's profile Send private message Visit poster's website
rix
Halfop


Joined: 21 Sep 2005
Posts: 42
Location: Estonia

PostPosted: Wed Sep 21, 2005 1:32 pm    Post subject: Reply with quote

Does it work in shell? And how do I edit it so it actually answers to !command? Embarassed
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Wed Sep 21, 2005 1:46 pm    Post subject: Reply with quote

irrelevant dude, you are on the wrong forum

moderators: maybe this forum should be locked for postings from people who don't actually have anything to share about using Tcl?
_________________
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use [code] tag when posting logs, code
Back to top
View user's profile Send private message Visit poster's website
domme
Voice


Joined: 20 Feb 2006
Posts: 1

PostPosted: Mon Feb 20, 2006 5:38 am    Post subject: Reply with quote

How can I do this with egghttp? I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame. It tends to freeze sometimes and doesnt do async connections.

Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)?

greets
Domme
Back to top
View user's profile Send private message
demond
Revered One


Joined: 12 Jun 2004
Posts: 3073
Location: San Francisco, CA

PostPosted: Mon Feb 20, 2006 9:15 pm    Post subject: Reply with quote

domme wrote:
How can I do this with egghttp?

don't use egghttp, it is severily outdated; it had its use long time ago, when Tcl still didn't have the built-in http package, which is superior in any way to egghttp
Quote:
I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame.

no it's not
Quote:
It tends to freeze sometimes and doesnt do async connections.

yes it does; check out my rssnews script
Quote:
Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)?

what are you talking about?

and post to the appropriate forum, this one is for FAQ contributions only, not for questions
_________________
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use [code] tag when posting logs, code
Back to top
View user's profile Send private message Visit poster's website
De Kus
Revered One


Joined: 15 Dec 2002
Posts: 1361
Location: Germany

PostPosted: Sat Feb 25, 2006 8:12 am    Post subject: Reply with quote

domme wrote:
It tends to freeze sometimes and doesnt do async connections.

It froze only on outdated TCL compilations provided by windrop.sourceforge.net. Get a recent one, if you still encounter that error (I believe it was up until around 8.4.7). That was the original reason why my scripts used egghttp instead of http, but now I would nolonger use it for a new script also. If you compare egghttp and http closely you will notice that http uses much more flexible and much more strict asycronly code than egghttp.
_________________
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
Back to top
View user's profile Send private message MSN Messenger
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Tcl FAQ All times are GMT - 4 Hours
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber