This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Parsing webpages made easy

Issues often discussed about Tcl scripting. Check before posting a scripting question.
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Parsing webpages made easy

Post by demond »

lots of folks have been asking for this lately, namely for scripts that parse webpages and filter out particular information; here is an elegant, easy and flexible (can be easily adapted for any webpage) way of doing just that

instead of digging up the HTML page structure in a messy, ugly hard-coded style with [regsub] (like virtually all eggdrop scripts do), we will use the tDOM package, which implements several important XML technologies that make out lives much easier:

Code: Select all

#!/bin/sh
# This line continues for Tcl, but is a single line for 'sh' \
exec tclsh8.4 "$0" ${1+"$@"}
package require tdom
package require http
set url "http://news.bbc.co.uk/sport"
set page [::http::data [::http::geturl $url]]
set doc [dom parse -html $page]
set root [$doc documentElement]
set node [$root selectNodes {//table[@width=416]/tr[1]/td[3]/div[2]}]
set text [[[lindex $node 0] childNodes] nodeValue]
puts "Latest sport news: [string trim $text]"
here, we fetch the latest sport headlines from BBC's news site, after having determined that the text we are interested in is located within the TABLE element with width=416, on the first TD row, at the third TD column, in the second DIV element (selectNodes is a standard XPath query); a handy visual tool for making such calculations is the DOM Inspector of Mozilla/Firefox (on Windows, you must have installed Firefox with Development Tools), which displays the HTML document structure as a tree

this little script's output is:
[demond@whitepine demond]$ ./tdom.sh
Latest sport news: Lord Coe flies to London on Thursday determined to make an immediate start to preparations for the 2012 Olympics.
k
kanibus
Halfop
Posts: 44
Joined: Tue May 03, 2005 7:22 am

Post by kanibus »

this looks to be a much simpler method that how i used the http package, i would like to rewrite some of my codes but i cannot get the tDOM package installed as i am getting c compiler errors :cry:
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

what errors? I've been getting some too, but managed to fix the source
k
kanibus
Halfop
Posts: 44
Joined: Tue May 03, 2005 7:22 am

Post by kanibus »

when i ./configure i get
C preprocessor "/lib/cpp" fails sanity check
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

that's not tDOM's problem, it's a problem with your compiler installation - most likely you are unable to compile any software from source, not only tDOM
k
kanibus
Halfop
Posts: 44
Joined: Tue May 03, 2005 7:22 am

Post by kanibus »

well its actually installed in /usr/bin/cpp but i cant find in the makefile where to change the dir
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

you don't have to change anything in the Makefile, it should have found the C preprocessor (cpp) in the correct location, and /usr/bin is correct, so it's not that causing the problem you experience

anyway, if you need more help on that, address your questions on the subject to the main forum; here we should discuss Tcl FAQ matters only
t
tonyrayo
Voice
Posts: 20
Joined: Thu Jul 31, 2003 3:29 pm
Location: Waldorf, MD
Contact:

Post by tonyrayo »

Thanks demond. I have 0 experience with tcl but I believe with general knowledge and the info you have provided I'll be able to come up with the script I need (just simple parsing of a webpage then outputting data... once that works adding a function to store var so it know when the webpage has been updated instead of constant flooding).
p
phab
Voice
Posts: 12
Joined: Mon Aug 22, 2005 6:34 am

Post by phab »

Doesnt work here... I copy and pasted the example:

[phab@debian ~]$ ./tdom.sh
Latest sport news:
[phab@debian ~]$

What's wrong? ;-)
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

most likely, BBC page has its structure changed and you need to correct your xpath
r
rix
Halfop
Posts: 42
Joined: Wed Sep 21, 2005 1:04 pm
Location: Estonia

Post by rix »

Does it work in shell? And how do I edit it so it actually answers to !command? :oops:
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

irrelevant dude, you are on the wrong forum

moderators: maybe this forum should be locked for postings from people who don't actually have anything to share about using Tcl?
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
d
domme
Voice
Posts: 1
Joined: Mon Feb 20, 2006 5:31 am

Post by domme »

How can I do this with egghttp? I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame. It tends to freeze sometimes and doesnt do async connections.

Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)?

greets
Domme
User avatar
demond
Revered One
Posts: 3073
Joined: Sat Jun 12, 2004 9:58 am
Location: San Francisco, CA
Contact:

Post by demond »

domme wrote:How can I do this with egghttp?
don't use egghttp, it is severily outdated; it had its use long time ago, when Tcl still didn't have the built-in http package, which is superior in any way to egghttp
I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame.
no it's not
It tends to freeze sometimes and doesnt do async connections.
yes it does; check out my rssnews script
Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)?
what are you talking about?

and post to the appropriate forum, this one is for FAQ contributions only, not for questions
connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use

Code: Select all

 tag when posting logs, code
User avatar
De Kus
Revered One
Posts: 1361
Joined: Sun Dec 15, 2002 11:41 am
Location: Germany

Post by De Kus »

domme wrote:It tends to freeze sometimes and doesnt do async connections.
It froze only on outdated TCL compilations provided by windrop.sourceforge.net. Get a recent one, if you still encounter that error (I believe it was up until around 8.4.7). That was the original reason why my scripts used egghttp instead of http, but now I would nolonger use it for a new script also. If you compare egghttp and http closely you will notice that http uses much more flexible and much more strict asycronly code than egghttp.
De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens...
Post Reply