View previous topic :: View next topic |
Author |
Message |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Thu Jul 07, 2005 1:14 am Post subject: Parsing webpages made easy |
|
|
lots of folks have been asking for this lately, namely for scripts that parse webpages and filter out particular information; here is an elegant, easy and flexible (can be easily adapted for any webpage) way of doing just that
instead of digging up the HTML page structure in a messy, ugly hard-coded style with [regsub] (like virtually all eggdrop scripts do), we will use the tDOM package, which implements several important XML technologies that make out lives much easier:
Code: |
#!/bin/sh
# This line continues for Tcl, but is a single line for 'sh' \
exec tclsh8.4 "$0" ${1+"$@"}
package require tdom
package require http
set url "http://news.bbc.co.uk/sport"
set page [::http::data [::http::geturl $url]]
set doc [dom parse -html $page]
set root [$doc documentElement]
set node [$root selectNodes {//table[@width=416]/tr[1]/td[3]/div[2]}]
set text [[[lindex $node 0] childNodes] nodeValue]
puts "Latest sport news: [string trim $text]"
|
here, we fetch the latest sport headlines from BBC's news site, after having determined that the text we are interested in is located within the TABLE element with width=416, on the first TD row, at the third TD column, in the second DIV element (selectNodes is a standard XPath query); a handy visual tool for making such calculations is the DOM Inspector of Mozilla/Firefox (on Windows, you must have installed Firefox with Development Tools), which displays the HTML document structure as a tree
this little script's output is:
Quote: |
[demond@whitepine demond]$ ./tdom.sh
Latest sport news: Lord Coe flies to London on Thursday determined to make an immediate start to preparations for the 2012 Olympics.
|
|
|
Back to top |
|
 |
kanibus Halfop
Joined: 03 May 2005 Posts: 44
|
Posted: Thu Jul 14, 2005 3:14 am Post subject: |
|
|
this looks to be a much simpler method that how i used the http package, i would like to rewrite some of my codes but i cannot get the tDOM package installed as i am getting c compiler errors  |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Thu Jul 14, 2005 4:09 am Post subject: |
|
|
what errors? I've been getting some too, but managed to fix the source |
|
Back to top |
|
 |
kanibus Halfop
Joined: 03 May 2005 Posts: 44
|
Posted: Thu Jul 14, 2005 1:03 pm Post subject: |
|
|
when i ./configure i get
Quote: |
C preprocessor "/lib/cpp" fails sanity check
|
|
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Thu Jul 14, 2005 1:14 pm Post subject: |
|
|
that's not tDOM's problem, it's a problem with your compiler installation - most likely you are unable to compile any software from source, not only tDOM |
|
Back to top |
|
 |
kanibus Halfop
Joined: 03 May 2005 Posts: 44
|
Posted: Thu Jul 14, 2005 3:15 pm Post subject: |
|
|
well its actually installed in /usr/bin/cpp but i cant find in the makefile where to change the dir |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Thu Jul 14, 2005 3:37 pm Post subject: |
|
|
you don't have to change anything in the Makefile, it should have found the C preprocessor (cpp) in the correct location, and /usr/bin is correct, so it's not that causing the problem you experience
anyway, if you need more help on that, address your questions on the subject to the main forum; here we should discuss Tcl FAQ matters only |
|
Back to top |
|
 |
tonyrayo Voice
Joined: 31 Jul 2003 Posts: 20 Location: Waldorf, MD
|
Posted: Sun Aug 21, 2005 4:50 pm Post subject: |
|
|
Thanks demond. I have 0 experience with tcl but I believe with general knowledge and the info you have provided I'll be able to come up with the script I need (just simple parsing of a webpage then outputting data... once that works adding a function to store var so it know when the webpage has been updated instead of constant flooding). |
|
Back to top |
|
 |
phab Voice
Joined: 22 Aug 2005 Posts: 12
|
Posted: Sun Sep 04, 2005 11:01 am Post subject: |
|
|
Doesnt work here... I copy and pasted the example:
[phab@debian ~]$ ./tdom.sh
Latest sport news:
[phab@debian ~]$
What's wrong?  |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Sun Sep 04, 2005 1:26 pm Post subject: |
|
|
most likely, BBC page has its structure changed and you need to correct your xpath |
|
Back to top |
|
 |
rix Halfop
Joined: 21 Sep 2005 Posts: 42 Location: Estonia
|
Posted: Wed Sep 21, 2005 1:32 pm Post subject: |
|
|
Does it work in shell? And how do I edit it so it actually answers to !command?  |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Wed Sep 21, 2005 1:46 pm Post subject: |
|
|
irrelevant dude, you are on the wrong forum
moderators: maybe this forum should be locked for postings from people who don't actually have anything to share about using Tcl? _________________ connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use [code] tag when posting logs, code |
|
Back to top |
|
 |
domme Voice
Joined: 20 Feb 2006 Posts: 1
|
Posted: Mon Feb 20, 2006 5:38 am Post subject: |
|
|
How can I do this with egghttp? I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame. It tends to freeze sometimes and doesnt do async connections.
Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)?
greets
Domme |
|
Back to top |
|
 |
demond Revered One

Joined: 12 Jun 2004 Posts: 3073 Location: San Francisco, CA
|
Posted: Mon Feb 20, 2006 9:15 pm Post subject: |
|
|
domme wrote: | How can I do this with egghttp? |
don't use egghttp, it is severily outdated; it had its use long time ago, when Tcl still didn't have the built-in http package, which is superior in any way to egghttp
Quote: | I'd like to parse some webpages for a proxy-like cache data base. But the standart http package is kind of lame. |
no it's not
Quote: | It tends to freeze sometimes and doesnt do async connections. |
yes it does; check out my rssnews script
Quote: | Does somebody know how to load the core commands of eggdrop into a normal tcl script to then use them with egghttp.tcl (wich needs the eggdrop core functions)? |
what are you talking about?
and post to the appropriate forum, this one is for FAQ contributions only, not for questions _________________ connection, sharing, dcc problems? click <here>
before asking for scripting help, read <this>
use [code] tag when posting logs, code |
|
Back to top |
|
 |
De Kus Revered One

Joined: 15 Dec 2002 Posts: 1361 Location: Germany
|
Posted: Sat Feb 25, 2006 8:12 am Post subject: |
|
|
domme wrote: | It tends to freeze sometimes and doesnt do async connections. |
It froze only on outdated TCL compilations provided by windrop.sourceforge.net. Get a recent one, if you still encounter that error (I believe it was up until around 8.4.7). That was the original reason why my scripts used egghttp instead of http, but now I would nolonger use it for a new script also. If you compare egghttp and http closely you will notice that http uses much more flexible and much more strict asycronly code than egghttp. _________________ De Kus
StarZ|De_Kus, De_Kus or DeKus on IRC
Copyright © 2005-2009 by De Kus - published under The MIT License
Love hurts, love strengthens... |
|
Back to top |
|
 |
|