This is the new home of the egghelp.org community forum.
All data has been migrated (including user logins/passwords) to a new phpBB version.


For more information, see this announcement post. Click the X in the top right-corner of this box to dismiss this message.

Help about a link

Help for those learning Tcl or writing their own scripts.
Post Reply
c
cerberus_gr
Halfop
Posts: 97
Joined: Fri Feb 07, 2003 8:57 am
Location: 127.0.0.1

Help about a link

Post by cerberus_gr »

Hello,

I have a code which gets all the links from a webpage. The formats could be:

1) http://www.domain/folder/file.htm
2) www.domain/folder/file.htm
3) http://domain/folder/file.htm
4) /folder/file.htm
5) file.hmt (relative)

I want to create a procedure which takes as parameters the link and the link from the html which parsed and returns the link in the format:

1) http://domain/folder/file.htm or
2) http://www.domain/folder/file.htm


Example:

Code: Select all

proc format_url { link parent } {
   ...
}

Thanks
User avatar
SaPrOuZy
Halfop
Posts: 75
Joined: Wed Mar 24, 2004 7:38 am
Location: Lebanon

Post by SaPrOuZy »

try to be clearer...
c
cerberus_gr
Halfop
Posts: 97
Joined: Fri Feb 07, 2003 8:57 am
Location: 127.0.0.1

Post by cerberus_gr »

Let's try again :)

I have a webpage in html format with 100 links inside. The links don't have the same format . The formats of the links for the file file.htm are:

1) <a href="http://www.domain/folder/file.htm">
2) <a href="www.domain/folder/file.htm">
3) <a href="http://domain/folder/file.htm">
4) <a href="/folder/file.htm">
5) <a href="file.htm"> (relative)


I have written a code which extracts all the links from the webpage and adds them to a list. So, I have a list like the following:

Code: Select all

(bin) 49 % echo $links
{http://www.domain/folder/file.htm www.domain/folder/file.htm http://domain/folder/file.htm /folder/file.htm file.htm}

Now, I want to create a procedure which takes each one of the links and returns it on the format:

http://www.domain/folder/file.htm or
http://domain/folder/file.htm

Example:

Code: Select all

proc format_url { link_found parent_link } {

}

(bin) 50 % set a [format_url "http://www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 51 % set a [format_url "www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 52 % set a [format_url "http://domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 53 % set a [format_url "/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm

(bin) 54 % set a [format_url "file.htm" "http://www.domain/lala"]
http://www.domain/[b]lala[/b]/file.htm


I'm not so good with regural expressions, so i need some help with this.
Last edited by cerberus_gr on Thu Jun 22, 2006 7:46 am, edited 1 time in total.
User avatar
user
&nbsp;
Posts: 1452
Joined: Tue Mar 18, 2003 9:58 pm
Location: Norway

Post by user »

Your request is weird.

1) "www.domain" != "domain"
2) links not starting with a protocol are relative, so the absolute version of "www.domain" would be "http://base.href/www.domain"
3) your last example doesn't make any sense to me at all
Have you ever read "The Manual"?
c
cerberus_gr
Halfop
Posts: 97
Joined: Fri Feb 07, 2003 8:57 am
Location: 127.0.0.1

Post by cerberus_gr »

Most of times www.domain is the same with domain, a the www is the default subdomain.

You are correct about 2, I didn't think like this.


I 'll describe you what exactly I want to do:

I want to create a package which extracts data from webpages. I'm going to give it a initial webpage and the script is going to follow every page and check for data inside. I'll have a list with all links that script found, and i'm going to visit every one.

My problem is that a lot of pages have links in different format. It could be a page which has 2 same links ("http://domain/hello.htm" and "/hello.htm") and I want my code to be clever to understand that these links are the same.

That's why I want to add links to the list with format "http://(subdomain.)domain/file.htm" in order to could check if a link already exists to the list and don't loose time to parse it again.

So, I need a procedure which is going to return a link in this format (like a web browser does with links)
Post Reply