| View previous topic :: View next topic |
| Author |
Message |
cerberus_gr Halfop
Joined: 07 Feb 2003 Posts: 97 Location: 127.0.0.1
|
|
| Back to top |
|
 |
SaPrOuZy Halfop

Joined: 24 Mar 2004 Posts: 75 Location: Lebanon
|
Posted: Wed Jun 21, 2006 8:55 am Post subject: |
|
|
| try to be clearer... |
|
| Back to top |
|
 |
cerberus_gr Halfop
Joined: 07 Feb 2003 Posts: 97 Location: 127.0.0.1
|
Posted: Wed Jun 21, 2006 10:30 am Post subject: |
|
|
Let's try again
I have a webpage in html format with 100 links inside. The links don't have the same format . The formats of the links for the file file.htm are:
1) <a href="http://www.domain/folder/file.htm">
2) <a href="www.domain/folder/file.htm">
3) <a href="http://domain/folder/file.htm">
4) <a href="/folder/file.htm">
5) <a href="file.htm"> (relative)
I have written a code which extracts all the links from the webpage and adds them to a list. So, I have a list like the following:
| Code: |
(bin) 49 % echo $links
{http://www.domain/folder/file.htm www.domain/folder/file.htm http://domain/folder/file.htm /folder/file.htm file.htm}
|
Now, I want to create a procedure which takes each one of the links and returns it on the format:
http://www.domain/folder/file.htm or
http://domain/folder/file.htm
Example:
| Code: |
proc format_url { link_found parent_link } {
}
|
(bin) 50 % set a [format_url "http://www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm
(bin) 51 % set a [format_url "www.domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm
(bin) 52 % set a [format_url "http://domain/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm
(bin) 53 % set a [format_url "/folder/file.htm" "http://www.domain/lala"]
http://www.domain/folder/file.htm
(bin) 54 % set a [format_url "file.htm" "http://www.domain/lala"]
http://www.domain/lala/file.htm
I'm not so good with regural expressions, so i need some help with this.
Last edited by cerberus_gr on Thu Jun 22, 2006 7:46 am; edited 1 time in total |
|
| Back to top |
|
 |
user

Joined: 18 Mar 2003 Posts: 1452 Location: Norway
|
Posted: Thu Jun 22, 2006 7:19 am Post subject: |
|
|
Your request is weird.
1) "www.domain" != "domain"
2) links not starting with a protocol are relative, so the absolute version of "www.domain" would be "http://base.href/www.domain"
3) your last example doesn't make any sense to me at all _________________ Have you ever read "The Manual"? |
|
| Back to top |
|
 |
cerberus_gr Halfop
Joined: 07 Feb 2003 Posts: 97 Location: 127.0.0.1
|
Posted: Thu Jun 22, 2006 7:56 am Post subject: |
|
|
Most of times www.domain is the same with domain, a the www is the default subdomain.
You are correct about 2, I didn't think like this.
I 'll describe you what exactly I want to do:
I want to create a package which extracts data from webpages. I'm going to give it a initial webpage and the script is going to follow every page and check for data inside. I'll have a list with all links that script found, and i'm going to visit every one.
My problem is that a lot of pages have links in different format. It could be a page which has 2 same links ("http://domain/hello.htm" and "/hello.htm") and I want my code to be clever to understand that these links are the same.
That's why I want to add links to the list with format "http://(subdomain.)domain/file.htm" in order to could check if a link already exists to the list and don't loose time to parse it again.
So, I need a procedure which is going to return a link in this format (like a web browser does with links) |
|
| Back to top |
|
 |
|