egghelp.org community Forum Index
[ egghelp.org home | forum home ]
egghelp.org community
Discussion of eggdrop bots, shell accounts and tcl scripts.
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

How to compare a phrase with a .txt

 
Post new topic   Reply to topic    egghelp.org community Forum Index -> Scripting Help
View previous topic :: View next topic  
Author Message
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Sat Apr 19, 2014 8:40 am    Post subject: How to compare a phrase with a .txt Reply with quote

I want to make a statistic script for words/lines/smilies written in a channel but i want to make it count only words from my language. I have found a list of over 70.000 words in my language and i want to know what the best way to compare lets say

"I want to compare this phrase with name.txt file"

with the .txt file that contains all the words in my language.

Thanks
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
dj-zath
Op


Joined: 15 Nov 2008
Posts: 134

PostPosted: Sun May 04, 2014 8:46 pm    Post subject: Reply with quote

hi there!

I think a good approach would be to load said txt file into an array and then you can use the "isearch" function... but, with 70,000 entires.. its going to be a BIG array!

example:

Code:

set  VarA [open "/path/to/txt/file" r]
while {!([eof $VarA])} {
     set TxtList [read $VarA]
}
set TxtList [list $TxtList]
close $VarA


this should create the initial variable called $TxtList which will contain a formatted list of all the words you can compare with

then you can use a "for each loop" to do the comparisons..

(mind you, I'm just "plopping this together" as I write this post.. it was just to give ya an idea; for actual WORKING code, It would have to be hashed out by some of the gurus here on the forums.. )

I hope this helps, at least a little Smile

-DjZ-
Smile Smile
Back to top
View user's profile Send private message Visit poster's website
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Mon May 05, 2014 2:56 am    Post subject: Reply with quote

Sorry for the late post Very Happy yes array was the only way i could make that script i had to set 610.696 arrays and no you dont need to foreach through it but use if [info exists var(word)]... also i noticed that using tcl8.4 it uses like 150 MB of RAM instead of something like 70 MB RAM on tcl8.5 when normally an eggdrop uses something like 6-8 MB RAM Very Happy
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
caesar
Mint Rubber


Joined: 14 Oct 2001
Posts: 3741
Location: Mint Factory

PostPosted: Mon May 05, 2014 5:15 am    Post subject: Reply with quote

The best approach would be to read one line at a time and compare whatever you want, and not reading the entire file into buffer and thus wasting precious system resources.
Code:

set fo [open "file.txt"]
while {[gets $fo line] >= 0} {
# work with $line here …
}
close $fo

and use break if you want to end the while loop.
_________________
Once the game is over, the king and the pawn go back in the same box.
Back to top
View user's profile Send private message
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Mon May 05, 2014 6:42 am    Post subject: Reply with quote

Yes caesar but i have 610.696 words so if i have to compare a line of 30 words with that file i dont think its the best way using a file. Yes its resource spending using array and load everything into the bot but thats the fastest way
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat May 24, 2014 3:59 pm    Post subject: Re: How to compare a phrase with a .txt Reply with quote

Madalin wrote:
I want to make a statistic script for words/lines/smilies written in a channel but i want to make it count only words from my language. I have found a list of over 70.000 words in my language and i want to know what the best way to compare lets say

"I want to compare this phrase with name.txt file"

with the .txt file that contains all the words in my language.

Thanks


You wouldn't keep it in a file. You would use the power of structured query language (sql) to handle storing and recall of it. This will fork off part of the process from the eggdrop having to do it all.

You would need to have your list of words in one sql db. The second db would be the list of usernames and track how often what they've typed fell within the first sql db. Hopefully you follow what I mean. Using eggdrop you would populate the second db, using the first db and queries against it, rather than your lsearch against a huge-array in memory. If you follow all that.. hopefully.. Wink
_________________
speechles' eggdrop tcl archive


Last edited by speechles on Sat May 24, 2014 4:02 pm; edited 1 time in total
Back to top
View user's profile Send private message
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Sat May 24, 2014 4:02 pm    Post subject: Reply with quote

I used array.. it takes almost 160 RAM (on tcl8.4) and like 60 RAM (tcl8.5) but it does the job as it should fast and reliable i dont want to make this using sql or anything else because i dont find that way usefull. The main channel has alot of traffic
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat May 24, 2014 4:12 pm    Post subject: Reply with quote

Madalin wrote:
I used array.. it takes almost 160 RAM (on tcl8.4) and like 60 RAM (tcl8.5) but it does the job as it should fast and reliable i dont want to make this using sql or anything else because i dont find that way usefull. The main channel has alot of traffic


Using eggdrop just eggdrop wont allow you to generate statistics off medium. With sql database handling it you can populate a website, an ftp login screen, an etc etc etc... You have unlimited potential of where this data can be put to use.

With storing it into arrays on eggdrop and using flat text files makes it less appealing to import things since you have to do so much legwork to make your format (however you chose to store the array) loadable into the other medium.

How much do you tie your channel into other mediums, like a forum, or a website for users, or these types of things. If these do exist for users than the natural thing to do is let them populate an sql database for irc so they can see their statistics of their irc ventures appears there as well. IRC will then appears more important since its presence appears on mediums other than stictly itself.

It fully depends on how you want to do things in the end. Eventually the array/txt file method will reach theoretical limits somewhere and break. During the saving of that giant database likely a failure occurs who knows, right? You can try to force your files into byte-chunk sizes so you can gain "random access" to them. This would alleviate alot of what you are doing. But this does not allow for expansion in the future like an sql-database would. You are limited to a certain byte-size with a top-limit. Look into the [seek] and [tell] commands if you are still going with the text file saving method. Make it random access for the sake of your eggdrop. Smile

ie, name: 20 bytes (padded with spaces for any unused so can string trim off)
words: 5 bytes (could be 99999 max) could use little/big endian get bigger
lines: 5 bytes (could be 99999 max) could use little/big endian get bigger
smilies: 5 bytes (could be 99999 max) could use little/big endian get bigger

my records are 35 bytes wide. I can easily use seek and tell to move about this. To create this array initially you would need to make sure all your values are staying within record bounds. Make sure your puts are using the -nonewline option to strip the ending newline always unless you account for these. I did not in my example above.

Code:
set count 0
set fh [open $file r]
while {![eof $fh]} {
   set line [read $fh 35]
   set ::nickindex([set n [string trim [string range $line 0 19]]]) $count
   set ::nickwords($n) [string trim [string range $line 20 24]]
   set ::nicklines($n) [string trim [string range $line 25 29]]
   set ::nicksmilies($n) [string trim [string range $line 30 34]]
   incr count
}
close $fh


The above code would load into ram simply to gain the placements of each nickname so it could know which place in our random access file this person is. So we can quickly go that point to write, instead of having to write the entire file over and over each time.

You understand?

Code:
#write a nick that already exists to the file
#new nicks would of course go at the end
set index $::nickindex($nick)
set fh [open $file w]
seek $fh [expr {$index*35}]
# skip nick (20 bytes) we dont need to overwrite
seek $fh 20
puts $fh [format %5s $::nickwords($nick)][format %5s $::nicklines($nick)][format %5s $::nicksmilies($nick)]


It makes it much less susceptible to failure using random access. The weak point will be your file becoming corrupt when it reaches abominable size. Hope this helps. Wink
_________________
speechles' eggdrop tcl archive


Last edited by speechles on Sat May 24, 2014 5:12 pm; edited 6 times in total
Back to top
View user's profile Send private message
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Sat May 24, 2014 4:44 pm    Post subject: Reply with quote

I dont think i made myself clear i have loaded all the words i want to match into array 610.696 words so when one user is writing a huge line on a specific channel instead of reading a file and doing foreach for every word i only do [info exists word(word)]

The stats looks like this

Quote:
<+ SRI> Statistics for _MaDaLiN_ are as follows: 1041 written lines containing 5775 words (5.5 per sentence) / 22710 points (3.9 per word) / 621 Smiles (0.6 per sentence) and 1932 words that do not belong to the Romanian language or was misspelled.


Also there are TOP commands for lines/words/smilies

With array everything works as fast as i need without harming the server or the eggdrop in any way. And yes what you said about the limitation of array you could be right but i never planned and ill never will to make something for users on website with statistics and alot more things because i simply dont need to.

The channel is like 100+ users and lets say 5-15 users are talking most of the time with large lines.

This array was recommanded by thommey after i posted here the topic

The only "problem" would be the RAM USAGE but as i have where to host the eggdrop without limitation is ok
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat May 24, 2014 4:51 pm    Post subject: Reply with quote

Read my post again. You are NOT creating your file the way i suggest. Your database will become corrupt after a certain point. If you use random access to write new placements into the file this becomes far less likely to corrupt your entire file. If you aren't you are far more likely if rewrite the entire array to save? thats what you are doing? then yeah, look at what i wrote above again....The ram usage would go down by not reading in the entire file to populate your array to begin with. Look at how i do it reading in 35 byte chunks... How are you now doing this already. The problem with your ram is if you load in your entire file ONCE it ate the ram, tcl will not release that ram. You are just oops. So you have to load it in by line, or by chunk, and populate your array with these chunks. Think as if it were records and we stored them in a file cabinet. I dont dump the file cabinet on your desk. You flip thru my neatly organized folders (aka chunks).
_________________
speechles' eggdrop tcl archive


Last edited by speechles on Sat May 24, 2014 4:53 pm; edited 1 time in total
Back to top
View user's profile Send private message
Madalin
Master


Joined: 24 Jun 2005
Posts: 310
Location: Constanta, Romania

PostPosted: Sat May 24, 2014 4:53 pm    Post subject: Reply with quote

I never modify the file where the words are contained and whenever i restart i just load that file to set again the words. The userfile is different. As i said everything works ok so far.

The userfile is using the same system (loadingf everything in array and when restarted setting them back) but as i said modifying that file is no problem it doesnt corrupts anything (yet) i dont think i will ever get to a point of corrupting the file when modifying it. I also did a AUTO REMOVAL proc to remove old users so the file will self sustaine
_________________
https://github.com/MadaliNTCL - To chat with me: https://tawk.to/MadaliNTCL
Back to top
View user's profile Send private message Send e-mail Visit poster's website Yahoo Messenger
speechles
Revered One


Joined: 26 Aug 2006
Posts: 1398
Location: emerald triangle, california (coastal redwoods)

PostPosted: Sat May 24, 2014 4:58 pm    Post subject: Reply with quote

Madalin wrote:
I never modify the file where the words are contained and whenever i restart i just load that file to set again the words. The userfile is different. As i said everything works ok so far.


The problem you have with eating RAM is obvious.

Code:
# This will eat the whatever the size of the file is in extra bytes.
# There are always 2 copies of the same thing eventually
# there will be two full copies of the same thing at the end
# doing it this way you must rewrite the entire file everytime
# you want to backup the array in memory. None of this is memory
# efficient below. The example is lousy.
#
# initalize and load the user stats file
set fh [open $file r]
set file [read $fh]
close $fh
foreach line [split $file \n] {
   set line [split $line \n]
   set ::nickwords([set n [lindex $line 0]]) [lindex $line 1]
   set ::nicklines($n) [lindex $line 2]
   set ::nicksmilies($n) [lindex $line 3]
}


# This will eat at least 35 bytes of RAM extra.
# There is only a single record in memory at a time.
# doing it this way I can mimize my rewrites when needing
# to save any part of the file. I can also save the entire file
# at any time if I want equally as easy. This example r0x.
#
# initalize and load the user stats file
set count 0
set fh [open $file r]
while {1} {
   set line [read -nonewline $fh 35]
   if {[eof $fh]} { close $fh ; break }
   set ::nickindex([set n [string trim [string range $line 0 19]]]) $count
   set ::nickwords($n) [string trim [string range $line 20 24]]
   set ::nicklines($n) [string trim [string range $line 25 29]]
   set ::nicksmilies($n) [string trim [string range $line 30 34]]
   incr count
}



You will not get back the RAM lost to reading this file even if you unset file and unset fh. So do not use reads at all unless it with a <chunk> range. Then it behaves similar to [gets] in that you control how much it will eat of your RAM. You must adapt your code to use [seek][tell] and either use [read -nonewline $socket <byte-range>] or iterate [gets] over newlines. Either way, everything else you do is moot. The problem with RAM is caused by your keeping two-copies of everything at the same time in RAM at once to create your initial arrays the bot keeps. You keep the entire file loaded into ram at once as well as the entire array as you build it. This is crazy. Instead read one line at a time, or a chunk, and build your array like that. caeser said it first, time to listen to advice eh?

http://docs.activestate.com/activetcl/8.4/tcllib/struct/record.html

Simpsons already did it. I mean tcl-lib has a nice clean record function available to do what are you doing and keep it memory efficient.
_________________
speechles' eggdrop tcl archive
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    egghelp.org community Forum Index -> Scripting Help All times are GMT - 4 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Forum hosting provided by Reverse.net

Powered by phpBB © 2001, 2005 phpBB Group
subGreen style by ktauber