How to compare a phrase with a .txt

Madalin · Post by **Madalin** » Sat Apr 19, 2014 8:40 am

I want to make a statistic script for words/lines/smilies written in a channel but i want to make it count only words from my language. I have found a list of over 70.000 words in my language and i want to know what the best way to compare lets say

"I want to compare this phrase with name.txt file"

with the .txt file that contains all the words in my language.

Thanks

dj-zath · Post by **dj-zath** » Sun May 04, 2014 8:46 pm

hi there!

I think a good approach would be to load said txt file into an array and then you can use the "isearch" function... but, with 70,000 entires.. its going to be a BIG array!

example:

Code: Select all

set  VarA [open "/path/to/txt/file" r]
while {!([eof $VarA])} {
     set TxtList [read $VarA]
}
set TxtList [list $TxtList]
close $VarA

this should create the initial variable called $TxtList which will contain a formatted list of all the words you can compare with

then you can use a "for each loop" to do the comparisons..

(mind you, I'm just "plopping this together" as I write this post.. it was just to give ya an idea; for actual WORKING code, It would have to be hashed out by some of the gurus here on the forums.. )

I hope this helps, at least a little

-DjZ-

Madalin · Post by **Madalin** » Mon May 05, 2014 2:56 am

Sorry for the late post

yes array was the only way i could make that script i had to set 610.696 arrays and no you dont need to foreach through it but use if [info exists var(word)]... also i noticed that using tcl8.4 it uses like 150 MB of RAM instead of something like 70 MB RAM on tcl8.5 when normally an eggdrop uses something like 6-8 MB RAM

Post by **caesar** » Mon May 05, 2014 5:15 am

The best approach would be to read one line at a time and compare whatever you want, and not reading the entire file into buffer and thus wasting precious system resources.

Code: Select all

set fo [open "file.txt"]
while {[gets $fo line] >= 0} {
# work with $line here …
}
close $fo

and use break if you want to end the while loop.

Madalin · Post by **Madalin** » Mon May 05, 2014 6:42 am

Yes caesar but i have 610.696 words so if i have to compare a line of 30 words with that file i dont think its the best way using a file. Yes its resource spending using array and load everything into the bot but thats the fastest way

speechles · Post by **speechles** » Sat May 24, 2014 3:59 pm

Madalin wrote:I want to make a statistic script for words/lines/smilies written in a channel but i want to make it count only words from my language. I have found a list of over 70.000 words in my language and i want to know what the best way to compare lets say

"I want to compare this phrase with name.txt file"

with the .txt file that contains all the words in my language.

Thanks

You wouldn't keep it in a file. You would use the power of structured query language (sql) to handle storing and recall of it. This will fork off part of the process from the eggdrop having to do it all.

You would need to have your list of words in one sql db. The second db would be the list of usernames and track how often what they've typed fell within the first sql db. Hopefully you follow what I mean. Using eggdrop you would populate the second db, using the first db and queries against it, rather than your lsearch against a huge-array in memory. If you follow all that.. hopefully..

Madalin · Post by **Madalin** » Sat May 24, 2014 4:02 pm

I used array.. it takes almost 160 RAM (on tcl8.4) and like 60 RAM (tcl8.5) but it does the job as it should fast and reliable i dont want to make this using sql or anything else because i dont find that way usefull. The main channel has alot of traffic

speechles · Post by **speechles** » Sat May 24, 2014 4:12 pm

Madalin wrote:I used array.. it takes almost 160 RAM (on tcl8.4) and like 60 RAM (tcl8.5) but it does the job as it should fast and reliable i dont want to make this using sql or anything else because i dont find that way usefull. The main channel has alot of traffic

Using eggdrop just eggdrop wont allow you to generate statistics off medium. With sql database handling it you can populate a website, an ftp login screen, an etc etc etc... You have unlimited potential of where this data can be put to use.

With storing it into arrays on eggdrop and using flat text files makes it less appealing to import things since you have to do so much legwork to make your format (however you chose to store the array) loadable into the other medium.

How much do you tie your channel into other mediums, like a forum, or a website for users, or these types of things. If these do exist for users than the natural thing to do is let them populate an sql database for irc so they can see their statistics of their irc ventures appears there as well. IRC will then appears more important since its presence appears on mediums other than stictly itself.

It fully depends on how you want to do things in the end. Eventually the array/txt file method will reach theoretical limits somewhere and break. During the saving of that giant database likely a failure occurs who knows, right? You can try to force your files into byte-chunk sizes so you can gain "random access" to them. This would alleviate alot of what you are doing. But this does not allow for expansion in the future like an sql-database would. You are limited to a certain byte-size with a top-limit. Look into the [seek] and [tell] commands if you are still going with the text file saving method. Make it random access for the sake of your eggdrop.

ie, name: 20 bytes (padded with spaces for any unused so can string trim off)
words: 5 bytes (could be 99999 max) could use little/big endian get bigger
lines: 5 bytes (could be 99999 max) could use little/big endian get bigger
smilies: 5 bytes (could be 99999 max) could use little/big endian get bigger

my records are 35 bytes wide. I can easily use seek and tell to move about this. To create this array initially you would need to make sure all your values are staying within record bounds. Make sure your puts are using the -nonewline option to strip the ending newline always unless you account for these. I did not in my example above.

Code: Select all

set count 0
set fh [open $file r]
while {![eof $fh]} {
	set line [read $fh 35]
	set ::nickindex([set n [string trim [string range $line 0 19]]]) $count
	set ::nickwords($n) [string trim [string range $line 20 24]]
	set ::nicklines($n) [string trim [string range $line 25 29]]
	set ::nicksmilies($n) [string trim [string range $line 30 34]]
	incr count
}
close $fh

The above code would load into ram simply to gain the placements of each nickname so it could know which place in our random access file this person is. So we can quickly go that point to write, instead of having to write the entire file over and over each time.

You understand?

Code: Select all

#write a nick that already exists to the file
#new nicks would of course go at the end
set index $::nickindex($nick)
set fh [open $file w]
seek $fh [expr {$index*35}]
# skip nick (20 bytes) we dont need to overwrite
seek $fh 20
puts $fh [format %5s $::nickwords($nick)][format %5s $::nicklines($nick)][format %5s $::nicksmilies($nick)]

It makes it much less susceptible to failure using random access. The weak point will be your file becoming corrupt when it reaches abominable size. Hope this helps.

Madalin · Post by **Madalin** » Sat May 24, 2014 4:44 pm

I dont think i made myself clear i have loaded all the words i want to match into array 610.696 words so when one user is writing a huge line on a specific channel instead of reading a file and doing foreach for every word i only do [info exists word(word)]

The stats looks like this

<+ SRI> Statistics for _MaDaLiN_ are as follows: 1041 written lines containing 5775 words (5.5 per sentence) / 22710 points (3.9 per word) / 621 Smiles (0.6 per sentence) and 1932 words that do not belong to the Romanian language or was misspelled.

Also there are TOP commands for lines/words/smilies

With array everything works as fast as i need without harming the server or the eggdrop in any way. And yes what you said about the limitation of array you could be right but i never planned and ill never will to make something for users on website with statistics and alot more things because i simply dont need to.

The channel is like 100+ users and lets say 5-15 users are talking most of the time with large lines.

This array was recommanded by thommey after i posted here the topic

The only "problem" would be the RAM USAGE but as i have where to host the eggdrop without limitation is ok

speechles · Post by **speechles** » Sat May 24, 2014 4:51 pm

Read my post again. You are NOT creating your file the way i suggest. Your database will become corrupt after a certain point. If you use random access to write new placements into the file this becomes far less likely to corrupt your entire file. If you aren't you are far more likely if rewrite the entire array to save? thats what you are doing? then yeah, look at what i wrote above again....The ram usage would go down by not reading in the entire file to populate your array to begin with. Look at how i do it reading in 35 byte chunks... How are you now doing this already. The problem with your ram is if you load in your entire file ONCE it ate the ram, tcl will not release that ram. You are just oops. So you have to load it in by line, or by chunk, and populate your array with these chunks. Think as if it were records and we stored them in a file cabinet. I dont dump the file cabinet on your desk. You flip thru my neatly organized folders (aka chunks).

Madalin · Post by **Madalin** » Sat May 24, 2014 4:53 pm

I never modify the file where the words are contained and whenever i restart i just load that file to set again the words. The userfile is different. As i said everything works ok so far.

The userfile is using the same system (loadingf everything in array and when restarted setting them back) but as i said modifying that file is no problem it doesnt corrupts anything (yet) i dont think i will ever get to a point of corrupting the file when modifying it. I also did a AUTO REMOVAL proc to remove old users so the file will self sustaine

speechles · Post by **speechles** » Sat May 24, 2014 4:58 pm

Madalin wrote:I never modify the file where the words are contained and whenever i restart i just load that file to set again the words. The userfile is different. As i said everything works ok so far.

The problem you have with eating RAM is obvious.

Code: Select all

# This will eat the whatever the size of the file is in extra bytes.
# There are always 2 copies of the same thing eventually
# there will be two full copies of the same thing at the end
# doing it this way you must rewrite the entire file everytime
# you want to backup the array in memory. None of this is memory
# efficient below. The example is lousy.
#
# initalize and load the user stats file
set fh [open $file r]
set file [read $fh]
close $fh
foreach line [split $file \n] {
   set line [split $line \n]
   set ::nickwords([set n [lindex $line 0]]) [lindex $line 1]
   set ::nicklines($n) [lindex $line 2]
   set ::nicksmilies($n) [lindex $line 3]
}


# This will eat at least 35 bytes of RAM extra.
# There is only a single record in memory at a time.
# doing it this way I can mimize my rewrites when needing
# to save any part of the file. I can also save the entire file
# at any time if I want equally as easy. This example r0x.
#
# initalize and load the user stats file
set count 0
set fh [open $file r]
while {1} {
   set line [read -nonewline $fh 35]
   if {[eof $fh]} { close $fh ; break }
   set ::nickindex([set n [string trim [string range $line 0 19]]]) $count
   set ::nickwords($n) [string trim [string range $line 20 24]]
   set ::nicklines($n) [string trim [string range $line 25 29]]
   set ::nicksmilies($n) [string trim [string range $line 30 34]]
   incr count
}

You will not get back the RAM lost to reading this file even if you unset file and unset fh. So do not use reads at all unless it with a <chunk> range. Then it behaves similar to [gets] in that you control how much it will eat of your RAM. You must adapt your code to use [seek][tell] and either use [read -nonewline $socket <byte-range>] or iterate [gets] over newlines. Either way, everything else you do is moot. The problem with RAM is caused by your keeping two-copies of everything at the same time in RAM at once to create your initial arrays the bot keeps. You keep the entire file loaded into ram at once as well as the entire array as you build it. This is crazy. Instead read one line at a time, or a chunk, and build your array like that. caeser said it first, time to listen to advice eh?

http://docs.activestate.com/activetcl/8 ... ecord.html

Simpsons already did it. I mean tcl-lib has a nice clean record function available to do what are you doing and keep it memory efficient.

egghelp/eggheads community

How to compare a phrase with a .txt

How to compare a phrase with a .txt

Re: How to compare a phrase with a .txt