Search
Prerequisite for a search in the cache archives with the WebAssistant is an indexation.
The search is available:
- with the URL http://127.0.0.1:8080/search/
- with the Command ls.
- In the search bar of the browser (Firefox 2, Internet Explorer 7)
Add for this at the search page the search plug-in to your browser.
At first you select a cache archive. Make sure that only indexed archives are available. You can search for words, domains and URLs. Several search criterias are combined with the operation AND.
Search Terms
Search for words
Input | Search for pages which … the word. |
---|---|
archiv | Exactly contain: archiv |
webpage | Start with: webpage |
End with: website | |
Contain: mirror |
Search for a domain
You in addition indicate the keyword: site
Input | Search for pages which … the characters in the domain name. |
---|---|
site:mm3tools | Exactly contain: mm3tools |
site:proxy | Start with: proxy |
site: | End with: browser |
site: | Contain: offline |
Search in a part of URL
You in addition indicate the keyword: url
Input | Search for pages which … the characters in the URL. |
---|---|
url:download | Contain: download |
Output of a Search
The result of a search is displayed as a hitlist.
The files (pages) are listed with their URL, size, date of archiving as well as 200 characters.
Text files are marked by TXT in addition.
The title and the description are reported to HTML files in addition.
The sequence of the files corresponds to the alphabetical sort of URL.
Several files from the same domain are reported intentionally.
Files with a red archiving date were actualized after construction of their index.
About the link Marker the page is displayed with highlighted Search-Words.
Marking isn't possible for all files.
Information about the Index
Word Histogram
The histogram displays a sorting of the words and the number of the files in which the corresponding word occurs.
For an alphabetical sort you use keyword: wordAlphabetical
Input | Histogram about words which … the characters. |
---|---|
wordAlphabetical:archiv | Exactly contain: archiv |
wordAlphabetical:webpage | Start with: webpage |
wordAlphabetical: | End with: website |
wordAlphabetical: | Contain: mirror |
wordAlphabetical: | Contain arbitrary characters (all words) |
For a sorting after frequency you use the keyword: wordFrequency
For a sorting after word length you use the keyword: wordLength
Domain Histogram
The histogram displays an alphabetical sort of the domains and the number of the files which are included in the domain.
Therefore you use the keyword: siteAlphabetical
Input | Histogram about domain which domain names … the characters. |
---|---|
siteAlphabetical:mm3tools | Exactly contain: mm3tools |
siteAlphabetical:proxy | Start with: proxy |
siteAlphabetical: | End with: browser |
siteAlphabetical: | Contain: offline |
siteAlphabetical: | Contain arbitrary characters (all domain names) |
For a sorting after frequency you use the keyword: siteFrequency
Indexing
The search in the cache archives with the WebAssistant presupposes an indexation.
It becomes indexed text and HTML files (pages). The algorithm of the Indexer, works essentially language independent.
At this the corresponding lower case characters are always used for capital characters.
The Latin and Russian alphabet as well as some special signs of European languages are supported.
Please, inform Tools, if you need another language.
Script file
You start the update with one of the following script files in the folder: MM3-WebAssistantProfessional/script/
Script | Operation System |
---|---|
MM3-Utility.bat | Windows of Microsoft |
MM3-Utility.sh | Linux and UNIX |
MM3-Utility.command | Mac OS X of Apple |
In the first dialog all utilities are displayed.
Select: Create an index for the search about an cache archive
With Next you get the configuration dialog: Indexer
Configuration of the Indexer
For the indexing you can set the following configuration:
- Select the cache archive to be indexed
- Specification of the minimal word length.
Only words which have a minimal word length are included into the indexing. Simplified, this word length consists of the characters of a word. - Display of the positive and negative word list
- Negative word list
These words aren't included into the index.
Stop words for English, Russian and German are existing.
If you have created additional stop words, please inform us about these. - Positive word list
These words are taken despite fall below the minimal word length.
The corresponding files are in the files positive.*.txt and negative.*.txt of the folder MM3-WebAssistantProfessional/config/search/. You can adapt the word lists to your need. The characters * stands for a language specific word list, e.g. en for the English and de for the German language. All files with a name structured correspondingly are used. We recommend for the identification of the language to use the abbreviations to ISO LanguageCode (ISO-639).
- Negative word list
You start the indexation after you have done the settings. The needed duration is dependent on the size of the archives. The indexation can take up some time. Please, close the WebAssistant before indexation.
Log output
You can take from the output of the Indexer:
- Indexed cache archives
- Number of the file still to be indexed
- At the moment indexed domain
- Time needed till now
- Progress bar
- Summarizing statistics about the indexation
Starting with a command line
You also can start the Indexer with the following command line:
java -jar MM3-WebAssistant.jar Indexer cacheActive=D:\CacheArchiv\ minWordLength=2 withNumber=yes start
Out of Memory
The needed memory is dependent on the size of the archives and the chosen minimal word length. You can increase the available memory for the Indexer in the script file, if the indexation needs more memory. You can alternatively subdivide the cache archive into several archives or increase the minimal word length.