Luence_ZenDei技術網路在線

Luence

-Advertisement-

Luence 是Apache軟體基金會的一個項目，是一個開發源碼的全文檢索引擎工具包，是一個全文檢索引擎的一個架構。提供了完成的查詢引擎和檢索引擎，部分文本分析引擎。全文檢索程式庫，雖然與搜索引擎相關，但是不能混淆。官方網址：https://lucene.apache.org/ 幫助文檔：htt ...

Luence

是Apache軟體基金會的一個項目，是一個開發源碼的全文檢索引擎工具包，是一個全文檢索引擎的一個架構。提供了完成的查詢引擎和檢索引擎，部分文本分析引擎。

全文檢索程式庫，雖然與搜索引擎相關，但是不能混淆。

官方網址：https://lucene.apache.org/

幫助文檔：https://lucene.apache.org/core/4_9_1/index.html

官方解釋：

Lucene is a Java full-text search engine. Lucene is not a complete application, but rather a code library and API that can easily be used to add search capabilities to applications.

倒排索引

瞭解Luence要知道倒排索引；

通俗解釋，我們通常都是通過查找文件位置及文件名，再查找文件的內容。倒排索引可以理解為通過文件內容來查找文件位置及文件名的。

倒排索引是一種索引方法，被用來存儲在全文搜索下某個單詞在一個文檔或者一組文檔中的存儲位置的映射。它是文檔檢索系統中最常用的數據結構。通過倒排索引，可以根據單詞快速獲取包含這個單詞的文檔列表。

倒排索引也是lucence的索引核心。

文件內容可以表示一個field，文件名稱可以表示一個field，將整個field進行分詞，然後根據分詞創建索引，建立一個個term；

如：文件的內容作為一個field，命名為"contents"，將文件內容進行分詞，假設文件內容為"A B"，分詞結果為"A","B"，這樣term的信息就為兩條，field的內容為"contents"，term對應的文本內容分別為"A"和"B"。

當查找指定分詞的時候就可以獲取這個分詞所在的doc，並獲取doc相關的信息。

demo編程

例子參考官方demo; package位置：org.apache.lucene.demo

自己寫了一個demo

MyIndexFiles.java

  1 import org.apache.commons.io.FileUtils;
  2 import org.apache.lucene.analysis.Analyzer;
  3 import org.apache.lucene.analysis.standard.StandardAnalyzer;
  4 import org.apache.lucene.document.*;
  5 import org.apache.lucene.index.*;
  6 import org.apache.lucene.queryparser.classic.ParseException;
  7 import org.apache.lucene.queryparser.classic.QueryParser;
  8 import org.apache.lucene.search.IndexSearcher;
  9 import org.apache.lucene.search.Query;
 10 import org.apache.lucene.search.ScoreDoc;
 11 import org.apache.lucene.search.TopDocs;
 12 import org.apache.lucene.store.Directory;
 13 import org.apache.lucene.store.FSDirectory;
 14 import org.apache.lucene.util.Version;
 15 import org.junit.Test;
 16 
 17 import java.io.File;
 18 import java.io.IOException;
 19 
 20 /**
 21  * Created by Edward on 2016/7/25.
 22  */
 23 public class MyIndexFiles {
 24 
 25 
 26     public static void main(String[] args) throws IOException {
 27 
 28         //文件方式存儲索引文件
 29         FSDirectory directory = FSDirectory.open(new File("D:\\documents\\Lucene\\MyDemo\\index"));
 30 
 31         //文本解析器，分詞器
 32         Analyzer analyzer= new StandardAnalyzer(Version.LUCENE_4_9);
 33 
 34         //索引寫配置，要指定解析器及版本信息
 35         IndexWriterConfig indexWriterConfig = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
 36 
 37         //創建寫索引
 38         IndexWriter indexWriter = new IndexWriter(directory, indexWriterConfig );
 39 
 40         //路徑
 41         File path = new File("D:\\documents\\Lucene\\MyDemo\\docs");
 42         //文件列表
 43         File[] listFile = path.listFiles();
 44         for(File file: listFile){
 45             //創建doc
 46             Document doc = new Document();
 47 
 48             //獲取文件屬性信息
 49             String filename = file.getName();
 50             long lastModified = file.lastModified();
 51 
 52             //通過commons-io-2.4.jar包中的FileUtils方法，讀文件內容轉化為String
 53             String readFile2Sting = FileUtils.readFileToString(file);
 54 
 55             //將field添加到doc
 56             //StringField不進行分詞，當做一個分詞
 57             //Field的有索引和存儲屬性，
 58                  //Field.Store.NO代表數據不進行存儲，僅能索引到，多用來處理文本內容，可獲取文件名然後通過文件位置打開文件獲取內容
 59                  //Field.Store.YES代表存儲數據，通常用來直接獲取文件路徑
 60             doc.add(new StringField("filename", filename, Field.Store.YES));
 61             doc.add(new LongField("modify", lastModified, Field.Store.YES));
 62             doc.add(new TextField("contents",readFile2Sting, Field.Store.NO));
 63 
 64             //新增的方式
 65             //indexWriter.addDocument(doc);
 66 
 67             //更新的方式， 更新與term匹配的docs
 68             indexWriter.updateDocument(new Term("filename", file.getName()), doc);
 69         }
 70         indexWriter.close();
 71     }
 72 
 73 
 74     @Test
 75     public void serach() throws IOException, ParseException {
 76 
 77         //本地索引文件
 78         Directory directory = FSDirectory.open(new File("D:\\documents\\Lucene\\MyDemo\\index"));
 79 
 80         //讀索引目錄
 81         IndexReader indexReader = DirectoryReader.open(directory);
 82 
 83         //創建索引搜索對象
 84         IndexSearcher indexSearcher = new IndexSearcher(indexReader);
 85 
 86         Analyzer analyzer= new StandardAnalyzer(Version.LUCENE_4_9);
 87 
 88         //查詢解析  指定查詢的item,解析器,版本
 89         QueryParser queryParse = new QueryParser(Version.LUCENE_4_9, "contents", analyzer);
 90 
 91         //查詢內容
 92         Query query = queryParse.parse("111");
 93 
 94         //查詢指定條數
 95         int num = 6;
 96         TopDocs topDocs= indexSearcher.search(query, num);
 97 
 98         //採集數
 99         ScoreDoc[] docs = topDocs.scoreDocs;
100 
101         for(ScoreDoc doc:docs){
102 
103             //獲取doc編號
104             int i = doc.doc;
105 
106             //通過文檔編號獲取文檔信息
107             Document d = indexSearcher.doc(i);
108 
109             //列印文檔信息
110             System.out.println(d.get("filename"));
111             System.out.println(d.get("modify"));
112             System.out.println(d.get("contents"));
113         }
114         indexReader.close();
115     }
116
117 }

您的分享是我們最大的動力!

-Advertisement-

更多相關文章

賦值

package hello; import java.util.Scanner; public class Hello { public static void main(String[] args) { // TODO Auto-generated method stub System.out.p ...
R語言之數據結構

R擁有許多用於存儲數據的對象類型，包括標量、向量、矩陣、數組、數據框、列表、因數。 1.標量：標量是只包含一個元素的向量 > a <- 1; # 數值型 > b <- "China"; # 字元型 > c <- TRUE; # 邏輯型 2.向量：向量用於存儲數值型、字元型或邏輯型數據的一維數組。通過 ...
java讀取xml文件的四種方法

Xml代碼第一種 DOM 實現方法：第二種，DOM4J實現方法第三種 JDOM實現方法：第四種SAX實現方法： ...
輸入

package hello; import java.util.Scanner; public class Hello { public static void main(String[] args) { // TODO Auto-generated method stub System.out.p ...
Mac上安裝go環境

Mac 安裝 GO語言開發環境官網：https://golang.org/ go語言的安裝：http://docscn.studygolang.com/doc/install 下載：go1.7rc3.darwin-amd64.pkg 預設安裝，被安裝了 /usr/local/go 目錄並自設置了 ...
Python成長筆記 - 基礎篇（一）python簡介

一、Python介紹 Python（英國發音：/ˈpaɪθən/ 美國發音：/ˈpaɪθɑːn/），由吉多·範羅蘇姆(Guido van Rossum)於1989年發明，第一個公開發行版發行於1991年。 Python是一個高層次的結合瞭解釋型、動態強類型的、面向對象的腳本語言 Python 是一種 ...
php使用phpmailer發送郵件

本人新手，由於要做郵件發送驗證碼，所以找到和搜集到這些，本人親測完全可以用這是163郵箱的因為不是企業郵箱填寫的賬號是163的賬號，但是密碼是授權碼授權碼的獲取方式為：然後然後在這個頁面向下看可以看到接下來就是執行代碼了 qq現在也要獲取授權碼登陸獲取方式設置-》賬戶這樣就可以了如 ...
PHP中的運算符---算術運算符、邏輯運算符、賦值運算符、比較運算符

1、算術運算符常見的算術運算符 2、邏輯運算符 PHP中的邏輯運算符 3、賦值運算符賦值運算符“=”是PHP中最基本的運算符，即把“=”右邊表達式的值賦給左邊的運算數。另外PHP中也常用到複合賦值運算符。複合賦值運算符 4、比較運算符 ...