有朋友需求一個問題,就是處理一張表中某一欄位,從這個欄位中去截取內容中最後一個中文詞語。 ID SourceText Result 1 張達:U:1楊英蘋:U:1,周忱:U:1,;苗橋:U:1,章瑋:U:1,; 2 gaoying,高穎:U; 3 gaoying,高穎:U; 4 mq,苗橋;ding ...
有朋友需求一個問題,就是處理一張表中某一欄位,從這個欄位中去截取內容中最後一個中文詞語。
ID SourceText Result 1 張達:U:1楊英蘋:U:1,周忱:U:1,;苗橋:U:1,章瑋:U:1,; 2 gaoying,高穎:U; 3 gaoying,高穎:U; 4 mq,苗橋;dingjian,丁健:U;zhangwei,章瑋;zc,周忱; 5 xwj,向文傑; 6 dingjian,丁健; 7 mq;chendeyong; 8 gy,郭穎; 9 houwenjun,侯文君;lj,李軍;sunle,孫樂; 10 dingjian,丁健:U; 11 dingjian,丁健:U;zhangwei,章瑋; 12 wwm,王文明;zkl,張康亮;jiangyuan,蔣遠;fyj,範雲軍; 13 dingjian,丁健; 14 fyj,範雲軍;wwm,王文明;zkl,張康亮; 15 lww,陸維巍;Source Code
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[DataSource]( [ID] [int] IDENTITY(1,1) NOT NULL, [SourceText] [nvarchar](100) NULL, --原始值 [Result] [nvarchar](100) NULL --處理結果 ) ON [PRIMARY] GOSource Code
另外,你還要創建另外一張表,用來存儲所有字元串中,分隔符號:
SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[Punctuation]( [ID] [int] IDENTITY(1,1) NOT NULL, [Name] [nvarchar](2) NULL ) ON [PRIMARY] GOSource Code
越來越接近我們需求的結果了:
此時,我需要對處理的結果,再次處理,得需要瞭解下麵二個函數:
《使用XQuery的nodes()方法實現字元拆分》http://www.cnblogs.com/insus/archive/2012/02/26/2368283.html
或
《MS SQL Server字元拆分函數》http://www.cnblogs.com/insus/p/3163564.html
其實二個函數,最終處理結果是一樣的。就是分割字元串,然後放在一張表中
回到剛纔的代碼中,我們只管添加一行代碼即可實現了我們的要求:
DECLARE @r INT = 1,@rs INT = 0 SELECT @rs = MAX([ID]) FROM [dbo].[DataSource] WHILE @r <= @rs BEGIN DECLARE @Text NVARCHAR(100) SELECT @Text = [SourceText] FROM [dbo].[DataSource] WHERE [ID] = @r WHILE PATINDEX('%[A-Za-z0-9]%',@Text) > 0 SET @Text = STUFF(@Text,PATINDEX('%[A-Za-z0-9]%',@Text),1,' ') DECLARE @x INT = 1,@xs INT = 0 SELECT @xs = MAX([ID]) FROM [dbo].[Punctuation] WHILE @x <= @xs BEGIN DECLARE @p NVARCHAR(2) SELECT @p = [Name] FROM [dbo].[Punctuation] WHERE [ID] = @x SET @Text = RTRIM(LTRIM(REPLACE(@Text,@p,' '))) SET @x = @x + 1 END SELECT TOP 1 @Text = [WORD] FROM [dbo].[udf_Split](@Text,' ') WHERE LEN(ISNULL([WORD],''))> 0 ORDER BY [ID] DESC UPDATE [dbo].[DataSource] SET [Result] = @Text WHERE [ID] = @r SET @r = @r + 1 END GOSource Code
OK,這就是實現的全部過程。但是,我們應該不滿足上面的代碼。既然都使用正則來去除字母,數字,那標點符號可以使用正則來去除對吧。
所以說,我們不必再創建一個表來存儲標點符號了。
DECLARE @r INT = 1,@rs INT = 0 SELECT @rs = MAX([ID]) FROM [dbo].[DataSource] WHILE @r <= @rs BEGIN DECLARE @Text NVARCHAR(100) SELECT @Text = [SourceText] FROM [dbo].[DataSource] WHERE [ID] = @r WHILE PATINDEX('%[A-Za-z0-9:,;]%',@Text) > 0 SET @Text = STUFF(@Text,PATINDEX('%[A-Za-z0-9:,;]%',@Text),1,' ') SET @Text = LTRIM(RTRIM(@Text)) SELECT TOP 1 @Text = [WORD] FROM [dbo].[udf_Split](@Text,' ') WHERE LEN(ISNULL([WORD],''))> 0 ORDER BY [ID] DESC UPDATE [dbo].[DataSource] SET [Result] = @Text WHERE [ID] = @r SET @r = @r + 1 ENDSource Code
最終的結果一樣,代碼很了不少!!!