2007年4月13日

[Namazu-devel-ja 1553] Re: hash の負荷軽減

寺西です。

同等のテストを stable-2-0 で行いました。


[stable-A1]
修正前 インデックス書き出し3回
User+System Time = 308.4399 Seconds


[stable-A2]
修正前 インデックス書き出し1回
User+System Time = 281.7702 Seconds

[stable-D1]
修正3 インデックス書き出し3回
User+System Time = 261.3492 Seconds

[stable-D2]
修正3 インデックス書き出し1回
User+System Time = 231.9596 Seconds

という結果でした。stable-A1 の時間を 1.0 として比較した結果は
次の通りです。

stable-A1: 1.00
stable-A2: 0.91
stable-D1: 0.85
stable-D2: 0.75 (:D1 0.89) (:A2 0.82)

- stable-A2 と stable-D2 を比較すると stable-D2 は stable-A2 の 0.82
でした。(修正の差)
- stable-D1 と stable-D2 を比較すると stable-D2 は stable-D1 の 0.89
でした。(インデックスの書き出し回数の差)


修正により約15%高速化しています。インデックスの書き出しも1回にすると、
約25%高速化するという結果です。

[stable-A1]

Total Elapsed Time = 582.2299 Seconds
User+System Time = 308.4399 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
17.7 54.78 52.967 733289 0.0001 0.0001 mknmz::hash
13.7 42.54 42.488 24816 0.0017 0.0017 mknmz::wordcount_sub
9.17 28.27 80.563 4136 0.0068 0.0195 mknmz::make_phrase_hash
7.06 21.78 86.173 4136 0.0053 0.0208 mknmz::count_words
4.16 12.82 12.533 118804 0.0001 0.0001 Text::Kakasi::xs_do_kakasi
3.74 11.53 11.530 4136 0.0028 0.0028
File::MMagic::checktype_data
3.60 11.09 19.175 3 3.6976 6.3916 mknmz::write_index_sub
3.54 10.91 38.297 4136 0.0026 0.0093 mknmz::put_field_index
2.81 8.655 8.694 4136 0.0021 0.0021
mailnews::mailnews_citation_filter
2.76 8.500 16.789 3 2.8333 5.5965
mknmz::write_phrase_hash_sub
2.53 7.808 8.369 4136 0.0019 0.0020 mailnews::mailnews_filter
2.44 7.528 7.268 104675 0.0001 0.0001 mknmz::get_last_docid
2.42 7.478 7.459 8486 0.0009 0.0009 NKF::nkf
2.37 7.318 7.040 111848 0.0001 0.0001 IO::File::open
2.28 7.028 7.020 4136 0.0017 0.0017 mailnews::uuencode_filter


[stable-A2]

Total Elapsed Time = 523.7002 Seconds
User+System Time = 281.7702 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
19.0 53.65 52.203 733289 0.0001 0.0001 mknmz::hash
15.1 42.69 42.650 24816 0.0017 0.0017 mknmz::wordcount_sub
9.63 27.14 78.788 4136 0.0066 0.0190 mknmz::make_phrase_hash
7.41 20.87 85.521 4136 0.0050 0.0207 mknmz::count_words
4.52 12.74 12.512 118804 0.0001 0.0001 Text::Kakasi::xs_do_kakasi
4.06 11.44 11.442 4136 0.0028 0.0028
File::MMagic::checktype_data
3.74 10.53 39.579 4136 0.0025 0.0096 mknmz::put_field_index
3.11 8.756 8.771 4136 0.0021 0.0021
mailnews::mailnews_citation_filter
2.79 7.868 8.413 4136 0.0019 0.0020 mailnews::mailnews_filter
2.60 7.329 7.106 111828 0.0001 0.0001 IO::File::open
2.58 7.269 7.253 8486 0.0009 0.0009 NKF::nkf
2.31 6.499 6.483 8272 0.0008 0.0008 gfilter::line_adjust_filter
2.26 6.359 6.352 4136 0.0015 0.0015 mailnews::uuencode_filter
1.93 5.449 5.442 4136 0.0013 0.0013
File::MMagic::checktype_byfilename
1.86 5.239 5.016 111815 0.0000 0.0000 IO::Handle::close


[stable-D1]

Total Elapsed Time = 653.0292 Seconds
User+System Time = 261.3492 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
11.2 29.49 46.747 4136 0.0071 0.0113 mknmz::make_phrase_hash
10.7 27.98 27.928 24816 0.0011 0.0011 mknmz::wordcount_sub
8.21 21.45 71.373 4136 0.0052 0.0173 mknmz::count_words
7.56 19.74 17.829 768357 0.0000 0.0000 mknmz::hash
4.97 12.97 12.683 118804 0.0001 0.0001 Text::Kakasi::xs_do_kakasi
4.28 11.18 19.715 3 3.7276 6.5716 mknmz::write_index_sub
4.27 11.16 11.160 4136 0.0027 0.0027
File::MMagic::checktype_data
4.17 10.88 39.007 4136 0.0026 0.0094 mknmz::put_field_index
3.40 8.885 8.894 4136 0.0021 0.0022
mailnews::mailnews_citation_filter
3.26 8.510 16.969 3 2.8366 5.6565
mknmz::write_phrase_hash_sub
3.15 8.238 7.978 104675 0.0001 0.0001 mknmz::get_last_docid
2.95 7.718 8.279 4136 0.0019 0.0020 mailnews::mailnews_filter
2.91 7.598 7.579 8486 0.0009 0.0009 NKF::nkf
2.90 7.588 7.310 111848 0.0001 0.0001 IO::File::open
2.70 7.048 7.040 4136 0.0017 0.0017 mailnews::uuencode_filter


[stable-D2]

Total Elapsed Time = 658.9697 Seconds
User+System Time = 231.9596 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
12.2 28.36 45.817 4136 0.0069 0.0111 mknmz::make_phrase_hash
11.8 27.44 27.388 24816 0.0011 0.0011 mknmz::wordcount_sub
9.15 21.21 71.023 4136 0.0051 0.0172 mknmz::count_words
8.61 19.96 18.049 768357 0.0000 0.0000 mknmz::hash
5.82 13.50 13.213 118804 0.0001 0.0001 Text::Kakasi::xs_do_kakasi
4.75 11.00 11.000 4136 0.0027 0.0027
File::MMagic::checktype_data
4.69 10.87 39.767 4136 0.0026 0.0096 mknmz::put_field_index
3.93 9.115 9.144 4136 0.0022 0.0022
mailnews::mailnews_citation_filter
3.25 7.548 8.069 4136 0.0018 0.0020 mailnews::mailnews_filter
3.14 7.288 7.010 111828 0.0001 0.0001 IO::File::open
3.13 7.268 7.260 4136 0.0018 0.0018 mailnews::uuencode_filter
2.96 6.868 6.849 8486 0.0008 0.0008 NKF::nkf
2.65 6.158 6.139 8272 0.0007 0.0007 gfilter::line_adjust_filter
2.37 5.489 5.480 4136 0.0013 0.0013
File::MMagic::checktype_byfilename
2.28 5.297 9.641 111809 0.0000 0.0001 util::fclose


--- mknmz.in.org 2007-04-12 23:15:04.000000000 +0900
+++ mknmz.in 2007-04-13 00:56:08.000000000 +0900
@@ -71,6 +71,8 @@ my $Magic = new File::MMagic;

my $ReceiveTERM = 0;

+my @CacheHash = ( {}, {}, {}, {}, );
+
STDOUT->autoflush(1);
STDERR->autoflush(1);
main();
@@ -106,6 +108,8 @@ sub main {
my $flist_ptr = 0;
my $processed_files_size = 0;

+ @CacheHash = ( {}, {}, {}, {}, );
+
if ($CheckPoint{'continue'}) {
# Restore variables
eval util::readfile($var::NMZ{'_checkpoint'}) ;
@@ -2243,13 +2247,17 @@ sub make_phrase_hash ($$$) {
my %tmp = ();
$$contref =~ s!\x7f */? *\d+ *\x7f!!g; # remove tags of weight
$$contref =~ tr/\xa1-\xfea-z0-9 \n//cd; # remove all symbols
+ $$contref =~ s/^\s+//;
+ $$contref =~ s/\s+$//;
my @words = split(/\s+/, $$contref);
- @words = grep {$_ ne ""} @words; # remove empty words
+# @words = grep {$_ ne ""} @words; # remove empty words
my $word_b = shift @words;
my $docid = $docid_count + $docid_base;
for my $word (@words) {
- next if ($word eq "" || length($word) > $conf::WORD_LENG_MAX);
- my $hash = hash($word_b . $word);
+ next if (length($word) == 0 || length($word) >
$conf::WORD_LENG_MAX);
+ $CacheHash[0]{$word_b} = hash(0, 0, $word_b)
+ if (!exists($CacheHash[0]{$word_b}));
+ my $hash = hash($CacheHash[0]{$word_b}, length($word_b),
$word);
unless (defined $tmp{$hash}) {
$tmp{$hash} = 1;
$PhraseHashLast{$hash} = 0 unless defined
$PhraseHashLast{$hash};
@@ -2345,17 +2353,30 @@ sub write_phrase_hash_sub () {
%PhraseHashLast = ();
}

-# Dr. Knuth's ``hash'' from (UNIX MAGAZINE May 1998)
-sub hash ($) {
- my ($word) = @_;
+sub hash ($$$) {
+ my ($prev, $start, $word) = @_;
+
+ my $offset = $start & 0x03;
+ return ($prev ^ $CacheHash[$offset]{$word}) & 65535
+ if (exists($CacheHash[$offset]{$word}));

my $hash = 0;
- for (my $i = 0; $word ne ""; $i++) {
- $hash ^= $Seed[$i & 0x03][ord($word)];
- $word = substr $word, 1;
- # $word =~ s/^.//; is slower
+ my $rword = reverse($word);
+ my $i = $offset;
+ while(length($rword)) {
+ $hash ^= $Seed[$i++ & 0x03][ord(chop($rword))];
}
- return $hash & 65535;
+ $CacheHash[$offset]{$word} = $hash & 65535;
+ return ($prev ^ $hash) & 65535;
+
+# Dr. Knuth's ``hash'' from (UNIX MAGAZINE May 1998)
+# my $hash = 0;
+# for (my $i = 0; $word ne ""; $i++) {
+# $hash ^= $Seed[$i & 0x03][ord($word)];
+# $word = substr $word, 1;
+# # $word =~ s/^.//; is slower
+# }
+# return $hash & 65535;
}

# Count frequencies of words.
@@ -2443,6 +2464,7 @@ sub wordcount_sub ($$\%) {
$word_count->{$word} = 0 unless defined($word_count->{$word});
$word_count->{$word} += $weight;
unless ($var::Opt{'nosymbol'}) {
+ next if $word !~ /[^\xa1-\xfea-z_0-9]/;
if ($word =~ /^[^\xa1-\xfea-z_0-9](.+)[^\xa1-\xfea-z_0-9]$/)
{
$word_count->{$1} = 0 unless defined($word_count->{$1});
$word_count->{$1} += $weight;
@@ -2456,8 +2478,7 @@ sub wordcount_sub ($$\%) {
$word_count->{$1} += $weight;
next unless $1 =~ /[^\xa1-\xfea-z_0-9]/;
}
- my @words_ = split(/[^\xa1-\xfea-z_0-9]+/, $word)
- if $word =~ /[^\xa1-\xfea-z_0-9]/;
+ my @words_ = split(/[^\xa1-\xfea-z_0-9]+/, $word);
for my $tmp (@words_) {
next if $tmp eq "";
$word_count->{$tmp} = 0 unless
defined($word_count->{$tmp});
--
=====================================================================
寺西 忠勝(TADAMASA TERANISHI) yw3t-trns@xxxxx
http://www.asahi-net.or.jp/~yw3t-trns/index.htm
Key fingerprint = 474E 4D93 8E97 11F6 662D 8A42 17F5 52F4 10E7 D14E

_______________________________________________
Namazu-devel-ja mailing list
Namazu-devel-ja@xxxxx
http://www.namazu.org/cgi-bin/mailman/listinfo/namazu-devel-ja

投稿者 xml-rpc : 2007年4月13日 01:53
役に立ちました?:
過去のフィードバック 平均:(0) 総合:(0) 投票回数:(0)
本記事へのTrackback: http://hoop.euqset.org/blog/mt-tb2006.cgi/56984
トラックバック
コメント
コメントする




画像の中に見える文字を入力してください。