Dataset statistics
| Number of variables | 4 |
|---|---|
| Number of observations | 55125 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 1.7 MiB |
| Average record size in memory | 32.0 B |
Variable types
| Text | 2 |
|---|---|
| Categorical | 1 |
| Numeric | 1 |
pos is highly imbalanced (72.7%) | Imbalance |
Reproduction
| Analysis started | 2024-08-20 08:04:28.787013 |
|---|---|
| Analysis finished | 2024-08-20 08:04:30.377965 |
| Duration | 1.59 second |
| Software version | ydata-profiling v4.8.3 |
| Download configuration | config.json |
lemma
Text
| Distinct | 52671 |
|---|---|
| Distinct (%) | 95.5% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 430.8 KiB |
Length
| Max length | 18 |
|---|---|
| Median length | 2 |
| Mean length | 2.5282902 |
| Min length | 1 |
Characters and Unicode
| Total characters | 139372 |
|---|---|
| Distinct characters | 4403 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 50621 ? |
|---|---|
| Unique (%) | 91.8% |
Sample
| 1st row | 優れる |
|---|---|
| 2nd row | 良い |
| 3rd row | 喜ぶ |
| 4th row | 褒める |
| 5th row | めでたい |
| Value | Count | Frequency (%) |
| ホーム | 12 | < 0.1% |
| 大和 | 11 | < 0.1% |
| 太刀 | 9 | < 0.1% |
| 頭 | 8 | < 0.1% |
| ダンス | 8 | < 0.1% |
| アップ | 8 | < 0.1% |
| ペーパー | 7 | < 0.1% |
| 大人 | 7 | < 0.1% |
| 一人 | 7 | < 0.1% |
| 端 | 7 | < 0.1% |
| Other values (52648) | 55211 |
Most occurring characters
| Value | Count | Frequency (%) |
| る | 2568 | 1.8% |
| ー | 1747 | 1.3% |
| り | 1488 | 1.1% |
| ン | 1329 | 1.0% |
| い | 1313 | 0.9% |
| し | 1073 | 0.8% |
| す | 1028 | 0.7% |
| ス | 793 | 0.6% |
| ト | 704 | 0.5% |
| く | 638 | 0.5% |
| Other values (4393) | 126691 |
Most occurring categories
| Value | Count | Frequency (%) |
| (unknown) | 139372 |
Most frequent character per category
(unknown)
| Value | Count | Frequency (%) |
| る | 2568 | 1.8% |
| ー | 1747 | 1.3% |
| り | 1488 | 1.1% |
| ン | 1329 | 1.0% |
| い | 1313 | 0.9% |
| し | 1073 | 0.8% |
| す | 1028 | 0.7% |
| ス | 793 | 0.6% |
| ト | 704 | 0.5% |
| く | 638 | 0.5% |
| Other values (4393) | 126691 |
Most occurring scripts
| Value | Count | Frequency (%) |
| (unknown) | 139372 |
Most frequent character per script
(unknown)
| Value | Count | Frequency (%) |
| る | 2568 | 1.8% |
| ー | 1747 | 1.3% |
| り | 1488 | 1.1% |
| ン | 1329 | 1.0% |
| い | 1313 | 0.9% |
| し | 1073 | 0.8% |
| す | 1028 | 0.7% |
| ス | 793 | 0.6% |
| ト | 704 | 0.5% |
| く | 638 | 0.5% |
| Other values (4393) | 126691 |
Most occurring blocks
| Value | Count | Frequency (%) |
| (unknown) | 139372 |
Most frequent character per block
(unknown)
| Value | Count | Frequency (%) |
| る | 2568 | 1.8% |
| ー | 1747 | 1.3% |
| り | 1488 | 1.1% |
| ン | 1329 | 1.0% |
| い | 1313 | 0.9% |
| し | 1073 | 0.8% |
| す | 1028 | 0.7% |
| ス | 793 | 0.6% |
| ト | 704 | 0.5% |
| く | 638 | 0.5% |
| Other values (4393) | 126691 |
reading
Text
| Distinct | 43079 |
|---|---|
| Distinct (%) | 78.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 430.8 KiB |
Length
| Max length | 18 |
|---|---|
| Median length | 14 |
| Mean length | 4.1593288 |
| Min length | 1 |
Characters and Unicode
| Total characters | 229283 |
|---|---|
| Distinct characters | 154 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 37016 ? |
|---|---|
| Unique (%) | 67.1% |
Sample
| 1st row | すぐれる |
|---|---|
| 2nd row | よい |
| 3rd row | よろこぶ |
| 4th row | ほめる |
| 5th row | めでたい |
| Value | Count | Frequency (%) |
| しょう | 45 | 0.1% |
| し | 40 | 0.1% |
| かん | 37 | 0.1% |
| き | 34 | 0.1% |
| そう | 33 | 0.1% |
| とう | 30 | 0.1% |
| せい | 29 | 0.1% |
| けい | 28 | 0.1% |
| けん | 27 | < 0.1% |
| かい | 25 | < 0.1% |
| Other values (43055) | 54975 |
Most occurring characters
| Value | Count | Frequency (%) |
| う | 17636 | 7.7% |
| ん | 16437 | 7.2% |
| い | 13587 | 5.9% |
| し | 10682 | 4.7% |
| く | 8244 | 3.6% |
| き | 7118 | 3.1% |
| ょ | 6936 | 3.0% |
| か | 6878 | 3.0% |
| り | 4794 | 2.1% |
| こ | 4738 | 2.1% |
| Other values (144) | 132233 |
Most occurring categories
| Value | Count | Frequency (%) |
| (unknown) | 229283 |
Most frequent character per category
(unknown)
| Value | Count | Frequency (%) |
| う | 17636 | 7.7% |
| ん | 16437 | 7.2% |
| い | 13587 | 5.9% |
| し | 10682 | 4.7% |
| く | 8244 | 3.6% |
| き | 7118 | 3.1% |
| ょ | 6936 | 3.0% |
| か | 6878 | 3.0% |
| り | 4794 | 2.1% |
| こ | 4738 | 2.1% |
| Other values (144) | 132233 |
Most occurring scripts
| Value | Count | Frequency (%) |
| (unknown) | 229283 |
Most frequent character per script
(unknown)
| Value | Count | Frequency (%) |
| う | 17636 | 7.7% |
| ん | 16437 | 7.2% |
| い | 13587 | 5.9% |
| し | 10682 | 4.7% |
| く | 8244 | 3.6% |
| き | 7118 | 3.1% |
| ょ | 6936 | 3.0% |
| か | 6878 | 3.0% |
| り | 4794 | 2.1% |
| こ | 4738 | 2.1% |
| Other values (144) | 132233 |
Most occurring blocks
| Value | Count | Frequency (%) |
| (unknown) | 229283 |
Most frequent character per block
(unknown)
| Value | Count | Frequency (%) |
| う | 17636 | 7.7% |
| ん | 16437 | 7.2% |
| い | 13587 | 5.9% |
| し | 10682 | 4.7% |
| く | 8244 | 3.6% |
| き | 7118 | 3.1% |
| ょ | 6936 | 3.0% |
| か | 6878 | 3.0% |
| り | 4794 | 2.1% |
| こ | 4738 | 2.1% |
| Other values (144) | 132233 |
pos
Categorical
IMBALANCE 
| Distinct | 5 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 430.8 KiB |
| 名詞 | |
|---|---|
| 動詞 | 4252 |
| 副詞 | 1207 |
| 形容詞 | 665 |
| 助動詞 | 2 |
Length
| Max length | 3 |
|---|---|
| Median length | 2 |
| Mean length | 2.0120998 |
| Min length | 2 |
Characters and Unicode
| Total characters | 110917 |
|---|---|
| Distinct characters | 7 |
| Distinct categories | 1 ? |
| Distinct scripts | 1 ? |
| Distinct blocks | 1 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | 動詞 |
|---|---|
| 2nd row | 形容詞 |
| 3rd row | 動詞 |
| 4th row | 動詞 |
| 5th row | 形容詞 |
Common Values
| Value | Count | Frequency (%) |
| 名詞 | 48999 | |
| 動詞 | 4252 | 7.7% |
| 副詞 | 1207 | 2.2% |
| 形容詞 | 665 | 1.2% |
| 助動詞 | 2 | < 0.1% |
Length
Histogram of lengths of the category
Common Values (Plot)
| Value | Count | Frequency (%) |
| 名詞 | 48999 | |
| 動詞 | 4252 | 7.7% |
| 副詞 | 1207 | 2.2% |
| 形容詞 | 665 | 1.2% |
| 助動詞 | 2 | < 0.1% |
Most occurring characters
| Value | Count | Frequency (%) |
| 詞 | 55125 | |
| 名 | 48999 | |
| 動 | 4254 | 3.8% |
| 副 | 1207 | 1.1% |
| 形 | 665 | 0.6% |
| 容 | 665 | 0.6% |
| 助 | 2 | < 0.1% |
Most occurring categories
| Value | Count | Frequency (%) |
| (unknown) | 110917 |
Most frequent character per category
(unknown)
| Value | Count | Frequency (%) |
| 詞 | 55125 | |
| 名 | 48999 | |
| 動 | 4254 | 3.8% |
| 副 | 1207 | 1.1% |
| 形 | 665 | 0.6% |
| 容 | 665 | 0.6% |
| 助 | 2 | < 0.1% |
Most occurring scripts
| Value | Count | Frequency (%) |
| (unknown) | 110917 |
Most frequent character per script
(unknown)
| Value | Count | Frequency (%) |
| 詞 | 55125 | |
| 名 | 48999 | |
| 動 | 4254 | 3.8% |
| 副 | 1207 | 1.1% |
| 形 | 665 | 0.6% |
| 容 | 665 | 0.6% |
| 助 | 2 | < 0.1% |
Most occurring blocks
| Value | Count | Frequency (%) |
| (unknown) | 110917 |
Most frequent character per block
(unknown)
| Value | Count | Frequency (%) |
| 詞 | 55125 | |
| 名 | 48999 | |
| 動 | 4254 | 3.8% |
| 副 | 1207 | 1.1% |
| 形 | 665 | 0.6% |
| 容 | 665 | 0.6% |
| 助 | 2 | < 0.1% |
score
Real number (ℝ)
| Distinct | 53034 |
|---|---|
| Distinct (%) | 96.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | -0.31976353 |
| Minimum | -1 |
|---|---|
| Maximum | 1 |
| Zeros | 20 |
| Zeros (%) | < 0.1% |
| Negative | 49983 |
| Negative (%) | 90.7% |
| Memory size | 430.8 KiB |
Quantile statistics
| Minimum | -1 |
|---|---|
| 5-th percentile | -0.9785214 |
| Q1 | -0.522353 |
| median | -0.339964 |
| Q3 | -0.176277 |
| 95-th percentile | 0.3779788 |
| Maximum | 1 |
| Range | 2 |
| Interquartile range (IQR) | 0.346076 |
Descriptive statistics
| Standard deviation | 0.38273805 |
|---|---|
| Coefficient of variation (CV) | -1.1969409 |
| Kurtosis | 3.6076668 |
| Mean | -0.31976353 |
| Median Absolute Deviation (MAD) | 0.171778 |
| Skewness | 1.3628983 |
| Sum | -17626.964 |
| Variance | 0.14648841 |
| Monotonicity | Decreasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| -0.130089 | 22 | < 0.1% |
| 0 | 20 | < 0.1% |
| -0.127868 | 10 | < 0.1% |
| 0.996701 | 10 | < 0.1% |
| -0.0199173 | 9 | < 0.1% |
| 0.96851 | 7 | < 0.1% |
| -0.133944 | 7 | < 0.1% |
| -0.0752967 | 6 | < 0.1% |
| -0.170712 | 6 | < 0.1% |
| 0.996789 | 5 | < 0.1% |
| Other values (53024) | 55023 |
| Value | Count | Frequency (%) |
| -1 | 1 | |
| -0.999999 | 1 | |
| -0.999998 | 1 | |
| -0.999997 | 2 | |
| -0.999961 | 1 | |
| -0.999947 | 1 | |
| -0.999882 | 1 | |
| -0.99986 | 1 | |
| -0.999831 | 1 | |
| -0.999805 | 1 |
| Value | Count | Frequency (%) |
| 1 | 1 | |
| 0.999995 | 1 | |
| 0.999979 | 2 | |
| 0.999645 | 1 | |
| 0.999486 | 1 | |
| 0.999314 | 1 | |
| 0.999295 | 1 | |
| 0.999267 | 1 | |
| 0.999122 | 1 | |
| 0.999104 | 1 |
| pos | score | |
|---|---|---|
| pos | 1.000 | 0.092 |
| score | 0.092 | 1.000 |
A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
| lemma | reading | pos | score | |
|---|---|---|---|---|
| 0 | 優れる | すぐれる | 動詞 | 1.000000 |
| 1 | 良い | よい | 形容詞 | 0.999995 |
| 2 | 喜ぶ | よろこぶ | 動詞 | 0.999979 |
| 3 | 褒める | ほめる | 動詞 | 0.999979 |
| 4 | めでたい | めでたい | 形容詞 | 0.999645 |
| 5 | 賢い | かしこい | 形容詞 | 0.999486 |
| 6 | 善い | いい | 形容詞 | 0.999314 |
| 7 | 適す | てきす | 動詞 | 0.999295 |
| 8 | 天晴 | あっぱれ | 名詞 | 0.999267 |
| 9 | 祝う | いわう | 動詞 | 0.999122 |
| lemma | reading | pos | score | |
|---|---|---|---|---|
| 55115 | 下手 | へた | 名詞 | -0.999831 |
| 55116 | 卑しい | いやしい | 形容詞 | -0.999860 |
| 55117 | ない | ない | 形容詞 | -0.999882 |
| 55118 | 浸ける | つける | 動詞 | -0.999947 |
| 55119 | 罵る | ののしる | 動詞 | -0.999961 |
| 55120 | ない | ない | 助動詞 | -0.999997 |
| 55121 | 酷い | ひどい | 形容詞 | -0.999997 |
| 55122 | 病気 | びょうき | 名詞 | -0.999998 |
| 55123 | 死ぬ | しぬ | 動詞 | -0.999999 |
| 55124 | 悪い | わるい | 形容詞 | -1.000000 |