Overview

Dataset statistics

Number of variables4
Number of observations55125
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory1.7 MiB
Average record size in memory32.0 B

Variable types

Text2
Categorical1
Numeric1

Alerts

pos is highly imbalanced (72.7%)Imbalance

Reproduction

Analysis started2024-08-20 08:04:28.787013
Analysis finished2024-08-20 08:04:30.377965
Duration1.59 second
Software versionydata-profiling v4.8.3
Download configurationconfig.json

Variables

lemma
Text

Distinct52671
Distinct (%)95.5%
Missing0
Missing (%)0.0%
Memory size430.8 KiB
2024-08-20T17:04:30.606982image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Length

Max length18
Median length2
Mean length2.5282902
Min length1

Characters and Unicode

Total characters139372
Distinct characters4403
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique50621 ?
Unique (%)91.8%

Sample

1st row優れる
2nd row良い
3rd row喜ぶ
4th row褒める
5th rowめでたい
ValueCountFrequency (%)
ホーム 12
 
< 0.1%
大和 11
 
< 0.1%
太刀 9
 
< 0.1%
8
 
< 0.1%
ダンス 8
 
< 0.1%
アップ 8
 
< 0.1%
ペーパー 7
 
< 0.1%
大人 7
 
< 0.1%
一人 7
 
< 0.1%
7
 
< 0.1%
Other values (52648) 55211
99.8%
2024-08-20T17:04:31.053725image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Most occurring characters

ValueCountFrequency (%)
2568
 
1.8%
1747
 
1.3%
1488
 
1.1%
1329
 
1.0%
1313
 
0.9%
1073
 
0.8%
1028
 
0.7%
793
 
0.6%
704
 
0.5%
638
 
0.5%
Other values (4393) 126691
90.9%

Most occurring categories

ValueCountFrequency (%)
(unknown) 139372
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
2568
 
1.8%
1747
 
1.3%
1488
 
1.1%
1329
 
1.0%
1313
 
0.9%
1073
 
0.8%
1028
 
0.7%
793
 
0.6%
704
 
0.5%
638
 
0.5%
Other values (4393) 126691
90.9%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 139372
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
2568
 
1.8%
1747
 
1.3%
1488
 
1.1%
1329
 
1.0%
1313
 
0.9%
1073
 
0.8%
1028
 
0.7%
793
 
0.6%
704
 
0.5%
638
 
0.5%
Other values (4393) 126691
90.9%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 139372
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
2568
 
1.8%
1747
 
1.3%
1488
 
1.1%
1329
 
1.0%
1313
 
0.9%
1073
 
0.8%
1028
 
0.7%
793
 
0.6%
704
 
0.5%
638
 
0.5%
Other values (4393) 126691
90.9%
Distinct43079
Distinct (%)78.1%
Missing0
Missing (%)0.0%
Memory size430.8 KiB
2024-08-20T17:04:31.356845image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Length

Max length18
Median length14
Mean length4.1593288
Min length1

Characters and Unicode

Total characters229283
Distinct characters154
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique37016 ?
Unique (%)67.1%

Sample

1st rowすぐれる
2nd rowよい
3rd rowよろこぶ
4th rowほめる
5th rowめでたい
ValueCountFrequency (%)
しょう 45
 
0.1%
40
 
0.1%
かん 37
 
0.1%
34
 
0.1%
そう 33
 
0.1%
とう 30
 
0.1%
せい 29
 
0.1%
けい 28
 
0.1%
けん 27
 
< 0.1%
かい 25
 
< 0.1%
Other values (43055) 54975
99.4%
2024-08-20T17:04:31.879394image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Most occurring characters

ValueCountFrequency (%)
17636
 
7.7%
16437
 
7.2%
13587
 
5.9%
10682
 
4.7%
8244
 
3.6%
7118
 
3.1%
6936
 
3.0%
6878
 
3.0%
4794
 
2.1%
4738
 
2.1%
Other values (144) 132233
57.7%

Most occurring categories

ValueCountFrequency (%)
(unknown) 229283
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
17636
 
7.7%
16437
 
7.2%
13587
 
5.9%
10682
 
4.7%
8244
 
3.6%
7118
 
3.1%
6936
 
3.0%
6878
 
3.0%
4794
 
2.1%
4738
 
2.1%
Other values (144) 132233
57.7%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 229283
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
17636
 
7.7%
16437
 
7.2%
13587
 
5.9%
10682
 
4.7%
8244
 
3.6%
7118
 
3.1%
6936
 
3.0%
6878
 
3.0%
4794
 
2.1%
4738
 
2.1%
Other values (144) 132233
57.7%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 229283
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
17636
 
7.7%
16437
 
7.2%
13587
 
5.9%
10682
 
4.7%
8244
 
3.6%
7118
 
3.1%
6936
 
3.0%
6878
 
3.0%
4794
 
2.1%
4738
 
2.1%
Other values (144) 132233
57.7%

pos
Categorical

IMBALANCE 

Distinct5
Distinct (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size430.8 KiB
名詞
48999 
動詞
 
4252
副詞
 
1207
形容詞
 
665
助動詞
 
2

Length

Max length3
Median length2
Mean length2.0120998
Min length2

Characters and Unicode

Total characters110917
Distinct characters7
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row動詞
2nd row形容詞
3rd row動詞
4th row動詞
5th row形容詞

Common Values

ValueCountFrequency (%)
名詞 48999
88.9%
動詞 4252
 
7.7%
副詞 1207
 
2.2%
形容詞 665
 
1.2%
助動詞 2
 
< 0.1%

Length

2024-08-20T17:04:32.028642image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Histogram of lengths of the category

Common Values (Plot)

2024-08-20T17:04:32.158503image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
ValueCountFrequency (%)
名詞 48999
88.9%
動詞 4252
 
7.7%
副詞 1207
 
2.2%
形容詞 665
 
1.2%
助動詞 2
 
< 0.1%

Most occurring characters

ValueCountFrequency (%)
55125
49.7%
48999
44.2%
4254
 
3.8%
1207
 
1.1%
665
 
0.6%
665
 
0.6%
2
 
< 0.1%

Most occurring categories

ValueCountFrequency (%)
(unknown) 110917
100.0%

Most frequent character per category

(unknown)
ValueCountFrequency (%)
55125
49.7%
48999
44.2%
4254
 
3.8%
1207
 
1.1%
665
 
0.6%
665
 
0.6%
2
 
< 0.1%

Most occurring scripts

ValueCountFrequency (%)
(unknown) 110917
100.0%

Most frequent character per script

(unknown)
ValueCountFrequency (%)
55125
49.7%
48999
44.2%
4254
 
3.8%
1207
 
1.1%
665
 
0.6%
665
 
0.6%
2
 
< 0.1%

Most occurring blocks

ValueCountFrequency (%)
(unknown) 110917
100.0%

Most frequent character per block

(unknown)
ValueCountFrequency (%)
55125
49.7%
48999
44.2%
4254
 
3.8%
1207
 
1.1%
665
 
0.6%
665
 
0.6%
2
 
< 0.1%

score
Real number (ℝ)

Distinct53034
Distinct (%)96.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean-0.31976353
Minimum-1
Maximum1
Zeros20
Zeros (%)< 0.1%
Negative49983
Negative (%)90.7%
Memory size430.8 KiB
2024-08-20T17:04:32.311857image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Quantile statistics

Minimum-1
5-th percentile-0.9785214
Q1-0.522353
median-0.339964
Q3-0.176277
95-th percentile0.3779788
Maximum1
Range2
Interquartile range (IQR)0.346076

Descriptive statistics

Standard deviation0.38273805
Coefficient of variation (CV)-1.1969409
Kurtosis3.6076668
Mean-0.31976353
Median Absolute Deviation (MAD)0.171778
Skewness1.3628983
Sum-17626.964
Variance0.14648841
MonotonicityDecreasing
2024-08-20T17:04:32.495642image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
-0.130089 22
 
< 0.1%
0 20
 
< 0.1%
-0.127868 10
 
< 0.1%
0.996701 10
 
< 0.1%
-0.0199173 9
 
< 0.1%
0.96851 7
 
< 0.1%
-0.133944 7
 
< 0.1%
-0.0752967 6
 
< 0.1%
-0.170712 6
 
< 0.1%
0.996789 5
 
< 0.1%
Other values (53024) 55023
99.8%
ValueCountFrequency (%)
-1 1
< 0.1%
-0.999999 1
< 0.1%
-0.999998 1
< 0.1%
-0.999997 2
< 0.1%
-0.999961 1
< 0.1%
-0.999947 1
< 0.1%
-0.999882 1
< 0.1%
-0.99986 1
< 0.1%
-0.999831 1
< 0.1%
-0.999805 1
< 0.1%
ValueCountFrequency (%)
1 1
< 0.1%
0.999995 1
< 0.1%
0.999979 2
< 0.1%
0.999645 1
< 0.1%
0.999486 1
< 0.1%
0.999314 1
< 0.1%
0.999295 1
< 0.1%
0.999267 1
< 0.1%
0.999122 1
< 0.1%
0.999104 1
< 0.1%

Interactions

2024-08-20T17:04:29.998431image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/

Correlations

2024-08-20T17:04:32.604513image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
posscore
pos1.0000.092
score0.0921.000

Missing values

2024-08-20T17:04:30.164510image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
A simple visualization of nullity by column.
2024-08-20T17:04:30.284629image/svg+xmlMatplotlib v3.7.2, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

lemmareadingposscore
0優れるすぐれる動詞1.000000
1良いよい形容詞0.999995
2喜ぶよろこぶ動詞0.999979
3褒めるほめる動詞0.999979
4めでたいめでたい形容詞0.999645
5賢いかしこい形容詞0.999486
6善いいい形容詞0.999314
7適すてきす動詞0.999295
8天晴あっぱれ名詞0.999267
9祝ういわう動詞0.999122
lemmareadingposscore
55115下手へた名詞-0.999831
55116卑しいいやしい形容詞-0.999860
55117ないない形容詞-0.999882
55118浸けるつける動詞-0.999947
55119罵るののしる動詞-0.999961
55120ないない助動詞-0.999997
55121酷いひどい形容詞-0.999997
55122病気びょうき名詞-0.999998
55123死ぬしぬ動詞-0.999999
55124悪いわるい形容詞-1.000000