It is currently Sat 13 Aug 2022 1:16 pm

All times are UTC


Forum rules


Please click here to view the forum rules



Post new topic Reply to topic  [ 9 posts ] 
Author Message
 Post subject: Irish words by frequency
PostPosted: Thu 01 Jul 2021 5:10 pm 
Offline

Joined: Thu 27 May 2021 3:22 am
Posts: 426
From the Nua Chorpas - Michal Boleslav Měchura sent me it years ago, but it has Open Database Licence.

I believe column 1 is the frequency (1= most frequent)
Column 2 is the word (or lemma)
Column 3 is the frequency in the corpus (the number of times it occurs in the whole corpus of 30m words)
Column 4 is the window size (I think this means that 25 means the word occurs every 25th word in Irish)

I became interested in Word Frequency when studying Russian, and Nick Brown's Learner's Frequency Dictionary of Russian arranged the most frequent 10,000 words by frequency (note: it always depends on what the corpus consists of - the frequency is only approximate). He added that there were words like the Russian for "woodpecker" (дятел) that native speakers would know, but that were not very helpful to learners as you could read Russian every day for 10 years and not come across it. He viewed only words as frequent as those occurring 10 times per 1m words worth learning. There are 8,000 such words in Russian, but he rounded the book out with 10,000 words, giving all words occurring 8 times per million.

Now Irish is periphrastic - so the raw number of words will be less. E.g. "eviction" is a word in English, but in Irish cur ó sheilbh uses three frequent words to make a new meaning. In fact, in the Nua-Chorpas, there are only 4,122 Irish words that occur at least 8 times per 1m words (i.e. a window size of 125,000). There are many anomalies: 6343rd is polla, occurring 101 times, or once very 340180 words. Surely the word is more common that that? This may reflect the type of works fed into the Corpus? Here are the first 100 most common Irish words. If people are interested, we could gradually work through the list of 6,450 words that Michal sent me from the Nua-Chorpas.

1 an 1338874 25
2 bí 1194301 28
3 ar 898707 38
4 agus 856233 40
5 is 678055 50
6 ag 673684 51
7 le 663052 51
8 na 660024 52
9 do 526579 65
10 go 458180 74
11 de 304296 112
12 sé 295900 116
13 sin 243901 140
14 ó 240522 142
15 é 212565 161
16 seo 186776 183
17 cuir 181783 189
18 mar 181317 189
19 ach 174944 196
20 déan 169196 203 [=dein in Cork]
21 faoi 150173 228 [=fé in Cork]
22 nó 142220 241 [pronounced nú in Cork]
23 duine 139569 246
24 tabhair 123139 279
25 féin 114602 299
26 ní 104620 328
27 aon 100018 343
28 as 98622 348
29 chun 96082 357
30 eile 94831 362
31 abair 94140 364
32 mé 91318 376 [usually pronounce me as an object pronoun in Cork]
33 tar 91096 377 [=tair in Cork]
34 cuid 87857 391
35 maith 87286 393
36 faigh 86973 395
37 sí 81913 419
38 ná 79199 433
39 bliain 75787 453
40 siad 75348 455
41 téigh 74714 459
42 nuair 73679 466
43 iad 67270 510
44 amach 63887 537
45 mo 63778 538
46 cé 62903 546
47 nach 61995 554 [=nách in Cork]
48 bain 60640 566
49 ceann 58819 584
50 gach 55191 622
51 tú 54337 632 [usually tu or thu where an object pronoun in Cork]
52 rud 54179 634
53 í 53027 647
54 caith 52901 649
55 Gaeilge 52339 656 [=Gaelainn in Cork]
56 trí 52004 660
57 gan 51455 667
58 féidir 50408 681
59 lá 48892 702
60 chomh 47797 718
61 fear 45850 749
62 isteach 45573 753
63 fad 45242 759
64 áit 44613 770
65 beag 44314 775
66 am 43223 794
67 chuig 41165 834 [a variant of chun, so not used in Cork]
68 Éire 41141 835
69 obair 41108 835
70 céad 40394 850
71 amháin 40383 850
72 taobh 39944 860
73 anois 39654 866
74 céile 38960 881
75 mac 38875 883
76 feic 38852 884
77 níos 38529 891
78 má 37692 911
79 teach 37246 922 [=tigh in Cork]
80 ceart 36986 928
81 gur 36788 933
82 idir 36440 942
83 scéal 35691 962
84 tír 35130 978
85 saol 34478 996
86 bith 34266 1002 [only really in 'ar bith']
87 roimh 33297 1031 [usually roim in Cork]
88 féad 32801 1047
89 ceist 32045 1072
90 ansin 31686 1084 [=ansan in Cork]
91 deireadh 30577 1123
92 bean 29714 1156
93 dóigh 29194 1176 [pronounced dó in Cork]
94 dá 28842 1191
95 fios 28504 1205
96 uair 28084 1223
97 alt 27940 1229
98 te 27935 1229
99 pobal 27643 1242
100 comhairle 26702 1286


Top
 Profile  
 
PostPosted: Sun 29 May 2022 8:05 pm 
Offline

Joined: Thu 26 May 2022 4:04 pm
Posts: 11
this is old, but I would be interested in seeing that list if you have it as a pdf, or are willing to post more of the words.

thanks,


Top
 Profile  
 
PostPosted: Sun 29 May 2022 8:29 pm 
Offline

Joined: Thu 27 May 2021 3:22 am
Posts: 426
gilmo789 wrote:
this is old, but I would be interested in seeing that list if you have it as a pdf, or are willing to post more of the words.

thanks,


Yes, I have the whole list, but you can't post PDFs to this site. You can email me at foghlamthoir@gmail.com and I will send it to you. I'll have to find it, but I have it in a folder.


Top
 Profile  
 
PostPosted: Mon 30 May 2022 3:54 pm 
Offline

Joined: Thu 26 May 2022 4:04 pm
Posts: 11
emailed


Top
 Profile  
 
PostPosted: Mon 30 May 2022 11:22 pm 
Offline

Joined: Thu 27 May 2021 3:22 am
Posts: 426
Hi, did you notice that there are 4122 words in that list that have a window size of more than 125,000? (ie are found at least 8 times per million)? Yet in the Russian frequency dictionary, there are 10,000 Russian words with that frequency.

This reflects the way in which Irish makes phrases out of existing words (Irish is periphrastic). E.g. in English "evict" is a separate word, but in Irish "cur ó sheilbh", using three words that occur in the list, can be brought together to mean "evict". So the 4122 headwords will make enough phrases to cover the same ground as 10,000 Russian ones.

The list is lemmatised, meaning that deinim, deineann, dhein sé, déanfad are not all separately listed. They occur just as déan.

As you can see, proficiency in Irish should be easy in terms of a pretty low vocabulary load. It is the different way of phrasing things that adds to the difficulties.


Top
 Profile  
 
PostPosted: Mon 30 May 2022 11:54 pm 
Offline

Joined: Thu 26 May 2022 4:04 pm
Posts: 11
I hadn't noticed the first point. Thats encouraging.

I guessed the second point. Are the word all in the Nominative or Verbal Noun form or are they given in their most common incarnation?

I can see this isn't the case, but if say bhí was more common the bí for example, would bí still be given?

thanks again


Top
 Profile  
 
PostPosted: Tue 31 May 2022 12:32 am 
Offline

Joined: Thu 27 May 2021 3:22 am
Posts: 426
I think verbs are listed under the imperative only. This is the point of lemmatisation.


Top
 Profile  
 
PostPosted: Wed 01 Jun 2022 9:50 am 
Offline

Joined: Mon 01 Sep 2014 10:03 pm
Posts: 496
Location: SAM
djwebb2021 wrote:
This reflects the way in which Irish makes phrases out of existing words (Irish is periphrastic). E.g. in English "evict" is a separate word, but in Irish "cur ó sheilbh", using three words that occur in the list, can be brought together to mean "evict". So the 4122 headwords will make enough phrases to cover the same ground as 10,000 Russian ones.


Does it, or does it just reflect the weakness of the Irish speakers in the corpus? I'm not sure which corpus it pulls from, but if it's sufficiently modern (post-1960), I would wager it reflects more the declining usage of words among younger, weaker speakers. Or, probably, a combination of the two.


Top
 Profile  
 
PostPosted: Wed 01 Jun 2022 10:50 am 
Offline

Joined: Thu 27 May 2021 3:22 am
Posts: 426
galaxyrocker wrote:
djwebb2021 wrote:
This reflects the way in which Irish makes phrases out of existing words (Irish is periphrastic). E.g. in English "evict" is a separate word, but in Irish "cur ó sheilbh", using three words that occur in the list, can be brought together to mean "evict". So the 4122 headwords will make enough phrases to cover the same ground as 10,000 Russian ones.


Does it, or does it just reflect the weakness of the Irish speakers in the corpus? I'm not sure which corpus it pulls from, but if it's sufficiently modern (post-1960), I would wager it reflects more the declining usage of words among younger, weaker speakers. Or, probably, a combination of the two.


I think it's from the modern Nua-chorpas, with 30m words in it. As you say, probably a combination of the two, but Irish is a lot more periphrastic. Also "tabhairt" is a word in the list, but actually "tabhairt fé", "tabhairt suas" and things like that should ideally be separate words.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 9 posts ] 

All times are UTC


Who is online

Users browsing this forum: Google [Bot] and 5 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group