[ad_1]
1 INTRODUCTION
In search of correct algorithms to optimize right this moment’s excessive computational actual‐world issues is a important and difficult job that has taken an excessive amount of efforts within the final decade. As an illustration, Barshandeh and Haghzadeh1 proposed a novel hybrid physics‐primarily based nature‐impressed meta‐heuristic algorithm which named as proposed hybrid optimization algorithm (PHOA). They built-in atom search optimization (ASO) and tree‐seed algorithm (TSA) to efficiently optimize conventional meta‐heuristic algorithms, furthermore, PHOA was additionally examined on seven actual‐life engineering issues and the outcomes of PHOA had been superior amongst conventional algorithms. As well as, Barshandeh et al.2 proposed a novel hybrid multipopulation algorithm (HMPA) that mixed synthetic ecosystem‐primarily based optimization (AEO) and Harris Hawks optimization (HHO) algorithms, then, adopted Levy‐flight technique, native search mechanism, quasi‐oppositional studying, and chaos concept to maximise the effectivity of the HMPA. Of their analysis, HMPA was examined on seven constrained/unconstrained actual‐life engineering issues, and the calculation outcomes of HMPA had been in contrast with related superior algorithms. The outcomes indicated that HMPA was outperformed the opposite competitor algorithms considerably. To increase the ideas of Barshandeh and Haghzadeh1 and Barshandeh et al.2 researches, it’s important to hunt optimization algorithms in dealing with actual‐life corpus evaluation points, particularly throughout this period of knowledge explosion.
On this trendy digital period, corpus constructing has advanced from handbook assortment to computerized assortment of textual knowledge. To handle its huge textual knowledge, corpus often combines statistics, machine studying algorithms, or synthetic intelligence (AI) methods; this facilitates the effectivity of knowledge assortment, data processing, data retrieval (IR), and so forth. Pure languages are some of the ubiquitous codecs of knowledge movement amongst folks. Analyzing, integrating, and reproducing textual knowledge inevitably require importing extremely correct algorithms to course of pure languages’ semantics and syntax. Corpus‐primarily based approaches that embed statistical algorithms, corresponding to frequency calculation and log‐probability check, are generally adopted by linguists and knowledge analysts for deciphering linguistic patterns and extracting area data.three, four As well as, in corpus‐primarily based approaches, phrase rating is a crucial method used to outline phrases’ significance degree and to retrieve important phrases from the big textual knowledge; this particularly helps uncover semantic relationships between lexical items.5, 6
Within the face of novel ailments, it’s important to construct specialised medical corpora for integrating, managing, and retrieving huge data associated to the ailments; such corpora assist additional successfully analyze, react, stop the ailments. For instance, COVID‐19, a novel illness outbreak in December 2019, has an in depth genetic kind with SARS coronavirus (SARS‐CoV), and has prompted over 40 million confirmed circumstances and 1 million deaths by the top of October 2020 (lower than a yr).7-12 Main researchers from varied nations are attempting to unveil the thriller of the novel illness. As of the top of October 2020, Internet of Science (WOS), an internationally famend tutorial database, has printed greater than 35,000 COVID‐19‐associated analysis articles (RAs); this quantity retains rising. Little question, governments around the globe are searching for direct and efficient measures to mitigate the pandemic and pace up the remedy of the confirmed circumstances.13, 14 With large textual knowledge about COVID‐19 being quickly distributed, it’s important for people to depend on machine algorithms to compute necessary semantic data, thereby, filtering and retrieving important messages.15, 16 Therefore, adopting corpus‐primarily based approaches to course of and combine COVID‐19‐associated English‐mediated textual knowledge will improve frontline medical personnel’s effectivity of data acquisition and notion.
Because the creation of laptop know-how, the practicality of corpus‐primarily based approaches has obtained widespread consideration and adoption in textual data evaluation fields. Frequency criterion is taken into account as one of many core analytical methods in corpus‐primarily based approaches. Nonetheless, merely counting on tokens’ frequency values to find out their significance could also be inadequate; tokens’ dispersion and focus situations additionally should be considered. For instance, when it comes to significance, a phrase occurring 100 instances in an RA shouldn’t be equal to a phrase occurring 10 instances every in 10 RAs as a result of phrases’ dispersion and focus situations are totally different. A possible resolution that adopts Hirsch index (H‐index) algorithm to combine and compute the standards of dispersion and focus is required to handle this concern. H‐index algorithm was initially used to quantify the accumulative impacts and relevance of a researcher’s scientific analysis achievements.17-23 However, this algorithm was not solely restricted to the needs of evaluating tutorial achievements but in addition seen its purposes within the fields of threat evaluation,22 medical,24 and so forth.
Dealing with important phrase‐rating points utilizing conventional frequency‐primarily based approaches could trigger distortion and bias as a result of these approaches neither refine the corpus knowledge nor concurrently compute phrases’ frequency dispersion and focus standards, therefore, the alleged extremely necessary phrases with excessive frequency could be challenged. Thus, this paper proposed a novel corpus‐primarily based strategy that integrates a corpus software program and H‐index algorithm as a computation technique and analysis metric that may improve the accuracy of phrase rating, compensate the deficiency of the standard frequency‐primarily based approaches, and additional increase the efficacy of corpus‐primarily based evaluation. To confirm the proposed strategy, 100 COVID‐19‐associated medical RAs with Science Quotation Index (SCI) from WOS had been retrieved and compiled as the large textual knowledge and an empirical instance which was embedded into the proposed strategy. The primary motive the researchers adopted this empirical instance was that SCI journals symbolize excessive‐high quality tutorial publications. As well as, understanding the particular linguistic pragmatics of medical RAs will help frontline healthcare personnel in processing and buying necessary COVID‐19 medical messages.
The rest of this paper is organized as follows: Part 2 describes preliminaries, explains the theoretical framework, and introduces the current novel illness, COVID‐19. Part three describes detailed steps of the proposed strategy. Part four makes use of COVID‐19‐associated RAs from WOS as the large textual knowledge (i.e., the goal corpus) and as an empirical instance to confirm the proposed strategy. Part 5 is the concluding a part of this examine.
2 PRELIMINARIES
2.1 Typical frequency‐primarily based corpus evaluation
With the advance of laptop know-how, corpus improvement has enabled folks to ascertain algorithms to combine, handle, and course of pure languages from huge textual knowledge, thereby driving the progress of pure language processing (NLP) and AI‐associated industries. O’Keeffe et al.25 famous that data on frequency counts of tokens is the premise for understanding core vocabularies that native audio system use continuously and the frequent mixtures of vocabulary utilization. Gathering massive knowledge (corpora) from native audio system’ written texts and discourse transcripts will present robust proof for understanding their linguistic patterns. Furthermore, rating phrases primarily based on their frequency will present the phrases which are adopted by the bulk and the phrases which are utilized in day‐to‐day communications.26, 27 Therefore, frequency‐primarily based corpus analytical approaches have extensively been adopted by linguists, sociologists, textual content analysts, and so forth for extracting robust linguistic proof for deciphering cultural phenomenon, jargon, style kind, and so forth.28, 29 For instance, Le and Miller6 adopted Sketch Engine, a corpus software program, to cross‐examine 4 medical corpus sources to extract probably the most continuously occurring medical morphemes in medical RAs. The ensuing knowledge indicated 136 specialised medical morphemes that account for eight.5% of the lexical gadgets within the Medical Internet Corpus, and the outcomes supplied English as a Overseas Language (EFL) medical college students a helpful tutorial useful resource for enhancing their comprehension of English medical vocabulary. Grabowski5 used WordSmith Instruments 5.Zero, a corpus software program, to current a corpus‐pushed description of the use and capabilities of prime‐50 key phrases (i.e., primarily based on keyness values) complemented by the same description of prime‐50 lexical bundles (LBs; primarily based on frequency values) within the evaluation of specialised corpus which comprises sufferers’ prescriptions, outlines of product introduction, medical trial protocols, and pharmacological RAs. The outcomes offered vital pedagogical worth for English for particular functions (ESP) college students and EFL practitioners within the pharmaceutical area.
Conventional corpus‐primarily based strategy was designed for successfully clarifying, categorizing, and deciphering the patterns of pure languages. Computing phrase frequency is thus a important method that corpus software program is able to (see Equation 1).
Definition 1. ((Anthony[30] and Scott[31]))If
represents the cumulated worth of a token’s total frequency, the place
means the sequence of a subcorpus;
means a token’s frequency; and
means a token’s frequency, counted in
subcorpus.
(1)
2.2 H‐index algorithm
H‐index algorithm was proposed by Jorge E. Hirsch,19 a physicist and a professor on the College of California, San Diego in 2005. H‐index is an analysis mechanism that’s used to measure a researcher’s tutorial productiveness and the quotation charge of printed articles; the index h is given to symbolize the variety of papers with quotation quantity greater than h, it’s a helpful index to quantify the tutorial achievements of a researcher. These days, this mechanism has been extensively adopted in a number of tutorial databases, corresponding to WOS, Google Scholar, Scopus, and even different analysis fields.18, 20, 22 The algorithm computes the interrelationships between publication portions and numbers of citations, and defines a researcher’s tutorial affect in sure area. For instance, Li et al.22 adopted H‐index algorithm to evaluate the importance of the city railroad community construction, which took topology, passenger amount, and passenger movement correlation of Beijing city railroad community into consideration to refine rail community construction and reduce operational dangers. Gao et al.17 proposed a weighted H‐index (hw) by setting up an operator H on weighted edges. Furthermore, the buildup of weighted H‐index (sh) within the node’s neighborhood defines the spreading affect, then utilized the vulnerable–contaminated–recovered (SIR) mannequin to research an epidemic spreading course of on 12 actual‐world networks, and to additional outline probably the most influential spreaders. Hanna et al.24 developed a novel metric for quantifying affected person‐degree utilization of emergency division (ED) imaging. Of their analysis, H‐index was adopted to measure a affected person’s annual ED imaging quantity, and the ensuing knowledge of sufferers’ H‐index values had been used because the referential knowledge for mitigating imaging‐associated prices and bettering throughput within the ED. In abstract, H‐index algorithm integrates a number of concerns to guage and to create the values of significance of the analysis objects, furthermore, the definition of Hirsch’s H‐index algorithm is outlined as follows:
Definition 2. ((Hirsch[19]))If the worth of operate f represents quotation instances of every paper and is ranked in descending sequence (see Equation 2), then discover f(n) equal to or bigger than n (see Equation three). The worth of H‐index has to fulfill this criterion, and may be described as follows:
(2)
(three)
the place n is the paper numbers,
is the quotation instances of the paper, and
represents quotation instances of every paper ranked from most to minimal.
To grasp this algorithm, two examples are given as follows:
Instance 1.If a researcher has 10 printed articles (n = 10) recognized as
, and the quotation numbers are randomly given as 9, 5, 50, 20, 6, eight, 6, four, 1, Zero, thus, f(A1) = 9, f (A2) = 5, f (Athree) = 50, f (Afour) = 20, f (A5) = 6, f (A6) = eight, f (A7) = 6, f (Aeight) = four, f (A9) = 1, f (A10) = Zero. Then, rerank the quotation numbers in descending sequence, they usually change into f (b1) = 50, f (b2) = 20, f (bthree) = 9, f (bfour) = eight, f (b5) = 6, f (b6) = 6, f (b7) = 5, f (beight) = four, f (b9) = 1, f (b10) = Zero. The outcomes point out that b6 satisfies the standards of Equation (2) the place f (b6) ≥ 6, thus H‐index = 6 (see Desk 1).
Instance 2.The illustrative diagram (see Determine 1) additionally explains the H‐index algorithm; there’s a reference line (i.e., it represents that the n paper must have not less than n citations) on the diagram, the papers’ citations need to be over or on the reference line to be included into the worth of H‐index. f(b6), on this case, is the sixth paper and can also be the final paper on the reference line. In the meantime, its quotation time is six and it satisfies Equation (2), f(b6) ≥ 6, thus, the worth of H‐index is the same as 6.
In abstract, H‐index algorithm presents the estimation of the importance, significance, and huge affect of a researcher’s cumulative tutorial contributions. It has change into a typical measurement and a criterion that’s unbiased to check and to guage the tutorial achievements of researchers who’re competing in the identical analysis fields.19

H‐index computing course of
| Authentic knowledge | Computing course of | H‐index end result | ||
|---|---|---|---|---|
| Analysis paper | Quotation instances | Analysis paper | Quotation time | |
| 1 | 9 | three | 50 | 6 |
| 2 | 5 | four | 20 | |
| three | 50 | 1 | 9 | |
| four | 20 | 6 | eight | |
| 5 | 6 | 5 | 6 | |
| 6 | eight | 7 | 6 | |
| 7 | 6 | 2 | 5 | |
| eight | four | eight | four | |
| 9 | 1 | 9 | 1 | |
| 10 | Zero | 10 | Zero | |
2.three COVID‐19
COVID‐19, whose authentic nomenclature was SARS‐CoV‐2, was renamed by WHO in February 2020. The clusters of first circumstances of the virus had been found in Wuhan metropolis, Hubei province, China.7 Epidemiologists, for now, suggest a risk that the virus which was initially carried by wild animals entered to human‐to‐human transmission routes as a result of locals within the metropolis have desire for “Yeh‐Wei”, meats of untamed animals, corresponding to bats, birds, and rodents.eight, 10 Upon visiting the attainable supply location of COVID‐19, Huanan market, medical specialists discovered loads of contaminated carcasses of untamed animals stocked and piled on the market. Thus, medical and organic specialists speculated that the novel coronavirus could consistently mutate in animal hosts (e.g., bats, pangolins, and so on.), then change into able to infecting people, particularly when folks course of animal carcasses or eat raw meals elements that host the virus.eight Certainly, many research have indicated that bats had been the preliminary hosts of COVID‐19 as a result of it has over 90% similarity to 2 SARS‐like coronaviruses from bats, bat‐SL‐CoVZX45 and bat‐SL‐CoVZX21.9, 12 By way of etiology, COVID‐19 has a genetic kind just like SARS‐CoV (i.e., an acute respiratory syndrome coronavirus which broke out in 2002) and MERS‐CoV (i.e., center east respiratory syndrome coronavirus which broke out in 2012),12, 32 however its spike (S) protein has mutated and enabled it to assault the host’s immune system, making the host too weak to withstand the virus.33 The comparability of COVID‐19 and two prior coronaviruses reveals that COVID‐19 causes a low fatality charge however has extraordinarily excessive infectious functionality.34 Yi et al.12 additionally identified that almost all of the human inhabitants lacks the immunity of COVID‐19 and is thus vulnerable to the novel coronavirus.
Reverse transcriptase polymerase chain response (RT‐PCR) was initially adopted as the first standards for diagnosing COVID‐19. Nonetheless, RT‐PCR check technique has a excessive likelihood of misdiagnosis which will speed up the pandemic, thus, a number of diagnosing check approaches had been built-in with the investigations of journey historical past survey, illness information, medical signs (see Determine 2), lab assessments, and X‐ray or computed tomography (CT) for making efficient diagnoses.35 Following the intensification of the COVID‐19 pandemic, fast check toolkits had been invented to quickly detect RNA, antigen, or antibody of SARS‐CoV‐2, giving extra time to frontline healthcare personnel to reply and remedy the confirmed circumstances. As well as, prior research identified that with out protecting measures (i.e., surgical masks, respiratory filtrations, and so on.), three main transmission routes of inhalation, droplet, and call routes will trigger 57%, 35%, and eight.2% of COVID‐19 an infection likelihood.36 For frontline healthcare personnel, specifically, who deal with confirmed circumstances and have extended publicity to the virus emission atmosphere and inhalation of droplets (<10 μm) that include the virus, their risk of an infection could attain over 80%.37 Prior analysis additionally confirmed that social distance (1.5–2 m) won’t be efficient if the virus emission supply doesn’t put on any protecting gear as a result of the virus may be unfold not less than 6 m away through sufferers’ coughing and sneezing.38, 39 Therefore, although the fatality charge of COVID‐19 shouldn’t be extraordinarily excessive, excessive an infection charges trigger difficulties in pandemic response and prevention.

In line with WHO, as of October 31, 2020, there have been 45,408,704 confirmed COVID‐19 circumstances and 1,179,363 COVID‐19 deaths (see Determine three). As a result of focused therapeutic medicines are nonetheless being developed, governments can solely presently depend on quarantine insurance policies, and current oblique medical therapies, thus, making residents take note of private hygiene, implementing border management measures, encouraging social distance and web buying, and so forth to lower shut contacts between folks and management the COVID‐19 pandemic.40-42

COVID‐19, on the time of this writing, remains to be a semi‐unknown novel illness for medical specialists and continues to be explored. To successfully handle the large medical textual details about it, it’s essential to create a COVID‐19‐specialised corpus, integrating acceptable algorithms for data processing and mining.
three METHODOLOGY
Conventional corpus‐primarily based computing strategies for important phrase rating primarily calculate phrases’ frequency values and rank them. Prior research believed excessive‐frequency phrases could replicate particular linguistic patterns in sure domains which might profit EFL audio system in more practical acquisition of area data when studying English texts.three, 5, 6, 43, 44 Thus, with fast data movement of COVID‐19, establishing COVID‐19 specialised corpus for well timed acquisition of up to date medical data is very important for medical care personnel.7, 9, 11, 14, 32 Actually, as of the top of October 2020, greater than 38,000 RAs on COVID‐19‐associated matters had been printed within the WOS database; this phenomenon indicated that a lot of analysis outcomes had been produced by main researchers globally. To successfully combine and decipher the English‐mediated skilled textual data and to additional enhance the effectivity of data acquisition, importing algorithms to compute key pure language semantics is kind of important. Corpus‐primarily based and NLP know-how therefore performs the important roles right now for people to effectively course of the large textual data out there.25, 45
Nonetheless, taking current corpus software program, corresponding to AntConc three.5.eight,30 WordSmith Instruments 5.Zero, and so forth, as examples, inside its current algorithms, these are nonetheless unable to concurrently compute these two situations. Their phrase‐rating outcomes can solely base on frequency worth or vary worth, respectively, therefore to make the analysis of phrases’ significance degree exist bias. Due to this fact, to compensate for the outcomes bias in phrase‐rating problems with the standard strategies, the researchers suggest a novel corpus‐primarily based strategy that integrates AntConc three.5.eight30 and H‐index algorithm19 to compute and to guage the significance of tokens.
The steps are as follows: within the preliminary stage of the proposed strategy, pattern and compile the textual knowledge because the goal corpus in a approach that appropriate for H‐index algorithm. Then, undertake Chen et al.’s46 corpus‐primarily based optimizing strategy to refine the goal corpus. Within the center a part of the proposed strategy, use AntConc three.5.eight30 to compute tokens’ frequency values and ranges, then, undertake H‐index algorithm to integrally compute tokens’ dispersion and focus situations, and to additional get hold of their H‐index values. Subsequent, rank tokens primarily based on their H‐index and frequency values. Postranking outcomes will make clear the significance of the proposed strategy and indicate the longer term attainable purposes in corpus‐primarily based and NLP fields. There are six steps in complete within the proposed strategy, furthermore, detailed descriptions are proven as follows (see Determine four):

Flowchart of the proposed strategy
Step 1. Compiling appropriate categorization of the large textual knowledge for H‐index evaluation.
H‐index algorithm is especially used to discover the quotation charge of analysis papers. On this examine, the authors undertake it to discover the utilization charge of tokens. On this step, the goal corpus (i.e., the large textual knowledge) needs to be segmented into its fundamental parts that contemplate an article as a unit as an alternative of compiling all recordsdata into a giant file (see Determine 5). Therefore, the H‐index of tokens will probably be computed efficiently.

Ultimate corpus compilation technique for H‐index algorithm
Step 2. Extracting tokens from the large textual knowledge.
Utilizing AntConc three.5.eight because the corpus software program to calculate and unveil the composition of the large textual knowledge, the quantitative knowledge will probably be retrieved and all tokens will probably be labeled with numbers on this step.
Step three. Optimizing the large textual knowledge.
Operate and meaningless phrases would lower the effectivity of corpus‐primarily based approaches, therefore to retrieve the substantive phrases which most replicate area data, a refining course of is inevitable. On this step, undertake the operate wordlist and machine optimizing course of to refine the large textual knowledge,46 the remaining content material phrases will probably be processed in subsequent steps.
Step four. Rating tokens primarily based on particular person total frequency standards.
After calculating every token’s total frequency primarily based on Equation (1) by the corpus software program, the wordlist on this step will probably be ranked primarily based on frequency standards, from highest to lowest frequency sequences.
Step 5. Rating tokens primarily based on H‐index algorithm.
On this step, the researchers undertake the H‐index algorithm to compute the importance of tokens. Right here, the quotation instances are thought of because the tokens’ adoption instances (i.e., frequency), thus, the calculation of tokens’ H‐index relies on a token showing equal to or greater than n instances in n RAs. First, primarily based on Equation (2), rank the phrase frequency of every RA in descending order. Then, primarily based on Equation (three), discover a phrase’s H‐index worth that satisfies the standards.
Step 6. Integrating tokens’ rating data for future prolonged purposes.
- 1.
Rating tokens primarily based on their H‐index values in descending order.
- 2.
If tokens have the identical H‐index values, then rank their frequency values in descending order.
The proposed strategy makes use of H‐index algorithm to compute a token’s diploma of significance, concurrently taking the standards of dispersion and focus into consideration. As well as, when going through the identical H‐index values, use tokens’ frequency values to outline their ranks to keep away from hesitation that happens when defining tokens’ diploma of significance.
four EMPIRICAL STUDY
four.1 Overview of the compiled large textual knowledge
The large textual knowledge on this paper are 100 RAs that had been collected from WOS. This selection was resulting from WOS that is among the largest, properly‐recognized, and main databases on this planet. Furthermore, many tutorial large textual knowledge evaluation researches and NLP researches of scientific fields adopted RAs from WOS as check knowledge.47-49 Therefore, on this examine, the researchers selected Medication, Normal, and Inner, a class that outlined journal quotation studies (JCR) for WOS, they then targeted on open entry (OA) journals (N = 24). To course of these 24 journals, first, the authors calculated their respective annual publications (knowledge retrieved from 2019.9.1 to 2020.eight.31), then, calculated the variety of papers that had been associated to the COVID‐19 matter. Lastly, they sampled the latest articles from every journal primarily based on ratio they usually additional compiled the large textual knowledge (see Desk 2). The analysis fields of the sampled journals comprise (1) environmental sciences, (2) public, environmental, and occupational well being, (three) infectious ailments, (four) tropical drugs, (5) microbiology, (6) toxicology, (7) healthcare sciences and companies, and (eight) well being coverage and companies. Moreover, the collected RAs all had COVID‐19 of their titles, they usually mentioned issues and options in the course of the COVID‐19 pandemic in step with their analysis fields. The paper amassing technique on this examine tried to achieve a steadiness between area and style kind as a lot as attainable to make native and EFL healthcare personnel perceive an important and extensively used tokens in medical RAs.
The composition of the large textual knowledge
| Subject | Class | Journal | Annual publication | COVID‐19‐associated RAs | Precise collected articles |
|---|---|---|---|---|---|
| COVID‐19 | Medication, Normal, and Inner | Worldwide Journal of Environmental Analysis and Public Well being | 7683 | 253 | 41 |
| Frontiers in Public Well being | 539 | 94 | 15 | ||
| Journal of World Well being | 228 | 45 | 7 | ||
| Lancet World Well being | 399 | 43 | 7 | ||
| Lancet Public Well being | 173 | 41 | 7 | ||
| Journal of An infection and Public Well being | 252 | 27 | four | ||
| Asian Pacific Journal of Tropical Medication | 102 | 22 | four | ||
| BMJ World Well being | 327 | 13 | 2 | ||
| Annals of World Well being | 97 | 13 | 2 | ||
| Globalization and Well being | 108 | 12 | 2 | ||
| Journal of Nepal Medical Affiliation | 172 | 11 | 2 | ||
| BMC Public Well being | 1817 | eight | 1 | ||
| Journal of Epidemiology | 79 | 5 | 1 | ||
| Antimicrobial Resistance and An infection Management | 195 | 5 | 1 | ||
| Reproductive Well being | 180 | 5 | 1 | ||
| Australian and New Zealand Journal of Public Well being | 114 | 5 | 1 | ||
| Archives of Public Well being | 91 | four | 1 | ||
| Environmental Well being Views | 175 | three | 1 | ||
| Well being Expectations | 185 | 2 | Zero | ||
| Battle and Well being | 79 | 2 | Zero | ||
| Tobacco Induced Ailments | 65 | 2 | Zero | ||
| Environmental Well being and Preventive Medication | 70 | 1 | Zero | ||
| Security and Well being at Work | 68 | 1 | Zero | ||
| Gaceta Sanitaria | 116 | 1 | Zero | ||
| Whole | 13,314 | 618 | 100 |
- Abbreviation: RA, analysis article.
four.2 Conventional corpus‐primarily based computing technique for dealing with important phrase‐rating points
AntConc three.5.eight30 works like different corpus software program; primarily based on Equation (1), it cumulates the sum of phrases’ incidence instances (i.e., frequency values) within the corpus and ranks phrases. Utilizing the compiled corpus for example, the standard technique for dealing with important phrase‐rating points will trigger the next issues: (1) operate and meaningless phrases aren’t eradicated, therefore content material phrases are ranked behind and this decreases analytical effectivity, (2) the dispersion situation of frequency shouldn’t be considered, (three) the focus situation of frequency shouldn’t be considered. Phrase‐rating ends in Determine 6 point out that the wordlist relies on phrases’ total frequency values and ranked in descending orders.

four.three The proposed strategy
On this part, the compiled large textual knowledge are embedded into the proposed novel corpus‐primarily based strategy for calculating the precise outcomes of the proposed strategy. An in depth description is proven as follows:
Step 1. Compiling appropriate categorization of the large textual knowledge for H‐index evaluation.
To successfully compute the H‐index values of every token, the composition of the corpus ought to contemplate every article as a unit. To handle the large textual knowledge, first, the researchers gave every journal a codename. For instance, Annals of World Well being was coded as AGH. The aim of coding journal names was for quickly and successfully retrieving sources of tokens, therefore, growing the effectivity of textual content evaluation and mining. Second, the file title of every article paper is given primarily based on a selected rule, for example, 01. In AGH‐01, 01 means the RA’s serial quantity (i.e., from the angle of the complete large textual knowledge), AGH means journal codename, and −01 represents the RA’s serial quantity within the present journal (see Desk three).
Journal codename and knowledge administration of RAs
| Journal title | Codename | Knowledge administration of RAs |
|---|---|---|
| Annals of World Well being | AGH | 01. AGH‐01, 02. AGH‐02 |
| Australian and New Zealand Journal of Public Well being | ANZJPH | 03. ANZJPH‐01 |
| Archives of Public Well being | APH | 04. APH‐01 |
| Asian Pacific Journal of Tropical Medication | APJTM | 05. APJTM‐01, 06. APJTM‐02, 07. APJTM‐03, 08. APJTM‐04 |
| Antimicrobial Resistance and An infection Management | ARIC | 09. ARIC‐01 |
| BMC Public Well being | BMCPH | 10. BMCPH‐01 |
| BMJ World Well being | BMJGH | 11. BMJGH‐01, 12. BMJGH‐02 |
| Environmental Well being Views | EHP | 13. EHP‐01 |
| Frontiers in Public Well being | FPH | 14. FPH‐01, 15. FPH‐02, 16. FPH‐03, 17. FPH‐04, 18. FPH‐05, 19. FPH‐06, 20. FPH‐07, 21. FPH‐08, 22. FPH‐09, 23. FPH‐10, 24. FPH‐11, 25. FPH‐12, 26. FPH‐13, 27. FPH‐14, 28. FPH‐15 |
| Globalization and Well being | GAH | 29. GAH‐01, 30. GAH‐02 |
| Worldwide Journal of Environmental Analysis and Public Well being | IJERPH | 31. IJERPH‐01, 32. IJERPH‐02, 33. IJERPH‐03, 34. IJERPH‐04, 35. IJERPH‐05, 36. IJERPH‐06, 37. IJERPH‐07, 38. IJERPH‐08, 39. IJERPH‐09, 40. IJERPH‐10, 41. IJERPH‐11, 42. IJERPH‐12, |
| 43. IJERPH‐13, 44. IJERPH‐14, 45. IJERPH‐15, 46. IJERPH‐16, 47. IJERPH‐17, 48. IJERPH‐18, 49. IJERPH‐19, 50. IJERPH‐20, 51. IJERPH‐21, 52. IJERPH‐22, 53. IJERPH‐23, 54. IJERPH‐24, | ||
| 55. IJERPH‐25, 56. IJERPH‐26, 57. IJERPH‐27, 58. IJERPH‐28, 59. IJERPH‐29, 60. IJERPH‐30, 61. IJERPH‐31, 62. IJERPH‐32, 63. IJERPH‐33, 64. IJERPH‐34, 65. IJERPH‐35, 66. IJERPH‐36, | ||
| 67. IJERPH‐37, 68. IJERPH‐38, 69. IJERPH‐39, 70. IJERPH‐40, 71. IJERPH‐41 | ||
| Journal of World Well being | JGH | 72. JGH‐01, 73. JGH‐02, 74. JGH‐03, 75. JGH‐04, 76. JGH‐05, 77. JGH‐06, 78. JGH‐07 |
| Journal of An infection and Public Well being | JIPH | 79. JIPH‐01, 80. JIPH‐02, 81. JIPH‐03, 82. JIPH‐04 |
| Journal of Nepal Medical Affiliation | JNMA | 83. JNMA‐01, 84. JNMA‐02 |
| Journal of Epidemiology | JOE | 85. JOE‐01 |
| Lancet World Well being | LGH | 86. LGH‐01, 87. LGH‐02, 88. LGH‐03, 89. LGH‐04, |
| 90. LGH‐05, 91. LGH‐06, 92. LGH‐07 | ||
| Lancet Public Well being | LPH | 93. LPH‐01, 94. LPH‐02, 95. LPH‐03, 96. LPH‐04, 97. LPH‐05, 98. LPH‐06, 99. LPH‐07 |
| Reproductive Well being | RH | 100. RH‐01 |
- Abbreviation: RA, analysis article.
Step 2. Extracting tokens from the large textual knowledge.
Knowledge administration of step one indicated that the precept of coding offers big comfort when launching AntConc three.5.eight to course of corpus knowledge. The corpus software program analyzed all RAs’ phrase varieties, tokens, and lexical variety (i.e., varieties and tokens ratio, TTR; see Desk four). The lexical outcomes of the compiled large textual knowledge indicated that authors from 100 RAs adopted 13,062 phrase varieties, and the entire corpus consists of 366,866 working phrases. Moreover, its TTR is roughly equal to Zero.0356 (additionally see Desk four).
Lexical knowledge of the compiled large textual knowledge
| Compiled large textual knowledge | Phrase varieties | Tokens | TTR | |
|---|---|---|---|---|
| Knowledge codename | Numbers of paper | |||
| AGH | 2 | 1543 | 7647 | Zero.2018 |
| ANZJPH | 1 | 683 | 1907 | Zero.3582 |
| APH | 1 | 695 | 3153 | Zero.2204 |
| APJTM | four | 1680 | 9062 | Zero.1854 |
| ARIC | 1 | 394 | 989 | Zero.3984 |
| BMCPH | 1 | 731 | 3108 | Zero.2352 |
| BMJGH | 2 | 2130 | 10,730 | Zero.1985 |
| EHP | 1 | 868 | 3333 | Zero.2604 |
| FPH | 15 | 5352 | 50,993 | Zero.1050 |
| GAH | 2 | 1304 | 6548 | Zero.1991 |
| IJERPH | 41 | 9124 | 184,639 | Zero.0494 |
| JGH | 7 | 3263 | 26,739 | Zero.1220 |
| JIPH | four | 1699 | 9554 | Zero.1778 |
| JNMA | 2 | 973 | 2763 | Zero.3522 |
| JOE | 1 | 865 | 3773 | Zero.2293 |
| LGH | 7 | 2905 | 20,091 | Zero.1446 |
| LPH | 7 | 2411 | 19,153 | Zero.1259 |
| RH | 1 | 857 | 2720 | Zero.3151 |
| Entire corpus | 100 | 13,062 | 366,866 | Zero.0356 |
- Abbreviation: TTR, varieties and tokens ratio.
Step three. Optimizing the large textual knowledge.
On the premise of Chen et al.’s46 analysis, operate phrases, corresponding to a, an, the, it, is, and so forth, would lower the effectivity of textual content mining and IR. Certainly, irrespective of which algorithm is used to calculate the significance of tokens, the irreplaceability of operate phrases in setting up significant sentences will trigger them to seem in ensuing knowledge and even be ranked very excessive, which immediately decreases the accuracy and effectivity of knowledge processing. Thus, the researchers adopted Chen et al.’s46 large textual knowledge refining strategy to optimize the compiled large textual knowledge; the refined wordlist on the corpus software program reveals that significant phrases are ranked to the entrance (see Determine 7). As well as, the information discrepancy confirmed that phrase forms of refined knowledge decreased by 238 phrases (i.e., operate phrases), nonetheless, tokens of refined knowledge decreased 157,911 phrases, which prompted a 43% downsizing within the corpus. Furthermore, the lexical variety was enhanced to Zero.Zero614 (see Desk 5). Unexpectedly, when going through extremely specialised medical RAs, operate phrases additionally occupied greater than 40% of the corpus. To keep away from data distortion, the eliminating process for operate phrases is inevitable.

Knowledge discrepancy between authentic and refined knowledge
| Lexical function | Authentic knowledge | Refined knowledge | Knowledge discrepancy |
|---|---|---|---|
| Phrase varieties | 13,062 | 12,824 | −238 (−1.eight%) |
| Tokens | 366,866 | 208,955 | −157,911 (−43%) |
| TTR | Zero.0356 | Zero.Zero614 |
- Abbreviation: TTR, varieties and tokens ratio.
Step four. Rating tokens primarily based on particular person total frequency standards.
After optimizing the compiled large textual knowledge, the authors adopted the refined conventional corpus‐primarily based computing technique30 to compute the sum of frequency values of every token (see Determine 7), and to seek out out every token’s frequency values in every RA by the Concordance Plot operate of the corpus software program. Within the Concordance Plot, Concordance Hit represents a token’s total frequency values, and Whole Plot (with hits) represents what number of RAs adopted a token. Take COVID for example, its Concordance Hit is 3520 (i.e., total frequency values) and Whole Plot (with hits) is 100 which suggests COVID was adopted by 100 RA authors (see Determine eight). Therefore, on this step, the authors obtained three necessary components which embody total frequency values, frequency values in every RA, and what number of RAs adopted a token. These components are important and will probably be calculated by the H‐index algorithm within the following step.

Step 5. Rating tokens primarily based on H‐index algorithm.
On this step, the researchers used the wordlist to compute tokens (N = 420) that had frequency values over 100. Take mortality for example, the authors recorded frequency values of mortality of every RA as authentic knowledge, and sorted every frequency from highest to lowest, then it was discovered that
; that glad the standards of Equation (three), thus, the worth of H‐index was given as 9 (see Desk 6). This computing strategy is used to calculate a token’s total adopting charges and consider its significance degree extra precisely. Then, they recorded tokens’ H‐index values in Excel software program for a rating course of.
An instance of a token’s H‐index computing course of
| Token | Authentic knowledge | Computing course of | H‐index end result | ||
|---|---|---|---|---|---|
| Articles | Frequency | Articles | Frequency | ||
| Mortality | 1 | 2 | 6 | 90 | 9 |
| 2 | 1 | 38 | 36 | ||
| three | 1 | 20 | 31 | ||
| four | three | 25 | 30 | ||
| 5 | 2 | 37 | 21 | ||
| 6 | 90 | 30 | 13 | ||
| 7 | 1 | 33 | 12 | ||
| eight | three | 22 | 10 | ||
| 9 | four | 15 | 9 | ||
| 10 | 5 | 18 | 6 | ||
| 11 | 1 | 23 | 6 | ||
| 12 | 2 | 10 | 5 | ||
| 13 | 1 | 39 | 5 | ||
| 14 | 2 | 9 | four | ||
| 15 | 9 | 31 | four | ||
| 16 | 2 | four | three | ||
| 17 | 1 | eight | three | ||
| 18 | 6 | 26 | three | ||
| 19 | 1 | 1 | 2 | ||
| 20 | 31 | 5 | 2 | ||
| 21 | 1 | 12 | 2 | ||
| 22 | 10 | 14 | 2 | ||
| 23 | 6 | 16 | 2 | ||
| 24 | 1 | 2 | 1 | ||
| 25 | 30 | three | 1 | ||
| 26 | three | 7 | 1 | ||
| 27 | 1 | 11 | 1 | ||
| 28 | 1 | 13 | 1 | ||
| 29 | 1 | 17 | 1 | ||
| 30 | 13 | 19 | 1 | ||
| 31 | four | 21 | 1 | ||
| 32 | 1 | 24 | 1 | ||
| 33 | 12 | 27 | 1 | ||
| 34 | 1 | 28 | 1 | ||
| 35 | 1 | 29 | 1 | ||
| 36 | 1 | 32 | 1 | ||
| 37 | 21 | 34 | 1 | ||
| 38 | 36 | 35 | 1 | ||
| 39 | 5 | 36 | 1 | ||
It was discovered that after utilizing the H‐index values to rank tokens, the sequences of the wordlist had been modified considerably as a result of H‐index calculated authors’ adoption charge in every RA and reinterpreted the significance of tokens. Nonetheless, tokens’ H‐index values usually produced the identical worth. If the identical H‐index values are encountered, the authors would type tokens by their frequency values once more. That’s, this paper considers H‐index and frequency values concurrently to make the necessary calculation of tokens extra correct.
Step 6. Integrating tokens’ rating data for future prolonged purposes.
The wordlist of Step 5 confirmed the mixtures of token’s H‐index and frequency values. The tokens’ rating concern dealt with by the proposed strategy redefine their significance degree, therefore, these knowledge present the necessary referential indicators for future purposes, corresponding to IR, NLP, large knowledge evaluation, machine studying, deep studying, and so forth. By this examine, the authors suggest a novel corpus‐primarily based strategy that integrates a corpus software program and H‐index algorithm to calculate which tokens are necessary in medical RAs. The ensuing knowledge will enhance native and EFL medical researchers’ studying and processing effectivity of medical RAs.
four.four Comparability and dialogue
- 1.
Refining corpus knowledge
In line with Desk eight, uncooked knowledge include many capabilities and meaningless tokens, corresponding to the, of, and, to, in, and so forth. The standard frequency‐primarily based strategy30 calculated all tokens’ frequency values, it was unable to determine which tokens include extra substantial meanings for people. To allow the corpus‐primarily based approaches to rank important phrases with substantial meanings, the refined conventional frequency‐primarily based strategy46 and the proposed strategy have eradicated operate and meaningless phrases. Therefore, primarily based on Desk eight, refined knowledge present content material phrases which have common or area‐oriented functions. It makes corpus analytical outcomes extra significant and enhances its effectivity in retrieving important phrases.
- 2.
Calculating frequency dispersion standards
A comparability of corpus‐primarily based approaches
| Strategies | Refining corpus knowledge | Calculating frequency dispersion standards | Calculating frequency focus standards |
|---|---|---|---|
| The standard frequency‐primarily based strategy30 | No | No | No |
| The refined conventional frequency‐primarily based strategy46 | Sure | No | No |
| The proposed strategy | Sure | Sure | Sure |
The highest 50 tokens of the in contrast three approaches (partial knowledge)
| Uncooked knowledge | Refined knowledge | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| The standard frequency‐primarily based strategy30 | The refined conventional frequency‐primarily based strategy44 | The proposed strategy | |||||||
| Rank | Frequency | Token | Rank | Frequency | Token | Rank | H‐index | Frequency | Token |
| 1 | 23,Zero79 | the | 1 | 3520 | COVID | 1 | 39 | 3520 | COVID |
| 2 | 14,660 | of | 2 | 2325 | well being | 2 | 28 | 2325 | well being |
| three | 13,258 | and | three | 1247 | examine | three | 21 | 1247 | examine |
| four | 9577 | to | four | 1162 | pandemic | four | 20 | 1162 | pandemic |
| 5 | 9218 | in | 5 | 1148 | circumstances | 5 | 18 | 1109 | sufferers |
| 6 | 5721 | a | 6 | 1109 | sufferers | 6 | 17 | 1148 | circumstances |
| 7 | 3891 | with | 7 | 999 | knowledge | 7 | 17 | 871 | throughout |
| eight | 3699 | for | eight | 871 | throughout | eight | 16 | 999 | knowledge |
| 9 | 3520 | COVID | 9 | 779 | social | 9 | 15 | 702 | folks |
| 10 | 3279 | that | 10 | 714 | public | 10 | 14 | 779 | social |
| 11 | 2857 | is | 11 | 711 | SARS | 11 | 14 | 701 | quantity |
| 12 | 2631 | as | 12 | 702 | folks | 12 | 14 | 660 | threat |
| 13 | 2544 | was | 13 | 701 | quantity | 13 | 14 | 645 | time |
| 14 | 2417 | had been | 14 | 660 | threat | 14 | 14 | 642 | illness |
| 15 | 2353 | on | 15 | 645 | time | 15 | 14 | 599 | care |
| 16 | 2325 | well being | 16 | 642 | illness | 16 | 13 | 714 | public |
| 17 | 2084 | be | 17 | 626 | desk | 17 | 13 | 711 | SARS |
| 18 | 1968 | by | 18 | 619 | reported | 18 | 13 | 619 | reported |
| 19 | 1882 | this | 19 | 615 | medical | 19 | 13 | 614 | CoV |
| 20 | 1873 | are | 20 | 614 | CoV | 20 | 13 | 594 | signs |
| 21 | 1783 | from | 21 | 599 | care | 21 | 13 | 593 | nations |
| 22 | 1699 | or | 22 | 594 | signs | 22 | 13 | 541 | one |
| 23 | 1404 | have | 23 | 593 | nations | 23 | 13 | 491 | transmission |
| 24 | 1383 | we | 24 | 582 | an infection | 24 | 12 | 582 | an infection |
| 25 | 1275 | not | 25 | 570 | inhabitants | 25 | 12 | 570 | inhabitants |
| 26 | 1247 | examine | 26 | 561 | individuals | 26 | 12 | 561 | individuals |
| 27 | 1213 | at | 27 | 546 | first | 27 | 12 | 533 | excessive |
| 28 | 1177 | their | 28 | 541 | one | 28 | 12 | 499 | evaluation |
| 29 | 1162 | pandemic | 29 | 533 | excessive | 29 | 12 | 439 | medical |
| 30 | 1148 | circumstances | 30 | 531 | management | 30 | 11 | 626 | desk |
| 31 | 1112 | it | 31 | 527 | used | 31 | 11 | 615 | medical |
| 32 | 1111 | an | 32 | 526 | outcomes | 32 | 11 | 546 | first |
| 33 | 1109 | sufferers | 33 | 506 | primarily based | 33 | 11 | 526 | outcomes |
| 34 | 999 | knowledge | 34 | 499 | evaluation | 34 | 11 | 506 | primarily based |
| 35 | 957 | extra | 35 | 498 | case | 35 | 11 | 498 | case |
| 36 | 921 | which | 36 | 491 | transmission | 36 | 11 | 471 | data |
| 37 | 871 | throughout | 37 | 471 | data | 37 | 11 | 470 | analysis |
| 38 | 861 | can | 38 | 470 | analysis | 38 | 11 | 466 | associated |
| 39 | 834 | has | 39 | 466 | associated | 39 | 11 | 465 | greater |
| 40 | 814 | additionally | 40 | 465 | greater | 40 | 11 | 456 | virus |
| 41 | 805 | these | 41 | 456 | virus | 41 | 11 | 404 | age |
| 42 | 792 | p | 42 | 452 | research | 42 | 11 | 404 | related |
| 43 | 788 | could | 43 | 451 | use | 43 | 11 | 404 | confirmed |
| 44 | 779 | social | 44 | 446 | two | 44 | 11 | 355 | components |
| 45 | 762 | been | 45 | 442 | coronavirus | 45 | 11 | 340 | mannequin |
| 46 | 752 | they | 46 | 439 | medical | 46 | 10 | 531 | management |
| 47 | 746 | had | 47 | 432 | outbreak | 47 | 10 | 527 | used |
| 48 | 737 | who | 48 | 431 | measures | 48 | 10 | 451 | use |
| 49 | 731 | all | 49 | 420 | CHINA | 49 | 10 | 446 | two |
| 50 | 731 | different | 50 | 406 | new | 50 | 10 | 442 | coronavirus |
The authors adopted the proposed strategy to compute the highest 420 tokens whose frequency values reached greater than 100, respectively, from the wordlist of the refined knowledge. In line with Desk eight, there have been vital variations in token rating between the standard corpus‐primarily based computing approaches30, 46 and the proposed strategy. The standard corpus‐primarily based computing approaches30, 46 solely calculated a token’s complete frequency values to outline its rank and significance; nonetheless, the frequency dispersion standards weren’t considered; that’s, a token with excessive frequency will not be extensively adopted or utilized by the RA authors, or could also be concentrated in only a few RAs and even probably happen in just one RA. However, the proposed strategy not solely used H‐index to compute the dispersion and focus standards of frequency concurrently, but in addition used frequency values to differentiate tokens that had the identical H‐index values. Due to this fact, after taking all standards into concerns, the proposed strategy is extra rigorous and correct. Apparently, tokens, corresponding to COVID, well being, examine, pandemic, reported, an infection, inhabitants, individuals, and case, nonetheless stay of their authentic ranks when put next with the refined conventional frequency‐primarily based strategy and the proposed strategy; that’s, after being calculated utilizing the 2 approaches, their frequency and H‐index values had been each extraordinarily excessive, therefore these tokens’ significance was unquestionable.
The calculation outcomes of the proposed strategy redefine the significance of tokens (N = 420) that had been in contrast with the standard corpus‐primarily based computing approaches.30, 46 In different phrases, the authors discovered solely 11 tokens (2.6%) that remained at authentic ranks and solely 9 tokens (2.1%) amongst them within the prime 50 wordlists (see Desk eight), 15 tokens (three.5%) that moved ahead greater than 100 ranks, respectively, 196 tokens (46.6%) that moved ahead from 1 to 99 ranks, respectively, 14 tokens (three.three%) that moved backward greater than 100 ranks, respectively, and 184 tokens (43.eight%) that moved backward from 1 to 99 ranks, respectively. In different phrases, the proposed strategy efficiently re‐evaluates the significance of tokens and makes greater than 97% modifications by adopting H‐index algorithm which concurrently took the dispersion and focus standards of frequency into consideration (see Desk 9).
Adjustments of token ranks (N = 420)
| Knowledge discrepancy | Token numbers | Proportion |
|---|---|---|
| Tokens keep on the authentic ranks | 11 | Zero.0262 |
| Tokens transfer ahead greater than 100 ranks | 15 | Zero.0357 |
| Tokens transfer ahead from 1 to 99 ranks | 196 | Zero.4667 |
| Tokens transfer backward greater than 100 ranks | 14 | Zero.0333 |
| Tokens transfer backward from 1 to 99 ranks | 184 | Zero.4381 |
| Tokens’ H‐index worth equal to 1 | 2 | Zero.0048 |
The proposed strategy may deal with tokens’ frequency focus standards. For instance, as found, hyponatremia was ranked at 231 primarily based on its calculation ends in the standard corpus‐primarily based computing approaches30, 46 (frequency = 153), and tobacco was ranked at 391 primarily based on its calculation ends in the standard corpus‐primarily based computing approaches30, 46 (frequency = 104). However, after computing by the proposed strategy, each phrases’ H‐index values had been equal to 1 (see Desk 9); therefore, their submit rank moved backward at 419 and 420, respectively (i.e., they turned the final necessary two phrases amongst 420 tokens), they moved backward by 188 and 29 sequences, respectively. Even when hyponatremia and tobacco had greater than 100 incidence instances within the compiled large textual knowledge, they had been adopted by just one RA every. In different phrases, their significance was nearly negligible as a result of there’s extraordinarily low likelihood that folks will encounter these two phrases in future COVID‐19‐associated RAs. Due to this fact, the standard corpus‐primarily based computing approaches30, 46 once more overestimated the tokens’ significance degree.
To conclude this part, tokens’ significance degree computation has affected the evaluation and improvement of huge knowledge administration and processing, search engines like google, and different relative AI industries. If the frequency worth is the one standards for rating tokens’ significance degree, the evaluation of their significance will probably be inaccurate and distorted. Therefore, we proposed the novel corpus‐primarily based strategy on this paper, which integrates a corpus software program and H‐index algorithm to take tokens’ frequency dispersion and focus standards into consideration concurrently, thus, precisely and comprehensively dealing with the token rating concern.
5 CONCLUSION
Conventional corpus‐primarily based computing strategies nonetheless current some analytical doubts throughout corpus processing, for instance, refining corpus knowledge, computing frequency dispersion standards, and computing frequency focus standards. These could trigger a lower in corpus knowledge processing effectivity, and extra severely, the analysis of tokens’ significance degree could also be biased as frequency worth is the one indicator used for dealing with phrase‐rating points in conventional corpus‐primarily based computing strategies. Thus, to compensate the blind aspect of the standard strategies, this paper proposed a novel corpus‐primarily based strategy that integrates a corpus software program and H‐index algorithm to refine corpus knowledge, to calculate tokens’ frequency dispersion and focus standards, and additional to deal with phrase‐rating points.
The numerous contributions of the proposed strategy are listed as: (1) the proposed strategy is ready to refine corpus knowledge through machine processing to get rid of operate and meaningless phrases, (2) the proposed strategy is ready to compute tokens’ frequency dispersion standards; furthermore, when going through tokens with the identical H‐index values, tokens’ frequency values are the second standards used to rank, therefore, it makes phrase‐rating course of extra correct and to keep away from hesitance conditions occurring within the rating course of, (three) the proposed strategy is ready to compute tokens’ frequency focus standards, corresponding to in circumstances the place a token has excessive‐frequency values however is overconcentrated in sure RAs; therefore, H‐index = 1 signifies that H‐index algorithm exactly evaluates a token’s significance degree, while, frequency values overestimate a token’s significance degree and trigger rating outcomes distortion. Moreover, in relation to textual evaluation in COVID‐19‐associated RAs, the proposed strategy additionally helps native and EFL frontline healthcare personnel to combine and retrieve skilled medical data, and to additional improve their data processing effectivity.
This paper exists a serious limitation that’s ready for future researches to beat, for instance, with out the assistant of current software program, H‐index computing course of nonetheless depends on human processing, as soon as the information are too bounteous, it can trigger a terrific burden on knowledge analysts. Therefore, when it comes to future perspective, this paper means that future corpus‐primarily based and NLP analysis can import H‐index algorithm to corpus program (i.e., software program) for processing large textual knowledge. It can improve accuracy and effectivity in dealing with phrase‐rating points, and support correct retrieval of important phrases from the large textual knowledge.
ACKNOWLEDGMENTS
The authors want to thank the Ministry of Science and Expertise, Taiwan, for financially supporting this examine underneath Contract Nos. MOST 108‐2410‐H‐145‐Zero01 and MOST 109‐2410‐H‐145‐Zero02.
CONFLICT OF INTERESTS
The authors declare that there is no such thing as a battle of pursuits.
REFERENCES
[ad_2]
Source link









