Networked information flows in Asia:
the research uses of the Altavista search engine
and "weblinksurvey" software

by
Dr T. Matthew Ciolek,
Research School of Pacific and Asian Studies,
Australian National University, Canberra ACT 0200, Australia
tmciolek@coombs.anu.edu.au

a paper for the panel on
"Internet Research: Methodological Considerations in Assessing the Impact of the Internet in Asia",
"Internet Political Economy Forum 2001: Internet and Development in Asia",
The National University of Singapore, Singapore, September 14-15, 2001.

Document created: 21 Jul 2001. Last revised: 8 Jan 2002
Please see the Appendix added on 4 Jan 2002

0. Abstract

This methodological study deals with web resources located in ten Asian countries, namely China, Hong Kong, Indonesia, Japan, Malaysia, Philippines, Singapore, South Korea, Taiwan, and Thailand. The study shows how the Altavista search engine can be quickly queried via an innovative, general-purpose (and easily modifiable) public-domain software to reveal trends in the flow of networked information. The paper discovers and describes a series of patterns of geographic preference/avoidance among the hypertext connections between the ten countries.

1. Introduction

This paper will not venture beyond the description of a new and effective method by which a large number of relationships between a large number of components of the global cyberspace can be tracked and measured. Also, it will not venture beyond a simple presentation of the collected statistical data describing East and Southeast Asian Internet. In view of the still experimental nature of this investigation, to proceed with the more advanced data analyses and discussions of results would be too hasty a decision.

We shall start this methodological inquiry by describing the cyberspace. The total body of web-based publicly accessible information can be viewed as a constellation of interlinked nodes. These informational nodes can be small and precise. They can be seen as individually addressed paragraphs of a document (i.e. "#" type html addresses within a particular web page), or the documents themselves, or whole bundles and clusters of documents stored in a thematic subdirectory or site. On the other hand, these nodes can be easily conceptualised as comprehensive and wide-ranging ones. In that case they would refer to the whole body of digital information published on one machine, or an entire network, or on an address defining a whole country. The country-wide addresses are standardised (RIPE 1997) and easily indentifiable. For instance, Australian informational resources are those with an address "au", whereas Bangladeshi sites are distinguished by a code "bd". Naturally, the cyberspace is a borderless environment and, in principle, "Argentinean" resources i.e. the ones which are published (or set up, owned, controlled etc.) by an Argentinean agency (whether an individual or an organisation) may be, in fact, situated in, say, Norway, whereas many of the "Norwegian" sites might be physically operating from France, New Zealand, Upper Volta and so forth. However, it is safe to assume that the bulk of the Internet material addressed by a particular country code, is situated within the geographical boundaries of that a country.

The nodes are connected with each other by means of hypertext links. Sometimes these connections are irregular and sparse; sometimes they are plentiful and extensive. Some of the connections are overt, in that they are established on publicly accessible web pages. Other connections can be hidden, residing on web pages kept on a machine (or cluster of machines) situated behind a firewall. Also, they might be established on people's individual web browsers in the form of personal collections of "bookmarks". However, since only the public material can be a subject to a direct and public inquiry, the existence (or non-existence) of these hidden nodes and links cannot be ascertained. Therefore, for the rest of this paper, the privately stored bookmarks and web pages residing on various "intranets" and behind firewalls will be consciously disregarded.

The hypertext links which span public online resources always have a direction. Links which lead to a node are the incoming links. The links which leave a node, i.e. point to some place in the cyberspace, form the outgoing links. Nodes which are linked to, form "targets" of electronic attention. Conversely, nodes which generate outgoing links form "sources" of such an attention. Naturally, any node can act simultaneously as the source and the target of the hypertext connections. Finally, nodes with more incoming links than the outgoing ones can be dubbed as the "information exporters." This is so because they represent places from where online information is siphoned. By the same token, nodes which act more as the sources of hypertext connections than as the targets can be considered as "information importers."

It is obvious that, for any informed discussion of political and economic aspects of electronic communications in Asia, we need to know how intensively and in what manner web resources of various countries interact with one another. It is vital that we have this information in addition to the data on numbers of networked computers (Internet Software Consortium 2001) installed in various countries; the number of web pages world-wide containing various "Asia-related" keywords (Ciolek 1998); as well as the numbers of phone lines, mobile phones, radios, TV-sets and Internet Service Providers (ISP) per capita, or per country (CIA Central Intelligence Agency 2000, World Bank nd.). In other words, in addition to the economic, demographic and electronic vital-statistics of various places and regions, we also need to have the means to determine who is watching whom, and in which direction the networked data tends to flow.

However, at first glance a dependable answer to the last two questions is difficult to secure. The universe of potential data to be sampled and analysed is not only very large but also volatile and very complex. Indeed, in January 2001 Asia as a whole comprised nearly 7.2 million networked computers (Internet Software Consortium 2001). At the same time, Internet users from all parts of the world generated at least 1,345 million web pages (Google 2001). Certainly, systematic research into that massive tangle of hypertext nodes and links appears to be a major operation.

Fortunately, as this paper will show, there is a simple yet effective method of gathering extensive statistics on the relationships between any group of Web nodes, whether in Asia or anywhere else in the world.

2. The Method

2.1. The research uses of Altavista search engine

The Altavista search engine (www.altavista.com) is one of the biggest (Bharat and Broder 1998; Compaq Co. 2001), most powerful and most widely known WWW search engines. In addition to being very fast and fairly comprehensive, Altavista offers a free online tool (Ciolek 2000) for identification of the informational relationships between a diverse range of networked nodes. This tool is unique to Altavista and is not available from other search engines such as AllTheWeb (www.alltheweb.com), Excite (www.excite.com), Google (www.google.com), HotBot (hotbot.lycos.com) or Lycos (search.lycos.com).

The Altavista facility consists of two simultaneous commands, both of which are directed to the search engine's query box. These commands are, in their most generic form : The above expression means:

The Altavista's "link:argument -host:argument" command can take at least thirty-six (36) specific forms. This is because an address (URL) of an online resource (such as a web page, web site, an archive, or a database) has at least six component parts (or seven, if we take into account "#" category of addresses within a web page itself, and more if the full range nested subdirectory addresses is considered). For instance, the computer address: observes, like all other computer addresses, the following global convention:

This means that the Altavista's " link:argument -host:argument" command may deal with questions about very specific and precise targets and sources, or very broad targets and sources, or any other of the 36 combinations of the nodes. The commonest application of the Altavista's "link: -host:" command is a query concerning the number and details of web pages with a hypertext link to a particular online resource. Such a query measures the degree of "online presence" exercised by the node in question. The query helps, therefore, with comparisons of online visibility of various web resources. Secondly, the query can be used to determine identities of sources making a link to a target node under one's jurisdiction, so that they can be notified (if necessary) about planned changes to the site's address or to policies governing access to its contents.

However, in addition to these two everyday uses of the Altavista's commands, there are also ones which include a timeless pair of geopolitical queries: (a) "who is watching whom", and (b) "how intensively this watching takes place".

2.2. An algorithm for collection of Altavista's data

In order to determine which of the two or more nodes (or sets of nodes), acts as an importer (or exporter) of networked information, one needs to carry out a sequence of distinct but intertwined operations. This is because raw answers provided by the Altavista need to be recalculated before they are fully usable for our purposes.

For instance, the following four steps determine a number of Singapore-based pages with links to web resources based in Singapore itself: For studies involving several nodes, the range of the operations needs to be greatly widened. Firstly, two lists of countries have to be declared: one for the "targets"; and another for the "sources." Secondly, one needs to run through all the permutations between the contents of these two lists. Thirdly, results have to be sorted, annotated and printed out. Once such an output is created and saved in a file (see Section 3.3), the data can be put into a spreadsheet and subjected to further processing and, finally, analysis and assessment.

The number of individual steps to be taken is a function of the number of "targets" times by the number of the "sources". Thus a tiny, say 3x3 investigation, involves about 36 operations (i.e. 3x3x4 steps), whereas a bigger one, say 12x12, means at least 576 operations (i.e. 12x12x4 steps). Assuming that on average, one manual operation takes exactly one minute, a hypothetical 3x3 study would take about 36 minutes. On the same assumption a bigger study would require at least 575 minutes, or 9:35 hours. Obviously, the more extensive studies the greater is the benefit from an automatisation of the whole procedure, especially if they are to be conducted on a repetitive basis.

2.3. Data collection schedules

The first dataset which used this algorithm was gathered by a manual procedure on 15th December 2000. The pilot sample covered China, HK, Japan, South Korea and Taiwan, as well as Australia, New Zealand, Indonesia, Malaysia, Singapore, Thailand, and the Philippines. The data obtained formed a 12x12 matrix with 144 cells. From that matrix, a smaller one (10x10) was extracted for the purpose of this paper. This sample provided 3,445,426 datapoints describing relationships between all countries mentioned in this pilot sample, except Australia and New Zealand.

A tabulation of contents of the December 2000 sample made clear that the Altavista database is worth regular tapping. However, manual investigations proved to be a time consuming and labour intensive enterprise. It was important therefore that the process of data gathering and sorting was automated as much as possible. Accordingly, in February and March 2001 this author designed software which was then written by Mr Brian Collins of the ANU Research School of Pacific and Asian Studies (RSPAS) IT Services (see Section 3.1).

Following trial runs, the first bug-free, automated data-collection session was run on 8th June 2001. It dealt with all 50 Asian countries. From that large dataset, a smaller one was extracted. The small set formed a 10x10 matrix of 6,262,142 datapoints describing relationships between all ten Asian countries previously stated.

2.4. Characteristics of the Altavista database

The differences in the number of hypertext links existing at various times between the same ten countries (December 2000 - 3,445,426; June 2001 - 6,262,142) showed that the Altavista database is a dynamic environment. Its contents keep changing. The number of pages with links known to the database enlarges and shrinks as a result of ongoing transformations in the world's cyberspace. Inevitably, questions arise as to how often and how extensively this database changes?

Fortunately, Altavista can be queried to provide a numeric answer to this question. Three consecutive measurements were taken on the 8th, 9th and 19th of June 2001, with each looking at the same set of 100 country-to-country relationships. Two comparisons (each involving 100 pairs of countries) showed that there was no difference between the data collected on the 8th and 9th June. However, they showed that there was 100% of difference between data from the 9th and 19th of June. In other words, during this 11 days span, the data recorded by Altavista changed 100 out of 200 possible times. This would suggest an average rate of change of 4.5% data holdings per day. Also, during that time out of the 100 recorded variations, 48% involved a drop in the number of hypertext links between pairs of countries; and 52% involved an increase in the volume of links. Overall, the number of pages with links recorded for the ten studied countries had grown by approximately 2%: from 6,265,330 to 6,330,301.

A second comparison between the data collected on the 19th and 25th June 2001 has shown, again, a 100% difference between the two datasets, with 58% of the cases indicating a decrease in the number of pages with links, and 42% showing an increase. These developments took place over a period of six days and suggest a 16.6% average daily rate of change. Also in the second comparison, the 6,330,301 links recorded for the studied countries was seen to decrease by 9% to 5,751,622.

Therefore, we may conclude that any study of the cyber-relationships between a given set of nodes needs to be conducted (a) as a series of repeated observations rather than as a one-off measurement; (b) over a substantial period of time; and (c) unless changes over time are studied, data collected at various times should be averaged, so that a generalised, more synthetic picture can be obtained.

This is a course which is adopted by this paper.

3. The software for automated data collection

3.1. Software

The WEBLINKSURVEY program runs on a Unix machine connected to the Internet. It is a short, plain-text (ASCII) file with a perl script kept in a subdirectory on a user's account. Within the Unix environment the WEBLINKSURVEY file(s) can be easily copied (using 'cp' command) and edited (using 'vi' command). Naturally, all program files need to be in the executable (i.e. -rwxr----- ) mode.

The software is archived on the Coombspapers anonymous FTP site at ftp://coombs.anu.edu.au/coombspapers/otherarchives/soc-science-software/.
The WEBLINKSURVEY program can be freely copied and used for private research purposes as long as the associated copyright and credits notes are retained. For any other uses please contact first the copyright holders.

The following is the listing of the WEBLINKSURVEY program.

3.2. File management issues

The program file can be given any name. However, for the ease of management, the file names should always accurately reflect the exact purpose of each variant copy of the software. For instance, a file named "weblinksurvey-fullasia" may be used to denote a program for the study of all permutations existing between all countries of Asia, whereas a file named "weblinksurvey-aejpkr" may be a good name for a program dealing with the relationships between the three countries such as United Arab Emirates ("ae"), Japan ("jp") and South Korea ("kr").

Similarly important is the adequate naming convention for the output files. For instance a file named "fullasia-011002.txt" indicates that it contains an ASCII output from the program "weblinksurvey-fullasia" run on the 2nd of October 2001. Likewise, a file named "aejpkr-030712.txt" should contain results of the software "weblinksurvey-aejpkr" activated on the 12 July of 2003.

The output files are generated by running a specific copy of our software, e.g. The time needed for completing of the software run depends on a number of factors. They are: the overall processing power of the Unix machine in question, the congestion of the network, the overall business of the www.altavista.com site, and the scope of operations requested by our program. Naturally, short analyses run quicker (taking no more then a few minutes) than the lengthy ones. For example, a run of the "weblinksurvey-fullasia" program which looks at all 50x50 permutations of links between Asian countries takes between 20 and 40 minutes.

Occasionally, a previously tried and well-behaving program can fail by not producing any output at all. If this is the case, then the most likely reason is that changes were made to the format of the Altavista's results page itself. That page tends to undergo sporadic minor modifications. For example, within a short period between mid-March and mid-June 2001 the Altavista results were re-phrased in three different ways: "pages found X", "We found about X results:", "We found X results:" When this happens, the remedy is simple. To make the WEBLINKSURVEY program operational again, identify the exact wording currently in use by Altavista and modify the appropriate line in the perl script accordingly.

A word of caution though. If these small but crucial changes to the format of Altavista's results page are, indeed, defence mechanisms protecting Altavista from mass automated polling, it is important that the scholarly community who uses the WEBLINKSURVEY program does so sparingly and tactfully. It is important that our method of using the Altavista does not lead to the closure of the database's "link: argument -host:argument" research facility.

3.3. Output

The content and length of the output file varies from one research question to another. However, all output files follow the same format (see below for a fragment of the "tenasia-010608.txt" file). The first column lists the identity, i.e. the whole URL (or the key fragment of it) , of the target node. The second column lists the identity of the source node. The third column lists the number of web pages with hypertext links originating in the source location and leading to the target in question.

An example of the "tenasia-010608.txt" output file:

4. Networked information flows in Asia

4.1. The WEBLINKSURVEY data

Results listed below are based on 4.8 million data points averaged from two consecutive studies conducted in December 2000 and June 2001.

Table 1 provides a detailed breakdown on the direction and volume of hypertext links connecting web sites in ten Eastern Asian countries. Table 2 summarises these data for each of the countries involved into two categories: pages with "domestic" vs. "international" links. Table 3 groups and analyses data from Table 1 in order to determine the relative intensities with which the ten studied Asian countries monitor each other by means of "international" hypertext links. Finally, Tables 4 and 5 express values from Table 1 as percentages.

Table 1. Origins and destinations of WWW links
in ten Asian countries

("domestic" links are underlined *)
WWW
pages from:
Pages with
links to:
JP
Pages with
links to:
TW
Pages with
links to:
KR
Pages with
links to:
HK
Pages with
links to:
SG
Pages with
links to:
CN
Pages with
links to:
MY
Pages with
links to:
TH
Pages with
links to:
ID
Pages with
links to:
PH
TOTAL
Japan 1,641,358 12,066 11,612 11,853 16,770 8,840 14,242 24,334 38,740 11,959 1,791,771
Taiwan 3,637 193,271 4,668 1,827 2,224 4,240 2,392 1,591 10,604 968 225,420
Sth Korea 9,811 3,819 443,699 5,506 3,772 4,516 4,892 3,258 26,107 4,073 509,449
HK 9,066 8,452 4,001 457,496 34,300 7,111 10,989 1,934 10,527 2,072 545,945
Singapore 13,008 8,188 6,494 13,137 495,523 9,021 63,974 11,578 48,880 8,594 678,395
China 4,330 4,432 3,021 5,741 1,107 101,772 2,128 462 12,521 325 135,837
Malaysia 3,106 1,457 4,433 1,216 7,893 2,728 256,484 1,337 33,690 1,493 313,836
Thailand 3,984 1,025 825 922 3,846 1,059 2,085 217,430 15,585 2,594 249,352
Indonesia 3,196 464 319 349 4,673 1,875 2,227 4,875 226,164 2,829 246,969
Philippines 1,268 791 988 756 2,722 1,919 1,136 3,287 10,465 136,283 159,612
TOTAL 1,692,762 233,962 480,056 498,799 572,828 143,079 360,548 270,085 433,281 171,186 4,856,584

* "Domestic" links span nodes situated in the same country

Table 2. Proportion of "domestic" and "international" WWW links in ten Asian countries
(based on Table 1, values rounded to the nearest percent) *
WWW
pages from:
Pages with
domestic
links
% Pages with
international
links
% Pages
TOTAL
%
Japan 1,641,358 92% 150,414 8% 1,791,771 100%
Taiwan 193,271 86% 32,149 14% 225,420 100%
Sth Korea 443,699 87% 65,751 13% 509,449 100%
HK 457,496 84% 88,449 16% 545,945 100%
Singapore 495,523 73% 182,872 27% 678,395 100%
China 101,772 75% 34,065 25% 135,837 100%
Malaysia 256,484 82% 57,352 18% 313,836 100%
Thailand 217,430 87% 31,922 13% 249,352 100%
Indonesia 226,164 92% 20,805 8% 246,969 100%
Philippines 136,283 85% 23,330 15% 159,612 100%
TOTAL 4,169,478 86% 687,107 14% 4,856,584 100%

* "International" links span nodes situated in two different countries

Table 3. Destinations of "international" WWW links
in ten Asian countries

(based on Table 1, values rounded to the nearest percent) *
Destinations: JP TW KR HK SG CN MY TH ID PH Pages with
Intl. links
TOTAL
WWW pages with
international links to:
51,405 40,691 36,358 41,304 77,305 41,307 104,064 52,655 207,117 34,904 687,107
% of TOTAL 7% 6% 5% 6% 11% 6% 15% 8% 30% 5% 100%

* "International" links span nodes situated in two different countries

Table 4. The volume of "international" WWW links by their origin and destination in ten Asian countries
(based on Table 1, values rounded to the nearest percent) *
WWW
pages from:
Target of
links to:
JP
Target of
links to:
TW
Target of
links to:
KR
Target of
links to:
HK
Target of
links to:
SG
Target of
links to:
CN
Target of
links to:
MY
Target of
links to:
TH
Target of
links to:
ID
Target of
links to:
PH
Japan x 30% 32% 29% 22% 21% 14% 46% 19% 34%
Taiwan 7% x 13% 4% 3% 10% 2% 3% 5% 3%
Sth Korea 19% 9% x 13% 5% 11% 5% 6% 13% 12%
HK 18% 21% 11% x 44% 17% 11% 4% 5% 6%
Singapore 25% 20% 18% 32% x 22% 61% 22% 24% 25%
China 8% 11% 8% 14% 1% x 2% 1% 6% 1%
Malaysia 6% 4% 12% 3% 10% 7% x 3% 16% 4%
Thailand 8% 3% 2% 2% 5% 3% 2% x 8% 7%
Indonesia 6% 1% 18% 1% 6% 5% 2% 9% x 8%
Philippines 3% 2% 3% 2% 4% 5% 1% 6% 5% x
Pages with
Intl. links
TOTAL
100% 100% 100% 100% 100% 100% 100% 100% 100% 100%

* "International" links span nodes situated in two different countries

Table 5. The volume of "international" WWW links by their destination and origin
in ten Asian countries

(based on Table 1, values rounded to the nearest percent) *
WWW
pages from:
Target of
links to:
JP
Target of
links to:
TW
Target of
links to:
KR
Target of
links to:
HK
Target of
links to:
SG
Target of
links to:
CN
Target of
links to:
MY
Target of
links to:
TH
Target of
links to:
ID
Target of
links to:
PH
Pages with
Intl. links
TOTAL
Japan x 8% 8% 8% 11% 6% 9% 16% 26% 8% 100%
Taiwan 11% x 15% 6% 7% 13% 7% 5% 33% 3% 100%
Sth Korea 15% 56% x 8% 6% 7% 7% 5% 40% 6% 100%
HK 10% 10% 5% x 39% 8% 12% 2% 12% 2% 100%
Singapore 7% 4% 4% 7% x 5% 35% 6% 27% 5% 100%
China 13% 13% 9% 17% 3% x 6% 1% 37% 1% 100%
Malaysia 5% 3% 8% 2% 14% 5% x 2% 59% 3% 100%
Thailand 12% 3% 3% 3% 12% 3% 7% x 49% 8% 100%
Indonesia 15% 2% 2% 2% 22% 9% 11% 23% x 14% 100%
Philippines 5% 3% 4% 3% 12% 8% 9% 14% 45% x 100%

* "International" links span nodes situated in two different countries

4.2. A summary of uncovered patterns

The collected data are preliminary, but even at this early stage they already deserve attention.

Table 1 shows that electronic information residing on web servers is neither uniformly distributed across the countries of the Eastern Asia, nor is it uniformly linked to. Japan produces the largest volume of hypertext links (1,791,771 or 37% of the total sample). It is followed by Singapore (678,395 or 14% of the total) and Hong Kong (545,945 or 11% of the total). Among the ten studied countries, the least prolific producers (within the studied sample, of course) of web links are: China (135,837 or 3% of the total), Philippines (159,612 or 3% of the total) and Taiwan (225,420 or 5% of the total).

Table 2 shows that there are distinct differences in which the Eastern Asian countries treat cyberspace. Some countries tend to be more self-centred than others. The most self-sufficient countries (in terms of the online information) were: Japan (international hypertext connections form only 8% of the total links), Indonesia (8%), Thailand (13%) and South Korea (13%). Simultaneously there were countries which were less concerned with their own online resources and which monitored the neighbourhood's cyberspace with zest. These were: Singapore (information-importing links form as much as 27% of the total links), China (25%) and Malaysia (18%).

Table 3 shows that the object of countries' electronic attention can be surprisingly narrow. Among the ten studied countries Indonesia attracts 30% of all pages with international (i.e. information-gathering) links generated by the other nine countries. The respective figure for Malaysia was 15%, and Singapore 11%. The least watched, the least linked-to countries in our sample were: Philippines (5%), South Korea (5%) and Taiwan, HK and China (6% each).

The fine-grain details of the above trends are presented in Tables 4 and 5. These reveal a chain of interactions between countries of the region. For instance, Table 4 shows that the largest importer of Thailand's online information is Japan (46% of regional hypertext linkages to Thailand originate on web sites from Japan). At the same time, the largest importer of Japanese online information is Singapore (25% of regional hypertext links to Japan originate in Singapore) while the largest importer of the Singaporean web material is Hong Kong (44% of all pages from the nine Eastern Asian countries with hypertext links to Singapore reside in HK).

An additional perspective on the above pattern is cast by Table 5. There, one can see that Thailand constitutes 16% of all international destinations originating in Japan. In turn, Japan attracts only 7% of international links established on Singaporean pages, whereas Singapore is the target of a hefty 39% of web pages from Hong Kong (all of it happening while Singapore directs only 7% electronic links towards Hong Kong).

Further patterns also exist. Table 5 summarises the the way countries interact electronically with China; Taiwan is more attentive to China's information resources (13%), than is Hong Kong (8%), South Korea (7%), Japan (6%) or Singapore (5%). Also, for reasons currently unknown, Indonesia points only 9% of its outgoing links in the direction of China, whereas 37% of China's web pages with international links point to Indonesia.

In other words (assuming that informational resources in both countries are of comparable size, and accessed with similar frequencies), online information seems to flow from Indonesia to China four times more strongly than in the reverse direction.

4.3. Conclusions

The political, economic or anthropological ramifications of the patterns present in Tables 1-5 remain outside the scope of this paper. Similarly, it is still too early to attempt to calculate correlations between our data and the existing cross-national social, economic or political indices such as used (or described) by the World Bank (nd), Thede (nd), Human Development Report Office 2001, or Saltman (2001).

Prior to any such investigations further evaluation and refining of the WEBLINKSURVEY methodology needs to take place. Nevertheless, the work conducted so far indicates that:

5. Appendix of 04 Jan 2002

Weblink surveys carried out since completion of this paper reveal that the Altavista's "link:argument -host:argument" command when applied to the country code Top Level Domains (ccTLD) generates a cerian amount spurious data. This is due to the fact that the command is not sernsitive enough. Such a command, say, "host:.id" (= list all the hosts associated with Indonesia country code) lists:
  1. hosts associated with the Top Level Domain ".in" (e.g. bandung.linux.or.id, "cyberjob.cbn.net.id");
  2. hosts in other Top Level Domains bearing a name commencing with letters "id" (e.g. " id.zaigen.co.kr", "id.lycosasia.com", or "id.dorfschenke.de");
  3. hosts in other Top Level Domains bearing a name containing letters "id" (e.g. "www-id.imag.fr");
  4. hosts residing on networks whose name commences with letters "id" (e.g. "www2.state.id.us", "submarine.id.ru").
The first set of data is correct, the next three are not. The ratio between the correct and incorrect sets of information depends on the actual combination of letters forming the country code. Certain combinations, such as "ID" (Indonesia), "TV" (for Tuvalu), or "OM" (for Oman) are more often used as a part of a host-name or network-name than such codes as "JP" (for Japan) or "CN" (for China). This is illustrated by Table 6 below.

Table 6. Percentage of Asian hosts retrieved via a given country code*
found to be associated with that country
Country Country code % of hosts
with 'correct' association
Japan jp 100
Kazakhstan kz 98
China cn 96
Pakistan pk 88
Turkey tr 88
Kirgyzstan kg 86
Cambodia kh 86
Thailand th 76
Lebanon lb 68
Indonesia id 66
Israel il 64
South Korea kr 64
Philippines ph 58
Taiwan tw 58
Uzbekistan uz 58
Yemen ye 56
Brunei bn 56
Jordan jo 54
Tajikistan tj 52
Bahrain bh 50
Vietnam vn 48
Singapore sg 48
Hong Kong hk 44
Sri Lanka lk 42
Armenia am 36
Malaysia my 32
Kuweit kw 30
Qatar qa 26
Iran ir 20
United Arab Emirates ae 20
Nepal np 16
Georgia ge 12
India in 10
Azarbaijan az 10
Maldives mv 6
Oman om 6
East Timor tp 4
Bhutan bt 4
Turkmenistan tm 2
Saudi Arabia sa 2
Burma mm 2
Mongolia mn 0.5
Macao mo 0.5
Bangladesh bd 0.5
Syria sy 0.5
Laos la 0.5
Indian Ocean Territories io 0.5
Iraq iq 0.5
North Korea kp 0.5
Afghanistan af 0.5

* Values rounded to the nearest per cent, where values are less than 1 per cent, they are rounded up to 0.5%
Sample: Five times the ten URLs listed by Altavista (www.altavista.com) on its output pages number 1, 5, 9, 13 and 17 in response to a query "host:ccTDL", e.g. "host:.kr" or "host:.jp". The survey of details of the 50 URLs for each of the 50 Asian countries/territories was carried out on 26 Dec 2001

Moreover, there is a good reason to supect that similar percentages apply also to measurements obtained via the "link:argument" command. This in turn means that all originally reported values in Tables 1 to 5 need to be re-calculated to reflect: This is so, because the final, corrected, numbers are a function of both specific sources AND specific destinations. For example, in a hypothetical case of 400 hypertext links found in an Altavista query "link:argument host:argument" for two countries, say, China and Japan

From/To CN JP
CN 100 100
JP 100 100

the final values will be a result of the multiplication of the original score by the percentage of China's CN sites as well as by Japan's JP sites. This means thaqt the adjusted values, using data from Table 6, will read:

From/To CN JP
CN 92 (=100*0.96*0.96) 96 (=100*0.96*1)
JP 96 (=100*1*0.96) 100 (=100*1*1)

Finally, it appears that the above problems stemming from the use of the "link:argument -host:argument" command at the level of whole countries can be partially alleviated by the use of a command "link:argument -domain:argument."

When the new query syntax is used we should need to worry only about calculating the reduced size of the target area.

6. Acknowledgements

My thanks are due to Ms Ann Andrews for her useful comments on the first draft of this paper.

7. References

[The great volatility of online information means that some of the URLs listed below may change by the time this article is printed. The date in round brackets indicates the version of the document in question. For current pointers please consult the online copy of this paper at http://www.ciolek.com/PAPERS/weblinksurvey2001.html address

8. Version and Change History


Site Meter
visitors to www.ciolek.com since 08 May 1997.

Maintainer: Dr T.Matthew Ciolek (tmciolek@ciolek.com)

Copyright (c) 2001 by T.Matthew Ciolek. All rights reserved. This Web page may be freely linked to other Web pages. Contents may not be republished, altered or plagiarized.

This page has been tested for full accessibility

URL http://www.ciolek.com/PAPERS/weblinksurvey2001.html

[ Asian Studies WWW VL ] [ www.ciolek.com ] [ Buddhist Studies WWW VL ]