Networked information flows in Asia:
the research uses of the Altavista search engine
and "weblinksurvey" software
by
Dr T. Matthew Ciolek,
Research School of Pacific and Asian Studies,
Australian National University, Canberra ACT 0200, Australia
tmciolek@coombs.anu.edu.au
a paper for the panel on
"Internet Research: Methodological
Considerations in Assessing the Impact of the Internet in Asia",
"Internet Political Economy Forum 2001: Internet and
Development in Asia",
The National University of Singapore, Singapore, September
14-15, 2001.
Document created: 21 Jul 2001. Last revised: 8 Jan 2002
Please see the Appendix added on 4 Jan 2002
0. Abstract
This methodological study deals with web resources located in ten
Asian countries, namely China, Hong Kong, Indonesia, Japan, Malaysia,
Philippines, Singapore, South Korea, Taiwan, and Thailand. The study
shows how the Altavista search engine can be quickly queried via an
innovative, general-purpose (and easily modifiable) public-domain
software to reveal trends in the flow of networked information. The
paper discovers and describes a series of patterns of geographic
preference/avoidance among the hypertext connections between the ten
countries.
1. Introduction
This paper will not venture beyond the description of a new and
effective method by which a large number of relationships between a
large number of components of the global cyberspace can be tracked and
measured. Also, it will not venture beyond a simple presentation of
the collected statistical data describing East and Southeast Asian
Internet. In view of the still experimental nature of this
investigation, to proceed with the more advanced data analyses and
discussions of results would be too hasty a decision.
We shall start this methodological inquiry by describing the cyberspace.
The total body of web-based publicly accessible information can be viewed as a
constellation of interlinked nodes. These informational
nodes can be small and precise. They can be seen as individually addressed
paragraphs of a document (i.e. "#" type html addresses within a
particular web page), or the documents themselves, or whole bundles
and clusters of documents stored in a thematic subdirectory or site.
On the other hand, these nodes can be easily conceptualised as
comprehensive and wide-ranging ones. In that case they would refer to
the whole body of digital information published on
one machine, or an entire network, or on an address defining a whole
country. The country-wide addresses are standardised (RIPE 1997) and easily indentifiable.
For instance, Australian informational resources are those with
an address "au",
whereas Bangladeshi sites are distinguished by a code "bd".
Naturally, the cyberspace is a borderless environment and, in
principle, "Argentinean" resources i.e. the ones which
are published (or set up, owned, controlled etc.) by an Argentinean
agency (whether an individual or an organisation) may be, in fact,
situated in, say, Norway, whereas many of the "Norwegian" sites
might be physically operating from France, New Zealand, Upper Volta and so
forth. However, it is safe to assume that the bulk of the Internet
material addressed by a particular country code, is situated within
the geographical boundaries of that a country.
The nodes are connected with each other by means of hypertext
links. Sometimes these connections are irregular and sparse;
sometimes they are plentiful and extensive. Some of the connections
are overt, in that they are established on publicly accessible web
pages. Other connections can be hidden, residing on web pages kept on
a machine (or cluster of machines) situated behind a firewall. Also,
they might be established on people's individual web browsers in the
form of personal collections of "bookmarks". However, since only the
public material can be a subject to a direct and public inquiry, the
existence (or non-existence) of these hidden nodes and links cannot be
ascertained. Therefore, for the rest of this paper, the privately
stored bookmarks and web pages residing on various "intranets" and
behind firewalls will be consciously disregarded.
The hypertext links which span public online resources always have a
direction. Links which lead to a node are the incoming links. The
links which leave a node, i.e. point to some place in the cyberspace,
form the outgoing links. Nodes which are linked to, form "targets" of
electronic attention. Conversely, nodes which generate outgoing links
form "sources" of such an attention. Naturally, any node can act
simultaneously as the source and the target of the hypertext
connections. Finally, nodes with more incoming links than the outgoing
ones can be dubbed as the "information exporters." This is so because
they represent places from where online information is siphoned. By
the same token, nodes which act more as the sources of hypertext
connections than as the targets can be considered as "information
importers."
It is obvious that, for any informed discussion of political and
economic aspects of electronic communications in Asia, we need to
know how intensively and in what manner web resources of various
countries interact with one another. It is vital that we have this
information in addition to the data on numbers of networked computers
(Internet Software Consortium 2001) installed in various countries;
the number of web pages world-wide containing various "Asia-related"
keywords (Ciolek 1998); as well as the numbers of phone lines, mobile
phones, radios, TV-sets and Internet Service Providers (ISP) per
capita, or per country (CIA Central Intelligence Agency 2000, World
Bank nd.). In other words, in addition to the economic, demographic
and electronic vital-statistics of various places and regions, we also
need to have the means to determine who is watching whom, and in which
direction the networked data tends to flow.
However, at first glance a dependable answer to the last two questions
is difficult to secure. The universe of potential data to be sampled
and analysed is not only very large but also volatile and very complex.
Indeed, in January 2001 Asia as a whole comprised nearly 7.2 million
networked computers (Internet Software Consortium 2001). At the same
time, Internet users from all parts of the world generated at least
1,345 million web pages (Google 2001). Certainly, systematic research
into that massive tangle of hypertext nodes and links appears to be a
major operation.
Fortunately, as this paper will show, there is a simple yet effective
method of gathering extensive statistics on the relationships between
any group of Web nodes, whether in Asia or anywhere else in the
world.
2. The Method
2.1. The research uses of Altavista search engine
The Altavista search engine (www.altavista.com) is one of the biggest
(Bharat and Broder 1998; Compaq Co. 2001), most powerful and most
widely known WWW search engines. In addition to being very fast and
fairly comprehensive, Altavista offers a free online tool (Ciolek
2000) for identification of the informational relationships between a
diverse range of networked nodes. This tool is unique to Altavista and
is not available from other search engines such as AllTheWeb
(www.alltheweb.com), Excite (www.excite.com), Google (www.google.com),
HotBot (hotbot.lycos.com) or Lycos (search.lycos.com).
The Altavista facility consists of two simultaneous commands, both of
which are directed to the search engine's query box.
These commands are, in their most generic form :
link:computer-address-A -host:computer-address-B
The above expression means:
within the realm of all web pages known at this moment to the Altavista's database,
get me a number (and also display a detailed list) of all the pages
with a hypertext link to computer address A.
At the same time, eliminate from the report data about links from pages residing at
address B.
The Altavista's "link:argument -host:argument" command can take at
least thirty-six (36) specific forms. This is because an address (URL)
of an online resource (such as a web page, web site, an archive, or a
database) has at least six component parts (or seven, if we take into
account "#" category of addresses within a web page itself, and more
if the full range nested subdirectory addresses is considered). For
instance, the computer address:
www.gu.edu.au/school/ais/asaapubs/eastasia.html
observes, like all other computer addresses, the following global convention:
machine.organization.networkcode.countrycode/subdirectory/file-name
- machine = www (i.e. name of the particular web server)
- organization = gu (i.e. Griffith University)
- networkcode = edu (i.e. academic network)
- countrycode = au (i.e. Australia)
- subdirectory = school/ais/asaapubs/ (i.e. area where the files reside)
- file-name = eastasia.html (i.e. name of the hypertext document)
This means that the Altavista's " link:argument -host:argument" command
may deal with questions about very specific and precise targets
and sources, or very broad targets and sources, or any
other of the 36 combinations of the nodes.
- For instance, the command
link:coombs.anu.edu.au/ASAA/ -host:www.gu.edu.au/school/ais/asaapubs/eastasia.html
locates all references to the target subdirectory "ASAA" on the
machine "coombs", of the institution "anu", of the network "edu", of
the country "au", minus any references originating from a web page
"eastasia.html". The source page in question resides on another
machine, and at another institution but still within the same network
and within in the same country.
- Whereas the command
link:www.nla.gov.au -host:www.nla.gov.au
locates all references to resources at the address "www.nla.gov.au", minus references
originating from that very same address, i.e. minus any self-references.
- Whereas the command
link:kaladarshan.arts.ohio-state.edu/exhib/kaney/pgs/kaneintr.html -host:.jp
locates all references to a page on Japanese calligraphy ("/kaneintr.html") at
the Huntington Archive of Buddhist and Related Art, Ohio State
University, USA, minus any Japan's references to the target page.
The commonest application of the Altavista's "link: -host:" command is
a query concerning the number and details of web pages with a
hypertext link to a particular online resource. Such a query measures
the degree of "online presence" exercised by the node in question. The
query helps, therefore, with comparisons of online visibility of
various web resources. Secondly, the query can be used to determine
identities of sources making a link to a target node under one's
jurisdiction, so that they can be notified (if necessary) about
planned changes to the site's address or to policies governing access
to its contents.
However, in addition to these two everyday uses of the Altavista's commands,
there are also ones which include a timeless pair of geopolitical queries: (a) "who
is watching whom", and (b) "how intensively this watching takes place".
2.2. An algorithm for collection of Altavista's data
In order to determine which of the two or more nodes (or sets of nodes), acts as
an importer (or exporter) of networked information, one needs to carry out a sequence
of distinct but intertwined operations. This is because raw answers provided by
the Altavista need to be recalculated
before they are fully usable for our purposes.
For instance, the following four steps determine a number of
Singapore-based pages with links to web resources based in Singapore
itself:
Step 1.
Operation: an altavista query "link:.sg -host:."
Result: "w", i.e. the number of pages with links world-wide
pointing to the web servers in the country domain ".sg"
Step 2.
Operation: an altavista query "link:.sg -host:.sg"
Result: "x", i.e. the number of all pages with links world-wide
pointing to the web servers in the country domain ".sg",
less those which originate in ".sg"
Step 3.
Operation: subtract "x" from "w"
Result: "z", i.e. the number of pages with links which originate in the
country domain ".sg" and point to the web servers in ".sg"
Step 4.
Operation: print the result "z"
For studies involving several nodes, the range of the
operations needs to be greatly widened. Firstly, two lists of
countries have to be declared: one for the "targets"; and another for
the "sources." Secondly, one needs to run through all the permutations
between the contents of these two lists. Thirdly, results have to be
sorted, annotated and printed out. Once such an output is created and
saved in a file (see Section 3.3), the data can be put into a
spreadsheet and subjected to further processing and, finally, analysis
and assessment.
The number of individual steps to be taken is a function of the number
of "targets" times by the number of the "sources". Thus a tiny, say
3x3 investigation, involves about 36 operations (i.e. 3x3x4 steps),
whereas a bigger one, say 12x12, means at least 576 operations (i.e.
12x12x4 steps). Assuming that on average, one manual operation takes
exactly one minute, a hypothetical 3x3 study would take about 36
minutes. On the same assumption a bigger study would require at least
575 minutes, or 9:35 hours. Obviously, the more extensive studies
the greater is the benefit from an automatisation of the whole procedure, especially if
they are to be conducted on a repetitive basis.
2.3. Data collection schedules
The first dataset which used this algorithm was gathered by a manual
procedure on 15th December 2000. The pilot sample covered China, HK,
Japan, South Korea and Taiwan, as well as Australia, New Zealand,
Indonesia, Malaysia, Singapore, Thailand, and the Philippines. The
data obtained formed a 12x12 matrix with 144 cells. From that matrix,
a smaller one (10x10) was extracted for the purpose of this paper.
This sample provided 3,445,426 datapoints describing relationships
between all countries mentioned in this pilot sample, except Australia
and New Zealand.
A tabulation of contents of the December 2000 sample made clear that
the Altavista database is worth regular tapping. However, manual
investigations proved to be a time consuming and labour intensive
enterprise. It was important therefore that the
process of data gathering and sorting was automated as much as
possible. Accordingly, in February and March 2001 this author designed
software which was then written by Mr Brian Collins of the ANU
Research School of Pacific and Asian Studies (RSPAS) IT Services (see
Section 3.1).
Following trial runs, the first bug-free, automated data-collection
session was run on 8th June 2001. It dealt with all 50 Asian
countries. From that large dataset, a smaller one was extracted. The
small set formed a 10x10 matrix of 6,262,142 datapoints describing
relationships between all ten Asian countries previously stated.
2.4. Characteristics of the Altavista database
The differences in the number of hypertext links existing at various
times between the same ten countries (December 2000 - 3,445,426; June
2001 - 6,262,142) showed that the Altavista database is a dynamic
environment. Its contents keep changing. The number of pages with
links known to the database enlarges and shrinks as a result of
ongoing transformations in the world's cyberspace. Inevitably, questions
arise as to how often and how extensively this database changes?
Fortunately, Altavista can be queried to provide a numeric answer to
this question. Three consecutive measurements were taken on the 8th,
9th and 19th of June 2001, with each looking at the same set of 100
country-to-country relationships. Two comparisons (each involving 100
pairs of countries) showed that there was no difference between the
data collected on the 8th and 9th June. However, they showed that
there was 100% of difference between data from the 9th and
19th of June. In other words, during this 11 days span, the data
recorded by Altavista changed 100 out of 200 possible times. This
would suggest an average rate of change of 4.5% data holdings per day.
Also, during that time out of the 100 recorded variations, 48%
involved a drop in the number of hypertext links between pairs of
countries; and 52% involved an increase in the volume of links.
Overall, the number of pages with links recorded for the ten studied
countries had grown by approximately 2%: from 6,265,330 to 6,330,301.
A second comparison between the data collected on the 19th and 25th
June 2001 has shown, again, a 100% difference between the two
datasets, with 58% of the cases indicating a decrease in the number of
pages with links, and 42% showing an increase. These developments took
place over a period of six days and suggest a 16.6% average daily rate
of change. Also in the second comparison, the 6,330,301 links recorded
for the studied countries was seen to decrease by 9% to 5,751,622.
Therefore, we may conclude that any study of the cyber-relationships
between a given set of nodes needs to be conducted (a) as a series
of repeated observations rather than as a one-off measurement; (b) over a
substantial period of time; and (c) unless changes over time are
studied, data collected at various times should be averaged, so that a
generalised, more synthetic picture can be obtained.
This is a course which is adopted by this paper.
3. The software for automated data collection
3.1. Software
The WEBLINKSURVEY program runs on a Unix machine connected to the
Internet. It is a short, plain-text (ASCII) file with a perl script
kept in a subdirectory on a user's account. Within the Unix environment
the WEBLINKSURVEY file(s) can be
easily copied (using 'cp' command) and edited (using 'vi' command).
Naturally, all program files need to be in the executable (i.e. -rwxr----- )
mode.
The software is archived on the Coombspapers anonymous FTP site at
ftp://coombs.anu.edu.au/coombspapers/otherarchives/soc-science-software/.
The WEBLINKSURVEY program can be freely copied and used for private research purposes as long
as the associated copyright and credits notes are retained.
For any other uses please contact first the copyright holders.
The following is the listing of the WEBLINKSURVEY program.
#!/usr/bin/perl
#
# Amend the path above acc. to the placement of
# the perl library on your mainframe machine
#
# -------------------- about the weblinksurvey --------------------------
# Name: WEBLINKSURVEY
# Version: 1.1 June 2001
#
# This 'weblinksurvey' software can be acquired free of charge from
# ftp://coombs.anu.edu.au/coombspapers/otherarchives/soc-science-software/
# file: weblinksurvey.perl
#
# Function: A general purpose, perl-based tool to study patterns
# of interconnections between various elements, sets
# and subsets of the global (or regional) web-space.
#
# Method: The program runs in a Unix environment. It (1) queries
# the online altavista [www.altavista.com] database
# of web pages (an altavista command "link:URL -host:URL");
# (2) counts the number of pages from one set of URLs which
# point to web pages from another set of URLs.
#
# The program implements an algorithm first described in
# www.ciolek.com/PAPERS/easian-info-flows.html
#
# Authors: Algorithm by T.Matthew Ciolek (tmciolek@coombs.anu.edu.au)
# Programming by Bryan Collins (bryan@coombs.anu.edu.au)
# Testing and additional programming by T.Matthew Ciolek
# Research School of Pacific and Asian Studies [rspas.anu.edu.au]
# The Australian National University, Canberra, Australia
#
# Copyright (c) 2001 by T.M.Ciolek & B.Collins
# This program can be freely copied and used for private
# research purposes as long as this and all above notes are
# retained. For any other uses please contact the copyright
# holders.
# -----------------------------------------------------------------------
#
# ---------------------
# do the data gathering
# ---------------------
#
use LWP::UserAgent;
@list1=('insert here individual, comma separated URLs');
@list2=('.','insert here individual, comma separated URLs');
#list1: targets (pages, servers, subdomains or domains) pointed at (linked to)
#list2: sources, locations from which the hyperlinks originate
foreach $one (@list1) {
foreach $two (@list2) {
$want="link:$one -host:$two";
$ua = new LWP::UserAgent;
$ua->agent("CoombsCustom/0.1 " . $ua->agent);
#customize your software's name and location;
#print "Doing link:$one -host:$two....";
$req = new HTTP::Request GET => "http://altavista.com/cgi-bin/qu
ery?q=link%3A$one+-host%3A$two&kl=XX";
$res = $ua->request($req);
if ($res->is_success) {
foreach $line (${$res->content_ref}) {
#print "checking $line\n";
#if ($line =~ /([0-9,]+) pages found/) {
#(an expression used by Altavista till May 2001)
#if ($line =~ /We found about ([0-9,]+) results:/) {
#(an expression used by Altavista till 13 Jun 2001)
if ($line =~ /We found ([0-9,]+) results:/) {
$found=$1;
#print "\t$found pages\n";
$found =~ s/,//g;
$hits{$one}{$two}=$found;
}
}
} else {
print "error!\n";
}
}
}
#
# -----------------------------------------------
# do the calculations, sort and format the output
# -----------------------------------------------
#
foreach $k (sort keys(%hits)) {
%x=%{$hits{$k}};
$country_total=$x{'.'};
foreach $l (sort keys (%x)) {
$cnt=$x{$l};
$diff=$country_total - $cnt;
if ($l eq '.') {
print "$k <-- WORLD\t$country_total\n";
} else {
print "$k <-- $l\t$diff\n";
}
}
print "\n";
}
# -------------------- end of the weblinksurvey --------------------------
3.2. File management issues
The program file can be given any name. However, for the ease of
management, the file names should always accurately
reflect the exact purpose of each variant copy of the software. For instance,
a file named "weblinksurvey-fullasia" may be used to denote a program for the
study of all permutations existing between all countries of Asia,
whereas a file named "weblinksurvey-aejpkr" may be a good name
for a program dealing with
the relationships between the three countries such as United Arab Emirates
("ae"), Japan ("jp") and South Korea ("kr").
Similarly important is the adequate naming convention for the output
files. For instance a file named "fullasia-011002.txt" indicates that
it contains an ASCII output from the program "weblinksurvey-fullasia"
run on the 2nd of October 2001. Likewise, a file named
"aejpkr-030712.txt" should contain results of the software
"weblinksurvey-aejpkr" activated on the 12 July of 2003.
The output files are generated by running a specific copy of our
software, e.g.
a site-specific UNIX PROMPT >weblinksurvey-tenasia > tenasia-010608.txt
The time needed for completing of the software run depends
on a number of factors. They are: the overall processing power of the Unix
machine in question, the congestion of the network, the overall
business of the www.altavista.com site, and the scope of
operations requested by our program. Naturally, short
analyses run quicker (taking no more then a few minutes) than the
lengthy ones. For example, a run of the "weblinksurvey-fullasia"
program which looks at all 50x50 permutations of links between Asian
countries takes between 20 and 40 minutes.
Occasionally, a previously tried and well-behaving program can fail
by not producing any output at all. If this is the case, then the
most likely reason is that changes were made to
the format of the Altavista's results page itself. That page tends to undergo
sporadic minor modifications. For example, within a short period between mid-March
and mid-June 2001 the Altavista results were re-phrased in three different ways:
"pages found X", "We found about X results:", "We found X results:" When this
happens, the remedy is simple. To make the WEBLINKSURVEY program
operational again, identify the exact wording currently in use by
Altavista and modify the appropriate line in the perl script accordingly.
A word of caution though. If these small but crucial changes to the format of
Altavista's results page are, indeed, defence mechanisms protecting Altavista from
mass automated polling, it is important that the scholarly community
who uses the WEBLINKSURVEY program does so sparingly and tactfully. It is
important that our method of using the Altavista does not lead to the closure of the database's
"link: argument -host:argument" research facility.
3.3. Output
The content and length of the output file varies from one research question to
another. However, all output files follow the same format (see below for a
fragment of the "tenasia-010608.txt" file). The first column lists the
identity, i.e. the whole URL (or the key fragment of it) , of the target node. The second
column lists the identity of the source node. The third
column lists the number of web pages with hypertext links originating in the source
location and leading to the target in question.
An example of the "tenasia-010608.txt" output file:
.cn <-- WORLD 1625035
.cn <-- .cn 105905
.cn <-- .hk 8690
.cn <-- .id 477
.cn <-- .jp 10943
.cn <-- .kr 3751
.cn <-- .my 1988
.cn <-- .ph 609
.cn <-- .sg 11878
.cn <-- .th 1670
.cn <-- .tw 6789
....................
....................
.tw <-- WORLD 1284651
.tw <-- .cn 6812
.tw <-- .hk 12120
.tw <-- .id 685
.tw <-- .jp 14580
.tw <-- .kr 4809
.tw <-- .my 1810
.tw <-- .ph 1035
.tw <-- .sg 13341
.tw <-- .th 1486
.tw <-- .tw 115494
4. Networked information flows in Asia
4.1. The WEBLINKSURVEY data
Results listed below are based on 4.8 million data points averaged
from two consecutive studies conducted in December 2000 and June 2001.
Table 1 provides a detailed breakdown on the direction and volume of
hypertext links connecting web sites in ten Eastern Asian countries.
Table 2 summarises these data for each of the countries involved into
two categories: pages with "domestic" vs. "international" links.
Table 3 groups and analyses data from Table 1 in order to determine
the relative intensities with which the ten studied Asian countries
monitor each other by means of "international" hypertext links.
Finally, Tables 4 and 5 express values from Table 1 as
percentages.
Table 1. Origins and destinations of WWW links
in ten
Asian countries
("domestic" links are underlined
*)
WWW pages from: |
Pages with links to: JP |
Pages with links to: TW |
Pages with links to: KR |
Pages with links to: HK |
Pages with links to: SG |
Pages with links to: CN |
Pages with links to: MY |
Pages with links to: TH |
Pages with links to: ID |
Pages with links to: PH |
TOTAL |
Japan |
1,641,358 |
12,066 |
11,612 |
11,853 |
16,770 |
8,840 |
14,242 |
24,334 |
38,740 |
11,959 |
1,791,771 |
Taiwan |
3,637 |
193,271 |
4,668 |
1,827 |
2,224 |
4,240 |
2,392 |
1,591 |
10,604 |
968 |
225,420 |
Sth Korea |
9,811 |
3,819 |
443,699 |
5,506 |
3,772 |
4,516 |
4,892 |
3,258 |
26,107 |
4,073 |
509,449 |
HK |
9,066 |
8,452 |
4,001 |
457,496 |
34,300 |
7,111 |
10,989 |
1,934 |
10,527 |
2,072 |
545,945 |
Singapore |
13,008 |
8,188 |
6,494 |
13,137 |
495,523 |
9,021 |
63,974 |
11,578 |
48,880 |
8,594 |
678,395 |
China |
4,330 |
4,432 |
3,021 |
5,741 |
1,107 |
101,772 |
2,128 |
462 |
12,521 |
325 |
135,837 |
Malaysia |
3,106 |
1,457 |
4,433 |
1,216 |
7,893 |
2,728 |
256,484 |
1,337 |
33,690 |
1,493 |
313,836 |
Thailand |
3,984 |
1,025 |
825 |
922 |
3,846 |
1,059 |
2,085 |
217,430 |
15,585 |
2,594 |
249,352 |
Indonesia |
3,196 |
464 |
319 |
349 |
4,673 |
1,875 |
2,227 |
4,875 |
226,164 |
2,829 |
246,969 |
Philippines |
1,268 |
791 |
988 |
756 |
2,722 |
1,919 |
1,136 |
3,287 |
10,465 |
136,283 |
159,612 |
TOTAL |
1,692,762 |
233,962 |
480,056 |
498,799 |
572,828 |
143,079 |
360,548 |
270,085 |
433,281 |
171,186 |
4,856,584 |
* "Domestic" links span nodes situated in the same country
Table 2. Proportion of "domestic" and "international" WWW
links in ten Asian countries
(based on Table 1, values rounded
to the nearest percent) *
WWW pages from: |
Pages with domestic links |
% |
Pages with international links |
% |
Pages TOTAL |
% |
Japan |
1,641,358 |
92% |
150,414 |
8% |
1,791,771 |
100% |
Taiwan |
193,271 |
86% |
32,149 |
14% |
225,420 |
100% |
Sth Korea |
443,699 |
87% |
65,751 |
13% |
509,449 |
100% |
HK |
457,496 |
84% |
88,449 |
16% |
545,945 |
100% |
Singapore |
495,523 |
73% |
182,872 |
27% |
678,395 |
100% |
China |
101,772 |
75% |
34,065 |
25% |
135,837 |
100% |
Malaysia |
256,484 |
82% |
57,352 |
18% |
313,836 |
100% |
Thailand |
217,430 |
87% |
31,922 |
13% |
249,352 |
100% |
Indonesia |
226,164 |
92% |
20,805 |
8% |
246,969 |
100% |
Philippines |
136,283 |
85% |
23,330 |
15% |
159,612 |
100% |
TOTAL |
4,169,478 |
86% |
687,107 |
14% |
4,856,584 |
100% |
* "International" links span nodes situated in two different countries
Table 3. Destinations of "international" WWW links
in
ten Asian countries
(based on Table 1, values rounded to the
nearest percent) *
Destinations: |
JP |
TW |
KR |
HK |
SG |
CN |
MY |
TH |
ID |
PH |
Pages with Intl. links TOTAL |
WWW pages with international links to: |
51,405 |
40,691 |
36,358 |
41,304 |
77,305 |
41,307 |
104,064 |
52,655 |
207,117 |
34,904 |
687,107 |
% of TOTAL |
7% |
6% |
5% |
6% |
11% |
6% |
15% |
8% |
30% |
5% |
100% |
* "International" links span nodes situated in two different countries
Table 4. The volume of "international" WWW links by their
origin and destination in ten Asian countries
(based on Table
1, values rounded to the nearest percent) *
WWW pages from: |
Target of links to: JP |
Target of links to: TW |
Target of links to: KR |
Target of links to: HK |
Target of links to: SG |
Target of links to: CN |
Target of links to: MY |
Target of links to: TH |
Target of links to: ID |
Target of links to: PH |
|
Japan |
x |
30% |
32% |
29% |
22% |
21% |
14% |
46% |
19% |
34% |
|
Taiwan |
7% |
x |
13% |
4% |
3% |
10% |
2% |
3% |
5% |
3% |
|
Sth Korea |
19% |
9% |
x |
13% |
5% |
11% |
5% |
6% |
13% |
12% |
|
HK |
18% |
21% |
11% |
x |
44% |
17% |
11% |
4% |
5% |
6% |
|
Singapore |
25% |
20% |
18% |
32% |
x |
22% |
61% |
22% |
24% |
25% |
|
China |
8% |
11% |
8% |
14% |
1% |
x |
2% |
1% |
6% |
1% |
|
Malaysia |
6% |
4% |
12% |
3% |
10% |
7% |
x |
3% |
16% |
4% |
|
Thailand |
8% |
3% |
2% |
2% |
5% |
3% |
2% |
x |
8% |
7% |
|
Indonesia |
6% |
1% |
18% |
1% |
6% |
5% |
2% |
9% |
x |
8% |
|
Philippines |
3% |
2% |
3% |
2% |
4% |
5% |
1% |
6% |
5% |
x |
|
Pages with Intl. links TOTAL |
100% |
100% |
100% |
100% |
100% |
100% |
100% |
100% |
100% |
100% |
|
* "International" links span nodes situated in two different countries
Table 5. The volume of "international" WWW links by their
destination and origin
in ten Asian countries
(based on
Table 1, values rounded to the nearest percent) *
WWW pages from: |
Target of links to: JP |
Target of links to: TW |
Target of links to: KR |
Target of links to: HK |
Target of links to: SG |
Target of links to: CN |
Target of links to: MY |
Target of links to: TH |
Target of links to: ID |
Target of links to: PH |
Pages with Intl. links TOTAL |
Japan |
x |
8% |
8% |
8% |
11% |
6% |
9% |
16% |
26% |
8% |
100% |
Taiwan |
11% |
x |
15% |
6% |
7% |
13% |
7% |
5% |
33% |
3% |
100% |
Sth Korea |
15% |
56% |
x |
8% |
6% |
7% |
7% |
5% |
40% |
6% |
100% |
HK |
10% |
10% |
5% |
x |
39% |
8% |
12% |
2% |
12% |
2% |
100% |
Singapore |
7% |
4% |
4% |
7% |
x |
5% |
35% |
6% |
27% |
5% |
100% |
China |
13% |
13% |
9% |
17% |
3% |
x |
6% |
1% |
37% |
1% |
100% |
Malaysia |
5% |
3% |
8% |
2% |
14% |
5% |
x |
2% |
59% |
3% |
100% |
Thailand |
12% |
3% |
3% |
3% |
12% |
3% |
7% |
x |
49% |
8% |
100% |
Indonesia |
15% |
2% |
2% |
2% |
22% |
9% |
11% |
23% |
x |
14% |
100% |
Philippines |
5% |
3% |
4% |
3% |
12% |
8% |
9% |
14% |
45% |
x |
100% |
* "International" links span nodes situated in two different countries
4.2. A summary of uncovered patterns
The collected data are preliminary, but even at this early stage they already deserve attention.
Table 1 shows that electronic information residing on web servers is
neither uniformly distributed across the countries of the Eastern
Asia, nor is it uniformly linked to. Japan produces the largest
volume of hypertext links (1,791,771 or 37% of the total sample). It is
followed by Singapore (678,395 or 14% of the total) and Hong Kong
(545,945 or 11% of the total). Among the ten studied countries, the
least prolific producers (within the studied sample, of course) of web
links are: China (135,837 or 3% of the total), Philippines
(159,612 or 3% of the total) and Taiwan (225,420 or 5% of the total).
Table 2 shows that there are distinct differences in which the Eastern
Asian countries treat cyberspace. Some countries tend to be more
self-centred than others. The most self-sufficient countries (in
terms of the online information) were: Japan (international hypertext
connections form only 8% of the total links), Indonesia (8%), Thailand
(13%) and South Korea (13%). Simultaneously there were countries which
were less concerned with their own online resources and which
monitored the neighbourhood's cyberspace with zest. These were:
Singapore (information-importing links form as much as 27% of the
total links), China (25%) and Malaysia (18%).
Table 3 shows that the object of countries' electronic attention can be
surprisingly narrow. Among the ten studied countries
Indonesia attracts 30% of all pages with international (i.e.
information-gathering) links generated by the other nine countries.
The respective figure for Malaysia was 15%, and Singapore
11%. The least watched, the least linked-to countries in our sample
were: Philippines (5%), South Korea (5%) and Taiwan, HK and China (6%
each).
The fine-grain details of the above trends are presented in Tables 4 and 5.
These reveal a chain of interactions between countries
of the region. For instance, Table 4 shows that the largest importer of Thailand's
online information is Japan (46% of regional hypertext linkages to
Thailand originate on web sites from Japan). At the same time, the
largest importer of Japanese online information is Singapore (25%
of regional hypertext links to Japan originate in Singapore) while the largest
importer of the Singaporean web material is Hong Kong (44% of all
pages from the nine Eastern Asian countries with hypertext links to
Singapore reside in HK).
An additional perspective on the above pattern is cast by Table 5.
There, one can see that Thailand constitutes 16% of all international
destinations originating in Japan. In turn, Japan attracts only 7%
of international links established on Singaporean pages,
whereas Singapore is the target of a hefty 39% of web pages from
Hong Kong (all of it happening while Singapore directs only 7%
electronic links towards Hong Kong).
Further patterns also exist. Table 5 summarises the the
way countries interact electronically with China; Taiwan
is more attentive to China's information resources
(13%), than is Hong Kong (8%), South Korea (7%), Japan (6%) or
Singapore (5%). Also, for reasons currently unknown, Indonesia points
only 9% of its outgoing links in the direction of China, whereas 37%
of China's web pages with international links point to Indonesia.
In other words (assuming that informational resources in both
countries are of comparable size, and accessed with similar
frequencies), online information seems to flow from Indonesia to China
four times more strongly than in the reverse direction.
4.3. Conclusions
The political, economic or anthropological ramifications of the
patterns present in Tables 1-5 remain outside
the scope of this paper. Similarly, it is still too early to attempt
to calculate correlations between our data and the
existing cross-national social, economic or political indices such as
used (or described) by the World Bank (nd), Thede (nd), Human
Development Report Office 2001, or Saltman (2001).
Prior to any such investigations further evaluation and
refining of the WEBLINKSURVEY methodology needs to take place.
Nevertheless, the work conducted so far indicates that:
(a) a large scale automated study of distribution and direction of hypertext links (and
hence patterns of flows of electronic information) is feasible;
(b) data collected for an initial sample of ten countries of East and South-East Asia
reveal several distinct styles (or strategies) in which these
countries relate to each other in cyberspace;
(c) such a study can be done quickly, easily, inexpensively and systematically by
anyone with access to a networked Unix computer;
(d) ideally, such studies should be performed as a series of long-term
data-gathering runs, all conducted at regular, say 3 or 6 monthly, intervals;
(e) ideally, in order to protect the unrestricted access to the Altavista
"link:argument -host:argument" command, studies involving WEBLINKSURVEY software
should be done on a coordinated (i.e. without replication of effort) and
cooperative (i.e. data-exchange) basis.
(f) finally, raw data from individual studies should be systematically
archived, preferably on a single, authoritative site specializing in
geography, demography and sociology of the Internet. Establishement of
such a data archive is vital if unique measurements of the Internet
are to be protected from loss, dispersal and alterations, and if
emerging spatial/temporal trends in Asian and global cyberspace are to
be adequately monitored and documented.
5. Appendix of 04 Jan 2002
Weblink surveys carried out since completion of this paper reveal that the Altavista's "link:argument -host:argument" command when applied to the country code Top Level Domains (ccTLD) generates a cerian amount spurious data. This is due to the fact that the command is not sernsitive enough. Such a command, say, "host:.id" (= list all the hosts associated with Indonesia country code) lists:
- hosts associated with the Top Level Domain ".in" (e.g.
bandung.linux.or.id, "cyberjob.cbn.net.id");
- hosts in other Top Level Domains bearing a name commencing with letters "id" (e.g. " id.zaigen.co.kr", "id.lycosasia.com", or "id.dorfschenke.de");
- hosts in other Top Level Domains bearing a name containing letters "id" (e.g. "www-id.imag.fr");
- hosts residing on networks whose name commences with letters "id"
(e.g. "www2.state.id.us", "submarine.id.ru").
The first set of data is correct, the next three are not. The ratio between the correct
and incorrect sets of information depends on the actual combination of letters
forming the country code. Certain combinations, such as "ID" (Indonesia), "TV" (for Tuvalu), or "OM" (for Oman) are more often used as a part of a host-name or network-name than such codes as "JP" (for Japan) or "CN" (for China). This is illustrated by Table 6 below.
Table 6. Percentage of Asian hosts retrieved via a given country code*
found to be associated with that country
Country |
Country code |
% of hosts with 'correct' association |
Japan |
jp |
100 |
Kazakhstan |
kz |
98 |
China |
cn |
96 |
Pakistan |
pk |
88 |
Turkey |
tr |
88 |
Kirgyzstan |
kg |
86 |
Cambodia |
kh |
86 |
Thailand |
th |
76 |
Lebanon |
lb |
68 |
Indonesia |
id |
66 |
Israel |
il |
64 |
South Korea |
kr |
64 |
Philippines |
ph |
58 |
Taiwan |
tw |
58 |
Uzbekistan |
uz |
58 |
Yemen |
ye |
56 |
Brunei |
bn |
56 |
Jordan |
jo |
54 |
Tajikistan |
tj |
52 |
Bahrain |
bh |
50 |
Vietnam |
vn |
48 |
Singapore |
sg |
48 |
Hong Kong |
hk |
44 |
Sri Lanka |
lk |
42 |
Armenia |
am |
36 |
Malaysia |
my |
32 |
Kuweit |
kw |
30 |
Qatar |
qa |
26 |
Iran |
ir |
20 |
United Arab Emirates |
ae |
20 |
Nepal |
np |
16 |
Georgia |
ge |
12 |
India |
in |
10 |
Azarbaijan |
az |
10 |
Maldives |
mv |
6 |
Oman |
om |
6 |
East Timor |
tp |
4 |
Bhutan |
bt |
4 |
Turkmenistan |
tm |
2 |
Saudi Arabia |
sa |
2 |
Burma |
mm |
2 |
Mongolia |
mn |
0.5 |
Macao |
mo |
0.5 |
Bangladesh |
bd |
0.5 |
Syria |
sy |
0.5 |
Laos |
la |
0.5 |
Indian Ocean Territories |
io |
0.5 |
Iraq |
iq |
0.5 |
North Korea |
kp |
0.5 |
Afghanistan |
af |
0.5 |
* Values rounded to the nearest per cent, where values are less than 1 per cent, they are rounded up to 0.5%
Sample: Five times the ten URLs listed by Altavista (www.altavista.com) on its output pages number 1, 5, 9, 13 and 17 in response to a query "host:ccTDL", e.g. "host:.kr" or "host:.jp". The survey of details of the 50 URLs for each of the 50 Asian countries/territories was carried out on 26 Dec 2001
Moreover, there is a good reason to supect that similar percentages apply also to measurements obtained via the "link:argument" command. This in turn means that all originally reported values in Tables 1 to 5 need to be re-calculated to reflect:
- the reduced size of the destinations of hypertext links (overestimated by the Altavista's "link:argument" command);
- the reduced size of the sources of hypertext links (overestimated by the Altavista's "host:argument" command).
This is so, because the final, corrected, numbers are a function of both specific sources AND specific destinations. For example, in a hypothetical case of 400 hypertext links
found in an Altavista query "link:argument host:argument" for two countries, say, China and Japan
From/To |
CN |
JP |
CN |
100 |
100 |
JP |
100 |
100 |
the final values will be a result of the multiplication of the original score by the percentage of China's CN sites as well as by Japan's
JP sites. This means thaqt the adjusted values, using data from Table 6, will read:
From/To |
CN |
JP |
CN |
92 (=100*0.96*0.96) |
96 (=100*0.96*1) |
JP |
96 (=100*1*0.96) |
100 (=100*1*1) |
Finally, it appears that the above problems stemming from the use of the
"link:argument -host:argument" command at the level of whole countries can be partially alleviated by the use of a command "link:argument -domain:argument."
When the new query syntax is used we should need to worry only about calculating the reduced size of the target area.
6. Acknowledgements
My thanks are due to Ms Ann Andrews for her useful comments on the first draft of this paper.
7. References
[The great volatility of online information means that some of the
URLs listed below may change by the time this article is printed. The
date in round brackets indicates the version of the document in
question. For current pointers please consult the online copy of this
paper at http://www.ciolek.com/PAPERS/weblinksurvey2001.html address
- Bharat, Krishna and Andrei Broder. 1998. A technique for measuring the relative size and overlap of
public Web search engines (v. Apr 1998).
www7.scu.edu.au/programme/fullpapers/1937/com1937.htm
- CIA Central Intelligence Agency. 2000. The
World Factbook 2000 (v. 1 Jun 2001).
www.cia.gov/cia/publications/factbook/index.html
- Ciolek, T. Matthew. 1998. Exploring the
Digital Annapurna: On Monitoring and Mapping of Asian Cyberspace.
Paper presented at the International
Convention of Asia Scholars (ICAS), Noordwijkerhout, Leiden, The
Netherlands, 25-28 June 1998
www.ciolek.com/PAPERS/leiden-98.html
Also, republished in: AsianDOC
Electronic Newsletter, 1:3 (Oct 1998).
asiandoc.lib.ohio-state.edu/v1n3/tech/annapurna.html
- Ciolek, T. Matthew. 2000. Networked
information flows in East Asia: a pilot study on research uses of the
Altavista search engine RSPAS, Australian National University, Canberra, Australia.
www.ciolek.com/PAPERS/easian-info-flows.html
- Compaq Corporation. 2001. Altavista - The Internet's Home Page (v. Jun 2001).
www.altavista.com
- Human Development Report Office. 2001. Human Development Report.
New York: United Nations Development Programme (UNDP). (v. 3 Jul 2001).
www.undp.org/hdro/
- Google Inc. 2001. Google Inc. (v. 21 Jun 2001).
www.google.com
- ISC Internet Software Consortium. 2001. Internet Domain Survey (v. 28 May
2001).
www.isc.org/ds/
- RIPE Network Coordination Centre. 1997. ISO 3166 Countrycodes (v. 1 Jun 2001).
ftp://ftp.ripe.net/iso3166-countrycodes
- Saltman, Alex. 2001. Measure of Success - The Latest Economic Indicator: Creativity.
Wired, 9.05, May 2001, pp. 83-86.
- Thede, Nancy. nd. Human Rights and Statistics
- some reflections on the no-man's-land between concept and
indicator. International Centre for Human Rights and Democratic Development. (v. 3 Jul 2001).
www.ichrdd.ca/111/english/commdoc/publications/demDev/statisticsIndicators.html
- The World Bank Group. nd. World Development Indicators (v. 2 Jul 2001).
www.worldbank.org/data/wdi2000/
8. Version and Change History
- Other revisions incorporate minor editorial and markup fixes.
- 04 Jan 2002 - added an Appendix
visitors to www.ciolek.com since 08 May 1997.
Maintainer: Dr T.Matthew Ciolek (tmciolek@ciolek.com)
Copyright (c) 2001 by T.Matthew Ciolek. All rights reserved. This Web page may be freely linked
to other Web pages. Contents may not be republished, altered or plagiarized.
This page has been tested for full accessibility
URL http://www.ciolek.com/PAPERS/weblinksurvey2001.html
[ Asian Studies WWW VL ]
[ www.ciolek.com ]
[ Buddhist Studies WWW VL ]