Hello, Ohio Internet Genealogists,
You may have heard some buzz and even outrage recently about Ancestry.com's new
Internet search engine, called the "Internet Biographical Collection" (IBC). It
has the following characteristics:
1) It has an Internet web crawler or robot that scans the web for biographical information
that might be of interest to its customers;
2) The pages it finds are processed to discard irrelevant material;
3) The pages are indexed, so that their content can be searched at the same time as other
Ancestry databases;
4) The pages are cached, so even if you take your originals down, they will remain
available at Ancestry (and updated materials are apparently slow to replace the previous
versions in their cache -- some pages are four months old!);
5) When the pages are loaded, they are inside an Ancestry-controlled frame.
Item #4 is felt, in some circles, to be copyright violation, as it removes an
individual's control over how their copyrighted material is published (i.e. Ancestry
inserts themselves into the middle of the transaction without permission).
Item #5 bothers more people because it tends to obscure the true ownership of the web
pages and keeps users in the Ancestry world rather than learning more about the other
site, e.g. USGenWeb.
It didn't help that, initially, the IBC was only available to paying Ancestry
subscribers, which made the page owners feel like Ancestry was making money at the expense
of their efforts to provide free access to genealogical materials.
Ancestry quickly made the IBC generally available, however there is still a great deal of
uproar of the matter, and Ancestry hasn't made any promises to properly deal with user
grievances. They seem to be adjusting their systems but it's unclear what the final
form will be.
******************************************
In any case, the point of this message is to remind everyone of the control that we do
have over these issues.
First, you can prevent internet robots from scanning your pages at all, if you don't
want them to (and they play fair). This involves creating a file named robots.txt in the
root directory of your server, i.e. the file
www.yourserver.com/robots.txt. To do this you
must have access to the root level of your site, which means that you either control your
own domain, or you can prevail upon your system administrator to write this file for you.
The file robots.txt can include the following text in it:
User-agent: *
Disallow: /
The slash refers to the root of your web space, which you would use if you control the
server. If you have a subdirectory such as
www.yourserver.com/ohio/yourcounty, then the
file would look like this:
User-agent: *
Disallow: /ohio/yourcounty/
The * means that this applies to any robot. If you just want to prevent Ancestry from
scanning your pages, use this text:
User-agent: MyFamilyBot
Disallow: /
or
User-agent: MyFamilyBot
Disallow: /ohio/yourcounty/
as appropriate.
Note: those of you with web sites on Rootsweb probably won't be able to use this
technique.
******************************************
Second, you can have more control by putting <meta> elements into your pages,
between <head> and </head>. They affect only that page, but they let you be
more specific about the robot behaviors you will allow. The three in particular are
indexing a page's content, following links, and archiving the page in its entirety.
To let the robots index your site and crawl from page to page, but not archive it, use
this element:
<meta name="robots" content="noarchive">
To prevent the robots from any of indexing, following, or archiving, use this element:
<meta name="robots" content="noindex,nofollow,noarchive">
(This will even prevent indexing of links obtained from other websites, unlike the
robots.txt file.)
To prevent the Ancestry bot from any access to your page, while only preventing others
from archiving, use these elements together:
<meta name="robots" content="noarchive">
<meta name="MyFamilyBot"
content="noindex,nofollow,noarchive">
General info can be found at:
http://www.google.com/support/webmasters/bin/answer.py?answer=35301
******************************************
Third, you can prevent other sites from framing your pages inside of theirs by including
the following code in your pages, between <head> and </head>:
<script type="text/javascript">
function breakFree() { if (parent.frames.length > 0) parent.location.href =
location.href; }
</script>
And then change the body element to read as follows:
<body onload="breakFree();">
Thanks to Linda Lewis for much of this information.
******************************************
Scott
Guernsey Cownty