mamamusings: search

elizabeth lane lawley's thoughts on technology, academia, family, and tangential topics

Monday, 2 October 2006

nick carr on "algorithmic integrity"

Nick Carr has a stunningly good post up today about search engine rankings, which can easily be manipulated by determined parties (a process known as “Google bombing”).

He quotes spokespeople from both Google and Microsoft defending the fact that the number one result for a search on Martin Luther King is a white supremacist site. Google’s spokesperson said that they “can’t tweak the results because of that automation and the need to maintain the integrity of the results,” while Microsoft’s representative said that they “always work to maintain the integrity of [their] results to ensure that they are not editorialized.”

Here’s how Carr responds to those positions:

By “editorialized,” [the Microsoft spokesperson] seems to mean “subjected to the exercise of human judgment.” And human judgment, it seems, is an unfit substitute for the mindless, automated calculations of an algorithm. We are not worthy to question the machine we have made. It is so pure that even its corruption is a sign of its integrity.
Posted at 2:28 PM | Permalink | Comments (0) | TrackBack (0)
categories: microsoft | search | technology

Friday, 27 January 2006

edge cases and early adopters

This week was the fourth version of Microsoft’ “search champ” program, and the first one where I’ve been heavily involved in the planning (rather than simply being an attendee). It was a great meeting, with some amazing people providing input into new product development in MSN/WindowsLive. I got see to old friends (like Cindy and Walt), and be a fangirl (hi, Merlin!).

During the wrap-up session, when Robert Scoble was talking about designing tools that would optimize everyone’s syndication experience so that they, too, could read 840 feeds, I called him an “edge case.” He didn’t like that. Not one bit. But his defense was, to me, unconvincing.

Robert’s an “edge case” to me in this context because very few people will ever have the time or the inclination—regardless of how good the tools are—to read that many sources. Robert does not because he’s some freak of nature, but because he’s got a job that requires him to monitor activity in the technology community. When I worked at the Library of Congress, I had a job that required me to read dozens of newspapers and magazines every single day, looking for articles related to governmental initiatives. That made me an edge case. Most people don’t read dozens of news publications every day, and it’s not that they want to but simply haven’t found the tools to do it. It’s that they don’t have a need for that much diffuse information.

He felt I used the term derisively, which I didn’t. He’s right that edge cases often push us in new directions, and I’ve got a long-standing interest in liminal spaces (the fancy academic term for those in-between spaces where contexts overlap and new ways of thinking and acting often emerge). But in his reaction, he confused what I see as two very different things—edge cases and early adopters. In this case he’s both. But his response focused much more on how his early adoption of new technologies—from macs to blogs—foreshadowed broader adoption. That’s about being an early adopter, which is not synonymous with being an edge case.

So what’s the difference? To me, an early adopter is someone who recognizes the value of a new technology or tool before it becomes widely used or accepted, and often evangelizes it to others. They recognize trends before they’re trends, and are the ones who are always acquiring the latest-and-greatest technical toys. An edge case is someone who’s on the extreme edge of an activity, regardless of whether they’re an early adopter. Someone who reads 840 blogs is an edge case. But so is someone who reads dozens of daily newspapers, or runs 10 miles every morning. Their choices may influence our behavior—those edge cases are great at recommending things to others—but most people will be far more moderate in their behavior.

There’s a story I cite a lot when I’m talking to people about diffusion of technological innovation. Back in my early days as a librarian in the 1980s, online searching didn’t mean launching a web browser and going to Google. Instead, it meant connecting via dial-up to an online database and doing a searches with complex boolean operators. Librarians loved this, and decided that the whole world needed to learn the “joy of searching.” It was that whole “teach a man to fish and he eats for a lifetime” mentality. One day at a library conference, I heard a wonderful speech by Herb White in which he scolded librarians for this mentality. “I have no joy of searching,” he told the audience. “I have joy of finding!”

In that context of online searching, librarians were both edge cases and early adopters—much like Robert is with blogs and syndicated feeds. They’re edge cases because they do in fact love to search as much as love to find. They find it hard to believe that not everyone would want to learn arcane search syntax in order to improve their online search experience. But they’re also early adopters—they were finding things online before the web was born, and they continue to push the limits on how you can use online search tools (one of my most popular posts ever was a transcription of Mary Ellen Bates’ fabulous “30 Search Tips in 40 Minutes” talk from the 2003 Internet Librarian conference).

Anyone who’s looked at aggregated query logs from a search engine knows that most of the people doing online searching these days aren’t masters of the boolean query. They didn’t become like the edge cases. But they did follow the early adopters—just in a more limited way.

So, Robert, my point wasn’t that because you’re an edge case nothing you do is relevant to other users. Nor do I think being an edge case is bad (I consider myself to be one, too). But the people who follow your lead as an early adopter won’t do it the way you do. They’re simply not going to want or need to read 840 syndicated feeds. And to try to optimize the user experience based on the needs of edge cases isn’t, I think, in anyone’s best interest.

Posted at 11:03 AM | Permalink | Comments (5) | TrackBack (4)
categories: microsoft | research | search | social software

Wednesday, 26 October 2005

internet librarian 05: search engine choices

Greg Notess and Gary Price, two genuine experts on search engines and our choices.

Greg and Gary both start out by saying “Google’s not the only answer.” It’s the job of information professionals to know all of the options, not just the most popular one. Gary notes how hard it is for anybody but Google to get the word out about their products.

Current web search engines with unique databases
* AskJeeves
* Google
MSN (says librarians really should pay more attention to this!)
Yahoo

meta engines
* A9
* clusty/vivisimo
* dogpile (one of the few that hits all 4)

vertical

Greg says that he doesn’t like to start his searches with Google. As a reference librarian, if he starts with something other than Google it boosts his credibility with patrons—he’s not just doing the same thing that they do! :) Shows the example of a discussion list posting that was only available on Yahoo (not on Google or MSN). If you care about comprehensivenss, you have to be willing to use multiple sources.

AskJeeves give you a different kind of relevance view. Says they’ve come the farthest on “quick info” on a search. Shows a search on “Chicago” as an example. He and Gary then also show a search on “the Beatles,” which gives you a variety of useful “expand your search” options. They note that AskJeeves have reduced the number of ads on their pages, which many people don’t realize. (In contrast to other

MSN Search is up next. Acknowledges that not all Microsoft products are best of breed. BUT…MSN search is no longer powered by other people’s indexes, and right now they’re doing a better job than anyone else of keeping things fresh. They also mention that MSN Search gives you free access to Encarta content. You get two hours of access each time you do a search leading to Encarta (can limit to Encarta only, or let it be part of the overall results). They haven’t promoted it, but it’s a feature that librarians should be promoting—particularly as a comparison to wikipedia.

Shows MSN’s search builder, which is great for showing people how to build complex searches—uses drop-down boxes and sliders for ranking. They don’t show start.com; will have to ping them about that, because I suspect they may not be aware of it.

Next up is Yahoo; they recommend that people use search.yahoo.com rather than yahoo.com, to avoid clutter. Shows that you can edit the tabs (there’s a tiny “edit” link up there…) to the kinds of vertical/specialized searches you want. (That’s cool! I didn’t know that!) If you’re logged into Yahoo, the settings will follow you. In advanced search, they show off the creative commons option, as well as their “subscriptions” search, which is extremely interesting (Mary Ellen mentioned this on Monday, too). He shows the blog search stuff that’s been added (that’s another post that’s brewing for me; I’m extremely unimpressed by their implementation of blog search). Then they show Mindset, as well—again, I don’t love that shopping/research is the only axis. Shows the shift from “did you mean”

Complains about lack of transparency in how search engines (especially Google) works.

Damn. I need to go to the airport, and will miss the metasearch and vertical search discussion. Hopefully someone else will blog it…I’m outta here!

Posted at 11:07 AM | Permalink | Comments (2) | TrackBack (0)
categories: conferences | librarianship | search

internet librarian 05: google debate

Rich Wiggins squares off against Roy Tennant in a debate over “Google: Catalyst for Digitization or Library Destruction?”

Rich starts off, and is utterly charming. Some funny starting slides, hard to capture in print because of their visual impact.

Starts by talking about a similar debate they had 4 years ago. (The slides are dense with bullet points now, and I’m sitting where it’s hard for me to see the screen, so I’m not going to try to transcribe them. Later I’ll look for a pointer to the presentation online.)

How many bytes are in the LIbrary of Congress? This is a non-trivial question, with lots of technical aspects. You can’t gloss those aspects (resolution, color, etc) because you’ll end up wasting effort. Rich cites Brewster Kahle’s estimate of 20 terabytes.

Rich says it’s becoming so inexpensive to capture full-text and images that complete digitization is becoming realistic. Disk space is cheap, scanning technology has improved. He asked google what they’re using, and they wouldn’t answer. (Color me shocked…) I wonder whether Microsoft will be more forthcoming, considering their partnership with OCA. I hope so. [add musing on google’s secrecy here]

Refers the comment last night by Stephen Abrams that we spend more money getting abook through ILL than we do to buy it. (That’s a really interesting thing to think about.)

There are a bunch of straw man arguments here. He dismisses the preservation argument—we have better access, since you can still get the stuff online after a fire. (But what happens when the power goes out? That happens a lot more often…) Doesn’t address the question of what happens when data is stored in proprietary formats—do we know what format Google will store this information in?

His bottom line, “Google Print has taught us to ‘think big.’” (hmmm. does the period go before or between the single and double quotes there?)

Argues that this vision of digitization will have to be done by a forward-thinking company — not by government. It has to be a company. (He claims that Google invented Ajax!!!!) Mocks Microsoft, saying they’re playing catchup, and not very well. “Hmmm…Google’s going to digitize millions of books? We’ll digitize 150,000!”

Now it’s Roy’s turn. Starts out by saying that his bottom line is “more access is better.” He thinks it’s great that Google’s digitizing stuff, that OCA is doing it, that libraries have been doing it for decades. There’s a lot of room for everyone to be involved. Says he’s going to try to be provocative, and starts out a halloween-themed slide that reads “Google: Devil? or Merely Evil?” (I didn’t get a photo of this, but would love to get the slide from him.) Says he’s going to talk about the scary monsters that he sees lurking in this project.

The first monster: the fair use problem. He’s concerned about Google trying to shield themselves with fair use. Because this has pulled the issue into the courts, it has the potential to result in restriction of fair use rights for everyone, including libraries.

The second monster: Closed access to open material. For example, there are many copies of Call of the Wild that are freely avaialble. But when you go to Google Print, you won’t know that—you’ll see the reprinted, proprietary version from a publisher, without an indication that it’s in the public domain and can be found from other sources. “And to add insult to injury, they give you links to buy the book, but no links to libraries.” He’s been assured this will change, but it hasn’t happened yet, and there’s no guarantee that it will.

The third monster: Blind, wholesale digitiazation. He’s not so sure this is a good thing. Large collections in research libraries are choked with out-of-date crap, so that their collection numbers are high enough to keep them in their “tier.” Also, because copyrighted information is more difficult to get to, people will rely on old, out of date information because it’s free and easy to get to. Is this a good thing? (This is a great point that I haven’t heard mentioned before.) OCA is more focused on selective digitization—for example, American literature.

The fourth monster: advertising. How long before we see ads for antidepressant medication next to Hamlet? Google’s window of opportunity to do “good things” will be constricted by their responsiblity to stockholders.

The fifth monster: secrecy
The agreements between Google and libraries have been largely kept secret. Before the announcement, the Google libraries could not even talk to each other. Michigan revealed theirs (but not until a Freedom of Info Act request forced it, and months after the project was announced). Rumor has it that UM has the best agreement from the library perspective, and that other libraries are agreeing to much less onerous terms. This is a hot button for me. One of the things that I really like about Microsoft is the extent to which its researchers regularly collaborate, publish, and present outside of the company. If Google’s intent is purely philanthropic, why does the commitment to “provide access to the world’s information” stop at their front door?

The sixth monster: longevity.

Now Adam Smith gets a chance to respond. Flashes a charming grin, and says “I’m not that dangerous, am I?” :) (This is what scares me most about Google. Their people and their products are indeed so seductively charming, it’s easy to take their claims of purely philanthropic motivation seriously.)

He encourages feedback and criticism—says that’s how they make their products better. They launch things quickly so they can get feedback quickly. They walk a difficult path in trying to make many parties happy. Their goal is to make information more accessible, not hidden in library stacks. Says he’ll be here to answer questions.

He’s asked about the scanning process—they’ve developed a proprietary non-destructive scanning process, but are not at liberty to disclose that. Someone asks about privacy, Adam refers them to Google’s privacy policy. Someone else asks if it’s true that one of the libraries requested that only manual page turning be part of the scanning, and he again invokes “no comment.”

I ask about the disjoint between the stated policy of helping the world by making information accessible and the veil of secrecy surrounding everything they do, and he’s unable to respond—says he’s only been there two years, and isn’t really familiar with the reasoning behind their policies on disclosure. I express surprise that he hasn’t asked for clarification, since I would think he’s asked this fairly often, and he says he’s never been challenged on this in a public forum before. I’d love to think that’s not true, but I suspect that the Google mystique, which they cultivate so very well, has a lot to do with that.

Lots of discussion, not all of which I capture mentally (let alone here on the screen).

Posted at 10:31 AM | Permalink | Comments (1) | TrackBack (1)
categories: conferences | librarianship | search

Tuesday, 25 October 2005

internet librarian: the googlebrary

Tonight’s panel is moderated by Stephen Abrams, with a number of library pundits and Adam Smith from Google Print. Before the presentation even begins, a young man circulates around the room handing out a glossy sheet with the Google logo at the top entitled “The Facts About Google Print.” Gotta love their ability to spin things. It’s not an “FAQ,” it’s not “information”—it’s Facts.

I’ve spent a lot of time over the past few days talking with librarians who are openly enthusiastic about Google’s digitization project—not because they love Google, but because they desperately want this information in searchable form. This evening at the speaker’s reception, someone said to me “the only question is when this will happen.” I looked at him in surprise, and responded that I thought that an equally important question was “who.”

So, the panel’s about to start…and the first thing I notice is that I seem to have been transported into a web 2.0 panel: all white men, all the time. The only difference is that all of these men are over 40. <sigh> I don’t mean to denigrate any of the panel members—they’re all smart, accomplished guys. Rich Wiggins from MSU, Steve Arnold from Arnold Info Systems, Roy Tennant from Cal Dig Lib, Mark Sandler of Univ Mich, and Adam Smith from Google Print.

Oh…wait! Barbara Quint, editor of Searcher Magazine, is here, virtually (via speaker phone). A truly invisible woman in this case.

Stephen Abrams is a great moderator—energetic, funny, engaging. Notes that Google’s under fire from publishers and authors, and now the threat of congressional hearings. “I’m sorry, I’m from Canada. We think your congressional hearings are great entertainment.”

Starts with Adam. “I’m Adam, I’m from Google, and I’m here to give you the TRUTH about Google, and dispel the misinformation that’s out there about Google.” (Heh…”I’m from the government Google and I’m here to help you.”)

“We’re doing this out of necessity, not desire.” (They’re hitting this line hard in a lot of contexts these days; I rather liked Nicholas Carr’s comment on this approach last week.)

Shows the three “user experiences” they intend: the publisher program, public domain books, and copyrighted books. The last is the one that’s most contentious. Smith says: “This is allowed under fair use.” Huh. Judge and jury, case closed? If it were that clear cut, would there be this much controversy surrounding it? While they may well be right, to present opinion as fact is troubling.

Abrams takes over again, and says that we’re going to move fifteen years into the future. We’ve built the megalibrary, and we’re looking back: what did we do right? Or…what did we do wrong? How did we get here?

Rich Wiggins starts out. He appears to have fallen under the Google spell… “Looking back, the leading search engine company, worth billions, has digitized the world’s culture.” A truly utopian vision. (I like Rich, and he’ll probably read this, so I’ll apologize in advance—Rich, I’m criticizing the ideas and tone, not the person. :)

Roy Tennant totally takes the other end: Google is bankrupt due to mismanagement, and the rest of the world has figured out how to do digitization well. (Adam, he says, has cleverly cashed out in 2009.) The MARC format is dead, libraries have discovered that systems don’t integrate well, and have come to grips with how to change them. I like this Utopian vision a lot better than the last one! (He and Rich are debating tomorrow morning; I’ll definitely have to attend that keynote!)

Mark Sandler: In 2020, Internet Librarian has become the Librarian conference; ALA in turn has become the American Print Library Assn. (Much laughter…) Google may or may not be there—he doesn’t know what the life span of a 7-year-old multi-billion dollar company is. But in Billings MT and Berea KY there are now libraries with 50 million, 100 million volumes available to their readers (from the speakerphone, Barbara’s voice cries “Yes! Yes!”).

Barbara looks back from 2020 to 2006, when Google launched “Google Press” (I can’t make sense of what she’s saying—the voice cuts in and out…) Five years later, it is renamed the “Google Full Court Press.” (wish I could hear all of this)

Steve talks about his book, “The Google Legacy.” Says he’s the only person in the room whom Sergey Brin has said is stupid. (Anybody have the cite to that? I couldn’t find it in a quick search…) He says he’s not interested in Google Print or Google Scholar, he’s more interested in GoogleBase, which allows Google to become world’s largest publisher of scientific information. Abrams asks him to explain GoogleBase, and he responds: “I’m not explaining Google Base. It’s not my job. Sergey thinks I’m stupid, and we have someone here from Google that Sergey thinks is smart. Let him explain it.” Heh.

He makes a critical point here, though. Microsoft’s products don’t delight. Google’s products do delight. (Quick round of Microsoft bashing ensues, during which I’m glad I’m not on stage. :)

Adam gets to have his futuring moment. Says 2006 was a turning point year, where “we all worked together to do the right thing.” We freed ourselves from the worries of digitization and formats. In 2020 everyone is an author, everyone is a publisher, everyone is an archivist, everyone is involved in the creative process. (He should read danah’s post from nearly two years ago… “Consumption and production are fundamentally different and there are different forms of pressure when engaging with either. There is no way that one can possibly say that the threshold for consumption is equivalent to the threshold for production.”)

(Roy suggests a round of Kumbaya at this point. I nearly fall off my chair. You go, Roy!)

Stephen asks “what will happen to the librarians in 2020?”

Mark says that some of them will be gone. Why would we need “local providers” when they have the WalMart of libraries? (He says this with a straight face…at least Roy seems to raise his eyebrows.) Local libraries are going to have to change their mission. It has to be about access, about pampering users and adding real value to their lives. They’re going to be like “cosmetic counters”. WTF?!? Apparently he’s serious here—he keeps going on this tack, as I become increasingly astonished.

Barbara weighs in over her spotty audio feed. (I have to ask…why are they using a telephone line run through the sound system rather than a high-quality IP solution with a direct audio line out of the computer? Skype gives far better quality than what we’re hearing.) She says readers are more tightly connected to their readers, authors are building books out of Google’s content. Book prices are dropping, open access keeps increasing. Librarians are helping to discriminate between good, bad, lousy and lousier materials. “when everything is digital, you’re paying people to help you not read bad stuff.” Librarians become censors. (Why the choice of that extraordinarily loaded word rather than the less judgmental and polarizing term “filters”?)

Roy says he wants to jump into this “digital lovefest.” Digital won’t make print go away—it never will. Putting digital materials online increases book circulation. Libraries have never been just about “stuff.” They’re about service. That doesn’t change when collections are digital. (Yay!)

Rich says the cloudy part of the crystal ball is about how we’ll be accessing this information. Display technology will change a lot about how we access things. If we have “e-paper” widely available by 2020, it changes this discussion.

Steve says everyone in this room needs to wake up the associations and get them more engaged in the role of the library as an institution. Unless that happens, we’ll have a repeat of what happened in Salinas, where the library was shut down. This is a job for everyone here to carry back to the associations and be militant about it, so we don’t become marginalized. Also, the library is an institution about learning and information, not limited to a type of material. It is a manifestation of how to organize and access information, whether it works with digital or print artifacts. Having said that, he thinks there will be a “pushing down” of librarianship into some institutions (like schools), and a pushing up into businesses—but the pain will be in the middle. That’s where the impact of Google will be.

Abrams breaks in, and says Adam is an “immigrant” into the world of libraries. What does Adam think?

Adam responds by saying that just because everything is digital doesn’t mean everything is good. (Um, yeah. This isn’t news to anyone in this room.) Editorial control will still be relevant and important. How do we communicate what’s good, when everyone’s “good” is a little different. Hopefully the “truly good” will rise to the top.

Stephen points out that Google has two new patents for determining the “quality” of information. Asks Adam what the impact of that will be on libraries. Smith doesn’t seem to really answer the question directly.

Audience questioner takes the room to task about the fact that we’re taking this very lightly; also points out that many of the panel members have a vested interest in Google’s success in this space. Barbara responds (again nearly unintelligible, but seems to be focused on serials).

Librarian from a small library says that his life isn’t long enough to read what they already have, let alone adding so much more. How do we evaluate all that information? (I’d like to see more discussion of collaborative filtering here…) Mark responds that as a collection dev officer, they try to buy “all but the very worst books.” Says in research libraries they’ve always operated on the “long tail” model—you can’t anticipate what researchers might want, so you collect broadly to try to cover all the bases. Maintaining that physical collection is tremendously difficult, and makes it harder and harder to move forward.

An audience member asks about preservation…Adams quite appropriately points to the work being done by academic researchers in this area.

A couple of questions about digital rights management. One commenter says Michigan’s agreement with google is quite impressive in this regard. (I’m starting to feel a little bad for him; the audience wants him to answer all of their questions about what they think is wrong with Google, and of course that’s not fair for him.

I ask about the fear of a single source—Steve responds that there will be at least three companies that will do this, that the market will force this to happen. Google will be one, obviously. Yahoo is looking at this as well. MSFT will probably be in that space. There will not be a single source, no matter how hard anyone tries. That will be emergent—the market will accomplish that. (Barbara says we have three: open content alliance with Yahoo and whoever else joins, and Amazon, and Google.) Steve disagrees—he believes there will be three, and the only one we know for sure at this point is Google. Barbara responds that right now we do have three—digitization is coming from three players, not one. Roy points out that Yahoo is only one of many players in OCA.

And then, as if on cue…

Big Announcement The Open Content Alliance tonight had an official inaugural event in San Francisco—and at the reception it was announced that Microsoft is joining the alliance, and is funding the digitization of 150K books over the next year. Microsoft’s contribution will be known as MSN Book Search.

Smith’s response: Google absolutely welcomes Microsoft’s participation in OCA, because it’s all about making the world a better place.

Some discussion about what will happen to the physical artifacts? Who will take responsibility for ensuring that the books themselves continue to exist? Will they be lost in the digital shuffle?

Roy: Librarians still have a lot to learn about Google. And Google still has a lot to learn about libraries. (he gets some applause on this)

[Oy. I’m tired. There are other things being said, but I’m no longer able to listen and process and type. Sorry.]

Posted at 9:47 PM | Permalink | Comments (2) | TrackBack (0)
categories: conferences | librarianship | microsoft | search

Monday, 24 October 2005

internet librarian 05: expert panel on searching

This is a two-part session, so it will go for nearly two hours. We’ll see how long I last. But I feel some sense of obligation to go to the search-related sessions so that I can go back and ask MSR or MSN to reimburse me for the extra day here that the conference organizers didn’t cover (I get two nights in the hotel as a speaker, but if I’d only stayed for that I would have missed a lot of the most interesting search presentations on either Monday or Wednesday).

Genie Tyburski starts out by talking about “setting limits” on time, sources, email, etc…makes me wonder if this is going to be somewhat like a “lifehacking for librarians” session. (If not, that would be a great session for a future IL panel, I think. Jane, you reading this? What do you think? :) She says email is unreliable, unproductive, and distracting. (Well, you could say the same thing about people, couldn’t you?) She talks about disposable addresses for logging into websites (I prefer the BugMeNot approach, when possible). Yes, this is sounding a lot like a lifehacker kind of talk. Not sure I’m going to get a lot out of it, since I’m already a faithful reader of 43 Folders and Lifehacker, and a recent convert to the GTD approach. She pushes RSS, but I see this as a false dichotomy. It’s not an alternative to email, unless most of your email comes from distribution lists. RSS is great for one-to-many, but lousy for one-to-one or many-to-many.

She talks about a tool called “WebSite-Watcher,” which she runs as a desktop application to monitor websites for changes. (Ah, shades of the infamous Winer-Watcher…) I’d prefer to lean on publishers to provide RSS rather than using this approach (I assume this is basically screen-scraping to generate the equivalent of RSS updates). Also mentions one called TrackEngine—she describes it as a similar approach, but a quick look at their site makes me wonder. They describe themselves as an “active bookmark manager”—will have to spend a little time with it to see what it involves.

Next up is Gary Price, from ResourceShelf.com and SearchEngineWatch.com. Can’t read the stuff on his screen, but it’s online. He reminds us of how few people have actually hear of RSS—the Yahoo survey said 12%. Points out how important explaining and describing this to end users is. He talks about a couple of bookmarking/clipping sites: Furl, eClips, filangy (huh…haven’t heard of this last one. worth exploring). He also demos Website-Watcher, and recommends it highly. My first impression is that it’s so ugly—but clearly it has devoted users.

Whoa—he gives the first mention of MSN I’ve heard, and a plug for start.com. Nice to hear someone talk about a site other than Yahoo.

Shows indeed.com, a metasearch engine for job sites—not just compilation sites, but also job postings on corporate sites—here’s a search for Microsoft jobs in Redmond. Points out that monitoring job openings can give you insight into what companies are up to.

Recommends Whois Source for good domain name searching/monitoring. Provides some nice tools; will have to start using this one.

Shows a couple of useful special-purpose research and news sites:
* Diplomacy Monitor for government documents from all over the world
* Paper Chase for legal documents
* iHealth beat for health technology
* SmartBrief: targeted newsfeeds on industry topics (subscription required, but it’s free)
* Topix.net: he calls this his service of the year for 2005, the best news service he knows of—better than Yahoo or Google
* NewsNow.co.uk: awful search, but great sources and topic organization

He’s reeling off more stuff, but I’m burning out here. :/ Think I’m going to skip out on the last section, which is Steven Cohen’s riff on RSS, followed by Q&A. I need the mental break more than I need more links…

Posted at 4:06 PM | Permalink | Comments (0) | TrackBack (0)
categories: conferences | search

internet librarian 05: 30 search tips in 40 minutes

Mary Ellen Bates’ annual search tips talk. This was a great talk two years ago, and I’ve been looking forward to this year’s version. I just hope I can keep up!

  1. Mine the Creative Commons for images, audio, web site tools (commoncontent.org is a hierarchical catalog; creativecommons.org/find/ is the CC search tool; Yahoo CC search is more comprehensive [search.yahoo.com/cc])
  2. Use MyYahoo’s MyWeb 2.0 feature to search “my and my friends’ sites” [Argh! Another one of my topics for tomorrow!] She focuses on the “search my sites” rather than the “search my community,” though, so she leaves me a nice window.
  3. See also AskJeeve’s myjeeves.ask.com, which allows you to click “save” on the search results page for the items you want. Allows you to create an “annotated webliography” (great term!)
  4. Google’s Personalized Search; searches pages you’ve visited before. You can turn search history on and off at www.google.com/psearch. (Calls this and the previous two “the rise of the truly customized electronic ready-reference shelf.)
  5. Start searching podcasts — podcast.net includes tags; blinkx.com searches a voice-recognition transcript; podscope.com; podcasts.yahoo.com. (Podcasts are more important now that “professional” content producers have regular podcasts—news, analysis, etc. Some content is only in audio form.)
  6. Furl It (I suspect she selects this rather than del.icio.us because it archives the full text of the page—which is a really nice aspect.) Mentions this is great for training, because it can sidestep firewalls. While she was talking, I found this nice piece on the copyright issues on Furl.
  7. See how others search web—browse answers.google.com, and see search strategies at the end of the answers. You can learn from their approaches.
  8. Consider Wikipedia (hits standard talking points; not a bad overview considering the time crunch; the fact that it’s even being included in this talk is evidence of how much this conference has changed in 2 years)
  9. Use “squishy Boolean”; it’s a relevance ranking. Dialog’s TARGET command (target hybrid green clean car? ? automobile?); LexisNexis’ SmartIndexing relevance threshold ( subject(cybercrime 9*%) ). She has an article in the March/April 2005 Online magazine on this.
  10. Use blogs to search hidden web content. A site may not be spidered by a search engine, but someone may well find and blog it. Use BlogDigger, BlogLines, Blogdex.net, blogsearch.google.com to find things indirectly—you’re leveraging the blog experts’s ability to find obscure content. (No time to dig up URLs…)
  11. Try a new search engine once a month. Groowe.com helps—it’s a toolbar that that lets you pick from a wide range of search engines. Also NeedleSearch, for Firefox, and “Super Search” Konfabulator widget.
  12. Yahoo’s Mindset feature (I don’t care for this because it assumes everything fits on a research/shopping slider, but I do see the value in being able to reduce ecommerce sites in search results)
  13. Watch for video-search capabilities. video.search.yahoo.com, video.google.com, blinkxtv.com, etc
  14. Use search engine “hybrids” - Scirus.com (science related web sites and fee-based services), Yahoo’s search subscriptions (get bibliographic info on subscription-protected material), ZoomInfo.com, RedLightGreen.com (search library catalogs around the world).
  15. Use BlogPulse’s Trend Search to track blog buzz over time, and see the relative poularity of terms in blogs.
  16. Search for words likely to be mentioned on a web site (looking for information infertility drugs, she searched for the names of three different drugs from different manufacturers—this helped eliminate company web sites)
  17. What works best for the professional online services doesn’t work well with web searching. Complex searches don’t work on the web. Order of search terms matters in a web search. Forget precision and go for what will likely float to the top.
  18. Compare search engines. Dogpile study found 85% of results of the first page of search results to be UNIQUE [snurl.com/gyLg]. See missingpieces.dogpile.com (shows which pieces are unique to each service) and [eeek! missed it the next one, but it does side by side google and yahoo…which I though was a violation of Google’s ToS]
  19. Check out Exalead.com. new-ish search engine with great advanced search features. Supports proximity search, phonetic, and “approximate spelling”
  20. Collect examples of site spoofingk for those “a-ha” moments in educating your clients. mypyramid.gov vs mypyramid.org; wto.org vs gatt.org (which one is the anti-WTO site?); dhmo.
  21. Watch for new applications of Google Map images. For example, housingmaps.com (she attributes this to Craigslist, not realizing that it’s a mashup between the two sites, not a craigslist feature); traffic (traffic.poly9.com)
  22. Check out newer data visualization tools. Grokker.com has a demo showing data viz for Yahoo searching. This is a big change for librarians, who are used to text results.
  23. Other visualizations — shows the treemap version of Google News. www.marumushi.com/apps/newsmap
  24. Use incominglinks.com to find specialized portals and directories. Intended for web managers to find places to get linked, but it’s valuable as a list of specialized sources.
  25. Y!Q from Yahoo yq.search.yahoo.com. Contextual searching—lets you highlight text on the page to refine your search. Can search from any web page. Requires IE toolbar or plug-in for best performance.
  26. Yahoo’s Site Explorer: siteexplorer.search.yahoo.com. Lets you search pages within a domain or subdomain. Can also generate a list of all outgoing links from a web page.
  27. Consider Amazon’s SIP’s and CAPS; great way to brainstorm search terms from a book on a topic. Also their new Text Stats for a book (“Fog Index, average syllables per word, words per dollar and per ounce)
  28. Use phrase search to find specialized directories of information. Google and Yahoo syntax: intitle:”directory of” {subject word}
  29. Try Konfabulator widgets (yet another Yahoo property…). Includes a number of search widgets, but she talks about a wider range)
  30. Her favorite way to kill time when customer “service” puts me on hold: 20Q.net, or buy a self-contained version from amazon.com. This kind of machine-learning will start to inform search tools.
Posted at 12:05 PM | Permalink | Comments (4) | TrackBack (0)
categories: conferences | search

Tuesday, 11 October 2005

search c[h]amps: striking a balance

I’m in the midst of another Search Champ meeting at Microsoft, and this one has taken a very different form from the first two. They listened to our feedback from the earlier meetings, in which we complained about (a) the fact that we were too diverse a group covering too many topics in too short a time, and (b) the highly structured, powerpoint driven, classroom style format.

This time we’ve got a small (nine people) group of blog-focused “champs,” and a very unstructured day in a couch-filled room with a few key discussion topics. Seem like a winning formula.

But…

It turns out that if you put a bunch of opinionated geeks in a room, they spend a lot of time talking over each other. And there’s definitely a gender divide in this behavior. (Also a newcomer versus returnee divide—the people who haven’t attended a search champ event before appear significantly less willing to shout to be heard.)

There’s clearly a balance that needs to be struck, and it’s one that I’m well familiar with from classroom settings. One-to-many, top-down, bullet-point-driven meetings are stultifying; free-for-all discussions end up marginalizing those unwilling to jump into the fray, and a lot of valuable things don’t get said. (Plus I’ve got a killer headache from people who seem to share my 11-year-old’s sense of what the appropriate decibel level for a small room is.)

All in all, I prefer the freewheeling to the overly structured, in large part because I’m one of those people willing to jump in, speak loudly, and demand attention when I feel I’ve got something important to say. But neither extreme is ideal, and my hope is that MSN will keep learning from these meetings, and find a happier medium for future meetings.

Posted at 3:51 PM | Permalink | Comments (7) | TrackBack (0)
categories: search

Wednesday, 10 August 2005

mindshare, market share, and monopolies

I talked on the phone today (why yes, I do still use analog communication media…) with danah boyd, who took me to task for my last post. Her concern wasn’t with my negativity about Google, but about the extent to which the post made it seem that I’d become an unapologetic supporter of the Microsoft culture (or cult). Her argument was that in fact, Google doesn’t dominate search, it only dominates among the technocrats—much like Powerboks are the toy of choice for social software geeks, but not for the world at large.

I was a little taken aback by this, because I’d been fully convinced that Google’s dominant mindshare (when was the last time you heard someone use MSN or Yahoo as a verb meaning “search”?) reflected an equally dominant market share. My interest in seeing MSN succeed was never (and still isn’t) about having a Microsoft monopoly replace a Google monopoly—it was, and still is, about there being legitimate competion in this space. I don’t want anybody having a chokehold on online information access. So I set out to do some fact-checking. (I assume that the MSN Search folks have very detailed numbers, but I didn’t want to ask for anything that I couldn’t blog about.)

I started at the Pew Internet & American Life site, since they’re generally my favorite source of solid stats on Internet use. In May & June of 2004, they conducted a survey on search engine usage. They reported on the results in both a memo from August 2004 and a more detailed report in January of 2005—the relevant piece of this survey found that when asked “Which search engine do you use MOST OFTEN,” 47% of respondents replied Google , followed by Yahoo at 26%. MSN trailed well behind both at 7%.

In an attempt to find something more recent, I did some broader searches on search engine statistics and market share, and found a Business Week article from last month entitled “Google’s Leap May Slow Rival’s Growth.” The article opens with this paragraph:

Nearly a year after Google’s (GOOG ) IPO marked the start of a new phase in Web search competition, the upstart is making industry giants Microsoft’s (MSFT ) MSN and Yahoo! (YHOO ) look like also-rans. Google’s share of U.S. searches hit 52% in June, up from 45% a year ago, according to Web analytics firm WebSideStory Inc. By contrast, Yahoo’s and MSN’s share slipped to 25% and 10% respectively. Says Mark S. Mahaney, an analyst at Smith Barney Citigroup (C ): “People haven’t been given a good reason to switch from Google.”

I also found an article from February 2005 on SearchEngineWatch by Danny Sullivan, in which he cites data received from comScore. The results he cites show Google with a 35% share, Yahoo with a 32% share, and MSN with a 16% share. Here’s how Danny describes that data:

The comScore Media Metrix qSearch service measures search-specific traffic on the internet. qSearch data is gathered by monitoring the web activities of 1.5 million English-speakers worldwide (1 million in the United States) via proxy metering.

Proxy metering allows comScore to see exactly how those within its panel have surfed the web. From this data, the company then extracts activity that’s considered to be specifically search-related.

[…] The qSearch figures are search-specific but not necessarily web-search specific. For example, a search performed at Yahoo Sports would count toward Yahoo’s overall total. That’s important to understand.

So, what am I missing? I can’t find any evidence that my perception of Google as the dominant player in this market is incorrect. If you know of research that contradicts this conclusion, I’d really love to know about it—please add a comment with a cite!

Right now, I don’t think that Microsoft’s search product is as good as Google’s. And I think that what Yahoo is doing with MyWeb is in fact the killer app of search. My working with MSN for a year isn’t going to suddenly catapult the company into a monopoly on web search (although it is giving me a fascinating view into how corporate culture influences the direction of products, not always in a good way). But I do think there’s value in evening the playing field. Microsoft is going after search market share—that’s a given. If I’m here, I can try to help them do it in a way that benefits the users of their service. If I’m not here, they only thing that changes is that my input into the product disappears. My presence has no impact on Microsoft’s business practices or goals. But it might well result in some influence on the direction of their product development, and I’m okay with that.

At the end of the day, I still harbor a healthy distrust of most corporations and their cultures, regardless of how much I like the people that work there, or the products they produce.


Update, 2:51pm

SearchEngineWatch has a few other articles on market share. This one provides the May 2005 Nielsen NetRating figures, showing Google with 48%, Yahoo with 21.2%, and MSN with 12.4%.

Posted at 10:59 AM | Permalink | Comments (5) | TrackBack (1)
categories: microsoft | search | technology
Liz sipping melange at Cafe Central in Vienna