December 30th, 2008 by Moti Karmona | מוטי קרמונה · 2 Comments
I was analyzing, dreaming, monitoring, crawling, debugging, reading, breathing, cursing, scaling, visualizing and learning the social graph for the last couple of months and I thought it might be a good idea to write a little something about The Social Graph Challenge with a pragmatic twist on few other common concepts.
——— Blitz Introduction to The Social Graph ———
The social graph is just a simplified mathematic abstraction when nodes are people and edges are relations between them.
In the last decade the internet have became more social than was ever expected it to be with the rapid growth and adaptation of social networks, social media and user-generated contributions and interactions.
Nowadays, there is a growing feeling that it is feasible to model and map the social web into a real-life social graph replication.
——— Pragmatic Overview on The Social Graph Challenge ———
Modeling | Building | Processing | Size | Architecture
(1) Modeling the Social Graph
To better understand how complicated it is to create a vocabulary for expressing metadata about people, their interests, relationships and activities you should simply pay a quick visit to the FOAF Project technical specification page
The FOAF (“Friend of a Friend”) Project has the most comprehensive model available today and it is still lacking some basic modeling granularity e.g. time awareness metadata, no privacy model, poor relationship model
*** The Social Cloud
It is common mistake to forget that people are more than just flat internet identities (e.g. Linked profile) and to complete the profile modeling we must add all their content to the graph e.g. Personal Blog, Flickr images, YouTube Videos, Delicious bookmarks, Tweets, Blog Comments etc.
Modeling all these content and consumption types will yield a broader definition (a.k.a. The Social Cloud) with even more complex modeling challenges.
(2) Building the Social Graph
*** The Paradigm Shift
While conventional internet crawlers, follow hyperlinks within web pages and treat pages as plain-text, social crawlers should have social-“awareness”:
- Identify and extract people identities fragments (e.g. social network profiles, blog authors)
- Identify relationships (e.g. social networks connections, blog-roll fans)
- Identify relations between content and people (author, bookmark, reference etc.)
*** The Standards Dilemma – No Silver Bullet
Beside FOAF, there are several open standard like RSS, ATOM for content syndication and microformats like HCard, XFN for profiles and network discovery, that seems promising and can help with the identification quest but although this is being pushed by giants (e.g. Google Social Graph API) the adaptation is still low and have many correctness and corruptions issues – e.g. all these people claimed to be WordPress.com using the XFN (rel=”me”) microformat
*** The Promise of Structured Sources (a.k.a. The structure myth)
The Myth: Most social Media sites (e.g. FaceBook, LinkedIn, MySpace, Flickr etc.) have a public available structured profile pages so in principle all need to be done is some XPath magic on HTML DOM to finish the parsing task.
But… Most of the work isn’t parsing but data modeling which require deep understanding of each site user model and usage
- Many Social Media sites have EULA restrictions which prohibit any access or use to the site content but if you are lucky you will get some offical API’s instead.
- Social Media sites have many (~weekly) structural changes in their CSS/HTML.
*** Few more Challenges with Social Crawling:
- Privacy-Ownership-Control – The data is the property of the users
- Unstructured Sources – It isn’t a trivial task to extract social entities from unstructured sources (e.g. blogs) and might require offline semantic processing on your collected data.
- Cross Network Relations – How to find those important hidden cross network relations e.g. between the biggest reliable network graph (e.g. FaceBook) and the richest content contributions (e.g. Blogosphere, YouTube, Flickr etc.)
- Identify Social Signs (e.g. Social Widgets, Comments, Blogroll etc.)
- Social Graph Update Mechanism and crawlers distribution
- Profiles Canonization
(3) Processing the Social Graph
*** The Identity Crisis
- Filtering Impersonation e.g. all these site use XFN (rel=”me”) to “say” they are TechCrunch
- Identify and have different modeling for non-individual identities (groups, shared authorship) e.g. Knitters Blog with 629 knitting contributors :)
- Strive to merge identities (a.k.a. profile fusion) when possible e.g. Moti Karmona in LinkedIn and Moti Karmona in FaceBook could be two instances (/profiles) of the same person and merging this profiles will enable:
- Cross network connectedness => Bridging between network richness (e.g. FaceBook) to content richness (e.g. Blogosphere)
- Richer people representation using identities aggregation => Richer networks
- The Fusion Challenge: You can pay a short visit to the nearest social aggregator directory but you can’t get away from some more complex algorithms for disambiguating web appearances of people with more common names like James Smith who doesn’t “play” in the social aggregation playground (like 98.7% of the graph).
*** Graph Enrichment
- Implicit Relations – Enrich the network with “implicit” relationships (Colleagues, Graduates, Neighbors) e.g. I have a LinkedIn profile and all my connections are hidden for public crawlers but the fact I work in Delver is public so if Delver is startup company with less than ~50 people than there is a good chance I know all the other workers in Delver => This simple heuristic rule can create an implicit relation between me and other workers of Delver without me explicitly claim that I know them (as I did in FaceBook)
- Generating the inverted relations when needed Followed vs. Follower
- Deeper, semantic extraction of social entities un-structured content
(4) The Social Graph Size
Let’s have some quick (and very dirty) guesstimates:
World Population is approx. ~6.7 Billion / 22% Internet penetration => 1.5 Billion internet users
Let’s say 65% of these users have some kind of presence in Social Media (~20% have more than one) => ~1 Billion Profiles x ~10 content items per profile
+ 1 Billion Profiles Nodes x ~100 network relations per profile => ~110 Billion Graph Edges + ~10 Billion Graph Nodes
It is highly depended on graph implementation but with this numbers, you can easily find yourself with ~1-2 Terabytes of graph metadata alone (without contents and profiles*)
(5) Two Cents on Social Graph Architecture
Updating and querying gigantic, dynamic, distributed, directed, cyclic, colored, weighted graph have “some” algorithmic, computational complexity – a little more complex than a blog post could cover…;-)
You can take a quick look at the tiny 15 Giga, 25 million nodes graph implementation in LinkedIn to get a glimpse to the technological challenge …
* Note: Indexing content and profiles data (e.g. for Building a Social Search Engine) is an architecture challenge equivalent to any modern search engine with ~10 Billion documents index
This is only the tip of the iceberg but it is more than enough for one blog post ;)
Credit: All the images were taken from Tamar Hak‘s amazing artwork – creating The Delver Kid image.
Tags: Delver · Disruptive Technology · Search · Social Network
December 14th, 2008 by Moti Karmona | מוטי קרמונה · 4 Comments
Google, Yahoo, Ask and Lycos have released* their top search terms for the past year (2008) and I have aggregated it to your convenience in one happy table below.
I don’t have anything smart to say about it but I did manage to pull out five intriguing insights.
My Five Cents:
- As done last year, it seems like Y! have removed all the navigational queries from their report (I wonder why ;)
- “Poker” is the “Top Search Term Of The Year” for for the 3rd consecutive year on Lycos… (what is Lycos? :)
- Though she didn’t make it to the White House, US vice-presidential candidate Sarah Palin captured the zeitgeist of internet users in 2008 while Obama in the 6th place.
- IMHO, Ask.com is just being too honest in their report – 50% of Ask search terms are navigational queries and the rest are boring.
- Britney Spears has been the most popular search term at Yahoo for seven of the past eight years!
Top Search Terms | 2008
|1||sarah palin||Britney Spears ||Dictionary||Poker|
|2||beijing 2008||WWE||MySpace||Paris Hilton|
|3||facebook login||Barack Obama||Google||YouTube|
|5||heath ledger||RuneScape||Facebook||Sarah Palin|
|6||obama||Jessica Alba||Coupons||Britney Spears|
|7||nasza klasa||Naruto||Cars||Clay Aiken|
|8||wer kennt wen||Lindsay Lohan||Craigslist||Pamela Anderson|
|9||euro 2008||Angeline Jolie||Online degrees||Facebook|
|10||jonas brothers||American Idol||Credit score||Holly Madison|
Update (18 Dec. 2008): Top 10 search queries that people used on Delicious in 2008 are: news, blogs, reference, wiki, restaurants, hotels, css, web 2.0, artists, music… I think it is loud-and-clear that the biggest bookmarking site isn’t fulfilling its search potential (!)
* Note: Microsot (Live) didn’t released the updated list until now and AOL didn’t break out overall terms so wasn’t included here.
Tags: Google · Internet · Search
December 12th, 2008 by Moti Karmona | מוטי קרמונה · 2 Comments
Warning: This post could be an interesting reading material only if you have windows system-files corruptions and as a real alternative to the expert exchange conspiracy ;)
This small vista saga started when I found myself unable to access domain assets (exchange, domain servers, shared storage etc.)
Browsing quickly throughout Event Viewer System logs I found out that Workstation, Netlogon and Computer Browser services were down due to rather long and frustrating service dependencies failures:
- The Netlogon service depends on the Workstation service which failed to start because of the following error: The dependency service or group failed to start. (Event ID 7001)
- The Computer Browser service depends on the Workstation service which failed to start because of the following error: The dependency service or group failed to start. (Event ID 7001)
- The Workstation service depends on the SMB 2.0 MiniRedirector service which failed to start because of the following error: The dependency service or group failed to start. (Event ID 7001)
- The SMB 2.0 MiniRedirector service depends on the SMB MiniRedirector Wrapper and Engine service which failed to start because of the following error: The dependency service or group failed to start. (Event ID 7001)
- The SMB 1.x MiniRedirector service depends on the SMB MiniRedirector Wrapper and Engine service which failed to start because of the following error: The dependency service or group failed to start. (Event ID 7001)
- The SMB MiniRedirector Wrapper and Engine service depends on the Redirected Buffering Sub Sysytem service which failed to start because of the following error: SMB MiniRedirector Wrapper and Engine is not a valid Win32 application. (Event ID 7001)
- The Redirected Buffering Sub Sysytem service failed to start due to the following error: Redirected Buffering Sub Sysytem is not a valid Win32 application. (Event ID 7000)
- The following boot-start or system-start driver(s) failed to load: CSC rdbss (Event ID 7026)
As a real IT expert, I tried 5 restarts before trying anything else ;)
So… to resolve this unfortunate issue, I had to use the notorious System File Checker tool (SFC.exe) .
This poorly documented windows utility will scan all protected system files and replaces incorrect (corrupted, changed or missing) versions with correct Microsoft versions and running this from the command prompt is much easier than booting off the DVD into repair mode.
Once you have an administrator command prompt open (click Start, click All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator), you can run the utility by using the following syntax:
SFC [/SCANNOW] [/VERIFYONLY] [/SCANFILE=<file>] [/VERIFYFILE=<file>]
[/OFFWINDIR=<offline windows directory> /OFFBOOTDIR=<offline boot directory>]
The most useful command is just to scan immediately, which will scan and attempt to repair any files that are changed or corrupted using this command:
The scanning replaced the corrupted system file rdbss.sys and I was back to domain browsing business right after :)
Note: If SFC shouts he can’t repair the corrupted files, than you will have to drill down to the CBS.log to find what is corrupted and replace it yourself
By the way, Marissa Mayer promised that Chrome Browser will be leaving Beta (while GMail is still in Beta…) and it just did yesterday and that Google Search Wiki would soon have a toggle button that allow people to turn it off (“early Q1.”) – I can’t wait… :)
Tags: Conspiracy · Tools
December 11th, 2008 by Moti Karmona | מוטי קרמונה · 3 Comments
WordPress 2.7 – “Coltrane” (named for John Coltrane) was released today and I have just finished a 25 minutes (including this post :) smooth upgrade using the WordPress Automatic Upgrade plugin.
Coltrane seems to have a very cool and much faster new inteface (150 people contribution) and I didn’t have any issues until now…
Much more details (+ visual introduction) on the WordPress.org blog.
Tags: Simplicity · Tools · WordPress
December 3rd, 2008 by Moti Karmona | מוטי קרמונה · 2 Comments
This post is a weird collection of three Internet Conspiracies from the last 24 hours.
Note: I do realize that this post is a creepy testimony to the fact that I might be building a search engine and reading too many blogs and the affect it might have on my sense-of-judgment…
Conspiracy I (Google thinks Facebook is a dangerous Phishing Site)
Google Chrome browsers around the world have claimed today that Facebook is a dangerous Phishing Site (read more on Facebook Developer Forums)
Conspiracy II (Searching for a compatitor ‘search engine’ with Yahoo)
Searching ‘Google’ in Yahoo, results with suggestion to use Yahoo search: “You could go to Google. Or you could stay here and get straight to your answers.” (it also works with ‘ask’, ‘aol’ and ‘live’ :)
Conspiracy III (Apple Anti-Virus or not)
Two weeks ago, Apple updated a technical note on its Support Web site that says:
“Apple encourages the widespread use of multiple antivirus utilities so that virus programmers have more than one application to circumvent, thus making the whole virus writing process more difficult.”
Yesterday, Apple removed the KnowledgeBase article from its support site (KBase Article HT2550 now points to a bare error page – see below) and Apple spokesperson Bill Evans explained:
“We have removed the KnowledgeBase article because it was old and inaccurate… The Mac is designed with built-in technologies that provide protection against malicious software and security threats right out of the box… However, since no system can be 100 percent immune from every threat, running antivirus software may offer additional protection.”
Tags: Conspiracy · Google · Internet