Jul 122012
 

Oh how I wish there was a out of the box solution to do this where you could just enter the source and destination urls and voila! the entire blog was migrated. Unfortunately there isnt πŸ™ (or atleast I am not aware of one).

So when I set out to migrate around a 1000 posts from 16 different Sharepoint blogs to WordPress, there were quite a few challenges on the way. The first step is to read the Sharepoint data.This would have been far easier with the proper database access to sharepoint db but unfortunately there was way too much red tape wrapped around it to get the access quick enough to start immediately, hence I took the dirty way out – to scrape the HTML pages and retrieving the data out of it. Scraping HTML is the usually the worst possible way to extract data from any source since the code is rarely ever reusable, bloated to handle nested tags and often filled with branching statements to handle special cases. Only go for the scraping approach if you don’t have any other way to access the data. What makes scraping bearable is the wonderful HTML Agility pack which I had blogged about earlier. Its XMLish approach to traverse HTML makes this activity quite easy.

This is the object that I used to represent each blog to be imported. It contains the source URL, destination blog, author Id. Each object represents the source site as well as the destination information needed to write to WordPress.

   class BlogsToBeMigrated
    {
        public string Url { get; set; }
        public int SiteID { get; set; }
        public int AuthorId { get; set; }
        public string DestinationBlogName { get; set; }
        public BlogsToBeMigrated(string url, int siteId, int authorID,string destination)
        {
            Url = url;
            SiteID = siteId;
            AuthorId = authorID;
            DestinationBlogName = destination;
        }
    }

This is the code to read each blog entry. Note that it takes in each blog object and extracts information out of it. The concatenation step would depend on how your sharepoint site is structured. Just make sure it points to the list view of the blog where all posts are listed in a tabular form. This helps us to take out each link and get the content to add in WordPress. To get the various xpath to navigate to the nodes, I used the chrome extension Xpath helper. This could be different for every sharepoint site. Just play around with the xpath till you get the required information

private static void MigrateSingleBlog(BlogsToBeMigrated blog)
{
    var siteUrl = String.Concat("YOUR SITE URL HERE", blog.Url, "/Lists/Posts/AllPosts.aspx");
    string siteList = ReadWebPage(siteUrl);
    DeletedBlogData(blog.SiteID);
    var listDoc = new HtmlDocument();
    listDoc.LoadHtml(siteList);
    var siteListNodes = listDoc.DocumentNode.SelectNodes("//td[@class='ms-vb-title']/table/tr/td/a");

    foreach (var site in siteListNodes)
    {
        var postUrl = String.Concat("YOUR SITE URL HERE", site.GetAttributeValue("href", "href"));
        var PageDump = ReadWebPage(postUrl);
        var postDoc = new HtmlDocument();
        postDoc.LoadHtml(PageDump);
        var siteContent = postDoc.DocumentNode.SelectSingleNode("//div[@class='ms-PostWrapper']");

        string postDate = siteContent.ChildNodes[0].InnerText;
        string postTitle = siteContent.ChildNodes[1].ChildNodes[0].ChildNodes[0].InnerText.Trim();
        var postContent = postDoc.DocumentNode.SelectSingleNode("//div[@class='ms-PostWrapper']/div[@class='ms-PostBody']/div");

        var postHtml = postContent.InnerHtml;

        MoveToWordPress(postDate, postTitle, postHtml, blog.Url,blog.SiteID,blog.AuthorId,blog.DestinationBlogName);
        Console.WriteLine(postDate + postTitle);
    }

}

Also remember to put this line in the constructor of your class before any of the HTML agility pack code is executed. This is needed because forms can be tricky elements in HTML due to their overlapping between tags which makes it difficult to parse the markup. This makes HTML Agility pack parse form tags as empty elements and the below line allows you to look inside them.

HtmlNode.ElementsFlags.Remove("form");

Now that I had access to all the sharepoint data the challenge was to enter this is in the right WordPress blogs. I took a look at CSV importer which allows bulk import of posts from CSV files. This step didnt work properly at all since the post content was way too large for a CSV file to be parsed properly. After numerous attempts to sanitize the CSV and escape each linebreak and comma, this step still ignored many valid posts and also filled gibberish in others. Then I thought of directly inserting the data in the WordPress database. Initially I was skeptical since wordpress might fill some related tables when a post was published, but found that there were no such problems. Directly inserting the data worked like a charm. It also allowed me to migrate the data multiple times each time I noticed an issue with improperly rendered markup

This is the method that enters in the WordPress blog table directly. Note that the multisite installation means different post tables which have a number in the table name. e.g. WP_2_posts, WP_3_posts etc. The index for the table name was in the BlogToBeMigrated object as I had manually created each wordpress blog which corresponded to a sharepoint blog. This step is fairly simple. It just creates a connection to the MySql database using the connector dlls, gets the maximum ID, increments it and uses that to insert a new post. The code isnt production standard but this isnt really something that I am looking to maintain for a long time. Till the migration is done right, we can just keep repeating with the required fixes and once its finished – you have a functioning site with no need to migrate anymore. Pragmatism wins.

private static void MoveToWordPress(string postDate,string postTitle,string postContent,string postUrl, int blogID, int authorID,string destinationBlogName)
{
   //Remember to include ConvertZeroDateTime=true in the connection string
    MySqlConnection wordpressConn = new MySqlConnection("DATABASE_CONNECTION STRING; ConvertZeroDateTime=true");

    using (wordpressConn)
    {
        wordpressConn.Open();
        int maxID = 0;
        var id = new MySqlCommand(String.Format("Select Max(ID) from WP_{0}_Posts", blogID), wordpressConn).ExecuteScalar();
        if (id.GetType() == typeof(System.DBNull))
            maxID = 1;
        else
            maxID = Convert.ToInt32(id);

        string SQLCommandText = "Insert into wp_{0}_posts (id,post_author,post_date,post_content,post_title,post_excerpt,post_status,comment_status,ping_status,post_name,to_ping,pinged,post_modified,post_modified_gmt,post_content_filtered,post_parent,guid,menu_order,post_type,comment_count)";
        SQLCommandText += " values (?ID,{1},?postDate,?postBody,?postTitle,'','publish','open','open',?postName,'','',?postDate,?postDate,'',0,?postGuid,0,'post',0)";
        HtmlDocument span = new HtmlDocument();

        MySqlCommand insertPost = new MySqlCommand(String.Format(SQLCommandText, blogID, authorID), wordpressConn);
        span.LoadHtml(postContent);

        var siteName = "YOUR_BLOG_URL_HERE/{0}/files/{1}";
        var imagenodes = span.DocumentNode.SelectNodes("//a/img");
        if (imagenodes != null)
        {
            foreach (var image in imagenodes)
            {
                var imageUrl = image.ParentNode.GetAttributeValue("href", "href");
                var imageThumbUrl = image.GetAttributeValue("src", "src");
                if (imageUrl.Contains("/sites"))
                {
                    var migratedFileName = imageUrl.Replace(String.Concat(postUrl, "/Lists/Posts/Attachments/"), string.Empty);
                    postContent = postContent.Replace(string.Format("href=\"{0}", imageUrl), String.Format("href=\"{0}", String.Format(siteName, destinationBlogName, migratedFileName)));
                           
                    postContent = postContent.Replace(string.Format("src=\"{0}", imageThumbUrl), String.Format("src=\"{0}", String.Format(siteName, destinationBlogName, migratedFileName)));
                }
                else
                    postContent = postContent.Replace(string.Format("src=\"{0}", imageThumbUrl), String.Format("src=\"{0}", imageUrl));
            }
        }
            span = new HtmlDocument();
            span.LoadHtml(postContent);
            var htmlNode = span.DocumentNode.SelectSingleNode("//span[@class='erte_embed']");
            if (htmlNode != null)
            {
                string urlValue = htmlNode.GetAttributeValue("id", "id");
                urlValue = HttpUtility.UrlDecode(urlValue);
                //   urlValue = urlValue;
                postContent = postContent.Replace(htmlNode.OuterHtml, urlValue);
            }

            maxID++;
            insertPost.Parameters.AddWithValue("?ID", maxID);
            insertPost.Parameters.AddWithValue("?postDate", DateTime.Parse(postDate));
            insertPost.Parameters.AddWithValue("?postBody", postContent);
            insertPost.Parameters.AddWithValue("?postTitle", postTitle);
            insertPost.Parameters.AddWithValue("?postName", postTitle.Replace('#', '-').Replace(' ', '-'));
            insertPost.Parameters.AddWithValue("?postGuid", "YOUR_BLOG_URL_HERE/?p=" + maxID);
            insertPost.ExecuteNonQuery();

        }
            
}

Note the additional processing around the image tags. This was because I had migrated the image files separately and wanted to update the image tag’s src attributes to reflect to the new path. If you plan on keeping your previous sharepoint installations up, then this step is optional since the attachments will be loaded from the sharepoint site anyway. But I would recommend migrating the images as well just for easier maintenance of the content.

The code to migrate a single image is below. The way to get all the image tags is very similar to how each blog content was retrieved. The only difference is that instead of entering in the database, we just use Agility pack to extract all image tags in the post contennt and call the below method to download it. The files are then saved in the wp-content directory and the path is updated in the migration logic.

  private static void DownloadImage(string url,string postUrl,string destination)
        {
            string saveDir = String.Format("C:\\Wordpress_Images\\{0}\\", destination);
            string filename = string.Concat(saveDir, url.Replace(string.Concat("YOUR_SHAREPOINT_URL", postUrl, "/Lists/Posts/Attachments/"), string.Empty)).Replace("/","\\");
            FileInfo fi = new FileInfo(filename);
            if (!fi.Directory.Exists)
                Directory.CreateDirectory(fi.Directory.FullName);
            //DirectoryInfo info = new DirectoryInfo(
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
            request.UseDefaultCredentials = true;
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();

            if ((response.StatusCode == HttpStatusCode.OK ||
                response.StatusCode == HttpStatusCode.Moved ||
                response.StatusCode == HttpStatusCode.Redirect) &&
                response.ContentType.StartsWith("image", StringComparison.OrdinalIgnoreCase))
            {

                using (Stream inputStream = response.GetResponseStream())
                using (Stream outputStream = File.OpenWrite(filename))
                {
                    byte[] buffer = new byte[4096];
                    int bytesRead;
                    do
                    {
                        bytesRead = inputStream.Read(buffer, 0, buffer.Length);
                        outputStream.Write(buffer, 0, bytesRead);
                    } while (bytesRead != 0);
                }
            }
        }