The general logic of executing the Onenote data ingest:

  1. Get OAuth access token from Azure
  2. Get API call to get all sections of a notebook
  3. Iterate over all the sections and pages in the notebook and append the section name, section id, page title, page id and html content to the local data list
  4. In a spark data frame, input the data in a new delta table
  5. Note, that there may be some Graph API return limits, but you can increase the page size up to a maximum of 999 items per request using the $top query parameter

Onenote data ingest PySpark code:

import requests

from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, StringType

# Initialize Spark session

spark = SparkSession.builder.appName(“OneNoteDataPipeline”).getOrCreate()

# Authentication

def get_access_token():

    url = “https://login.microsoftonline.com/982d56ce-f6e7-4334-a1c4-5d6779c789a6/oauth2/v2.0/token”

    payload = {

        “grant_type”: “client_credentials”,

        “client_id”: “[client_id]”,

        “client_secret”: “[client_secret]”,

        “scope”: “https://graph.microsoft.com/.default”,

    }

    headers = {“Content-Type”: “application/x-www-form-urlencoded”}

    response = requests.post(url, data=payload, headers=headers)

    if response.status_code == 200:

        return response.json()[“access_token”]

    else:

        raise Exception(f”Failed to get token: {response.status_code}, {response.text}”)

access_token = get_access_token()

headers = {“Authorization”: f”Bearer {access_token}”}

# 1. Get sections

sections_url = “https://graph.microsoft.com/v1.0/users/[user]/notebooks/[notebook]/sections”

sections_response = requests.get(sections_url, headers=headers)

if sections_response.status_code != 200:

    raise Exception(f”Failed to get sections: {sections_response.status_code}, {sections_response.text}”)

sections = sections_response.json()[“value”]

# 2. Iterate over sections to get pages

data = []

for section in sections:

    section_id = section[“id”]

    section_name = section[“displayName”]

    pages_url = f”https://graph.microsoft.com/v1.0/users/[user]/onenote/sections/{section_id}/pages”

    pages_response = requests.get(pages_url, headers=headers)

    if pages_response.status_code != 200:

        raise Exception(f”Failed to get pages for section {section_id}: {pages_response.status_code}, {pages_response.text}”)

    pages = pages_response.json()[“value”]

    # 3. Iterate over pages to get HTML content

    for page in pages:

        page_id = page[“id”]

        page_title = page[“title”]

        content_url = f”https://graph.microsoft.com/v1.0/users/[user]/onenote/pages/{page_id}/content”

        content_response = requests.get(content_url, headers=headers)

        if content_response.status_code != 200:

            raise Exception(f”Failed to get content for page {page_id}: {content_response.status_code}, {content_response.text}”)

        html_content = content_response.text

        # Append the data to the list

        data.append((section_name, section_id, page_title, page_id, html_content))

# 4. Store the data in a Delta table in Fabric’s Lakehouse

# Define schema for Spark DataFrame

schema = StructType([

    StructField(“section_name”, StringType(), True),

    StructField(“section_id”, StringType(), True),

    StructField(“page_title”, StringType(), True),

    StructField(“page_id”, StringType(), True),

    StructField(“html_content”, StringType(), True),

])

# Create Spark DataFrame

df = spark.createDataFrame(data, schema)

# Write to Fabric’s Lakehouse (assuming a Lakehouse connection is set up)

lakehouse_path = “abfss://[email protected]/OneNoteContent.Lakehouse/Tables”

df.write.format(“delta”).mode(“overwrite”).save(lakehouse_path)

print(f”Data saved successfully to {lakehouse_path}”)

This code would be inputted into the lakehouse as a parquet file which then could be transformed into a delta table:

The general logic for the confluence page creation is:

  1. Create all the custom folders in the appropriate space
  2. For each folder, based on the record’s section name column, create a new page with the html content in it

import requests
import json

# Load lakehouse data
lakehouse_df = spark.read.format("delta").load(
    "abfss://[email protected]/OneNoteContent.Lakehouse/Tables/onenote_content"
)

# Confluence API details
confluence_url = "https://gscs.atlassian.net/wiki/rest/api/content"
username = "[user_name]"
api_token = "[api_token]"
auth = (username, api_token)

# Parent folder ID where folders will be created
parent_folder_id = "25231644"  # Replace with your actual parent folder ID

# List of folders to create
folders = [
   [list of folders]
]

# Create folders in Confluence
def create_folder(folder_name):
    payload = {
        "type": "folder",
        "title": folder_name,
        "space": {"key": "IT"},
        "status": "current",
        "ancestors": [{"id": parent_folder_id}]
    }
    response = requests.post(confluence_url, auth=auth, headers={"Content-Type": "application/json"}, json=payload)
    if response.status_code == 200:
        print(f"Folder '{folder_name}' created successfully.")
        return response.json()["id"]
    else:
        print(f"Failed to create folder '{folder_name}': {response.text}")
        return None

# Create folders and store their IDs
folder_ids = {}
for folder in folders:
    folder_id = create_folder(folder)
    if folder_id:
        folder_ids[folder] = folder_id

print("Folder creation completed. Folder IDs:")
print(folder_ids)

# Create pages in Confluence
def create_page(page_title, html_content, ancestor_id):
    payload = {
        "type": "page",
        "title": page_title,
        "space": {"key": "IT"},
        "status": "current",
        "body": {"storage": {"value": html_content, "representation": "storage"}},
        "ancestors": [{"id": ancestor_id}]
    }
    response = requests.post(confluence_url, auth=auth, headers={"Content-Type": "application/json"}, json=payload)
    if response.status_code == 200:
        print(f"Page '{page_title}' created successfully.")
    else:
        print(f"Failed to create page '{page_title}': {response.text}")

# Process records in the Delta table and create pages in corresponding folders
for row in lakehouse_df.collect():
    section_name = row["section_name"].strip()
    page_title = row["page_title"].strip()
    html_content = row["html_content"].strip()

    # Retrieve the ancestor folder ID for the section, defaulting to a specific folder if not found
    ancestor_id = folder_ids.get(section_name, "[default_section_id]")  # Default folder ID
    create_page(page_title, html_content, ancestor_id)

Conclusion:

By leveraging the pyspark script, I was able to bulk input all the onenote notes into a delta table in lakehouse based on page id, section id and html content. This delta table can then be referenced in a new pyspark script that creates new pages utilizing the html content. This was able to auto-input >300 onenote notes into Confluence; however, the attachments were not able to be properly inputted. This problem will need to be researched a bit more.

To Automate this in the background:

We can leverage Fabric’s scheduling tool that runs our automation script based on parameters: repeat interval, start time, end time, and time zone.

Leave a Reply