This document is taken from a presentation Mark O'Neil delivered at the 2017 Teaching & Learning Conference in Milan, Italy. You can download the presentation and script files at the end of this document.
Complete Refresh vs. Store/Delete
Complete Refresh | Explicit Store and Delete |
---|---|
Complete Refresh takes a file and manages the deletes from a set of data based on differences in the data provided and what is stored in Learn. This has substantial overhead as the data must be compared and determinations regarding differences (update), whether data is new (store), or whether the data is in Learn, but not in the data file (delete).
Complete Refresh is often easier from an SIS extraction as you pull all the relevant data and create the feed file. |
Using Store or Delete gives you explicit control over granular sets of data. This is substantially faster as only the data present in the file is acted upon. Store and Delete files are more difficult to generate as the SIS needs to process the data pulls to create files containing only the differences between pulls to generate the feed files. However, the efficiency gained in processing often makes Store and Delete the more worthwhile method. |
Example
Given:
Old.txt: A Complete Refresh feed from yesterday
EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2006|ddishez001|none|Dan|Dishez|ddishez@sununiversity.edu|Student
2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student
2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student
2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student
2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student
2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student
2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student
2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student
And New.txt: A Complete Refresh feed from today
EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student
2005|mmacneil002|none|Mark|MacNeil|mmacneil1@sununiversity.edu|Student
2007|rrondelle001|none|Ronda|Rondelle|rrondella@sununiversity.edu|Student
2003|lgonzales001|none|Lois|Gonzales|lgonzales@sununiversity.edu|Student
2004|mmacneil001|none|Megan|MacNeil|mmacneil@sununiversity.edu|Student
2000|jsmith001|none|John|Smith1|jsmith@sununiversity.edu|Student
2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student
Process
- We need to compare new to old complete refresh feeds to find differences
- sort the files and diff
- We want our DISABLE document to contain only records that are not in new
- Our STORE document to contain only documents that are CHANGED or ADDED in New
- CHANGED means the record may be in old and thus in both old and new so is a special case
- We take the disabled doc and These will be in DISABLED due to initial diff – need to remove
- ADDED means the record is only in New
- This is discovered via another diff between old and new
- CHANGED means the record may be in old and thus in both old and new so is a special case
This may look like a lot of costly steps against a large set of refresh data, but I guarantee it and the Snapshot DELETE/STORE will be significantly faster than a complete refresh.
Create Sorted files:
- $ sort New.txt > sortedNew.txt
- $ sort Old.txt > sortedOld.txt
Create DISABLE.txt file and write the proper header:
- $echo 'EXTERNAL_PERSON_KEY|…|LASTNAME|EMAIL|INSTITUTION_ROLE' > DISABLE.txt
Generate the DISABLEd users (those not in the new file):
- $diff sortedNew.txt sortedOld.txt | grep ‘> ' | sed ’s/> //g’ >> DISABLE.txt
Generate the Updates from DISABLEd users file
- $ ./sedRemoveUpdates.sh
DISABLE.txt contents before sedRemoveUpdates:
EXTERNAL_PERSON_KEY|SYSTEM_ROLE|USER_ID|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student
2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student
2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student
DISABLE.txt contents after sedRemoveUpdates:
EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student
2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student
Running sedRemoveUpdates removes the updates (which which were matched in the diff). Where sedRemoveUpdates.sh is:
#!/bin/bash
while read line; do
matchMe=$(cut -d'|' -f1 <<< $line)
echo $matchMe
searchResult=$(grep $matchMe < sortedNew.txt)
if [ -z "$searchResult" ]; then
#echo "not found"
else
sed -i.bak "/$matchMe/d" ./DISABLE.txt
fi
done < DISABLE.txt
exit 0
now DISABLED.txt contains only those EXTERNAL_PERSON_KEYS that are not in the sortedNew.txt file.
We can then use a similar process reversing the Diff to find Updated users to create a smaller Store file that reflects only additions and updated lines in the sortedNew.txt by copying only new and updated entries from a slightly altered sortedNew file to a STORE.txt file.
Copy sortedNew to storeNew:
$cp sortedNew.txt storeNew.txt
Create STORE.txt file and write the proper header:
$echo 'EXTERNAL_PERSON_KEY|…|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE' > STORE.txt
add the missing data from old to new by appending DISABLED to sortedOld and resorting:
$cat DISABLED.txt >> storeNew.txt
$sort storeNew.txt > sortedStoreNew.txt
Diff the old file to the new data and append to STORE.txt:
$diff sortedStoreNew.txt sortedOld.txt | grep '< '|sed 's/< //g' >> STORE.txt
STORE.txt:
EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2006|ddishez001|none|Daniel|Dishez|ddishez@sununiversity.edu|Student
2020|bbaggins|none|bilbo|baggins|bbaggins@sununiversity.edu|Student
DISABLE.txt:
EXTERNAL_PERSON_KEY|USER_ID|SYSTEM_ROLE|FIRSTNAME|LASTNAME|EMAIL|INSTITUTION_ROLE
2002|ttsai001|none|Tevis|Tsai|ttsai@sununiversity.edu|Student
2001|kthomas001|none|Kyle|Thomas2|kthomas@sununiversity.edu|Student
Archive Old.txt and Save New.txt as Old.txt for the next update.
Additional thoughts
- Faster by orders of magnitude
- Easy to insert into Snapshot SIS Framework flow
- Less prone to “where’d my data go?” scenarios
- Timestamp and archive your files, and prep for the next iteration
- archive_dir = “./archives”
- now = `date +%Y-%m-%d.%H:%M:%S`
- mv old.txt $archive_dir+”archive” + $now
- mv new.txt old.txt
Downloads
- Scripts required to provide SIS deltas
- 'Complete Refresh is Evil!' Presentation
Brilliant. Thanks!